Hadoop Eco System Frameworks: Pig, Hive, HBase



Hadoop Ecosystem Frameworks

Hadoop Ecosystem frameworks provide higher-level tools to process, analyze, and store big data without writing low-level MapReduce code.

Hadoop Eco System Frameworks: Pig, Hive, HBase

Key Frameworks:

  • Pig – For data transformation and ETL
  • Hive – For SQL-like queries on big data
  • HBase – For real-time access to large-scale data

Apache Pig

Pig is a high-level scripting language (Pig Latin) for processing large data sets on Hadoop.

  • Simplifies MapReduce programming
  • Designed for data transformation, aggregation, and ETL tasks

Key Components

  • Pig Latin – Scripting language
  • Pig Engine – Translates Pig Latin into MapReduce jobs

Execution Modes:

  • Local mode – Runs on a single machine
  • Hadoop mode – Runs on HDFS cluster

Example (Pig Latin Script)

-- Load data students = LOAD 'students.txt' USING PigStorage(',') AS (name:chararray, age:int, course:chararray); -- Filter students above 20 adults = FILTER students BY age > 20; -- Group by course grouped = GROUP adults BY course; -- Count per course result = FOREACH grouped GENERATE group, COUNT(adults); -- Store result STORE result INTO 'output';

Real-Life Example: Processing website logs to find daily visitors or filtering users by age.

Apache Hive

Hive is a data warehouse framework on Hadoop that allows SQL-like queries on large data stored in HDFS.

  • Uses HiveQL (SQL-like language)
  • Converts queries to MapReduce, Tez, or Spark jobs

Key Features

FeatureDescription
Query LanguageHiveQL (SQL-like)
StorageHDFS or HBase
SchemaDefined on read (flexible)
PartitioningSplits tables into partitions for faster queries
IndexingImproves query speed

Hive Example

-- Create table CREATE TABLE students (name STRING, age INT, course STRING) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','; -- Load data LOAD DATA INPATH '/user/hadoop/students.txt' INTO TABLE students; -- Query data SELECT course, COUNT(*) FROM students GROUP BY course;

Real-Life Example: Analyzing sales data to calculate total sales per region.

Apache HBase

HBase is a NoSQL column-oriented database built on HDFS for real-time access to large datasets.

  • Works like Google Bigtable
  • Best for sparse data and random reads/writes

Key Features

FeatureDescription
Column-OrientedStores data in column families
ScalableHandles petabytes of data
Real-TimeFast read/write operations
IntegrationWorks with Hive, Pig, Spark

HBase Concepts

TermMeaning
TableCollection of column families
RowUnique row key
Column FamilyGroup of related columns
CellIntersection of row and column (stores value)

HBase Example (CLI Commands)

-- Create table create 'students', 'info' -- Insert data put 'students', 'row1', 'info:name', 'Jay' put 'students', 'row1', 'info:age', '24' -- Retrieve data get 'students', 'row1' -- Scan table scan 'students'

Real-Life Example: Storing user profiles for social media platforms with fast read/write access.

Comparison: Pig vs Hive vs HBase

FeaturePigHiveHBase
LanguagePig LatinSQL-like (HiveQL)Column-based NoSQL
Use CaseETL, Data TransformationData Analysis & ReportingReal-time random access
ExecutionMapReduceMapReduce/Tez/SparkHDFS-backed
Ease of UseMediumEasy (SQL knowledge)Medium (NoSQL knowledge)
Real-TimeNoNoYes

Applications on Big Data

FrameworkApplication Example
PigCleaning and transforming log files
HiveSales reports, trend analysis using SQL
HBaseReal-time recommendation system, user profile storage

Exam-Ready Short Notes

  • Pig – ETL tool, Pig Latin script, converts to MapReduce
  • Hive – SQL-like querying on HDFS, uses HiveQL
  • HBase – Column-oriented NoSQL, real-time access
  • Pig vs Hive – Pig for data processing, Hive for data querying
  • HBase vs Hive – HBase for random reads/writes, Hive for batch queries

Pig

Apache Pig is a high-level platform used to process large datasets in Hadoop.

  • Uses a scripting language called Pig Latin
  • Converts Pig Latin scripts into MapReduce jobs
  • Ideal for ETL, data transformation, and data processing tasks

Real-Life Example: Processing website log files to calculate page visits or filter users by region.

Execution Modes of Pig

Pig can run in three modes:

ModeDescriptionUse Case
Local ModeRuns on a single machine using local filesSmall datasets, testing
Hadoop Mode / MapReduce ModeRuns on Hadoop cluster, uses HDFSLarge datasets, production
Tez ModeRuns on Apache Tez engine for faster executionFaster processing than MapReduce

Comparison of Pig with Traditional Databases

FeaturePigRDBMS
LanguagePig Latin (Script)SQL
SchemaOptional (Schema-on-read)Fixed schema
ProcessingBatch/ParallelSequential or limited parallelism
FlexibilityHighLow
Suitable for Big DataYesLimited

Summary: Pig is more flexible and scalable than traditional databases for big data.

Grunt – Pig Interactive Shell

  • Grunt is the interactive shell for Pig
  • Allows executing Pig Latin commands interactively
  • Useful for testing and debugging scripts

Example Commands in Grunt:

grunt> A = LOAD 'data.txt' USING PigStorage(',') AS (name:chararray, age:int); grunt> DUMP A;

Pig Latin – Scripting Language

Pig Latin is a data flow language for processing large datasets.

Basic Structure

  • LOAD – Load data into Pig
  • TRANSFORM – Apply transformations (filter, join, group)
  • STORE / DUMP – Store or display output

Example:

-- Load data students = LOAD 'students.txt' USING PigStorage(',') AS (name:chararray, age:int, course:chararray); -- Filter students older than 20 adults = FILTER students BY age > 20; -- Store result STORE adults INTO 'output';

User Defined Functions (UDFs)

  • Pig allows custom functions called UDFs
  • Written in Java, Python, or other JVM languages
  • Used when built-in functions are not sufficient

Example: Creating a UDF to calculate student grades from marks.

Data Processing Operators in Pig

1. LOAD & STORE

  • LOAD – Load data into Pig
  • STORE – Save data to HDFS

2. FOREACH

  • Iterate over each record

names = FOREACH students GENERATE name;

3. FILTER

  • Select records based on condition

adults = FILTER students BY age > 20;

4. GROUP

  • Group records by field

grouped = GROUP students BY course;

5. JOIN

  • Join two datasets

joined = JOIN students BY id, marks BY student_id;

6. ORDER / SORT

  • Sort records

sorted = ORDER students BY age DESC;

7. DISTINCT

  • Remove duplicates

unique_courses = DISTINCT students.course;

8. UNION

  • Combine two datasets

all_students = UNION students1, students2;

Real-Life Applications of Pig

TaskExample
ETLCleaning and transforming raw log files
Data AnalyticsCounting clicks on a website
Data PreparationFiltering customer data for analysis
IntegrationWorks with Hive, HBase, and Spark

Exam-Ready Short Notes

  • Pig – High-level data processing framework on Hadoop
  • Pig Latin – Scripting language for data transformations
  • Grunt – Interactive Pig shell
  • UDF – Custom functions for complex tasks
  • Operators – LOAD, STORE, FILTER, JOIN, GROUP, FOREACH, ORDER

Hive

Apache Hive is a data warehouse software built on top of Hadoop that allows SQL-like querying of large datasets stored in HDFS.

  • Uses HiveQL (SQL-like language)
  • Converts queries into MapReduce, Tez, or Spark jobs
  • Best for batch data analysis and reporting

Real-Life Example: An e-commerce company analyzing sales transactions or customer behavior across millions of records.

Hive Architecture

Hive architecture consists of the following key components:

+----------------------+ | User Layer | | (Hive Shell / UI / | | JDBC / ODBC clients)| +----------+-----------+ | +----------v-----------+ | Driver / Compiler | | (HiveQL -> MR Jobs)| +----------+-----------+ | +----------v-----------+ | Execution Engine | | (Manages MapReduce, | | Tez, Spark Jobs) | +----------+-----------+ | +----------v-----------+ | Metastore | | (Metadata storage, | | DB for tables/schemas) | +----------+-----------+ | +----------v-----------+ | Storage Layer | | (HDFS) | +----------------------+

Key Components

ComponentRole
Hive Shell / CLIUser interface to run HiveQL
Driver / CompilerCompiles HiveQL into execution plan
Execution EngineRuns jobs on Hadoop (MapReduce/Tez/Spark)
MetastoreStores metadata about tables, partitions, schema
Storage LayerHDFS or HBase stores the actual data

Hive Installation

Requirements

  • Hadoop installed
  • Java JDK
  • Hive binary package

Installation Steps 

  • Download Hive from Apache site
  • Extract Hive package
  • Set environment variables (HIVE_HOME)
  • Configure hive-site.xml
  • Connect to Hive Metastore (embedded or remote DB)
  • Start Hive shell: hive

Hive Shell & Services

Hive Shell

  • Interactive CLI for executing HiveQL queries
  • Similar to SQL console

Example:

hive> SHOW DATABASES; hive> USE sales_db;

Hive Services

ServicePurpose
CLICommand-line interface
JDBC / ODBCConnect external applications
Web UIMonitor Hive jobs
Thrift ServerExecute queries remotely
MetastoreStores table metadata

Hive Metastore

  • Stores metadata (tables, columns, partitions, data types)
  • Can use embedded Derby DB or MySQL/PostgreSQL for production
  • Essential for query compilation and optimization

Hive vs Traditional Databases

FeatureHiveRDBMS
SchemaSchema-on-readSchema-on-write
Query LanguageHiveQL (SQL-like)SQL
TransactionsLimitedFull ACID support
StorageHDFSLocal disk / Storage engine
Best forBatch analyticsOLTP / Small datasets

HiveQL – SQL-Like Language

HiveQL allows you to create tables, query data, and manipulate data:

Creating a Table

CREATE TABLE students ( id INT, name STRING, age INT, course STRING ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';

Loading Data

LOAD DATA INPATH '/user/hadoop/students.txt' INTO TABLE students;

Querying Data

SELECT * FROM students WHERE age > 20; SELECT course, COUNT(*) FROM students GROUP BY course;

User Defined Functions (UDFs)

  • Custom functions for special operations
  • Written in Java or Python
  • Can be used in HiveQL

Example: SELECT my_custom_udf(name) FROM students;

Sorting and Aggregating

  • Sorting: ORDER BY, SORT BY
  • Aggregation: COUNT, SUM, AVG, MIN, MAX

Example:

SELECT course, COUNT(*) AS total FROM students GROUP BY course ORDER BY total DESC;

Hive and MapReduce Scripts

  • Hive converts queries to MapReduce jobs automatically
  • Users don’t need to write MapReduce manually
  • Can also embed custom MapReduce scripts in Hive queries

Joins & Subqueries

Join Example

SELECT s.name, m.marks FROM students s JOIN marks m ON s.id = m.student_id;

Subquery Example

SELECT name, age FROM students WHERE id IN (SELECT student_id FROM marks WHERE marks > 80);

Real-Life Applications of Hive

TaskExample
Data AnalysisSales reports, clickstream analysis
Data WarehousingAggregating historical data for BI
ETLTransform raw logs into structured format
IntegrationWorks with Spark, HBase, Pig

Exam-Ready Short Notes

  • Hive – SQL-like data warehouse on Hadoop
  • HiveQL – Language to query data in Hive
  • Metastore – Stores metadata of tables/partitions
  • Tables – Managed (Hive-controlled) or External (HDFS-controlled)
  • Operators – Sorting (ORDER BY), Aggregation (COUNT, SUM)
  • Joins & Subqueries – Combine tables and nested queries

HBase

Apache HBase is a NoSQL column-oriented database built on Hadoop for real-time random read/write access to very large datasets.

  • Modeled after Google Bigtable
  • Works on HDFS for storage
  • Best for sparse, large-scale data

Real-Life Example:

  • Storing user profiles for social media applications
  • Online transaction systems with fast read/write

HBase Concepts

ConceptDescription
TableCollection of column families
RowUnique row key identifies a record
Column FamilyGroup of related columns
ColumnStores data (cells)
CellIntersection of row and column; stores value and timestamp

HBase Clients

HBase provides different clients to interact with data:

ClientPurpose
Java APIBuild applications in Java
REST APIAccess HBase via HTTP
Thrift APICross-language support
ShellInteractive CLI for commands

HBase Example (CLI)

-- Create table with column family 'info' create 'students', 'info' -- Insert data put 'students', 'row1', 'info:name', 'Jay' put 'students', 'row1', 'info:age', '24' -- Retrieve data get 'students', 'row1' -- Scan table scan 'students'

HBase vs RDBMS

FeatureHBaseRDBMS
Data ModelColumn-orientedRow-oriented
SchemaFlexibleFixed
TransactionsLimited (No full ACID)Full ACID
StorageHDFSLocal disk/DB engine
QueryAPIs / MapReduceSQL
Use CaseReal-time large dataOLTP / structured data

Advanced HBase Usage

1. Schema Design

  • Column families group related data
  • Minimize column families for efficiency
  • Use row keys carefully for query performance

2. Advanced Indexing

  • Secondary indexes for faster queries
  • Coprocessors for custom server-side processing

3. Integration

  • Works with Hive, Pig, Spark for analytics

Zookeeper

  • Centralized service to coordinate distributed applications
  • Provides configuration management, synchronization, and naming services

How Zookeeper Helps HBase

  • Monitors HBase cluster
  • Manages Master and RegionServer state
  • Ensures failover and high availability

Building Applications with Zookeeper

  • Applications use Zookeeper for coordination

Common tasks:

  • Leader election
  • Configuration updates
  • Distributed locks

Real-Life Example: Spark or HBase cluster uses Zookeeper to monitor nodes and handle failures automatically

IBM Big Data Strategy

IBM provides enterprise-level Big Data solutions to manage and analyze large datasets efficiently.

Key Tools

ToolDescription
InfosphereData integration, governance, and warehousing
BigInsightsEnterprise Hadoop platform for analytics
Big SheetsExcel-like interface on BigInsights for non-programmers
Big SQLSQL engine to query Hadoop data using standard SQL

IBM Big Data Ecosystem Use Case

  • Retail: Analyze customer transactions in real-time using Big SQL and BigInsights
  • Banking: Fraud detection using HBase and Spark on BigInsights

Exam-Ready Short Notes

  • HBase – Column-oriented NoSQL database on Hadoop
  • Zookeeper – Coordinates distributed applications, monitors cluster
  • Advanced HBase – Schema design, secondary indexing, coprocessors
  • IBM Big Data Tools – Infosphere, BigInsights, BigSheets, Big SQL
  • HBase vs RDBMS – HBase: real-time, flexible, Hadoop-backed; RDBMS: structured, transactional

Conclusion 

HBase provides real-time access to large-scale distributed data, while Zookeeper ensures cluster reliability, coordination, and failover. IBM’s Big Data tools, including Infosphere, BigInsights, BigSheets, and Big SQL, offer enterprise-grade data analytics, integration, and SQL-on-Hadoop capabilities. Together, these technologies provide a robust framework for modern Big Data management and analytics.