Hadoop Eco System Frameworks: Pig, Hive, HBase

Hadoop Ecosystem Frameworks

Hadoop Ecosystem frameworks provide higher-level tools to process, analyze, and store big data without writing low-level MapReduce code.

Key Frameworks:

Pig – For data transformation and ETL
Hive – For SQL-like queries on big data
HBase – For real-time access to large-scale data

Apache Pig

Pig is a high-level scripting language (Pig Latin) for processing large data sets on Hadoop.

Simplifies MapReduce programming
Designed for data transformation, aggregation, and ETL tasks

Key Components

Pig Latin – Scripting language
Pig Engine – Translates Pig Latin into MapReduce jobs

Execution Modes:

Local mode – Runs on a single machine
Hadoop mode – Runs on HDFS cluster

Example (Pig Latin Script)


-- Load data
students = LOAD 'students.txt' USING PigStorage(',') AS (name:chararray, age:int, course:chararray);

-- Filter students above 20
adults = FILTER students BY age > 20;

-- Group by course
grouped = GROUP adults BY course;

-- Count per course
result = FOREACH grouped GENERATE group, COUNT(adults);

-- Store result
STORE result INTO 'output';

Real-Life Example: Processing website logs to find daily visitors or filtering users by age.

Apache Hive

Hive is a data warehouse framework on Hadoop that allows SQL-like queries on large data stored in HDFS.

Uses HiveQL (SQL-like language)
Converts queries to MapReduce, Tez, or Spark jobs

Key Features

Feature	Description
Query Language	HiveQL (SQL-like)
Storage	HDFS or HBase
Schema	Defined on read (flexible)
Partitioning	Splits tables into partitions for faster queries
Indexing	Improves query speed

Hive Example


-- Create table
CREATE TABLE students (name STRING, age INT, course STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

-- Load data
LOAD DATA INPATH '/user/hadoop/students.txt' INTO TABLE students;

-- Query data
SELECT course, COUNT(*) FROM students GROUP BY course;

Real-Life Example: Analyzing sales data to calculate total sales per region.

Apache HBase

HBase is a NoSQL column-oriented database built on HDFS for real-time access to large datasets.

Works like Google Bigtable
Best for sparse data and random reads/writes

Key Features

Feature	Description
Column-Oriented	Stores data in column families
Scalable	Handles petabytes of data
Real-Time	Fast read/write operations
Integration	Works with Hive, Pig, Spark

HBase Concepts

Term	Meaning
Table	Collection of column families
Row	Unique row key
Column Family	Group of related columns
Cell	Intersection of row and column (stores value)

HBase Example (CLI Commands)


-- Create table
create 'students', 'info'

-- Insert data
put 'students', 'row1', 'info:name', 'Jay'
put 'students', 'row1', 'info:age', '24'

-- Retrieve data
get 'students', 'row1'

-- Scan table
scan 'students'

Real-Life Example: Storing user profiles for social media platforms with fast read/write access.

Comparison: Pig vs Hive vs HBase

Feature	Pig	Hive	HBase
Language	Pig Latin	SQL-like (HiveQL)	Column-based NoSQL
Use Case	ETL, Data Transformation	Data Analysis & Reporting	Real-time random access
Execution	MapReduce	MapReduce/Tez/Spark	HDFS-backed
Ease of Use	Medium	Easy (SQL knowledge)	Medium (NoSQL knowledge)
Real-Time	No	No	Yes

Applications on Big Data

Framework	Application Example
Pig	Cleaning and transforming log files
Hive	Sales reports, trend analysis using SQL
HBase	Real-time recommendation system, user profile storage

Exam-Ready Short Notes

Pig – ETL tool, Pig Latin script, converts to MapReduce
Hive – SQL-like querying on HDFS, uses HiveQL
HBase – Column-oriented NoSQL, real-time access
Pig vs Hive – Pig for data processing, Hive for data querying
HBase vs Hive – HBase for random reads/writes, Hive for batch queries

Pig

Apache Pig is a high-level platform used to process large datasets in Hadoop.

Uses a scripting language called Pig Latin
Converts Pig Latin scripts into MapReduce jobs
Ideal for ETL, data transformation, and data processing tasks

Real-Life Example: Processing website log files to calculate page visits or filter users by region.

Execution Modes of Pig

Pig can run in three modes:

Mode	Description	Use Case
Local Mode	Runs on a single machine using local files	Small datasets, testing
Hadoop Mode / MapReduce Mode	Runs on Hadoop cluster, uses HDFS	Large datasets, production
Tez Mode	Runs on Apache Tez engine for faster execution	Faster processing than MapReduce

Comparison of Pig with Traditional Databases

Feature	Pig	RDBMS
Language	Pig Latin (Script)	SQL
Schema	Optional (Schema-on-read)	Fixed schema
Processing	Batch/Parallel	Sequential or limited parallelism
Flexibility	High	Low
Suitable for Big Data	Yes	Limited

Summary: Pig is more flexible and scalable than traditional databases for big data.

Grunt – Pig Interactive Shell

Grunt is the interactive shell for Pig
Allows executing Pig Latin commands interactively
Useful for testing and debugging scripts

Example Commands in Grunt:


grunt> A = LOAD 'data.txt' USING PigStorage(',') AS (name:chararray, age:int);
grunt> DUMP A;

Pig Latin – Scripting Language

Pig Latin is a data flow language for processing large datasets.

Basic Structure

LOAD – Load data into Pig
TRANSFORM – Apply transformations (filter, join, group)
STORE / DUMP – Store or display output

Example:


-- Load data
students = LOAD 'students.txt' USING PigStorage(',') AS (name:chararray, age:int, course:chararray);

-- Filter students older than 20
adults = FILTER students BY age > 20;

-- Store result
STORE adults INTO 'output';

User Defined Functions (UDFs)

Pig allows custom functions called UDFs
Written in Java, Python, or other JVM languages
Used when built-in functions are not sufficient

Example: Creating a UDF to calculate student grades from marks.

Data Processing Operators in Pig

1. LOAD & STORE

LOAD – Load data into Pig
STORE – Save data to HDFS

2. FOREACH

Iterate over each record


names = FOREACH students GENERATE name;

3. FILTER

Select records based on condition


adults = FILTER students BY age > 20;

4. GROUP

Group records by field


grouped = GROUP students BY course;

5. JOIN

Join two datasets


joined = JOIN students BY id, marks BY student_id;

6. ORDER / SORT

Sort records


sorted = ORDER students BY age DESC;

7. DISTINCT

Remove duplicates


unique_courses = DISTINCT students.course;

8. UNION

Combine two datasets


all_students = UNION students1, students2;

Real-Life Applications of Pig

Task	Example
ETL	Cleaning and transforming raw log files
Data Analytics	Counting clicks on a website
Data Preparation	Filtering customer data for analysis
Integration	Works with Hive, HBase, and Spark

Exam-Ready Short Notes

Pig – High-level data processing framework on Hadoop
Pig Latin – Scripting language for data transformations
Grunt – Interactive Pig shell
UDF – Custom functions for complex tasks
Operators – LOAD, STORE, FILTER, JOIN, GROUP, FOREACH, ORDER

Hive

Apache Hive is a data warehouse software built on top of Hadoop that allows SQL-like querying of large datasets stored in HDFS.

Uses HiveQL (SQL-like language)
Converts queries into MapReduce, Tez, or Spark jobs
Best for batch data analysis and reporting

Real-Life Example: An e-commerce company analyzing sales transactions or customer behavior across millions of records.

Hive Architecture

Hive architecture consists of the following key components:


            +----------------------+
            |      User Layer      |
            | (Hive Shell / UI /  |
            | JDBC / ODBC clients)|
            +----------+-----------+
                       |
            +----------v-----------+
            |    Driver / Compiler |
            |  (HiveQL -> MR Jobs)|
            +----------+-----------+
                       |
            +----------v-----------+
            |    Execution Engine  |
            | (Manages MapReduce, |
            |  Tez, Spark Jobs)   |
            +----------+-----------+
                       |
            +----------v-----------+
            |      Metastore       |
            | (Metadata storage,   |
            |  DB for tables/schemas) |
            +----------+-----------+
                       |
            +----------v-----------+
            |   Storage Layer      |
            |      (HDFS)         |
            +----------------------+

Key Components

Component	Role
Hive Shell / CLI	User interface to run HiveQL
Driver / Compiler	Compiles HiveQL into execution plan
Execution Engine	Runs jobs on Hadoop (MapReduce/Tez/Spark)
Metastore	Stores metadata about tables, partitions, schema
Storage Layer	HDFS or HBase stores the actual data

Hive Installation

Requirements

Hadoop installed
Java JDK
Hive binary package

Installation Steps

Download Hive from Apache site
Extract Hive package
Set environment variables (HIVE_HOME)
Configure hive-site.xml
Connect to Hive Metastore (embedded or remote DB)
Start Hive shell: hive

Hive Shell & Services

Hive Shell

Interactive CLI for executing HiveQL queries
Similar to SQL console

Example:


hive> SHOW DATABASES;
hive> USE sales_db;

Hive Services

Service	Purpose
CLI	Command-line interface
JDBC / ODBC	Connect external applications
Web UI	Monitor Hive jobs
Thrift Server	Execute queries remotely
Metastore	Stores table metadata

Hive Metastore

Stores metadata (tables, columns, partitions, data types)
Can use embedded Derby DB or MySQL/PostgreSQL for production
Essential for query compilation and optimization

Hive vs Traditional Databases

Feature	Hive	RDBMS
Schema	Schema-on-read	Schema-on-write
Query Language	HiveQL (SQL-like)	SQL
Transactions	Limited	Full ACID support
Storage	HDFS	Local disk / Storage engine
Best for	Batch analytics	OLTP / Small datasets

HiveQL – SQL-Like Language

HiveQL allows you to create tables, query data, and manipulate data:

Creating a Table


CREATE TABLE students (
  id INT,
  name STRING,
  age INT,
  course STRING
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',';

Loading Data


LOAD DATA INPATH '/user/hadoop/students.txt' INTO TABLE students;

Querying Data


SELECT * FROM students WHERE age > 20;
SELECT course, COUNT(*) FROM students GROUP BY course;

User Defined Functions (UDFs)

Custom functions for special operations
Written in Java or Python
Can be used in HiveQL

Example: SELECT my_custom_udf(name) FROM students;

Sorting and Aggregating

Sorting: ORDER BY, SORT BY
Aggregation: COUNT, SUM, AVG, MIN, MAX

Example:


SELECT course, COUNT(*) AS total
FROM students
GROUP BY course
ORDER BY total DESC;

Hive and MapReduce Scripts

Hive converts queries to MapReduce jobs automatically
Users don’t need to write MapReduce manually
Can also embed custom MapReduce scripts in Hive queries

Joins & Subqueries

Join Example


SELECT s.name, m.marks
FROM students s
JOIN marks m ON s.id = m.student_id;

Subquery Example


SELECT name, age 
FROM students 
WHERE id IN (SELECT student_id FROM marks WHERE marks > 80);

Real-Life Applications of Hive

Task	Example
Data Analysis	Sales reports, clickstream analysis
Data Warehousing	Aggregating historical data for BI
ETL	Transform raw logs into structured format
Integration	Works with Spark, HBase, Pig

Exam-Ready Short Notes

Hive – SQL-like data warehouse on Hadoop
HiveQL – Language to query data in Hive
Metastore – Stores metadata of tables/partitions
Tables – Managed (Hive-controlled) or External (HDFS-controlled)
Operators – Sorting (ORDER BY), Aggregation (COUNT, SUM)
Joins & Subqueries – Combine tables and nested queries

HBase

Apache HBase is a NoSQL column-oriented database built on Hadoop for real-time random read/write access to very large datasets.

Modeled after Google Bigtable
Works on HDFS for storage
Best for sparse, large-scale data

Real-Life Example:

Storing user profiles for social media applications
Online transaction systems with fast read/write

HBase Concepts

Concept	Description
Table	Collection of column families
Row	Unique row key identifies a record
Column Family	Group of related columns
Column	Stores data (cells)
Cell	Intersection of row and column; stores value and timestamp

HBase Clients

HBase provides different clients to interact with data:

Client	Purpose
Java API	Build applications in Java
REST API	Access HBase via HTTP
Thrift API	Cross-language support
Shell	Interactive CLI for commands

HBase Example (CLI)


-- Create table with column family 'info'
create 'students', 'info'

-- Insert data
put 'students', 'row1', 'info:name', 'Jay'
put 'students', 'row1', 'info:age', '24'

-- Retrieve data
get 'students', 'row1'

-- Scan table
scan 'students'

HBase vs RDBMS

Feature	HBase	RDBMS
Data Model	Column-oriented	Row-oriented
Schema	Flexible	Fixed
Transactions	Limited (No full ACID)	Full ACID
Storage	HDFS	Local disk/DB engine
Query	APIs / MapReduce	SQL
Use Case	Real-time large data	OLTP / structured data

Advanced HBase Usage

1. Schema Design

Column families group related data
Minimize column families for efficiency
Use row keys carefully for query performance

2. Advanced Indexing

Secondary indexes for faster queries
Coprocessors for custom server-side processing

3. Integration

Works with Hive, Pig, Spark for analytics

Zookeeper

Centralized service to coordinate distributed applications
Provides configuration management, synchronization, and naming services

How Zookeeper Helps HBase

Monitors HBase cluster
Manages Master and RegionServer state
Ensures failover and high availability

Building Applications with Zookeeper

Applications use Zookeeper for coordination

Common tasks:

Leader election
Configuration updates
Distributed locks

Real-Life Example: Spark or HBase cluster uses Zookeeper to monitor nodes and handle failures automatically

IBM Big Data Strategy

IBM provides enterprise-level Big Data solutions to manage and analyze large datasets efficiently.

Key Tools

Tool	Description
Infosphere	Data integration, governance, and warehousing
BigInsights	Enterprise Hadoop platform for analytics
Big Sheets	Excel-like interface on BigInsights for non-programmers
Big SQL	SQL engine to query Hadoop data using standard SQL

IBM Big Data Ecosystem Use Case

Retail: Analyze customer transactions in real-time using Big SQL and BigInsights
Banking: Fraud detection using HBase and Spark on BigInsights

Exam-Ready Short Notes

HBase – Column-oriented NoSQL database on Hadoop
Zookeeper – Coordinates distributed applications, monitors cluster
Advanced HBase – Schema design, secondary indexing, coprocessors
IBM Big Data Tools – Infosphere, BigInsights, BigSheets, Big SQL
HBase vs RDBMS – HBase: real-time, flexible, Hadoop-backed; RDBMS: structured, transactional

Conclusion

HBase provides real-time access to large-scale distributed data, while Zookeeper ensures cluster reliability, coordination, and failover. IBM’s Big Data tools, including Infosphere, BigInsights, BigSheets, and Big SQL, offer enterprise-grade data analytics, integration, and SQL-on-Hadoop capabilities. Together, these technologies provide a robust framework for modern Big Data management and analytics.

Hadoop Eco System Frameworks: Pig, Hive, HBase