Hadoop Eco System Frameworks: Pig, Hive, HBase
Hadoop Ecosystem Frameworks
Hadoop Ecosystem frameworks provide higher-level tools to process, analyze, and store big data without writing low-level MapReduce code.
Key Frameworks:
- Pig – For data transformation and ETL
- Hive – For SQL-like queries on big data
- HBase – For real-time access to large-scale data
Apache Pig
Pig is a high-level scripting language (Pig Latin) for processing large data sets on Hadoop.
- Simplifies MapReduce programming
- Designed for data transformation, aggregation, and ETL tasks
Key Components
- Pig Latin – Scripting language
- Pig Engine – Translates Pig Latin into MapReduce jobs
Execution Modes:
- Local mode – Runs on a single machine
- Hadoop mode – Runs on HDFS cluster
Example (Pig Latin Script)
Real-Life Example: Processing website logs to find daily visitors or filtering users by age.
Apache Hive
Hive is a data warehouse framework on Hadoop that allows SQL-like queries on large data stored in HDFS.
- Uses HiveQL (SQL-like language)
- Converts queries to MapReduce, Tez, or Spark jobs
Key Features
| Feature | Description |
|---|---|
| Query Language | HiveQL (SQL-like) |
| Storage | HDFS or HBase |
| Schema | Defined on read (flexible) |
| Partitioning | Splits tables into partitions for faster queries |
| Indexing | Improves query speed |
Hive Example
Real-Life Example: Analyzing sales data to calculate total sales per region.
Apache HBase
HBase is a NoSQL column-oriented database built on HDFS for real-time access to large datasets.
- Works like Google Bigtable
- Best for sparse data and random reads/writes
Key Features
| Feature | Description |
|---|---|
| Column-Oriented | Stores data in column families |
| Scalable | Handles petabytes of data |
| Real-Time | Fast read/write operations |
| Integration | Works with Hive, Pig, Spark |
HBase Concepts
| Term | Meaning |
|---|---|
| Table | Collection of column families |
| Row | Unique row key |
| Column Family | Group of related columns |
| Cell | Intersection of row and column (stores value) |
HBase Example (CLI Commands)
Real-Life Example: Storing user profiles for social media platforms with fast read/write access.
Comparison: Pig vs Hive vs HBase
| Feature | Pig | Hive | HBase |
|---|---|---|---|
| Language | Pig Latin | SQL-like (HiveQL) | Column-based NoSQL |
| Use Case | ETL, Data Transformation | Data Analysis & Reporting | Real-time random access |
| Execution | MapReduce | MapReduce/Tez/Spark | HDFS-backed |
| Ease of Use | Medium | Easy (SQL knowledge) | Medium (NoSQL knowledge) |
| Real-Time | No | No | Yes |
Applications on Big Data
| Framework | Application Example |
|---|---|
| Pig | Cleaning and transforming log files |
| Hive | Sales reports, trend analysis using SQL |
| HBase | Real-time recommendation system, user profile storage |
Exam-Ready Short Notes
- Pig – ETL tool, Pig Latin script, converts to MapReduce
- Hive – SQL-like querying on HDFS, uses HiveQL
- HBase – Column-oriented NoSQL, real-time access
- Pig vs Hive – Pig for data processing, Hive for data querying
- HBase vs Hive – HBase for random reads/writes, Hive for batch queries
Pig
Apache Pig is a high-level platform used to process large datasets in Hadoop.
- Uses a scripting language called Pig Latin
- Converts Pig Latin scripts into MapReduce jobs
- Ideal for ETL, data transformation, and data processing tasks
Real-Life Example: Processing website log files to calculate page visits or filter users by region.
Execution Modes of Pig
Pig can run in three modes:
| Mode | Description | Use Case |
|---|---|---|
| Local Mode | Runs on a single machine using local files | Small datasets, testing |
| Hadoop Mode / MapReduce Mode | Runs on Hadoop cluster, uses HDFS | Large datasets, production |
| Tez Mode | Runs on Apache Tez engine for faster execution | Faster processing than MapReduce |
Comparison of Pig with Traditional Databases
| Feature | Pig | RDBMS |
|---|---|---|
| Language | Pig Latin (Script) | SQL |
| Schema | Optional (Schema-on-read) | Fixed schema |
| Processing | Batch/Parallel | Sequential or limited parallelism |
| Flexibility | High | Low |
| Suitable for Big Data | Yes | Limited |
Summary: Pig is more flexible and scalable than traditional databases for big data.
Grunt – Pig Interactive Shell
- Grunt is the interactive shell for Pig
- Allows executing Pig Latin commands interactively
- Useful for testing and debugging scripts
Example Commands in Grunt:
Pig Latin – Scripting Language
Pig Latin is a data flow language for processing large datasets.
Basic Structure
- LOAD – Load data into Pig
- TRANSFORM – Apply transformations (filter, join, group)
- STORE / DUMP – Store or display output
Example:
User Defined Functions (UDFs)
- Pig allows custom functions called UDFs
- Written in Java, Python, or other JVM languages
- Used when built-in functions are not sufficient
Example: Creating a UDF to calculate student grades from marks.
Data Processing Operators in Pig
1. LOAD & STORE
LOAD– Load data into PigSTORE– Save data to HDFS
2. FOREACH
-
Iterate over each record
3. FILTER
-
Select records based on condition
4. GROUP
-
Group records by field
5. JOIN
-
Join two datasets
6. ORDER / SORT
-
Sort records
7. DISTINCT
-
Remove duplicates
8. UNION
-
Combine two datasets
Real-Life Applications of Pig
| Task | Example |
|---|---|
| ETL | Cleaning and transforming raw log files |
| Data Analytics | Counting clicks on a website |
| Data Preparation | Filtering customer data for analysis |
| Integration | Works with Hive, HBase, and Spark |
Exam-Ready Short Notes
- Pig – High-level data processing framework on Hadoop
- Pig Latin – Scripting language for data transformations
- Grunt – Interactive Pig shell
- UDF – Custom functions for complex tasks
- Operators – LOAD, STORE, FILTER, JOIN, GROUP, FOREACH, ORDER
Hive
Apache Hive is a data warehouse software built on top of Hadoop that allows SQL-like querying of large datasets stored in HDFS.
- Uses HiveQL (SQL-like language)
- Converts queries into MapReduce, Tez, or Spark jobs
- Best for batch data analysis and reporting
Real-Life Example: An e-commerce company analyzing sales transactions or customer behavior across millions of records.
Hive Architecture
Hive architecture consists of the following key components:
Key Components
| Component | Role |
|---|---|
| Hive Shell / CLI | User interface to run HiveQL |
| Driver / Compiler | Compiles HiveQL into execution plan |
| Execution Engine | Runs jobs on Hadoop (MapReduce/Tez/Spark) |
| Metastore | Stores metadata about tables, partitions, schema |
| Storage Layer | HDFS or HBase stores the actual data |
Hive Installation
Requirements
- Hadoop installed
- Java JDK
- Hive binary package
Installation Steps
- Download Hive from Apache site
- Extract Hive package
- Set environment variables (
HIVE_HOME) - Configure
hive-site.xml - Connect to Hive Metastore (embedded or remote DB)
- Start Hive shell:
hive
Hive Shell & Services
Hive Shell
- Interactive CLI for executing HiveQL queries
- Similar to SQL console
Example:
Hive Services
| Service | Purpose |
|---|---|
| CLI | Command-line interface |
| JDBC / ODBC | Connect external applications |
| Web UI | Monitor Hive jobs |
| Thrift Server | Execute queries remotely |
| Metastore | Stores table metadata |
Hive Metastore
- Stores metadata (tables, columns, partitions, data types)
- Can use embedded Derby DB or MySQL/PostgreSQL for production
- Essential for query compilation and optimization
Hive vs Traditional Databases
| Feature | Hive | RDBMS |
|---|---|---|
| Schema | Schema-on-read | Schema-on-write |
| Query Language | HiveQL (SQL-like) | SQL |
| Transactions | Limited | Full ACID support |
| Storage | HDFS | Local disk / Storage engine |
| Best for | Batch analytics | OLTP / Small datasets |
HiveQL – SQL-Like Language
HiveQL allows you to create tables, query data, and manipulate data:
Creating a Table
Loading Data
Querying Data
User Defined Functions (UDFs)
- Custom functions for special operations
- Written in Java or Python
- Can be used in HiveQL
Example: SELECT my_custom_udf(name) FROM students;
Sorting and Aggregating
- Sorting:
ORDER BY,SORT BY - Aggregation:
COUNT,SUM,AVG,MIN,MAX
Example:
Hive and MapReduce Scripts
- Hive converts queries to MapReduce jobs automatically
- Users don’t need to write MapReduce manually
- Can also embed custom MapReduce scripts in Hive queries
Joins & Subqueries
Join Example
Subquery Example
Real-Life Applications of Hive
| Task | Example |
|---|---|
| Data Analysis | Sales reports, clickstream analysis |
| Data Warehousing | Aggregating historical data for BI |
| ETL | Transform raw logs into structured format |
| Integration | Works with Spark, HBase, Pig |
Exam-Ready Short Notes
- Hive – SQL-like data warehouse on Hadoop
- HiveQL – Language to query data in Hive
- Metastore – Stores metadata of tables/partitions
- Tables – Managed (Hive-controlled) or External (HDFS-controlled)
- Operators – Sorting (
ORDER BY), Aggregation (COUNT,SUM) - Joins & Subqueries – Combine tables and nested queries
HBase
Apache HBase is a NoSQL column-oriented database built on Hadoop for real-time random read/write access to very large datasets.
- Modeled after Google Bigtable
- Works on HDFS for storage
- Best for sparse, large-scale data
Real-Life Example:
- Storing user profiles for social media applications
- Online transaction systems with fast read/write
HBase Concepts
| Concept | Description |
|---|---|
| Table | Collection of column families |
| Row | Unique row key identifies a record |
| Column Family | Group of related columns |
| Column | Stores data (cells) |
| Cell | Intersection of row and column; stores value and timestamp |
HBase Clients
HBase provides different clients to interact with data:
| Client | Purpose |
|---|---|
| Java API | Build applications in Java |
| REST API | Access HBase via HTTP |
| Thrift API | Cross-language support |
| Shell | Interactive CLI for commands |
HBase Example (CLI)
HBase vs RDBMS
| Feature | HBase | RDBMS |
|---|---|---|
| Data Model | Column-oriented | Row-oriented |
| Schema | Flexible | Fixed |
| Transactions | Limited (No full ACID) | Full ACID |
| Storage | HDFS | Local disk/DB engine |
| Query | APIs / MapReduce | SQL |
| Use Case | Real-time large data | OLTP / structured data |
Advanced HBase Usage
1. Schema Design
- Column families group related data
- Minimize column families for efficiency
- Use row keys carefully for query performance
2. Advanced Indexing
- Secondary indexes for faster queries
- Coprocessors for custom server-side processing
3. Integration
-
Works with Hive, Pig, Spark for analytics
Zookeeper
- Centralized service to coordinate distributed applications
- Provides configuration management, synchronization, and naming services
How Zookeeper Helps HBase
- Monitors HBase cluster
- Manages Master and RegionServer state
- Ensures failover and high availability
Building Applications with Zookeeper
-
Applications use Zookeeper for coordination
Common tasks:
- Leader election
- Configuration updates
- Distributed locks
Real-Life Example: Spark or HBase cluster uses Zookeeper to monitor nodes and handle failures automatically
IBM Big Data Strategy
IBM provides enterprise-level Big Data solutions to manage and analyze large datasets efficiently.
Key Tools
| Tool | Description |
|---|---|
| Infosphere | Data integration, governance, and warehousing |
| BigInsights | Enterprise Hadoop platform for analytics |
| Big Sheets | Excel-like interface on BigInsights for non-programmers |
| Big SQL | SQL engine to query Hadoop data using standard SQL |
IBM Big Data Ecosystem Use Case
- Retail: Analyze customer transactions in real-time using Big SQL and BigInsights
- Banking: Fraud detection using HBase and Spark on BigInsights
Exam-Ready Short Notes
- HBase – Column-oriented NoSQL database on Hadoop
- Zookeeper – Coordinates distributed applications, monitors cluster
- Advanced HBase – Schema design, secondary indexing, coprocessors
- IBM Big Data Tools – Infosphere, BigInsights, BigSheets, Big SQL
- HBase vs RDBMS – HBase: real-time, flexible, Hadoop-backed; RDBMS: structured, transactional
Conclusion
HBase provides real-time access to large-scale distributed data, while Zookeeper ensures cluster reliability, coordination, and failover. IBM’s Big Data tools, including Infosphere, BigInsights, BigSheets, and Big SQL, offer enterprise-grade data analytics, integration, and SQL-on-Hadoop capabilities. Together, these technologies provide a robust framework for modern Big Data management and analytics.