Hadoop and Map-Reduce
HADOOP
History of Hadoop
Hadoop was developed to handle huge volumes of data that traditional systems could not process.
Timeline
| Year | Development |
|---|---|
| 2003 | Google published papers on GFS |
| 2004 | Google introduced MapReduce |
| 2006 | Doug Cutting created Hadoop |
| 2008 | Hadoop became Apache project |
| 2010+ | Widely used in industries |
Real-Life Example: Google needed to index billions of web pages, which led to Hadoop-like systems.
Apache Hadoop
Apache Hadoop is an open-source framework used to store and process Big Data across multiple computers.
Key Features
- Distributed storage
- Parallel processing
- Fault tolerance
- Scalable and cost-effective
Real-Life Example: Facebook uses Hadoop to analyze user activity.
Hadoop Distributed File System (HDFS)
HDFS is the storage layer of Hadoop.
Key Characteristics
- Stores data in blocks
- Data is replicated
- Works on commodity hardware
HDFS Architecture
| Component | Function |
|---|---|
| NameNode | Master; manages metadata |
| DataNode | Slave; stores actual data |
| Secondary NameNode | Backup metadata |
Example: A large video file is divided into blocks and stored on multiple machines.
Components of Hadoop
Core Components
| Component | Purpose |
|---|---|
| HDFS | Data storage |
| YARN | Resource management |
| MapReduce | Data processing |
| Hadoop Common | Libraries & utilities |
Real-Life Example: Like a factory
- HDFS = Warehouse
- MapReduce = Workers
- YARN = Manager
Data Format in Hadoop
Hadoop can process multiple data formats.
| Format | Example |
|---|---|
| Text | Log files |
| CSV | Student records |
| JSON | Web data |
| XML | Config files |
| Sequence Files | Binary data |
| Avro | Row-based data |
| Parquet | Column-based data |
Example: Website click logs stored as JSON files.
Analyzing Data with Hadoop
Hadoop uses MapReduce to analyze data.
MapReduce Phases
- Map – Converts data into key-value pairs
- Shuffle – Groups similar keys
- Reduce – Aggregates results
Real-Life Example: Counting word frequency in documents:
- Map → Count words
- Reduce → Sum counts
Scaling Out in Hadoop
Scaling out means adding more machines instead of increasing system size.
Why Scaling Out?
- Cost-effective
- Better performance
- High availability
Example: If data increases, add 10 more nodes instead of upgrading one server.
Hadoop Streaming
Hadoop Streaming allows users to write MapReduce programs in any language.
Supported Languages
- Python
- Perl
- Shell scripts
Example: A Python script processes log files using Hadoop Streaming.
Hadoop Pipes
Hadoop Pipes allow writing MapReduce programs in C/C++.
Key Features
- High performance
- Suitable for compute-intensive tasks
Example: Scientific data processing using C++ with Hadoop Pipes.
Hadoop Ecosystem
The Hadoop Ecosystem is a collection of tools that work with Hadoop.
Major Hadoop Ecosystem Components
| Tool | Purpose |
|---|---|
| Hive | SQL-like queries |
| Pig | Data flow scripting |
| HBase | NoSQL database |
| Sqoop | RDBMS to Hadoop |
| Flume | Log collection |
| Kafka | Streaming data |
| Oozie | Workflow scheduling |
| Zookeeper | Coordination |
Real-Life Example: E-commerce company:
- Flume → collect logs
- HDFS → store data
- Hive → analyze data
- HBase → real-time access
Hadoop Advantages
| Advantage | Description |
|---|---|
| Cost-effective | Uses low-cost hardware |
| Scalable | Easy to expand |
| Fault tolerant | Data replication |
| Flexible | Handles all data types |
Hadoop Limitations
| Limitation | Explanation |
|---|---|
| Not real-time | Batch processing |
| Complex setup | Requires expertise |
| High latency | Slow for small tasks |
Exam-Ready Short Definitions
- Hadoop – Open-source Big Data framework.
- HDFS – Distributed storage system.
- MapReduce – Parallel processing model.
- YARN – Resource manager.
- Hadoop Ecosystem – Supporting tools of Hadoop.
Conclusion (Exam Style)
Apache Hadoop is a powerful framework for storing and processing large volumes of data efficiently. Its distributed nature, fault tolerance, and scalability make it suitable for Big Data applications across various industries.
MAP-REDUCE
Map-Reduce Framework and Basics
Map-Reduce is a programming model used in Hadoop to process large datasets in parallel across multiple machines.
Basic Idea
- Break big data into small parts
- Process them in parallel
- Combine the results
Two Main Functions
- Map – Processes input data and creates key-value pairs
- Reduce – Combines values for the same key
Example: Counting words in 1 lakh documents:
- Map → Counts words in each document
- Reduce → Adds all counts together
How Map-Reduce Works (Step by Step)
Working Flow
- Input data stored in HDFS
- Data split into blocks
- Mapper processes each split
- Shuffle & Sort groups similar keys
- Reducer aggregates data
- Final output stored in HDFS
Real-Life Example: Exam result processing:
- Map → Marks of students
- Reduce → Total and average marks
Developing a Map-Reduce Application
Main Steps
- Write Mapper class
- Write Reducer class
- Configure Driver class
- Specify input and output paths
- Run the job
Languages Used
- Java (most common)
- Python (using Hadoop Streaming)
Example: Log analysis to count number of website visitors.
Unit Tests with MRUnit
MRUnit is a testing framework used to test Map-Reduce programs.
Why MRUnit?
- Tests Mapper and Reducer separately
- Finds errors early
- Saves execution time
Example: Testing if Mapper correctly outputs (word, 1) pairs.
Test Data and Local Tests
Test Data: Small sample data used for testing before running on full cluster.
Local Testing
- Runs Map-Reduce on local system
- No Hadoop cluster required
Example: Testing Map-Reduce using 5 text files before processing 5 TB data.
Anatomy of a Map-Reduce Job Run
A Map-Reduce job consists of multiple components working together.
Main Components
| Component | Role |
|---|---|
| Job Client | Submits job |
| Resource Manager | Allocates resources |
| Application Master | Manages job |
| Mapper | Processes data |
| Reducer | Aggregates data |
Failures in Map-Reduce
Failures are common in distributed systems.
Types of Failures
- Node failure
- Network failure
- Task failure
How Hadoop Handles Failures
- Task re-execution
- Data replication
- Automatic recovery
Example: If one DataNode fails, Hadoop runs task on another node.
Job Scheduling
Job scheduling decides which job runs first.
Scheduling Types
| Scheduler | Description |
|---|---|
| FIFO | First job runs first |
| Fair Scheduler | Equal resource sharing |
| Capacity Scheduler | Fixed capacity per user |
Shuffle and Sort
- Shuffle: Transfers intermediate data from Mapper to Reducer.
- Sort: Groups data by key before reduction.
Example: All (“Hadoop”, 1) pairs go to the same Reducer.
Task Execution
Execution Steps
- Map tasks run first
- Reduce tasks start after shuffle
- Tasks executed in parallel
Example: 100 Map tasks + 10 Reduce tasks for faster processing.
Map-Reduce Types
| Type | Description |
|---|---|
| Map-only Job | No reducer required |
| Map-Reduce Job | Uses both |
| Reduce-only Job | Rarely used |
Example: Data filtering → Map-only job
Input Formats
InputFormat decides how data is read.
| Input Format | Use Case |
|---|---|
| TextInputFormat | Text files |
| KeyValueInputFormat | Key-value data |
| SequenceFileInputFormat | Binary data |
| NLineInputFormat | Fixed lines per mapper |
Output Formats
OutputFormat decides how output is stored.
| Output Format | Description |
|---|---|
| TextOutputFormat | Plain text |
| SequenceFileOutputFormat | Binary |
| MapFileOutputFormat | Indexed output |
Map-Reduce Features
| Feature | Explanation |
|---|---|
| Scalability | Handles petabytes of data |
| Fault tolerance | Automatic recovery |
| Parallelism | Faster processing |
| Data locality | Moves computation to data |
| Reliability | Consistent results |
Real-World Map-Reduce Applications
| Industry | Application |
|---|---|
| Search Engines | Indexing web pages |
| E-commerce | Customer behavior analysis |
| Banking | Fraud detection |
| Social Media | Trend analysis |
| Telecom | Call data analysis |
Real-Life Example: Amazon uses Map-Reduce to analyze customer purchase history.
Short Exam-Ready Definitions
- Map-Reduce – Distributed data processing model.
- Mapper – Converts input into key-value pairs.
- Reducer – Aggregates results.
- Shuffle – Transfers intermediate data.
- InputFormat – Defines input structure.