Hadoop and Map-Reduce

HADOOP

History of Hadoop

Hadoop was developed to handle huge volumes of data that traditional systems could not process.

Timeline

Year	Development
2003	Google published papers on GFS
2004	Google introduced MapReduce
2006	Doug Cutting created Hadoop
2008	Hadoop became Apache project
2010+	Widely used in industries

Real-Life Example: Google needed to index billions of web pages, which led to Hadoop-like systems.

Apache Hadoop

Apache Hadoop is an open-source framework used to store and process Big Data across multiple computers.

Key Features

Distributed storage
Parallel processing
Fault tolerance
Scalable and cost-effective

Real-Life Example: Facebook uses Hadoop to analyze user activity.

Hadoop Distributed File System (HDFS)

HDFS is the storage layer of Hadoop.

Key Characteristics

Stores data in blocks
Data is replicated
Works on commodity hardware

HDFS Architecture

Component	Function
NameNode	Master; manages metadata
DataNode	Slave; stores actual data
Secondary NameNode	Backup metadata

Example: A large video file is divided into blocks and stored on multiple machines.

Components of Hadoop

Core Components

Component	Purpose
HDFS	Data storage
YARN	Resource management
MapReduce	Data processing
Hadoop Common	Libraries & utilities

Real-Life Example: Like a factory

HDFS = Warehouse
MapReduce = Workers
YARN = Manager

Data Format in Hadoop

Hadoop can process multiple data formats.

Format	Example
Text	Log files
CSV	Student records
JSON	Web data
XML	Config files
Sequence Files	Binary data
Avro	Row-based data
Parquet	Column-based data

Example: Website click logs stored as JSON files.

Analyzing Data with Hadoop

Hadoop uses MapReduce to analyze data.

MapReduce Phases

Map – Converts data into key-value pairs
Shuffle – Groups similar keys
Reduce – Aggregates results

Real-Life Example: Counting word frequency in documents:

Map → Count words
Reduce → Sum counts

Scaling Out in Hadoop

Scaling out means adding more machines instead of increasing system size.

Why Scaling Out?

Cost-effective
Better performance
High availability

Example: If data increases, add 10 more nodes instead of upgrading one server.

Hadoop Streaming

Hadoop Streaming allows users to write MapReduce programs in any language.

Supported Languages

Python
Perl
Shell scripts

Example: A Python script processes log files using Hadoop Streaming.

Hadoop Pipes

Hadoop Pipes allow writing MapReduce programs in C/C++.

Key Features

High performance
Suitable for compute-intensive tasks

Example: Scientific data processing using C++ with Hadoop Pipes.

Hadoop Ecosystem

The Hadoop Ecosystem is a collection of tools that work with Hadoop.

Major Hadoop Ecosystem Components

Tool	Purpose
Hive	SQL-like queries
Pig	Data flow scripting
HBase	NoSQL database
Sqoop	RDBMS to Hadoop
Flume	Log collection
Kafka	Streaming data
Oozie	Workflow scheduling
Zookeeper	Coordination

Real-Life Example: E-commerce company:

Flume → collect logs
HDFS → store data
Hive → analyze data
HBase → real-time access

Hadoop Advantages

Advantage	Description
Cost-effective	Uses low-cost hardware
Scalable	Easy to expand
Fault tolerant	Data replication
Flexible	Handles all data types

Hadoop Limitations

Limitation	Explanation
Not real-time	Batch processing
Complex setup	Requires expertise
High latency	Slow for small tasks

Exam-Ready Short Definitions

Hadoop – Open-source Big Data framework.
HDFS – Distributed storage system.
MapReduce – Parallel processing model.
YARN – Resource manager.
Hadoop Ecosystem – Supporting tools of Hadoop.

Conclusion (Exam Style)

Apache Hadoop is a powerful framework for storing and processing large volumes of data efficiently. Its distributed nature, fault tolerance, and scalability make it suitable for Big Data applications across various industries.

MAP-REDUCE

Map-Reduce Framework and Basics

Map-Reduce is a programming model used in Hadoop to process large datasets in parallel across multiple machines.

Basic Idea

Break big data into small parts
Process them in parallel
Combine the results

Two Main Functions

Map – Processes input data and creates key-value pairs
Reduce – Combines values for the same key

Example: Counting words in 1 lakh documents:

Map → Counts words in each document
Reduce → Adds all counts together

How Map-Reduce Works (Step by Step)

Working Flow

Input data stored in HDFS
Data split into blocks
Mapper processes each split
Shuffle & Sort groups similar keys
Reducer aggregates data
Final output stored in HDFS

Real-Life Example: Exam result processing:

Map → Marks of students
Reduce → Total and average marks

Developing a Map-Reduce Application

Main Steps

Write Mapper class
Write Reducer class
Configure Driver class
Specify input and output paths
Run the job

Languages Used

Java (most common)
Python (using Hadoop Streaming)

Example: Log analysis to count number of website visitors.

Unit Tests with MRUnit

MRUnit is a testing framework used to test Map-Reduce programs.

Why MRUnit?

Tests Mapper and Reducer separately
Finds errors early
Saves execution time

Example: Testing if Mapper correctly outputs (word, 1) pairs.

Test Data and Local Tests

Test Data: Small sample data used for testing before running on full cluster.

Local Testing

Runs Map-Reduce on local system
No Hadoop cluster required

Example: Testing Map-Reduce using 5 text files before processing 5 TB data.

Anatomy of a Map-Reduce Job Run

A Map-Reduce job consists of multiple components working together.

Main Components

Component	Role
Job Client	Submits job
Resource Manager	Allocates resources
Application Master	Manages job
Mapper	Processes data
Reducer	Aggregates data

Failures in Map-Reduce

Failures are common in distributed systems.

Types of Failures

Node failure
Network failure
Task failure

How Hadoop Handles Failures

Task re-execution
Data replication
Automatic recovery

Example: If one DataNode fails, Hadoop runs task on another node.

Job Scheduling

Job scheduling decides which job runs first.

Scheduling Types

Scheduler	Description
FIFO	First job runs first
Fair Scheduler	Equal resource sharing
Capacity Scheduler	Fixed capacity per user

Shuffle and Sort

Shuffle: Transfers intermediate data from Mapper to Reducer.
Sort: Groups data by key before reduction.

Example: All (“Hadoop”, 1) pairs go to the same Reducer.

Task Execution

Execution Steps

Map tasks run first
Reduce tasks start after shuffle
Tasks executed in parallel

Example: 100 Map tasks + 10 Reduce tasks for faster processing.

Map-Reduce Types

Type	Description
Map-only Job	No reducer required
Map-Reduce Job	Uses both
Reduce-only Job	Rarely used

Example: Data filtering → Map-only job

Input Formats

InputFormat decides how data is read.

Input Format	Use Case
TextInputFormat	Text files
KeyValueInputFormat	Key-value data
SequenceFileInputFormat	Binary data
NLineInputFormat	Fixed lines per mapper

Output Formats

OutputFormat decides how output is stored.

Output Format	Description
TextOutputFormat	Plain text
SequenceFileOutputFormat	Binary
MapFileOutputFormat	Indexed output

Map-Reduce Features

Feature	Explanation
Scalability	Handles petabytes of data
Fault tolerance	Automatic recovery
Parallelism	Faster processing
Data locality	Moves computation to data
Reliability	Consistent results

Real-World Map-Reduce Applications

Industry	Application
Search Engines	Indexing web pages
E-commerce	Customer behavior analysis
Banking	Fraud detection
Social Media	Trend analysis
Telecom	Call data analysis

Real-Life Example: Amazon uses Map-Reduce to analyze customer purchase history.

Short Exam-Ready Definitions

Map-Reduce – Distributed data processing model.
Mapper – Converts input into key-value pairs.
Reducer – Aggregates results.
Shuffle – Transfers intermediate data.
InputFormat – Defines input structure.

Hadoop and Map-Reduce

HADOOP

History of Hadoop

Timeline

Apache Hadoop

Key Features

Hadoop Distributed File System (HDFS)

Key Characteristics

HDFS Architecture

Components of Hadoop

Core Components

Real-Life Example: Like a factory

Data Format in Hadoop

Analyzing Data with Hadoop

MapReduce Phases

Scaling Out in Hadoop

Why Scaling Out?

Hadoop Streaming

Supported Languages

Hadoop Pipes

Key Features

Hadoop Ecosystem

Major Hadoop Ecosystem Components

Hadoop Advantages

Hadoop Limitations

Exam-Ready Short Definitions

Conclusion (Exam Style)

MAP-REDUCE

Basic Idea

Two Main Functions

How Map-Reduce Works (Step by Step)

Working Flow

Developing a Map-Reduce Application

Main Steps

Languages Used

Unit Tests with MRUnit

Why MRUnit?

Test Data and Local Tests

Local Testing

Anatomy of a Map-Reduce Job Run

Main Components

Failures in Map-Reduce

Types of Failures

How Hadoop Handles Failures

Job Scheduling

Scheduling Types

Shuffle and Sort

Task Execution

Execution Steps

Map-Reduce Types

Input Formats

Output Formats

Map-Reduce Features

Real-World Map-Reduce Applications

Short Exam-Ready Definitions

You might like