Hadoop and Map-Reduce



HADOOP 

History of Hadoop

Hadoop was developed to handle huge volumes of data that traditional systems could not process.

Timeline

YearDevelopment
2003Google published papers on GFS
2004Google introduced MapReduce
2006Doug Cutting created Hadoop
2008Hadoop became Apache project
2010+Widely used in industries

Real-Life Example: Google needed to index billions of web pages, which led to Hadoop-like systems.

Apache Hadoop

Apache Hadoop is an open-source framework used to store and process Big Data across multiple computers.

Key Features

  • Distributed storage
  • Parallel processing
  • Fault tolerance
  • Scalable and cost-effective

Real-Life Example: Facebook uses Hadoop to analyze user activity.

Hadoop Distributed File System (HDFS)

HDFS is the storage layer of Hadoop.

Key Characteristics

  • Stores data in blocks
  • Data is replicated
  • Works on commodity hardware

HDFS Architecture

ComponentFunction
NameNodeMaster; manages metadata
DataNodeSlave; stores actual data
Secondary NameNodeBackup metadata

Example: A large video file is divided into blocks and stored on multiple machines.

Components of Hadoop

Core Components

ComponentPurpose
HDFSData storage
YARNResource management
MapReduceData processing
Hadoop CommonLibraries & utilities

Real-Life Example: Like a factory

  • HDFS = Warehouse
  • MapReduce = Workers
  • YARN = Manager

Data Format in Hadoop

Hadoop can process multiple data formats.

FormatExample
TextLog files
CSVStudent records
JSONWeb data
XMLConfig files
Sequence FilesBinary data
AvroRow-based data
ParquetColumn-based data

Example: Website click logs stored as JSON files.

Analyzing Data with Hadoop

Hadoop uses MapReduce to analyze data.

MapReduce Phases

  • Map – Converts data into key-value pairs
  • Shuffle – Groups similar keys
  • Reduce – Aggregates results

Real-Life Example: Counting word frequency in documents:

  • Map → Count words
  • Reduce → Sum counts

Scaling Out in Hadoop

Scaling out means adding more machines instead of increasing system size.

Why Scaling Out?

  • Cost-effective
  • Better performance
  • High availability

Example: If data increases, add 10 more nodes instead of upgrading one server.

Hadoop Streaming

Hadoop Streaming allows users to write MapReduce programs in any language.

Supported Languages

  • Python
  • Perl
  • Shell scripts

Example: A Python script processes log files using Hadoop Streaming.

Hadoop Pipes

Hadoop Pipes allow writing MapReduce programs in C/C++.

Key Features

  • High performance
  • Suitable for compute-intensive tasks

Example: Scientific data processing using C++ with Hadoop Pipes.

Hadoop Ecosystem

The Hadoop Ecosystem is a collection of tools that work with Hadoop.

Major Hadoop Ecosystem Components

ToolPurpose
HiveSQL-like queries
PigData flow scripting
HBaseNoSQL database
SqoopRDBMS to Hadoop
FlumeLog collection
KafkaStreaming data
OozieWorkflow scheduling
ZookeeperCoordination

Real-Life Example: E-commerce company:

  • Flume → collect logs
  • HDFS → store data
  • Hive → analyze data
  • HBase → real-time access

Hadoop Advantages

AdvantageDescription
Cost-effectiveUses low-cost hardware
ScalableEasy to expand
Fault tolerantData replication
FlexibleHandles all data types

Hadoop Limitations

LimitationExplanation
Not real-timeBatch processing
Complex setupRequires expertise
High latencySlow for small tasks

Exam-Ready Short Definitions

  • Hadoop – Open-source Big Data framework.
  • HDFS – Distributed storage system.
  • MapReduce – Parallel processing model.
  • YARN – Resource manager.
  • Hadoop Ecosystem – Supporting tools of Hadoop.

Conclusion (Exam Style)

Apache Hadoop is a powerful framework for storing and processing large volumes of data efficiently. Its distributed nature, fault tolerance, and scalability make it suitable for Big Data applications across various industries.

MAP-REDUCE

Map-Reduce Framework and Basics

Map-Reduce is a programming model used in Hadoop to process large datasets in parallel across multiple machines.

Basic Idea

  • Break big data into small parts
  • Process them in parallel
  • Combine the results

Two Main Functions

  • Map – Processes input data and creates key-value pairs
  • Reduce – Combines values for the same key

Example: Counting words in 1 lakh documents:

  • Map → Counts words in each document
  • Reduce → Adds all counts together

How Map-Reduce Works (Step by Step)

Working Flow

  • Input data stored in HDFS
  • Data split into blocks
  • Mapper processes each split
  • Shuffle & Sort groups similar keys
  • Reducer aggregates data
  • Final output stored in HDFS

Real-Life Example: Exam result processing:

  • Map → Marks of students
  • Reduce → Total and average marks

Developing a Map-Reduce Application

Main Steps

  • Write Mapper class
  • Write Reducer class
  • Configure Driver class
  • Specify input and output paths
  • Run the job

Languages Used

  • Java (most common)
  • Python (using Hadoop Streaming)

Example: Log analysis to count number of website visitors.

Unit Tests with MRUnit

MRUnit is a testing framework used to test Map-Reduce programs.

Why MRUnit?

  • Tests Mapper and Reducer separately
  • Finds errors early
  • Saves execution time

Example: Testing if Mapper correctly outputs (word, 1) pairs.

Test Data and Local Tests

Test Data: Small sample data used for testing before running on full cluster.

Local Testing

  • Runs Map-Reduce on local system
  • No Hadoop cluster required

Example: Testing Map-Reduce using 5 text files before processing 5 TB data.

Anatomy of a Map-Reduce Job Run

A Map-Reduce job consists of multiple components working together.

Main Components

ComponentRole
Job ClientSubmits job
Resource ManagerAllocates resources
Application MasterManages job
MapperProcesses data
ReducerAggregates data

Failures in Map-Reduce

Failures are common in distributed systems.

Types of Failures

  • Node failure
  • Network failure
  • Task failure

How Hadoop Handles Failures

  • Task re-execution
  • Data replication
  • Automatic recovery

Example: If one DataNode fails, Hadoop runs task on another node.

Job Scheduling

Job scheduling decides which job runs first.

Scheduling Types

SchedulerDescription
FIFOFirst job runs first
Fair SchedulerEqual resource sharing
Capacity SchedulerFixed capacity per user

Shuffle and Sort

  • Shuffle: Transfers intermediate data from Mapper to Reducer.
  • Sort: Groups data by key before reduction.

Example: All (“Hadoop”, 1) pairs go to the same Reducer.

Task Execution

Execution Steps

  • Map tasks run first
  • Reduce tasks start after shuffle
  • Tasks executed in parallel

Example: 100 Map tasks + 10 Reduce tasks for faster processing.

Map-Reduce Types

TypeDescription
Map-only JobNo reducer required
Map-Reduce JobUses both
Reduce-only JobRarely used

Example: Data filtering → Map-only job

Input Formats

InputFormat decides how data is read.

Input FormatUse Case
TextInputFormatText files
KeyValueInputFormatKey-value data
SequenceFileInputFormatBinary data
NLineInputFormatFixed lines per mapper

Output Formats

OutputFormat decides how output is stored.

Output FormatDescription
TextOutputFormatPlain text
SequenceFileOutputFormatBinary
MapFileOutputFormatIndexed output

Map-Reduce Features

FeatureExplanation
ScalabilityHandles petabytes of data
Fault toleranceAutomatic recovery
ParallelismFaster processing
Data localityMoves computation to data
ReliabilityConsistent results

Real-World Map-Reduce Applications

IndustryApplication
Search EnginesIndexing web pages
E-commerceCustomer behavior analysis
BankingFraud detection
Social MediaTrend analysis
TelecomCall data analysis

Real-Life Example: Amazon uses Map-Reduce to analyze customer purchase history.

Short Exam-Ready Definitions

  • Map-Reduce – Distributed data processing model.
  • Mapper – Converts input into key-value pairs.
  • Reducer – Aggregates results.
  • Shuffle – Transfers intermediate data.
  • InputFormat – Defines input structure.