HDFS – HADOOP DISTRIBUTED FILE SYSTEM

Introduction to HDFS

HDFS (Hadoop Distributed File System) is the storage system of Hadoop designed to store very large files across multiple machines in a reliable and fault-tolerant manner.

HDFS breaks big files into smaller blocks and stores them across many computers.

Real-Life Example: A 5 GB video is split and stored across many servers instead of one hard disk.

Design of HDFS

HDFS is designed with the following goals:

Key Design Principles

Principle	Explanation
Scalability	Can store petabytes of data
Fault Tolerance	Works even if nodes fail
High Throughput	Fast data access
Commodity Hardware	Uses low-cost systems
Write Once, Read Many	Files are not frequently modified

HDFS Concepts

Main Components

Component	Role
NameNode	Master node; stores metadata
DataNode	Stores actual data blocks
Secondary NameNode	Metadata backup
Client	Accesses HDFS

Metadata Includes

File name
Block location
Permissions

Benefits and Challenges of HDFS

Benefits

Benefit	Description
Fault Tolerant	Data replication
High Scalability	Add nodes easily
Cost-Effective	Uses cheap hardware
Large File Support	Handles TB-PB files

Challenges

Challenge	Explanation
Small File Problem	Not efficient for small files
Not Real-Time	Batch processing
Single NameNode (older versions)	Risk of failure
High Latency	Slow for small reads

File Sizes, Block Sizes and Block Abstraction

File Sizes: HDFS handles very large files (GB to PB).
Block Size: Default block size = 128 MB

Block Abstraction

Files are split into fixed-size blocks.
Blocks stored independently.

Example: 300 MB file → 3 blocks (128 + 128 + 44 MB)

Data Replication in HDFS

Replication means storing multiple copies of data blocks.

Replication Factor

Default = 3

Benefits

Fault tolerance
Data availability

Example: One block stored on 3 different DataNodes.

How HDFS Stores Files

Write Process

Client requests NameNode
NameNode provides DataNode list
Client writes data to DataNodes
Blocks are replicated

Example: Uploading a video to HDFS.

How HDFS Reads Files

Read Process

Client requests NameNode
NameNode provides block locations
Client reads from nearest DataNode

Example: Streaming stored log files.

How HDFS Writes Files

Write Flow

File divided into blocks
Blocks written sequentially
Replication applied
Metadata updated

Java Interfaces to HDFS

HDFS provides Java APIs to interact with files.

Common Classes

Class	Purpose
FileSystem	Access HDFS
Path	File path
FSDataInputStream	Read data
FSDataOutputStream	Write data

Example: Java program to upload files to HDFS.

Command Line Interface (CLI)

HDFS provides command-line tools.

Common Commands

Command	Description
`hdfs dfs -ls`	List files
`hdfs dfs -put`	Upload file
`hdfs dfs -get`	Download file
`hdfs dfs -rm`	Delete file
`hdfs dfs -df`	Disk usage

Hadoop File System Interfaces

Hadoop supports multiple file systems.

File System	Description
HDFS	Distributed file system
Local FS	Local disk
HBase FS	NoSQL storage
S3 FS	Cloud storage

Data Flow in HDFS

Data Flow Steps

Client → NameNode → DataNodes
DataNodes → Client (read)

Key Feature

Data locality (processing near data)

Data Ingest with Flume and Sqoop

Apache Flume

Used to ingest streaming data.

Feature	Use
Real-time ingestion	Logs
Reliable	Event-based

Example: Collecting web server logs.

Apache Sqoop

Used to import/export RDBMS data.

Feature	Use
Structured data	MySQL, Oracle
Bulk transfer	Fast ingestion

Example: Importing student database from MySQL to HDFS.

Hadoop Archives (HAR)

HAR reduces small file problems.

Purpose

Combine small files into one archive
Reduce NameNode load

Example: Thousands of small images merged into one HAR file.

Exam-Ready Short Definitions

HDFS – Distributed storage system of Hadoop.
NameNode – Stores metadata.
DataNode – Stores actual data.
Replication – Multiple copies of data.
Block Size – Fixed data chunk (128 MB).

Hadoop I/O (Input / Output)

Hadoop I/O deals with how data is stored, transferred, compressed, and processed efficiently in Hadoop.

Main I/O concepts:

Compression
Serialization
Avro
File-based data structures

Compression in Hadoop

Compression reduces the size of data files so that:

Less storage is used
Data transfer is faster
Network cost is reduced

Why Compression is Important in Hadoop

Hadoop processes huge data
Smaller files → faster MapReduce jobs

Common Compression Techniques

Compression Type	Description
Gzip	High compression, slow
Bzip2	High compression, splittable
Snappy	Fast, low compression
LZO	Very fast, splittable

Real-Life Example: Zipping a large folder before emailing it.

Serialization in Hadoop

Serialization converts objects into byte streams so they can be:

Stored in HDFS
Transferred over network

Why Serialization is Needed

Faster data exchange
Less memory usage

Serialization in Hadoop

Uses Writable interface
Faster than Java serialization

Example: Converting student objects into bytes for storage.

Apache Avro

Avro is a row-based data serialization framework used in Hadoop.

Key Features of Avro

Schema stored with data
Language independent
Compact and fast

Advantages

Feature	Benefit
Schema evolution	Easy updates
Compact format	Less storage
Fast processing	High performance

Real-Life Example: Sharing structured data between Java and Python programs.

File-Based Data Structures in Hadoop

Hadoop provides file formats optimized for Big Data.

Common File-Based Structures

File Type	Use
Sequence File	Binary key-value pairs
Map File	Indexed sequence file
Avro File	Structured row-based data
Parquet	Column-based analytics
ORC	Optimized column storage

Example: Parquet files used in analytics queries for faster results.

Hadoop Environment

The Hadoop environment includes:

Hardware
Software
Configuration
Security
Monitoring

Setting Up a Hadoop Cluster

Types of Hadoop Clusters

Mode	Description
Standalone	Single machine
Pseudo-distributed	One machine, multi-daemon
Fully distributed	Multiple machines

Real-Life Example

Small college lab → Pseudo-distributed
Large company → Fully distributed

Cluster Specification

Key Specifications

Component	Requirement
CPU	Multi-core processors
RAM	8 GB or more
Storage	High-capacity HDD/SSD
Network	High bandwidth
OS	Linux preferred

Cluster Setup and Installation

Installation Steps

Install Java
Install Hadoop
Configure SSH
Set environment variables
Configure core-site.xml, hdfs-site.xml
Format NameNode
Start Hadoop services

Example: Installing Hadoop on Linux virtual machines.

Hadoop Configuration

Main Configuration Files

File	Purpose
core-site.xml	Core settings
hdfs-site.xml	HDFS properties
yarn-site.xml	Resource management
mapred-site.xml	MapReduce settings

Example: Setting block size or replication factor.

Security in Hadoop

Why Security is Needed

Sensitive data
Multiple users
Distributed access

Security Features

Feature	Purpose
Kerberos	Authentication
Access Control Lists	Authorization
Encryption	Data protection
Audit logs	Tracking access

Example: Only authorized employees can access customer data.

Administering Hadoop

Hadoop Administration Tasks

User management
Resource allocation
Job monitoring
Backup & recovery
Log management

Real-Life Example: Hadoop admin manages cluster health in an IT company.

HDFS Monitoring & Maintenance

Monitoring Tools

Web UI
Logs
Metrics
Alerts

Maintenance Activities

Disk replacement
Node addition/removal
Data balancing

Example: Replacing a failed DataNode without data loss.

Hadoop Benchmarks

Benchmarks measure Hadoop performance.

Popular Benchmarks

Benchmark	Purpose
TeraSort	Sorting performance
TestDFSIO	HDFS I/O performance
HiBench	Workload testing
DFSIO	Read/write speed

Example: Testing how fast Hadoop can sort 1 TB data.

Hadoop in the Cloud

Why Use Hadoop in Cloud?

No hardware cost
Easy scalability
Pay-as-you-use

Cloud Hadoop Platforms

Platform	Service
AWS	EMR
Google Cloud	Dataproc
Azure	HDInsight

Real-Life Example: Startups using AWS EMR instead of building clusters.

Advantages of Cloud-Based Hadoop

Advantage	Description
Scalability	Add/remove nodes easily
Cost-effective	Pay only for usage
High availability	Managed services
Faster setup	Minutes instead of weeks

Exam-Ready Short Definitions

Compression – Reducing data size.
Serialization – Converting objects into bytes.
Avro – Schema-based data format.
Hadoop Cluster – Group of machines running Hadoop.
Kerberos – Hadoop authentication system.

Conclusion

Hadoop I/O techniques such as compression, serialization, and Avro improve storage efficiency and processing speed. A properly configured Hadoop environment with security, monitoring, and cloud deployment ensures scalable, reliable, and cost-effective Big Data processing.

HDFS – HADOOP DISTRIBUTED FILE SYSTEM