HDFS – HADOOP DISTRIBUTED FILE SYSTEM



HDFS – HADOOP DISTRIBUTED FILE SYSTEM

Introduction to HDFS

HDFS (Hadoop Distributed File System) is the storage system of Hadoop designed to store very large files across multiple machines in a reliable and fault-tolerant manner.

HDFS – HADOOP DISTRIBUTED FILE SYSTEM

HDFS breaks big files into smaller blocks and stores them across many computers.

Real-Life Example: A 5 GB video is split and stored across many servers instead of one hard disk.

Design of HDFS

HDFS is designed with the following goals:

Key Design Principles

PrincipleExplanation
ScalabilityCan store petabytes of data
Fault ToleranceWorks even if nodes fail
High ThroughputFast data access
Commodity HardwareUses low-cost systems
Write Once, Read ManyFiles are not frequently modified

HDFS Concepts

Main Components

ComponentRole
NameNodeMaster node; stores metadata
DataNodeStores actual data blocks
Secondary NameNodeMetadata backup
ClientAccesses HDFS

Metadata Includes

  • File name
  • Block location
  • Permissions

Benefits and Challenges of HDFS

Benefits

BenefitDescription
Fault TolerantData replication
High ScalabilityAdd nodes easily
Cost-EffectiveUses cheap hardware
Large File SupportHandles TB-PB files

Challenges

ChallengeExplanation
Small File ProblemNot efficient for small files
Not Real-TimeBatch processing
Single NameNode (older versions)Risk of failure
High LatencySlow for small reads

File Sizes, Block Sizes and Block Abstraction

  • File Sizes: HDFS handles very large files (GB to PB).
  • Block Size: Default block size = 128 MB

Block Abstraction

  • Files are split into fixed-size blocks.
  • Blocks stored independently.

Example: 300 MB file → 3 blocks (128 + 128 + 44 MB)

Data Replication in HDFS

Replication means storing multiple copies of data blocks.

Replication Factor

  • Default = 3

Benefits

  • Fault tolerance
  • Data availability

Example: One block stored on 3 different DataNodes.

How HDFS Stores Files

Write Process

  • Client requests NameNode
  • NameNode provides DataNode list
  • Client writes data to DataNodes
  • Blocks are replicated

Example: Uploading a video to HDFS.

How HDFS Reads Files

Read Process

  • Client requests NameNode
  • NameNode provides block locations
  • Client reads from nearest DataNode

Example: Streaming stored log files.

How HDFS Writes Files

Write Flow

  • File divided into blocks
  • Blocks written sequentially
  • Replication applied
  • Metadata updated

Java Interfaces to HDFS

HDFS provides Java APIs to interact with files.

Common Classes

ClassPurpose
FileSystemAccess HDFS
PathFile path
FSDataInputStreamRead data
FSDataOutputStreamWrite data

Example: Java program to upload files to HDFS.

Command Line Interface (CLI)

HDFS provides command-line tools.

Common Commands

CommandDescription
hdfs dfs -lsList files
hdfs dfs -putUpload file
hdfs dfs -getDownload file
hdfs dfs -rmDelete file
hdfs dfs -dfDisk usage

Hadoop File System Interfaces

Hadoop supports multiple file systems.

File SystemDescription
HDFSDistributed file system
Local FSLocal disk
HBase FSNoSQL storage
S3 FSCloud storage

Data Flow in HDFS

Data Flow Steps

  • Client → NameNode → DataNodes
  • DataNodes → Client (read)

Key Feature

  • Data locality (processing near data)

Data Ingest with Flume and Sqoop

Apache Flume

Used to ingest streaming data.

FeatureUse
Real-time ingestionLogs
ReliableEvent-based

Example: Collecting web server logs.

Apache Sqoop

Used to import/export RDBMS data.

FeatureUse
Structured dataMySQL, Oracle
Bulk transferFast ingestion

Example: Importing student database from MySQL to HDFS.

Hadoop Archives (HAR)

HAR reduces small file problems.

Purpose

  • Combine small files into one archive
  • Reduce NameNode load

Example: Thousands of small images merged into one HAR file.

Exam-Ready Short Definitions

  • HDFS – Distributed storage system of Hadoop.
  • NameNode – Stores metadata.
  • DataNode – Stores actual data.
  • Replication – Multiple copies of data.
  • Block Size – Fixed data chunk (128 MB).

Hadoop I/O (Input / Output)

Hadoop I/O deals with how data is stored, transferred, compressed, and processed efficiently in Hadoop.

Main I/O concepts:

  • Compression
  • Serialization
  • Avro
  • File-based data structures

Compression in Hadoop

Compression reduces the size of data files so that:

  • Less storage is used
  • Data transfer is faster
  • Network cost is reduced

Why Compression is Important in Hadoop

  • Hadoop processes huge data
  • Smaller files → faster MapReduce jobs

Common Compression Techniques

Compression TypeDescription
GzipHigh compression, slow
Bzip2High compression, splittable
SnappyFast, low compression
LZOVery fast, splittable

Real-Life Example: Zipping a large folder before emailing it.

Serialization in Hadoop

Serialization converts objects into byte streams so they can be:

  • Stored in HDFS
  • Transferred over network

Why Serialization is Needed

  • Faster data exchange
  • Less memory usage

Serialization in Hadoop

  • Uses Writable interface

  • Faster than Java serialization

Example: Converting student objects into bytes for storage.

Apache Avro

Avro is a row-based data serialization framework used in Hadoop.

Key Features of Avro

  • Schema stored with data
  • Language independent
  • Compact and fast

Advantages

FeatureBenefit
Schema evolutionEasy updates
Compact formatLess storage
Fast processingHigh performance

Real-Life Example: Sharing structured data between Java and Python programs.

File-Based Data Structures in Hadoop

Hadoop provides file formats optimized for Big Data.

Common File-Based Structures

File TypeUse
Sequence FileBinary key-value pairs
Map FileIndexed sequence file
Avro FileStructured row-based data
ParquetColumn-based analytics
ORCOptimized column storage

Example: Parquet files used in analytics queries for faster results.

Hadoop Environment

The Hadoop environment includes:

  • Hardware
  • Software
  • Configuration
  • Security
  • Monitoring

Setting Up a Hadoop Cluster

Types of Hadoop Clusters

ModeDescription
StandaloneSingle machine
Pseudo-distributedOne machine, multi-daemon
Fully distributedMultiple machines

Real-Life Example

  • Small college lab → Pseudo-distributed
  • Large company → Fully distributed

Cluster Specification

Key Specifications

ComponentRequirement
CPUMulti-core processors
RAM8 GB or more
StorageHigh-capacity HDD/SSD
NetworkHigh bandwidth
OSLinux preferred

Cluster Setup and Installation

Installation Steps

  • Install Java
  • Install Hadoop
  • Configure SSH
  • Set environment variables
  • Configure core-site.xml, hdfs-site.xml
  • Format NameNode
  • Start Hadoop services

Example: Installing Hadoop on Linux virtual machines.

Hadoop Configuration

Main Configuration Files

FilePurpose
core-site.xmlCore settings
hdfs-site.xmlHDFS properties
yarn-site.xmlResource management
mapred-site.xmlMapReduce settings

Example: Setting block size or replication factor.

Security in Hadoop

Why Security is Needed

  • Sensitive data
  • Multiple users
  • Distributed access

Security Features

FeaturePurpose
KerberosAuthentication
Access Control ListsAuthorization
EncryptionData protection
Audit logsTracking access

Example: Only authorized employees can access customer data.

Administering Hadoop

Hadoop Administration Tasks

  • User management
  • Resource allocation
  • Job monitoring
  • Backup & recovery
  • Log management

Real-Life Example: Hadoop admin manages cluster health in an IT company.

HDFS Monitoring & Maintenance

Monitoring Tools

  • Web UI
  • Logs
  • Metrics
  • Alerts

Maintenance Activities

  • Disk replacement
  • Node addition/removal
  • Data balancing

Example: Replacing a failed DataNode without data loss.

Hadoop Benchmarks

Benchmarks measure Hadoop performance.

Popular Benchmarks

BenchmarkPurpose
TeraSortSorting performance
TestDFSIOHDFS I/O performance
HiBenchWorkload testing
DFSIORead/write speed

Example: Testing how fast Hadoop can sort 1 TB data.

Hadoop in the Cloud

Why Use Hadoop in Cloud?

  • No hardware cost
  • Easy scalability
  • Pay-as-you-use

Cloud Hadoop Platforms

PlatformService
AWSEMR
Google CloudDataproc
AzureHDInsight

Real-Life Example: Startups using AWS EMR instead of building clusters.

Advantages of Cloud-Based Hadoop

AdvantageDescription
ScalabilityAdd/remove nodes easily
Cost-effectivePay only for usage
High availabilityManaged services
Faster setupMinutes instead of weeks

Exam-Ready Short Definitions

  • Compression – Reducing data size.
  • Serialization – Converting objects into bytes.
  • Avro – Schema-based data format.
  • Hadoop Cluster – Group of machines running Hadoop.
  • Kerberos – Hadoop authentication system.

Conclusion 

Hadoop I/O techniques such as compression, serialization, and Avro improve storage efficiency and processing speed. A properly configured Hadoop environment with security, monitoring, and cloud deployment ensures scalable, reliable, and cost-effective Big Data processing.