HDFS – HADOOP DISTRIBUTED FILE SYSTEM
HDFS – HADOOP DISTRIBUTED FILE SYSTEM
Introduction to HDFS
HDFS (Hadoop Distributed File System) is the storage system of Hadoop designed to store very large files across multiple machines in a reliable and fault-tolerant manner.
HDFS breaks big files into smaller blocks and stores them across many computers.
Real-Life Example: A 5 GB video is split and stored across many servers instead of one hard disk.
Design of HDFS
HDFS is designed with the following goals:
Key Design Principles
| Principle | Explanation |
|---|---|
| Scalability | Can store petabytes of data |
| Fault Tolerance | Works even if nodes fail |
| High Throughput | Fast data access |
| Commodity Hardware | Uses low-cost systems |
| Write Once, Read Many | Files are not frequently modified |
HDFS Concepts
Main Components
| Component | Role |
|---|---|
| NameNode | Master node; stores metadata |
| DataNode | Stores actual data blocks |
| Secondary NameNode | Metadata backup |
| Client | Accesses HDFS |
Metadata Includes
- File name
- Block location
- Permissions
Benefits and Challenges of HDFS
Benefits
| Benefit | Description |
|---|---|
| Fault Tolerant | Data replication |
| High Scalability | Add nodes easily |
| Cost-Effective | Uses cheap hardware |
| Large File Support | Handles TB-PB files |
Challenges
| Challenge | Explanation |
|---|---|
| Small File Problem | Not efficient for small files |
| Not Real-Time | Batch processing |
| Single NameNode (older versions) | Risk of failure |
| High Latency | Slow for small reads |
File Sizes, Block Sizes and Block Abstraction
- File Sizes: HDFS handles very large files (GB to PB).
- Block Size: Default block size = 128 MB
Block Abstraction
- Files are split into fixed-size blocks.
- Blocks stored independently.
Example: 300 MB file → 3 blocks (128 + 128 + 44 MB)
Data Replication in HDFS
Replication means storing multiple copies of data blocks.
Replication Factor
-
Default = 3
Benefits
- Fault tolerance
- Data availability
Example: One block stored on 3 different DataNodes.
How HDFS Stores Files
Write Process
- Client requests NameNode
- NameNode provides DataNode list
- Client writes data to DataNodes
- Blocks are replicated
Example: Uploading a video to HDFS.
How HDFS Reads Files
Read Process
- Client requests NameNode
- NameNode provides block locations
- Client reads from nearest DataNode
Example: Streaming stored log files.
How HDFS Writes Files
Write Flow
- File divided into blocks
- Blocks written sequentially
- Replication applied
- Metadata updated
Java Interfaces to HDFS
HDFS provides Java APIs to interact with files.
Common Classes
| Class | Purpose |
|---|---|
| FileSystem | Access HDFS |
| Path | File path |
| FSDataInputStream | Read data |
| FSDataOutputStream | Write data |
Example: Java program to upload files to HDFS.
Command Line Interface (CLI)
HDFS provides command-line tools.
Common Commands
| Command | Description |
|---|---|
hdfs dfs -ls | List files |
hdfs dfs -put | Upload file |
hdfs dfs -get | Download file |
hdfs dfs -rm | Delete file |
hdfs dfs -df | Disk usage |
Hadoop File System Interfaces
Hadoop supports multiple file systems.
| File System | Description |
|---|---|
| HDFS | Distributed file system |
| Local FS | Local disk |
| HBase FS | NoSQL storage |
| S3 FS | Cloud storage |
Data Flow in HDFS
Data Flow Steps
- Client → NameNode → DataNodes
- DataNodes → Client (read)
Key Feature
-
Data locality (processing near data)
Data Ingest with Flume and Sqoop
Apache Flume
Used to ingest streaming data.
| Feature | Use |
|---|---|
| Real-time ingestion | Logs |
| Reliable | Event-based |
Example: Collecting web server logs.
Apache Sqoop
Used to import/export RDBMS data.
| Feature | Use |
|---|---|
| Structured data | MySQL, Oracle |
| Bulk transfer | Fast ingestion |
Example: Importing student database from MySQL to HDFS.
Hadoop Archives (HAR)
HAR reduces small file problems.
Purpose
- Combine small files into one archive
- Reduce NameNode load
Example: Thousands of small images merged into one HAR file.
Exam-Ready Short Definitions
- HDFS – Distributed storage system of Hadoop.
- NameNode – Stores metadata.
- DataNode – Stores actual data.
- Replication – Multiple copies of data.
- Block Size – Fixed data chunk (128 MB).
Hadoop I/O (Input / Output)
Hadoop I/O deals with how data is stored, transferred, compressed, and processed efficiently in Hadoop.
Main I/O concepts:
- Compression
- Serialization
- Avro
- File-based data structures
Compression in Hadoop
Compression reduces the size of data files so that:
- Less storage is used
- Data transfer is faster
- Network cost is reduced
Why Compression is Important in Hadoop
- Hadoop processes huge data
- Smaller files → faster MapReduce jobs
Common Compression Techniques
| Compression Type | Description |
|---|---|
| Gzip | High compression, slow |
| Bzip2 | High compression, splittable |
| Snappy | Fast, low compression |
| LZO | Very fast, splittable |
Real-Life Example: Zipping a large folder before emailing it.
Serialization in Hadoop
Serialization converts objects into byte streams so they can be:
- Stored in HDFS
- Transferred over network
Why Serialization is Needed
- Faster data exchange
- Less memory usage
Serialization in Hadoop
-
Uses Writable interface
-
Faster than Java serialization
Example: Converting student objects into bytes for storage.
Apache Avro
Avro is a row-based data serialization framework used in Hadoop.
Key Features of Avro
- Schema stored with data
- Language independent
- Compact and fast
Advantages
| Feature | Benefit |
|---|---|
| Schema evolution | Easy updates |
| Compact format | Less storage |
| Fast processing | High performance |
Real-Life Example: Sharing structured data between Java and Python programs.
File-Based Data Structures in Hadoop
Hadoop provides file formats optimized for Big Data.
Common File-Based Structures
| File Type | Use |
|---|---|
| Sequence File | Binary key-value pairs |
| Map File | Indexed sequence file |
| Avro File | Structured row-based data |
| Parquet | Column-based analytics |
| ORC | Optimized column storage |
Example: Parquet files used in analytics queries for faster results.
Hadoop Environment
The Hadoop environment includes:
- Hardware
- Software
- Configuration
- Security
- Monitoring
Setting Up a Hadoop Cluster
Types of Hadoop Clusters
| Mode | Description |
|---|---|
| Standalone | Single machine |
| Pseudo-distributed | One machine, multi-daemon |
| Fully distributed | Multiple machines |
Real-Life Example
- Small college lab → Pseudo-distributed
- Large company → Fully distributed
Cluster Specification
Key Specifications
| Component | Requirement |
|---|---|
| CPU | Multi-core processors |
| RAM | 8 GB or more |
| Storage | High-capacity HDD/SSD |
| Network | High bandwidth |
| OS | Linux preferred |
Cluster Setup and Installation
Installation Steps
- Install Java
- Install Hadoop
- Configure SSH
- Set environment variables
- Configure core-site.xml, hdfs-site.xml
- Format NameNode
- Start Hadoop services
Example: Installing Hadoop on Linux virtual machines.
Hadoop Configuration
Main Configuration Files
| File | Purpose |
|---|---|
| core-site.xml | Core settings |
| hdfs-site.xml | HDFS properties |
| yarn-site.xml | Resource management |
| mapred-site.xml | MapReduce settings |
Example: Setting block size or replication factor.
Security in Hadoop
Why Security is Needed
- Sensitive data
- Multiple users
- Distributed access
Security Features
| Feature | Purpose |
|---|---|
| Kerberos | Authentication |
| Access Control Lists | Authorization |
| Encryption | Data protection |
| Audit logs | Tracking access |
Example: Only authorized employees can access customer data.
Administering Hadoop
Hadoop Administration Tasks
- User management
- Resource allocation
- Job monitoring
- Backup & recovery
- Log management
Real-Life Example: Hadoop admin manages cluster health in an IT company.
HDFS Monitoring & Maintenance
Monitoring Tools
- Web UI
- Logs
- Metrics
- Alerts
Maintenance Activities
- Disk replacement
- Node addition/removal
- Data balancing
Example: Replacing a failed DataNode without data loss.
Hadoop Benchmarks
Benchmarks measure Hadoop performance.
Popular Benchmarks
| Benchmark | Purpose |
|---|---|
| TeraSort | Sorting performance |
| TestDFSIO | HDFS I/O performance |
| HiBench | Workload testing |
| DFSIO | Read/write speed |
Example: Testing how fast Hadoop can sort 1 TB data.
Hadoop in the Cloud
Why Use Hadoop in Cloud?
- No hardware cost
- Easy scalability
- Pay-as-you-use
Cloud Hadoop Platforms
| Platform | Service |
|---|---|
| AWS | EMR |
| Google Cloud | Dataproc |
| Azure | HDInsight |
Real-Life Example: Startups using AWS EMR instead of building clusters.
Advantages of Cloud-Based Hadoop
| Advantage | Description |
|---|---|
| Scalability | Add/remove nodes easily |
| Cost-effective | Pay only for usage |
| High availability | Managed services |
| Faster setup | Minutes instead of weeks |
Exam-Ready Short Definitions
- Compression – Reducing data size.
- Serialization – Converting objects into bytes.
- Avro – Schema-based data format.
- Hadoop Cluster – Group of machines running Hadoop.
- Kerberos – Hadoop authentication system.
Conclusion
Hadoop I/O techniques such as compression, serialization, and Avro improve storage efficiency and processing speed. A properly configured Hadoop environment with security, monitoring, and cloud deployment ensures scalable, reliable, and cost-effective Big Data processing.