HDFS – HADOOP DISTRIBUTED FILE SYSTEM
HDFS – HADOOP DISTRIBUTED FILE SYSTEM
Introduction to HDFS
HDFS (Hadoop Distributed File System) is the storage system of Hadoop designed to store very large files across multiple machines in a reliable and fault-tolerant manner.
HDFS breaks big files into smaller blocks and stores them across many computers.
Real-Life Example: A 5 GB video is split and stored across many servers instead of one hard disk.
Design of HDFS
HDFS is designed with the following goals:
Key Design Principles
| Principle | Explanation |
|---|---|
| Scalability | Can store petabytes of data |
| Fault Tolerance | Works even if nodes fail |
| High Throughput | Fast data access |
| Commodity Hardware | Uses low-cost systems |
| Write Once, Read Many | Files are not frequently modified |
HDFS Concepts
Main Components
| Component | Role |
|---|---|
| NameNode | Master node; stores metadata |
| DataNode | Stores actual data blocks |
| Secondary NameNode | Metadata backup |
| Client | Accesses HDFS |
Metadata Includes
- File name
- Block location
- Permissions
Benefits and Challenges of HDFS
Benefits
| Benefit | Description |
|---|---|
| Fault Tolerant | Data replication |
| High Scalability | Add nodes easily |
| Cost-Effective | Uses cheap hardware |
| Large File Support | Handles TB-PB files |
Challenges
| Challenge | Explanation |
|---|---|
| Small File Problem | Not efficient for small files |
| Not Real-Time | Batch processing |
| Single NameNode (older versions) | Risk of failure |
| High Latency | Slow for small reads |
File Sizes, Block Sizes and Block Abstraction
- File Sizes: HDFS handles very large files (GB to PB).
- Block Size: Default block size = 128 MB
Block Abstraction
- Files are split into fixed-size blocks.
- Blocks stored independently.
Example: 300 MB file → 3 blocks (128 + 128 + 44 MB)
Data Replication in HDFS
Replication means storing multiple copies of data blocks.
Replication Factor
-
Default = 3
Benefits
- Fault tolerance
- Data availability
Example: One block stored on 3 different DataNodes.
How HDFS Stores Files
Write Process
- Client requests NameNode
- NameNode provides DataNode list
- Client writes data to DataNodes
- Blocks are replicated
Example: Uploading a video to HDFS.
How HDFS Reads Files
Read Process
- Client requests NameNode
- NameNode provides block locations
- Client reads from nearest DataNode
Example: Streaming stored log files.
How HDFS Writes Files
Write Flow
- File divided into blocks
- Blocks written sequentially
- Replication applied
- Metadata updated
Java Interfaces to HDFS
HDFS provides Java APIs to interact with files.
Common Classes
| Class | Purpose |
|---|---|
| FileSystem | Access HDFS |
| Path | File path |
| FSDataInputStream | Read data |
| FSDataOutputStream | Write data |
Example: Java program to upload files to HDFS.
Command Line Interface (CLI)
HDFS provides command-line tools.
Common Commands
| Command | Description |
|---|---|
hdfs dfs -ls | List files |
hdfs dfs -put | Upload file |
hdfs dfs -get | Download file |
hdfs dfs -rm | Delete file |
hdfs dfs -df | Disk usage |
Hadoop File System Interfaces
Hadoop supports multiple file systems.
| File System | Description |
|---|---|
| HDFS | Distributed file system |
| Local FS | Local disk |
| HBase FS | NoSQL storage |
| S3 FS | Cloud storage |
Data Flow in HDFS
Data Flow Steps
- Client → NameNode → DataNodes
- DataNodes → Client (read)
Key Feature
-
Data locality (processing near data)
Data Ingest with Flume and Sqoop
Apache Flume
Used to ingest streaming data.
| Feature | Use |
|---|---|
| Real-time ingestion | Logs |
| Reliable | Event-based |
Example: Collecting web server logs.
Apache Sqoop
Used to import/export RDBMS data.
| Feature | Use |
|---|---|
| Structured data | MySQL, Oracle |
| Bulk transfer | Fast ingestion |
Example: Importing student database from MySQL to HDFS.
Hadoop Archives (HAR)
HAR reduces small file problems.
Purpose
- Combine small files into one archive
- Reduce NameNode load
Example: Thousands of small images merged into one HAR file.
Exam-Ready Short Definitions
- HDFS – Distributed storage system of Hadoop.
- NameNode – Stores metadata.
- DataNode – Stores actual data.
- Replication – Multiple copies of data.
- Block Size – Fixed data chunk (128 MB).
Hadoop I/O (Input / Output)
Hadoop I/O deals with how data is stored, transferred, compressed, and processed efficiently in Hadoop.
Main I/O concepts:
- Compression
- Serialization
- Avro
- File-based data structures
Compression in Hadoop
Compression reduces the size of data files so that:
- Less storage is used
- Data transfer is faster
- Network cost is reduced
Why Compression is Important in Hadoop
- Hadoop processes huge data
- Smaller files → faster MapReduce jobs
Common Compression Techniques
| Compression Type | Description |
|---|---|
| Gzip | High compression, slow |
| Bzip2 | High compression, splittable |
| Snappy | Fast, low compression |
| LZO | Very fast, splittable |
Real-Life Example: Zipping a large folder before emailing it.
Serialization in Hadoop
Serialization converts objects into byte streams so they can be:
- Stored in HDFS
- Transferred over network
Why Serialization is Needed
- Faster data exchange
- Less memory usage
Serialization in Hadoop
-
Uses Writable interface
-
Faster than Java serialization
Example: Converting student objects into bytes for storage.
Apache Avro
Avro is a row-based data serialization framework used in Hadoop.
Key Features of Avro
- Schema stored with data
- Language independent
- Compact and fast
Advantages
| Feature | Benefit |
|---|---|
| Schema evolution | Easy updates |
| Compact format | Less storage |
| Fast processing | High performance |
Real-Life Example: Sharing structured data between Java and Python programs.
File-Based Data Structures in Hadoop
Hadoop provides file formats optimized for Big Data.
Common File-Based Structures
| File Type | Use |
|---|---|
| Sequence File | Binary key-value pairs |
| Map File | Indexed sequence file |
| Avro File | Structured row-based data |
| Parquet | Column-based analytics |
| ORC | Optimized column storage |
Example: Parquet files used in analytics queries for faster results.
Hadoop Environment
The Hadoop environment includes:
- Hardware
- Software
- Configuration
- Security
- Monitoring
Setting Up a Hadoop Cluster
Types of Hadoop Clusters
| Mode | Description |
|---|---|
| Standalone | Single machine |
| Pseudo-distributed | One machine, multi-daemon |
| Fully distributed | Multiple machines |
Real-Life Example
- Small college lab → Pseudo-distributed
- Large company → Fully distributed
Cluster Specification
Key Specifications
| Component | Requirement |
|---|---|
| CPU | Multi-core processors |
| RAM | 8 GB or more |
| Storage | High-capacity HDD/SSD |
| Network | High bandwidth |
| OS | Linux preferred |
Cluster Setup and Installation
Installation Steps
- Install Java
- Install Hadoop
- Configure SSH
- Set environment variables
- Configure core-site.xml, hdfs-site.xml
- Format NameNode
- Start Hadoop services
Example: Installing Hadoop on Linux virtual machines.
Hadoop Configuration
Main Configuration Files
| File | Purpose |
|---|---|
| core-site.xml | Core settings |
| hdfs-site.xml | HDFS properties |
| yarn-site.xml | Resource management |
| mapred-site.xml | MapReduce settings |
Example: Setting block size or replication factor.
Security in Hadoop
Why Security is Needed
- Sensitive data
- Multiple users
- Distributed access
Security Features
| Feature | Purpose |
|---|---|
| Kerberos | Authentication |
| Access Control Lists | Authorization |
| Encryption | Data protection |
| Audit logs | Tracking access |
Example: Only authorized employees can access customer data.
Administering Hadoop
Hadoop Administration Tasks
- User management
- Resource allocation
- Job monitoring
- Backup & recovery
- Log management
Real-Life Example: Hadoop admin manages cluster health in an IT company.
HDFS Monitoring & Maintenance
Monitoring Tools
- Web UI
- Logs
- Metrics
- Alerts
Maintenance Activities
- Disk replacement
- Node addition/removal
- Data balancing
Example: Replacing a failed DataNode without data loss.
Hadoop Benchmarks
Benchmarks measure Hadoop performance.
Popular Benchmarks
| Benchmark | Purpose |
|---|---|
| TeraSort | Sorting performance |
| TestDFSIO | HDFS I/O performance |
| HiBench | Workload testing |
| DFSIO | Read/write speed |
Example: Testing how fast Hadoop can sort 1 TB data.
Hadoop in the Cloud
Why Use Hadoop in Cloud?
- No hardware cost
- Easy scalability
- Pay-as-you-use
Cloud Hadoop Platforms
| Platform | Service |
|---|---|
| AWS | EMR |
| Google Cloud | Dataproc |
| Azure | HDInsight |
Real-Life Example: Startups using AWS EMR instead of building clusters.
Advantages of Cloud-Based Hadoop
| Advantage | Description |
|---|---|
| Scalability | Add/remove nodes easily |
| Cost-effective | Pay only for usage |
| High availability | Managed services |
| Faster setup | Minutes instead of weeks |
Exam-Ready Short Definitions
- Compression – Reducing data size.
- Serialization – Converting objects into bytes.
- Avro – Schema-based data format.
- Hadoop Cluster – Group of machines running Hadoop.
- Kerberos – Hadoop authentication system.
Conclusion
Hadoop I/O techniques such as compression, serialization, and Avro improve storage efficiency and processing speed. A properly configured Hadoop environment with security, monitoring, and cloud deployment ensures scalable, reliable, and cost-effective Big Data processing.
This article on HDFS (Hadoop Distributed File System) provides a useful overview of distributed storage architecture and large-scale data management concepts in Hadoop environments. Understanding HDFS is very important for students and professionals working with big data technologies, scalable storage systems, and distributed computing platforms. Learners interested in similar implementation concepts can also explore Big Data Projects to understand how enterprise-scale data processing systems are designed and implemented.
ReplyDelete