Hadoop Eco System and YARN, NoSQL, Spark, SCALA
Hadoop Ecosystem
The Hadoop Ecosystem is a collection of tools and frameworks built around Apache Hadoop to store, process, analyze, and manage big data efficiently.
Why Hadoop Ecosystem is Needed
- HDFS + MapReduce alone are not sufficient for all big data needs
- Different tasks like data ingestion, querying, streaming, security, coordination require specialized tools
Major Components of Hadoop Ecosystem
| Category | Tool | Purpose |
|---|---|---|
| Storage | HDFS | Distributed file storage |
| Resource Management | YARN | Cluster resource management |
| Processing | MapReduce, Spark | Data processing |
| SQL on Hadoop | Hive | Query big data using SQL |
| NoSQL DB | HBase | Real-time read/write access |
| Data Ingestion | Flume | Collect log/streaming data |
| Data Transfer | Sqoop | Import/export data from RDBMS |
| Workflow | Oozie | Job scheduling |
| Coordination | ZooKeeper | Distributed coordination |
| Serialization | Avro | Data serialization |
| Metadata | Hive Metastore | Table and schema info |
| Security | Knox, Ranger | Authentication & authorization |
Hadoop Schedulers
Schedulers decide how cluster resources are allocated among jobs.
Types of Schedulers in Hadoop
1. FIFO Scheduler
- Jobs execute in submission order
- Simple but inefficient
- Not suitable for multi-user clusters
2. Fair Scheduler
- Resources are shared equally among running jobs
- Each job gets a fair share of CPU and memory
Features
- Improves cluster utilization
- Supports job priorities
- Suitable for multi-user environments
Capacity Scheduler
- Cluster is divided into queues
- Each queue has guaranteed capacity
Features
- Enterprise-level scheduler
- Ensures minimum resources for each team
- Supports multiple users and organizations
Fair vs Capacity Scheduler
| Feature | Fair Scheduler | Capacity Scheduler |
|---|---|---|
| Resource Sharing | Equal | Queue-based |
| Best For | Small clusters | Large enterprises |
| Complexity | Medium | High |
Hadoop 2.0 – New Features
Hadoop 2.0 introduced major architectural improvements.
NameNode High Availability (HA)
Problem in Hadoop 1.x
-
Single NameNode = Single Point of Failure
Solution in Hadoop 2.0
Two NameNodes:
- Active NameNode
- Standby NameNode
How It Works
- Both NameNodes share metadata via JournalNodes
- If Active fails, Standby takes over automatically
Benefit
- Improved reliability
- Zero downtime
HDFS Federation
What is HDFS Federation?
- Multiple independent NameNodes
- Each NameNode manages its own namespace
Why Federation?
Single NameNode cannot handle:
- Large metadata
- High traffic
Benefits
- Scalability
- Better performance
- Namespace isolation
MRv2 (MapReduce Version 2)
Limitation of MRv1
JobTracker handled:
- Resource management
- Job scheduling
- Monitoring
Problem
- Overloaded JobTracker
- Scalability issues
YARN (Yet Another Resource Negotiator)
YARN is the core component of Hadoop 2.x.
What YARN Does
-
Separates resource management from data processing
Key Components of YARN
| Component | Role |
|---|---|
| ResourceManager (RM) | Global resource manager |
| NodeManager (NM) | Manages node resources |
| ApplicationMaster (AM) | Manages application lifecycle |
| Containers | Allocated resources |
How YARN Works (Simple Flow)
- Client submits job
- ResourceManager allocates container
- ApplicationMaster starts
- Containers are allocated for tasks
- NodeManagers execute tasks
Advantages of YARN
- Supports multiple processing frameworks
- Better scalability
- Efficient resource utilization
- Enables Spark, Tez, Flink on Hadoop
Running MRv1 in YARN
Is it Possible? Yes.
How?
-
MapReduce runs as an application on YARN
JobTracker replaced by:
- ResourceManager
- ApplicationMaster
Benefit
- Backward compatibility
- Smooth migration from Hadoop 1.x to 2.x
Summary Table
| Feature | Hadoop 1.x | Hadoop 2.x |
|---|---|---|
| Resource Mgmt | JobTracker | YARN |
| Scalability | Limited | High |
| HA | No | Yes |
| Multi-Framework | No | Yes |
Real-World Example
E-commerce Company
- HDFS stores user logs
- Sqoop imports order data
- Hive analyzes sales
- YARN allocates resources
- Spark runs real-time analytics
SCALA (Structured, Concise, High-Level Language)
Scala is a high-level programming language that combines:
- Object-Oriented Programming (OOP)
- Functional Programming (FP)
Scala runs on the Java Virtual Machine (JVM) and is widely used with Apache Spark.
Why Scala is Important
- Short and clean syntax
- Faster development
- Best language for Spark
Real-Life Example: Writing big data programs in Spark using fewer lines of code than Java.
Features of Scala
| Feature | Explanation |
|---|---|
| JVM Based | Runs on JVM |
| Functional | Supports lambda functions |
| OOP | Uses classes & objects |
| Type Inference | No need to declare data type |
| Immutability | Data cannot change |
Classes and Objects
Class
A class is a blueprint to create objects.
Object
An object is a singleton instance (only one instance).
Difference Between Class and Object
| Class | Object |
|---|---|
| Blueprint | Instance |
| Multiple objects | Single instance |
Basic Types and Operators
Basic Data Types in Scala
| Type | Example |
|---|---|
| Int | 10 |
| Double | 10.5 |
| Boolean | true |
| Char | 'A' |
| String | "Scala" |
Operators
| Operator Type | Example |
|---|---|
| Arithmetic | +, -, *, / |
| Relational | >, <, >= |
| Logical | &&, |
| Assignment | = |
Built-In Control Structures
1. If-Else
2. While Loop
3. For Loop
4. Match Case (Switch Alternative)
Functions and Closures
Function
A function is a block of code that performs a task.
Anonymous Function (Lambda)
Closure
A closure is a function that uses variables outside its scope.
Real-Life Example: Using external configuration values in Spark functions.
Inheritance in Scala
Inheritance allows a class to reuse properties of another class.
Example
Usage
Advantages of Inheritance
- Code reuse
- Easy maintenance
- Hierarchical design
Scala vs Java
| Feature | Scala | Java |
|---|---|---|
| Lines of Code | Fewer | More |
| Functional Support | Yes | Limited |
| Syntax | Simple | Verbose |
| Used in Spark | Yes | Less |
Exam-Ready Short Notes
- Scala – JVM-based language combining OOP & FP
- Class – Blueprint of object
- Object – Singleton instance
- Closure – Function using external variables
- Inheritance – Reuse of parent class