Hadoop Eco System and YARN, NoSQL, Spark, SCALA

Hadoop Ecosystem

The Hadoop Ecosystem is a collection of tools and frameworks built around Apache Hadoop to store, process, analyze, and manage big data efficiently.

Why Hadoop Ecosystem is Needed

HDFS + MapReduce alone are not sufficient for all big data needs
Different tasks like data ingestion, querying, streaming, security, coordination require specialized tools

Major Components of Hadoop Ecosystem

Category	Tool	Purpose
Storage	HDFS	Distributed file storage
Resource Management	YARN	Cluster resource management
Processing	MapReduce, Spark	Data processing
SQL on Hadoop	Hive	Query big data using SQL
NoSQL DB	HBase	Real-time read/write access
Data Ingestion	Flume	Collect log/streaming data
Data Transfer	Sqoop	Import/export data from RDBMS
Workflow	Oozie	Job scheduling
Coordination	ZooKeeper	Distributed coordination
Serialization	Avro	Data serialization
Metadata	Hive Metastore	Table and schema info
Security	Knox, Ranger	Authentication & authorization

Hadoop Schedulers

Schedulers decide how cluster resources are allocated among jobs.

Types of Schedulers in Hadoop

1. FIFO Scheduler

Jobs execute in submission order
Simple but inefficient
Not suitable for multi-user clusters

2. Fair Scheduler

Resources are shared equally among running jobs
Each job gets a fair share of CPU and memory

Features

Improves cluster utilization
Supports job priorities
Suitable for multi-user environments

Capacity Scheduler

Cluster is divided into queues
Each queue has guaranteed capacity

Features

Enterprise-level scheduler
Ensures minimum resources for each team
Supports multiple users and organizations

Fair vs Capacity Scheduler

Feature	Fair Scheduler	Capacity Scheduler
Resource Sharing	Equal	Queue-based
Best For	Small clusters	Large enterprises
Complexity	Medium	High

Hadoop 2.0 – New Features

Hadoop 2.0 introduced major architectural improvements.

NameNode High Availability (HA)

Problem in Hadoop 1.x

Single NameNode = Single Point of Failure

Solution in Hadoop 2.0

Two NameNodes:

Active NameNode
Standby NameNode

How It Works

Both NameNodes share metadata via JournalNodes
If Active fails, Standby takes over automatically

Benefit

Improved reliability
Zero downtime

HDFS Federation

What is HDFS Federation?

Multiple independent NameNodes
Each NameNode manages its own namespace

Why Federation?

Single NameNode cannot handle:

Large metadata
High traffic

Benefits

Scalability
Better performance
Namespace isolation

MRv2 (MapReduce Version 2)

Limitation of MRv1

JobTracker handled:

Resource management
Job scheduling
Monitoring

Problem

Overloaded JobTracker
Scalability issues

YARN (Yet Another Resource Negotiator)

YARN is the core component of Hadoop 2.x.

What YARN Does

Separates resource management from data processing

Key Components of YARN

Component	Role
ResourceManager (RM)	Global resource manager
NodeManager (NM)	Manages node resources
ApplicationMaster (AM)	Manages application lifecycle
Containers	Allocated resources

How YARN Works (Simple Flow)

Client submits job
ResourceManager allocates container
ApplicationMaster starts
Containers are allocated for tasks
NodeManagers execute tasks

Advantages of YARN

Supports multiple processing frameworks
Better scalability
Efficient resource utilization
Enables Spark, Tez, Flink on Hadoop

Running MRv1 in YARN

Is it Possible? Yes.

How?

MapReduce runs as an application on YARN

JobTracker replaced by:

ResourceManager
ApplicationMaster

Benefit

Backward compatibility
Smooth migration from Hadoop 1.x to 2.x

Summary Table

Feature	Hadoop 1.x	Hadoop 2.x
Resource Mgmt	JobTracker	YARN
Scalability	Limited	High
HA	No	Yes
Multi-Framework	No	Yes

Real-World Example

E-commerce Company

HDFS stores user logs
Sqoop imports order data
Hive analyzes sales
YARN allocates resources
Spark runs real-time analytics

NoSQL Databases & MongoDB

NoSQL (Not Only SQL) databases are used to store large volumes of unstructured or semi-structured data efficiently.

Why NoSQL?

Traditional relational databases:

Are slow with big data
Require fixed schema
Do not scale easily

Key Features of NoSQL

Schema-less
Horizontally scalable
High performance
Suitable for Big Data

Types of NoSQL Databases

Type	Example	Use Case
Document-based	MongoDB	JSON-like documents
Key-Value	Redis	Caching
Column-based	Cassandra	Large-scale analytics
Graph-based	Neo4j	Social networks

Introduction to MongoDB

MongoDB is a document-oriented NoSQL database that stores data in BSON (Binary JSON) format.

Key Characteristics

Schema-less
High performance
Scalable
Open source

Real-Life Example

User profiles in social media apps
Product catalogs in e-commerce websites

MongoDB Data Model

Basic Terminology

RDBMS	MongoDB
Database	Database
Table	Collection
Row	Document
Column	Field

MongoDB Data Types

Common Data Types

Data Type	Description	Example
String	Text data	"Jay"
Number	Integer / Float	25
Boolean	True/False	true
Array	List of values	["Java", "Python"]
Object	Embedded document	{city:"Lucknow"}
Date	Date value	ISODate()
Null	Empty value	null

Creating Documents in MongoDB

Insert a Document


db.students.insertOne({
  name: "Jay",
  age: 24,
  course: "MCA"
})

Insert Multiple Documents


db.students.insertMany([
  {name:"Amit", age:23},
  {name:"Rohit", age:25}
])

Explanation

Documents are stored in collections
Fields can vary across documents

Updating Documents

Update One Document


db.students.updateOne(
  {name:"Jay"},
  {$set:{age:25}}
)

Update Multiple Documents


db.students.updateMany(
  {course:"MCA"},
  {$set:{college:"AKTU"}}
)

Deleting Documents

Delete One Document


db.students.deleteOne({name:"Jay"})

Delete Multiple Documents


db.students.deleteMany({course:"MCA"})

Querying Documents

Find All Documents


db.students.find()

Find with Condition


db.students.find({age:{$gt:23}})

Projection (Select Fields)


db.students.find({}, {name:1, age:1})

Introduction to Indexing in MongoDB

Indexing improves query performance by reducing search time.

Why Indexing is Needed

Faster data retrieval
Efficient searching
Improves performance on large collections

Create an Index


db.students.createIndex({name:1})

Types of Indexes

Single field index
Compound index
Unique index
Text index

Real-Life Example: Phone contact list uses indexing to find names quickly

Capped Collections

A fixed-size collection that automatically removes old documents when size limit is reached.

Key Features

Maintains insertion order
High performance
Automatically overwrites old data

Create a Capped Collection


db.createCollection("logs", {
  capped: true,
  size: 100000,
  max: 100
})

Use Case of Capped Collections

Log files
Chat messages
Streaming data

MongoDB vs RDBMS

Feature	MongoDB	RDBMS
Schema	Flexible	Fixed
Scalability	Horizontal	Vertical
Data Format	BSON	Tables
Performance	High	Moderate

Advantages of MongoDB

Flexible schema
High speed
Easy scaling
Suitable for Big Data

Disadvantages of MongoDB

No joins (limited)
Memory usage is high
Not suitable for complex transactions

Exam-Ready Short Notes

NoSQL – Non-relational database for Big Data.
MongoDB – Document-based NoSQL database.
Document – JSON-like data unit.
Index – Improves query speed.
Capped Collection – Fixed-size collection.

Conclusion

MongoDB is a popular NoSQL database designed to handle large volumes of unstructured data. Its schema-less design, high scalability, and indexing features make it suitable for modern web and Big Data applications.

APACHE SPARK

Apache Spark is a fast, in-memory data processing framework used for Big Data analytics.

Spark processes large data faster than Hadoop MapReduce by keeping data in memory.

Real-Life Example: An e-commerce company analyzing millions of customer clicks in real time.

Installing Apache Spark

System Requirements

Java (JDK 8 or above)
Scala / Python
Hadoop (optional, for YARN mode)

Steps to Install Spark (Standalone Mode)

Download Spark from Apache website
Extract Spark package
Set environment variables
Start Spark shell

Common Commands

spark-shell → Scala
pyspark → Python
spark-submit → Run applications

Spark Applications

A Spark application is a program written using Spark APIs to process data.

Components of a Spark Application

Component	Description
Driver Program	Controls execution
SparkContext	Entry point
Executors	Run tasks
Cluster Manager	Manages resources

Example Spark Application

Word count program
Log analysis
Sales data processing

Spark Jobs, Stages, and Tasks

Spark Job: A job is triggered when an action is called (e.g., count(), collect()).

Spark Stage: A stage is a group of operations that can be executed together.

Divided based on shuffle operations

Spark Task: A task is the smallest unit of work executed on a partition.

Relationship

Level	Meaning
Job	Entire computation
Stage	Part of job
Task	Smallest unit

Example

Reading data → Job
Filtering data → Stage
Processing partition → Task

Resilient Distributed Datasets (RDDs)

RDD is the core data structure of Spark.

RDD is an immutable, distributed collection of objects processed in parallel.

Key Features of RDD

Feature	Explanation
Resilient	Fault tolerant
Distributed	Spread across nodes
Immutable	Cannot be changed
Lazy Evaluation	Executes only when action is called

RDD Operations

Transformations

map()
filter()
flatMap()

Actions

count()
collect()
saveAsTextFile()

Real-Life Example: Student records distributed across servers.

Anatomy of a Spark Job Run

Step-by-Step Execution

Application starts
SparkContext created
RDD transformations defined
Action triggers job
Job divided into stages
Stages divided into tasks
Tasks executed by executors
Results returned to driver

Spark Execution Flow (Text Diagram)


Driver Program
     ↓
SparkContext
     ↓
Job
     ↓
Stages
     ↓
Tasks
     ↓
Executors

Spark on YARN

YARN is Hadoop’s resource management system.

Why Run Spark on YARN?

Better resource utilization
Integration with Hadoop ecosystem
Multi-tenant support

Spark on YARN Modes

Mode	Description
Client Mode	Driver runs on client
Cluster Mode	Driver runs on cluster

How Spark Runs on YARN

Spark application submitted
YARN allocates resources
ApplicationMaster starts
Executors launched
Tasks executed
Results returned

Benefits of Spark on YARN

Dynamic resource allocation
Fault tolerance
Supports MapReduce, Spark, Hive together

Spark vs Hadoop MapReduce

Feature	Spark	MapReduce
Processing Speed	Very Fast	Slow
Storage	In-Memory	Disk-Based
Real-Time	Yes	No
Iterative Tasks	Efficient	Inefficient

Advantages of Spark

High performance
In-memory computation
Supports SQL, ML, Streaming
Easy APIs (Java, Python, Scala)

Limitations of Spark

High memory usage
Expensive hardware
Complex tuning

Exam-Ready Short Notes

Spark – In-memory Big Data processing framework
RDD – Core data structure of Spark
Job – Triggered by action
Stage – Group of tasks
Task – Smallest execution unit
YARN – Resource manager

SCALA (Structured, Concise, High-Level Language)

Scala is a high-level programming language that combines:

Object-Oriented Programming (OOP)
Functional Programming (FP)

Scala runs on the Java Virtual Machine (JVM) and is widely used with Apache Spark.

Why Scala is Important

Short and clean syntax
Faster development
Best language for Spark

Real-Life Example: Writing big data programs in Spark using fewer lines of code than Java.

Features of Scala

Feature	Explanation
JVM Based	Runs on JVM
Functional	Supports lambda functions
OOP	Uses classes & objects
Type Inference	No need to declare data type
Immutability	Data cannot change

Classes and Objects

Class

A class is a blueprint to create objects.


class Student(name: String, age: Int) {
  def display() = println(name + " " + age)
}

Object

An object is a singleton instance (only one instance).


object Main {
  def main(args: Array[String]) {
    val s = new Student("Jay", 24)
    s.display()
  }
}

Difference Between Class and Object

Class	Object
Blueprint	Instance
Multiple objects	Single instance

Basic Types and Operators

Basic Data Types in Scala

Type	Example
Int	10
Double	10.5
Boolean	true
Char	'A'
String	"Scala"

Operators

Operator Type	Example
Arithmetic	+, -, *, /
Relational	>, <, >=
Logical	&&,
Assignment	=


val a = 10
val b = 5
println(a + b)

Built-In Control Structures

1. If-Else


val marks = 75
if (marks >= 40)
  println("Pass")
else
  println("Fail")

2. While Loop


var i = 1
while (i <= 5) {
  println(i)
  i += 1
}

3. For Loop


for (i <- 1 to 5)
  println(i)

4. Match Case (Switch Alternative)


val day = 1
day match {
  case 1 => println("Monday")
  case 2 => println("Tuesday")
  case _ => println("Invalid")
}

Functions and Closures

Function

A function is a block of code that performs a task.


def add(a: Int, b: Int): Int = {
  a + b
}

Anonymous Function (Lambda)


val sum = (a: Int, b: Int) => a + b

Closure

A closure is a function that uses variables outside its scope.


var x = 10
val addX = (y: Int) => y + x
println(addX(5))   // Output: 15

Real-Life Example: Using external configuration values in Spark functions.

Inheritance in Scala

Inheritance allows a class to reuse properties of another class.

Example


class Person {
  def speak() = println("I am a person")
}

class Student extends Person {
  def study() = println("I am studying")
}

Usage


val s = new Student
s.speak()
s.study()

Advantages of Inheritance

Code reuse
Easy maintenance
Hierarchical design

Scala vs Java

Feature	Scala	Java
Lines of Code	Fewer	More
Functional Support	Yes	Limited
Syntax	Simple	Verbose
Used in Spark	Yes	Less

Exam-Ready Short Notes

Scala – JVM-based language combining OOP & FP
Class – Blueprint of object
Object – Singleton instance
Closure – Function using external variables
Inheritance – Reuse of parent class