Hadoop Eco System and YARN, NoSQL, Spark, SCALA



Hadoop Ecosystem

The Hadoop Ecosystem is a collection of tools and frameworks built around Apache Hadoop to store, process, analyze, and manage big data efficiently.

Hadoop Eco System and YARN, NoSQL, Spark, SCALA

Why Hadoop Ecosystem is Needed

  • HDFS + MapReduce alone are not sufficient for all big data needs
  • Different tasks like data ingestion, querying, streaming, security, coordination require specialized tools

Major Components of Hadoop Ecosystem

CategoryToolPurpose
StorageHDFSDistributed file storage
Resource ManagementYARNCluster resource management
ProcessingMapReduce, SparkData processing
SQL on HadoopHiveQuery big data using SQL
NoSQL DBHBaseReal-time read/write access
Data IngestionFlumeCollect log/streaming data
Data TransferSqoopImport/export data from RDBMS
WorkflowOozieJob scheduling
CoordinationZooKeeperDistributed coordination
SerializationAvroData serialization
MetadataHive MetastoreTable and schema info
SecurityKnox, RangerAuthentication & authorization

Hadoop Schedulers

Schedulers decide how cluster resources are allocated among jobs.

Types of Schedulers in Hadoop

1. FIFO Scheduler

  • Jobs execute in submission order
  • Simple but inefficient
  • Not suitable for multi-user clusters

2. Fair Scheduler

  • Resources are shared equally among running jobs
  • Each job gets a fair share of CPU and memory

Features

  • Improves cluster utilization
  • Supports job priorities
  • Suitable for multi-user environments

Capacity Scheduler

  • Cluster is divided into queues
  • Each queue has guaranteed capacity

Features

  • Enterprise-level scheduler
  • Ensures minimum resources for each team
  • Supports multiple users and organizations

Fair vs Capacity Scheduler

FeatureFair SchedulerCapacity Scheduler
Resource SharingEqualQueue-based
Best ForSmall clustersLarge enterprises
ComplexityMediumHigh

Hadoop 2.0 – New Features

Hadoop 2.0 introduced major architectural improvements.

NameNode High Availability (HA)

Problem in Hadoop 1.x

  • Single NameNode = Single Point of Failure

Solution in Hadoop 2.0

Two NameNodes:

  • Active NameNode
  • Standby NameNode

How It Works

  • Both NameNodes share metadata via JournalNodes
  • If Active fails, Standby takes over automatically

Benefit

  • Improved reliability
  • Zero downtime

HDFS Federation

What is HDFS Federation?

  • Multiple independent NameNodes
  • Each NameNode manages its own namespace

Why Federation?

Single NameNode cannot handle:

  • Large metadata
  • High traffic

Benefits

  • Scalability
  • Better performance
  • Namespace isolation

MRv2 (MapReduce Version 2)

Limitation of MRv1

JobTracker handled:

  • Resource management
  • Job scheduling
  • Monitoring

Problem

  • Overloaded JobTracker
  • Scalability issues

YARN (Yet Another Resource Negotiator)

YARN is the core component of Hadoop 2.x.

What YARN Does

  • Separates resource management from data processing

Key Components of YARN

ComponentRole
ResourceManager (RM)Global resource manager
NodeManager (NM)Manages node resources
ApplicationMaster (AM)Manages application lifecycle
ContainersAllocated resources

How YARN Works (Simple Flow)

  • Client submits job
  • ResourceManager allocates container
  • ApplicationMaster starts
  • Containers are allocated for tasks
  • NodeManagers execute tasks

Advantages of YARN

  • Supports multiple processing frameworks
  • Better scalability
  • Efficient resource utilization
  • Enables Spark, Tez, Flink on Hadoop

Running MRv1 in YARN

Is it Possible? Yes.

How?

  • MapReduce runs as an application on YARN

JobTracker replaced by:

  • ResourceManager
  • ApplicationMaster

Benefit

  • Backward compatibility
  • Smooth migration from Hadoop 1.x to 2.x

Summary Table

FeatureHadoop 1.xHadoop 2.x
Resource MgmtJobTrackerYARN
ScalabilityLimitedHigh
HANoYes
Multi-FrameworkNoYes

Real-World Example

E-commerce Company

  • HDFS stores user logs
  • Sqoop imports order data
  • Hive analyzes sales
  • YARN allocates resources
  • Spark runs real-time analytics

NoSQL Databases & MongoDB

NoSQL (Not Only SQL) databases are used to store large volumes of unstructured or semi-structured data efficiently.

Why NoSQL?

Traditional relational databases:

  • Are slow with big data
  • Require fixed schema
  • Do not scale easily

Key Features of NoSQL

  • Schema-less
  • Horizontally scalable
  • High performance
  • Suitable for Big Data

Types of NoSQL Databases

TypeExampleUse Case
Document-basedMongoDBJSON-like documents
Key-ValueRedisCaching
Column-basedCassandraLarge-scale analytics
Graph-basedNeo4jSocial networks

Introduction to MongoDB

MongoDB is a document-oriented NoSQL database that stores data in BSON (Binary JSON) format.

Key Characteristics

  • Schema-less
  • High performance
  • Scalable
  • Open source

Real-Life Example

  • User profiles in social media apps
  • Product catalogs in e-commerce websites

MongoDB Data Model

Basic Terminology

RDBMSMongoDB
DatabaseDatabase
TableCollection
RowDocument
ColumnField

MongoDB Data Types

Common Data Types

Data TypeDescriptionExample
StringText data"Jay"
NumberInteger / Float25
BooleanTrue/Falsetrue
ArrayList of values["Java", "Python"]
ObjectEmbedded document{city:"Lucknow"}
DateDate valueISODate()
NullEmpty valuenull

Creating Documents in MongoDB

Insert a Document

db.students.insertOne({ name: "Jay", age: 24, course: "MCA" })

Insert Multiple Documents

db.students.insertMany([ {name:"Amit", age:23}, {name:"Rohit", age:25} ])

Explanation

  • Documents are stored in collections
  • Fields can vary across documents

Updating Documents

Update One Document

db.students.updateOne( {name:"Jay"}, {$set:{age:25}} )

Update Multiple Documents

db.students.updateMany( {course:"MCA"}, {$set:{college:"AKTU"}} )

Deleting Documents

Delete One Document

db.students.deleteOne({name:"Jay"})

Delete Multiple Documents

db.students.deleteMany({course:"MCA"})

Querying Documents

Find All Documents

db.students.find()

Find with Condition

db.students.find({age:{$gt:23}})

Projection (Select Fields)

db.students.find({}, {name:1, age:1})

Introduction to Indexing in MongoDB

Indexing improves query performance by reducing search time.

Why Indexing is Needed

  • Faster data retrieval
  • Efficient searching
  • Improves performance on large collections

Create an Index

db.students.createIndex({name:1})

Types of Indexes

  • Single field index
  • Compound index
  • Unique index
  • Text index

Real-Life Example: Phone contact list uses indexing to find names quickly

Capped Collections

A fixed-size collection that automatically removes old documents when size limit is reached.

Key Features

  • Maintains insertion order
  • High performance
  • Automatically overwrites old data

Create a Capped Collection

db.createCollection("logs", { capped: true, size: 100000, max: 100 })

Use Case of Capped Collections

  • Log files
  • Chat messages
  • Streaming data

MongoDB vs RDBMS 

FeatureMongoDBRDBMS
SchemaFlexibleFixed
ScalabilityHorizontalVertical
Data FormatBSONTables
PerformanceHighModerate

Advantages of MongoDB

  • Flexible schema
  • High speed
  • Easy scaling
  • Suitable for Big Data

Disadvantages of MongoDB

  • No joins (limited)
  • Memory usage is high
  • Not suitable for complex transactions

Exam-Ready Short Notes

  • NoSQL – Non-relational database for Big Data.
  • MongoDB – Document-based NoSQL database.
  • Document – JSON-like data unit.
  • Index – Improves query speed.
  • Capped Collection – Fixed-size collection.

Conclusion 

MongoDB is a popular NoSQL database designed to handle large volumes of unstructured data. Its schema-less design, high scalability, and indexing features make it suitable for modern web and Big Data applications.

APACHE SPARK

Apache Spark is a fast, in-memory data processing framework used for Big Data analytics.

Spark processes large data faster than Hadoop MapReduce by keeping data in memory.

Real-Life Example: An e-commerce company analyzing millions of customer clicks in real time.

Installing Apache Spark

System Requirements

  • Java (JDK 8 or above)
  • Scala / Python
  • Hadoop (optional, for YARN mode)

Steps to Install Spark (Standalone Mode)

  • Download Spark from Apache website
  • Extract Spark package
  • Set environment variables
  • Start Spark shell

Common Commands

  • spark-shell → Scala
  • pyspark → Python
  • spark-submit → Run applications

Spark Applications

A Spark application is a program written using Spark APIs to process data.

Components of a Spark Application

ComponentDescription
Driver ProgramControls execution
SparkContextEntry point
ExecutorsRun tasks
Cluster ManagerManages resources

Example Spark Application

  • Word count program
  • Log analysis
  • Sales data processing

Spark Jobs, Stages, and Tasks

Spark Job: A job is triggered when an action is called (e.g., count(), collect()).

Spark Stage: A stage is a group of operations that can be executed together.

  • Divided based on shuffle operations

Spark Task: A task is the smallest unit of work executed on a partition.

Relationship

LevelMeaning
JobEntire computation
StagePart of job
TaskSmallest unit

Example

  • Reading data → Job
  • Filtering data → Stage
  • Processing partition → Task

Resilient Distributed Datasets (RDDs)

RDD is the core data structure of Spark.

RDD is an immutable, distributed collection of objects processed in parallel.

Key Features of RDD

FeatureExplanation
ResilientFault tolerant
DistributedSpread across nodes
ImmutableCannot be changed
Lazy EvaluationExecutes only when action is called

RDD Operations

Transformations

  • map()
  • filter()
  • flatMap()

Actions

  • count()
  • collect()
  • saveAsTextFile()

Real-Life Example: Student records distributed across servers.

Anatomy of a Spark Job Run

Step-by-Step Execution

  • Application starts
  • SparkContext created
  • RDD transformations defined
  • Action triggers job
  • Job divided into stages
  • Stages divided into tasks
  • Tasks executed by executors
  • Results returned to driver

Spark Execution Flow (Text Diagram)

Driver Program ↓ SparkContext ↓ Job ↓ Stages ↓ Tasks ↓ Executors

Spark on YARN

YARN is Hadoop’s resource management system.

Why Run Spark on YARN?

  • Better resource utilization
  • Integration with Hadoop ecosystem
  • Multi-tenant support

Spark on YARN Modes

ModeDescription
Client ModeDriver runs on client
Cluster ModeDriver runs on cluster

How Spark Runs on YARN

  • Spark application submitted
  • YARN allocates resources
  • ApplicationMaster starts
  • Executors launched
  • Tasks executed
  • Results returned

Benefits of Spark on YARN

  • Dynamic resource allocation
  • Fault tolerance
  • Supports MapReduce, Spark, Hive together

Spark vs Hadoop MapReduce 

FeatureSparkMapReduce
Processing SpeedVery FastSlow
StorageIn-MemoryDisk-Based
Real-TimeYesNo
Iterative TasksEfficientInefficient

Advantages of Spark

  • High performance
  • In-memory computation
  • Supports SQL, ML, Streaming
  • Easy APIs (Java, Python, Scala)

Limitations of Spark

  • High memory usage
  • Expensive hardware
  • Complex tuning

Exam-Ready Short Notes

  • Spark – In-memory Big Data processing framework
  • RDD – Core data structure of Spark
  • Job – Triggered by action
  • Stage – Group of tasks
  • Task – Smallest execution unit
  • YARN – Resource manager

SCALA (Structured, Concise, High-Level Language)

Scala is a high-level programming language that combines:

  • Object-Oriented Programming (OOP)
  • Functional Programming (FP)

Scala runs on the Java Virtual Machine (JVM) and is widely used with Apache Spark.

Why Scala is Important

  • Short and clean syntax
  • Faster development
  • Best language for Spark

Real-Life Example: Writing big data programs in Spark using fewer lines of code than Java.

Features of Scala

FeatureExplanation
JVM BasedRuns on JVM
FunctionalSupports lambda functions
OOPUses classes & objects
Type InferenceNo need to declare data type
ImmutabilityData cannot change

Classes and Objects

Class

A class is a blueprint to create objects.

class Student(name: String, age: Int) { def display() = println(name + " " + age) }

Object

An object is a singleton instance (only one instance).

object Main { def main(args: Array[String]) { val s = new Student("Jay", 24) s.display() } }

Difference Between Class and Object

ClassObject
BlueprintInstance
Multiple objectsSingle instance

Basic Types and Operators

Basic Data Types in Scala

TypeExample
Int10
Double10.5
Booleantrue
Char'A'
String"Scala"

Operators

Operator TypeExample
Arithmetic+, -, *, /
Relational>, <, >=
Logical&&,
Assignment=
val a = 10 val b = 5 println(a + b)

Built-In Control Structures

1. If-Else

val marks = 75 if (marks >= 40) println("Pass") else println("Fail")

2. While Loop

var i = 1 while (i <= 5) { println(i) i += 1 }

3. For Loop

for (i <- 1 to 5) println(i)

4. Match Case (Switch Alternative)

val day = 1 day match { case 1 => println("Monday") case 2 => println("Tuesday") case _ => println("Invalid") }

Functions and Closures

Function

A function is a block of code that performs a task.

def add(a: Int, b: Int): Int = { a + b }

Anonymous Function (Lambda)

val sum = (a: Int, b: Int) => a + b

Closure

A closure is a function that uses variables outside its scope.

var x = 10 val addX = (y: Int) => y + x println(addX(5)) // Output: 15

Real-Life Example: Using external configuration values in Spark functions.

Inheritance in Scala

Inheritance allows a class to reuse properties of another class.

Example

class Person { def speak() = println("I am a person") } class Student extends Person { def study() = println("I am studying") }

Usage

val s = new Student s.speak() s.study()

Advantages of Inheritance

  • Code reuse
  • Easy maintenance
  • Hierarchical design

Scala vs Java 

FeatureScalaJava
Lines of CodeFewerMore
Functional SupportYesLimited
SyntaxSimpleVerbose
Used in SparkYesLess

Exam-Ready Short Notes

  • Scala – JVM-based language combining OOP & FP
  • Class – Blueprint of object
  • Object – Singleton instance
  • Closure – Function using external variables
  • Inheritance – Reuse of parent class