Introduction to Distributed Data Processing



Introduction to Distributed Data Processing

Introduction to Distributed Data Processing

Distributed Data Processing means processing data across multiple computers (nodes) connected via a network.

Features

  • Data stored at multiple locations
  • Processing done collaboratively
  • Network-based communication

Diagram: Distributed Data Processing

Node 1 Node 2 Node 3
(Data + App) (Data + App) (Data + App)
\ | /
Network Communication

Advantages

  • Faster processing
  • Resource sharing
  • Fault tolerance

Distributed Database System (DDBS)

A Distributed Database System is a collection of multiple, logically interrelated databases distributed over a network.

Key Characteristics

  • Data is physically distributed
  • Appears as a single database to users
  • Controlled by Distributed DBMS (DDBMS)

Diagram: Distributed Database

User
|
DDBMS Layer
/ | \
Site1 Site2 Site3
(DB1) (DB2) (DB3)

Promises of Distributed Database Systems

Why is DDBMS important?

Advantages / Promises

FeatureDescription
TransparencyUser sees one DB
ReliabilityFailure of one node doesn’t stop the system
ScalabilityEasy to add nodes
PerformanceParallel processing
AvailabilityData accessible anytime

Types of Transparency

TypeMeaning
Location TransparencyUser doesn’t know the data location
Replication TransparencyCopies hidden
Fragmentation TransparencyData split hidden

Problem Areas in DDBMS

Challenges

Problems Table

ProblemDescription
Data ConsistencyMaintaining the same data across nodes
Network FailureCommunication issues
SecurityData protection
Concurrency ControlMultiple users access
Query OptimizationEfficient query execution

Example Problem

If one node fails → data may become inconsistent.

Distributed DBMS Architecture

Architecture defines the structure and interaction of components in DDBMS.

Architectural Models for Distributed DBMS

Client-Server Architecture

Diagram

Client → Request → Server → Database

Features

  • Clients send requests
  • Server processes data

Peer-to-Peer Architecture

Diagram

Node1 ↔ Node2 ↔ Node3
(All equal nodes)

Features

  • No central server
  • Each node acts as a client & server

Multi-tier Architecture

Diagram

Client → Application Server → Database Server

Features

  • Better security
  • Scalable design

Comparison of Architectures

ArchitectureAdvantageDisadvantage
Client-ServerSimpleServer overload
Peer-to-PeerFlexibleComplex
Multi-tierSecureCostly

DDBMS Architecture

Components

Architecture Diagram

User Interface
|
Global Query Processor
|
Local Query Processor
|
Local Databases

Layers Explanation

1. Global Query Processor

  • Converts user query into sub-queries

2. Local Query Processor

  • Executes queries at local sites

3. Data Manager

  • Handles storage & retrieval

Types of Distributed DBMS

Based on Homogeneity

TypeDescription
HomogeneousSame DBMS
HeterogeneousDifferent DBMS

Based on Data Distribution

TypeDescription
ReplicatedData copies
FragmentedData split
HybridBoth

Data Fragmentation

Breaking database into smaller pieces.

Types

TypeDescription
HorizontalRows divided
VerticalColumns divided
HybridBoth

Fragmentation Diagram

Table: Students
----------------------
| ID | Name | Marks |
----------------------

Horizontal:
Site1 → ID 1–50
Site2 → ID 51–100

Vertical:
Site1 → ID, Name
Site2 → ID, Marks

Data Replication

Storing multiple copies of data at different locations.

Types

TypeDescription
FullEntire DB copied
PartialSome data copied

Advantages

  • High availability
  • Faster access

Disadvantages

  • Update complexity
  • Storage cost

Combined DDBMS Working

User Query
|
Global Processor
|
Fragmentation / Replication
|
Local Sites Execution
|
Result Combined

Important Exam Questions

Short Questions

  • Define DDBMS.
  • What is data fragmentation?
  • What is replication?

Long Questions

  • Explain the architecture of DDBMS.
  • Describe the advantages and problems of DDBMS.
  • Compare Client-Server and Peer-to-Peer.

Practical/Theory Mix

  • Explain fragmentation with an example.
  • Draw an architecture diagram of DDBMS.
  • Discuss transparency in DDBMS.

Final Summary

  • DDBMS = distributed + single system view
  • Architectures → Client-Server, P2P, Multi-tier
  • Fragmentation & Replication → core concepts
  • Problems → consistency, security, network

Database Design in Distributed DBMS

Overview of Distributed Database Design

Database design in DDBMS focuses on how data is divided, stored, and managed across multiple sites.

Objectives

  • Efficient data access
  • High availability
  • Minimum communication cost
  • Data consistency

Design Process Diagram

Global Database Design

Fragmentation

Allocation

Local Database Design

Alternative Design Strategies

These define how we design a distributed database.

Top-Down Approach

Design starts from a global schema, then divided into fragments.

Diagram

Global Schema

Fragmentation

Allocation

Advantages

  • Better control
  • Uniform design
  • Suitable for new systems

Disadvantages

  • Complex
  • Time-consuming

Bottom-Up Approach

Existing databases are integrated into one distributed system.

Diagram

Local Databases

Integration

Global Schema

Advantages

  • Easy for existing systems
  • Faster implementation

Disadvantages

  • Data inconsistency
  • Integration issues

Comparison Table

FeatureTop-DownBottom-Up
Start PointGlobal schemaLocal DBs
Use CaseNew systemExisting system
ComplexityHighModerate

Distribution Design Issues

These are key challenges in distributing data.

Important Issues

  • Data Distribution: Where to store data?
  • Replication: How many copies of data?
  • Fragmentation: How to divide data?
  • Allocation: Where to place fragments?
  • Transparency: Hide complexity from users

Design Issues Diagram

Data Distribution
|
Fragmentation
|
Replication
|
Allocation

Fragmentation

Breaking a database into smaller parts (fragments).

Why Fragmentation?

  • Improve performance
  • Reduce data transfer
  • Increase parallelism

Types of Fragmentation

1. Horizontal Fragmentation

Rows are divided.

Students Table:
ID | Name | City

Site1 → City = Delhi
Site2 → City = Mumbai

2. Vertical Fragmentation

Columns are divided.

Site1 → ID, Name
Site2 → ID, Marks

3. Hybrid Fragmentation

A combination of both.

Fragmentation Comparison

TypeBased OnExample
HorizontalRowsRegion-wise data
VerticalColumnsSensitive data
HybridBothComplex systems

Fragmentation Rules

  • Completeness → No data loss
  • Reconstruction → Can rebuild the original table
  • Disjointness → No overlap

Allocation

Placing fragments at different sites.

Types of Allocation

1. Centralised Allocation

All data → One site
  • Simple but less reliable

2. Distributed Allocation

Fragment1 → Site1
Fragment2 → Site2
  • Better performance

3. Replicated Allocation

Same data → Multiple sites
  • High availability

Allocation Comparison

TypeAdvantageDisadvantage
CentralizedSimpleSingle point failure
DistributedFast accessComplex
ReplicatedReliableHigh cost

Fragmentation vs Allocation

FeatureFragmentationAllocation
MeaningDivide dataPlace data
PurposeEfficiencyAvailability
ExampleSplit tableStore at the site

Combined Design Workflow

Global Schema

Fragmentation

Allocation

Replication

Execution

Important Exam Questions

Short Questions

  • Define fragmentation.
  • What is allocation?
  • The difference between horizontal and vertical fragmentation.

Long Questions

  • Explain database design strategies in DDBMS.
  • Discuss fragmentation types with examples.
  • Explain allocation strategies.

Case-Based Question

  • Design fragmentation for a student database.
  • Suggest an allocation strategy for the banking system.

Final Summary

  • Top-Down → Start global → divide
  • Bottom-Up → Merge local DBs
  • Fragmentation → split data
  • Allocation → place data
  • Goal → performance + availability + efficiency