Query Processing & Decomposition in Distributed DBM



Query Processing 

Query Processing & Decomposition in Distributed DBM

Query processing is the process of translating a user query into an efficient execution plan and retrieving data from distributed databases.

Example

SELECT name FROM Students WHERE marks > 80;

👉 In DDBMS, this query may run across multiple sites.

Query Processing Flow

User Query (SQL)

Query Parser

Query Optimizer

Execution Plan

Distributed Execution

Result

Objectives of Query Processing

Main Goals

ObjectiveDescription
Minimize Response TimeFast query results
Reduce Communication CostLess data transfer
Efficient Resource UseCPU, memory optimization
Parallel ProcessingExecute at multiple sites

Key Focus: Communication cost is most important in distributed systems.

Characterisation of Query Processors

Describes features and behaviour of query processors in DDBMS.

Characteristics

FeatureDescription
Distributed ExecutionRuns on multiple nodes
ParallelismSimultaneous execution
OptimizationFinds the best execution plan
TransparencyThe user is unaware of the distribution

Types of Query Processors

TypeDescription
CentralizedSingle-site processing
DistributedMultiple sites
ParallelSimultaneous execution

Layers of Query Processing

Query processing is divided into multiple layers for efficiency.

Layered Architecture

User Query

Query Decomposition Layer

Data Localization Layer

Global Optimization Layer

Local Optimization Layer

Execution Layer

Explanation of Layers

1. Query Decomposition Layer

  • Breaks query into smaller parts
  • Converts SQL to relational algebra

2. Data Localisation Layer

  • Identifies where data is stored

3. Global Optimisation Layer

  • Creates an efficient distributed plan

4. Local Optimisation Layer

  • Optimises queries at each site

5. Execution Layer

  • Executes the query and returns the result

Query Decomposition

Process of breaking a high-level query into simpler operations.

Steps in Query Decomposition

Decomposition Process

SQL Query

Parsing

Normalization

Relational Algebra Conversion

Optimization

Explanation

1. Parsing

  • Check syntax & correctness

2. Normalisation

  • Simplify query

3. Relational Algebra

  • Convert into operations like:
    • Selection (σ)
    • Projection (π)
    • Join (⨝)

Example

SQL: SELECT name FROM Students WHERE marks > 80;

Relational Algebra:π name (σ marks > 80 (Students))

Localisation of Distributed Data

Process of mapping a global query to local data fragments.

Why Needed?

  • Data is distributed across sites
  • Query must be executed where data exists

Localization Diagram

Global Query

Fragment Mapping

Local Queries

Execution at Sites

Example

Global Table: Students

Fragmented as:

  • Site1 → Students with marks > 50
  • Site2 → Students with marks ≤ 50

Query:

SELECT name FROM Students;

👉 Converted to:

  • Query Site1
  • Query Site2
  • Combine results

Query Optimisation in Distributed Systems

Key Factors

FactorDescription
Data TransferMinimise network usage
Join StrategyEfficient joins
Parallel ExecutionFaster processing

Strategies

  • Move query to data
  • Move data to query
  • Hybrid approach

Query Processing Example (Complete)

Full Workflow

User Query

Query Decomposition

Data Localization

Global Optimization

Local Execution

Final Result

Centralised vs Distributed Query Processing

FeatureCentralizedDistributed
Data LocationSingleMultiple
ComplexityLowHigh
SpeedModerateHigh (parallel)

Important Exam Questions

Short Questions

  • Define query processing.
  • What is query decomposition?
  • What is data localisation?

Long Questions

  • Explain layers of query processing.
  • Describe query decomposition steps with an example.
  • Explain the localisation of distributed data.

Numerical/Case Questions

  • Convert SQL to relational algebra.
  • Explain the execution plan for the distributed query.

Final Summary

  • Query Processing → convert SQL → execution plan
  • Decomposition → break query
  • Localisation → map to sites
  • Goal → minimise cost + fast execution

Distributed Query Optimisation

Query optimisation is the process of selecting the most efficient execution plan for a query.

In a distributed DBMS, the goal is to minimise total cost (especially communication cost).

Why Needed?

  • The same query can have multiple execution plans
  • Choose the plan with:
    • Minimum time
    • Minimum data transfer
    • Efficient resource usage

Query Optimisation Flow

SQL Query

Query Parser

Multiple Execution Plans

Cost Estimation

Best Plan Selected

Cost Factors

FactorDescription
CPU CostProcessing time
I/O CostDisk access
Communication CostData transfer between sites
Response TimeTotal time

Most important in DDBMS → Communication Cost

Centralised Query Optimisation

Optimisation performed as if the data were located at a single site.

Features

  • Ignores data distribution
  • Uses traditional optimisation techniques
  • Simpler than distributed optimisation

Steps

  1. Parse query
  2. Convert to relational algebra
  3. Apply optimisation rules
  4. Generate execution plan

Centralized Optimization Diagram

Query

Relational Algebra

Optimization Rules

Execution Plan

Optimization Techniques

1. Selection Pushdown

Apply selection early.

σ marks > 80 (Students)

2. Projection Pushdown

Select only required columns.

3. Join Reordering

Change the order of joins to reduce cost.

Example

Instead of: (Students' ⨝ Marks) ⨝ Department

Use: Students ⨝ (Marks ⨝ Department)

Choose a smaller intermediate result.

Limitations

  • Does not consider network cost
  • Not suitable for distributed systems

Distributed Query Optimisation

Optimisation considering data distribution across multiple sites.

Goals

GoalDescription
Minimise Communication CostReduce data transfer
Parallel ExecutionFaster results
Load BalancingEfficient use of nodes

Distributed Optimisation Flow

Global Query

Fragment Mapping

Multiple Distributed Plans

Cost Evaluation

Best Distributed Plan

Key Decisions

  1. Where to execute the query?
  2. Where to perform joins?
  3. Whether to move data or query?

Strategies

1. Query Shipping

Send a query to the data site.

Query → Site → Execution → Result
2. Data Shipping

Bring data to the query site.

Data → Site → Query Execution
3. Hybrid Approach

Combine both.

Comparison

StrategyAdvantageDisadvantage
Query ShippingLess data transferRemote processing
Data ShippingSimpleHigh network cost
HybridBalancedComplex

Distributed Query Optimisation Algorithms

1. Heuristic-Based Optimisation

Uses rules (heuristics) instead of full cost calculation.

Rules

  • Apply selection early
  • Reduce intermediate results
  • Perform joins efficiently

Advantages

  • Fast
  • Simple

Disadvantages

  • Not always optimal

2. Cost-Based Optimisation

Evaluates multiple query plans using cost functions.

Steps

  1. Generate possible plans
  2. Estimate cost
  3. Select the minimum cost plan

Cost Formula

Total Cost = CPU + I/O + Communication

Advantages

  • Accurate
  • Optimal plan

Disadvantages

  • Time-consuming
  • Complex

3. Dynamic Programming Algorithm

Breaks the query into smaller sub-problems.

Features

  • Stores intermediate results
  • Avoids recomputation

Working

Small Plans → Combine → Optimal Plan

Advantage

  • Efficient for complex queries

4. Genetic Algorithm (Advanced)

Uses evolutionary techniques to find an optimal plan.

Features

  • Works for a large search space
  • Near-optimal solutions

Algorithm Comparison

AlgorithmSpeedAccuracy
HeuristicFastMedium
Cost-BasedSlowHigh
Dynamic ProgrammingMediumHigh
GeneticMediumVery High

Join Optimisation in Distributed Systems

Join Strategies

  • Nested Loop Join: Simple but slow
  • Hash Join: Faster for large data

Semi-Join 

Semi-Join Process

Site1 → Send join attribute → Site2
Site2 → Filter data → Send back

Advantage

  • Reduces data transfer

Example of Distributed Optimisation

Query

SELECT name FROM Students S, Marks M
WHERE S.id = M.id;

Optimization Steps

  1. Identify data location
  2. Choose a join site
  3. Apply semi-join
  4. Minimize transfer

Centralised vs Distributed Optimisation

FeatureCentralizedDistributed
Data LocationSingleMultiple
Cost FactorCPU, I/O+ Communication
ComplexityLowHigh

Final Workflow

Query

Decomposition

Localization

Optimization

Execution

Important Exam Questions

Short Questions

  • Define query optimization
  • What is semi-join?
  • The difference between heuristic and cost-based

Long Questions

  • Explain distributed query optimisation
  • Describe optimisation algorithms
  • Compare centralised vs. distributed optimisation

Case-Based

  • Optimise a distributed join query
  • Explain query shipping vs data shipping

Final Summary

  • Optimisation → choose the best plan
  • Centralised → ignores distribution
  • Distributed → minimises communication cost
  • Algorithms → Heuristic, Cost-based, DP, Genetic
  • Semi-join → key technique