Query Processing & Decomposition in Distributed DBM
Query Processing
Query processing is the process of translating a user query into an efficient execution plan and retrieving data from distributed databases.
Example
SELECT name FROM Students WHERE marks > 80;
👉 In DDBMS, this query may run across multiple sites.
Query Processing Flow
User Query (SQL)
↓
Query Parser
↓
Query Optimizer
↓
Execution Plan
↓
Distributed Execution
↓
Result
Objectives of Query Processing
Main Goals
| Objective | Description |
|---|---|
| Minimize Response Time | Fast query results |
| Reduce Communication Cost | Less data transfer |
| Efficient Resource Use | CPU, memory optimization |
| Parallel Processing | Execute at multiple sites |
Key Focus: Communication cost is most important in distributed systems.
Characterisation of Query Processors
Describes features and behaviour of query processors in DDBMS.
Characteristics
| Feature | Description |
|---|---|
| Distributed Execution | Runs on multiple nodes |
| Parallelism | Simultaneous execution |
| Optimization | Finds the best execution plan |
| Transparency | The user is unaware of the distribution |
Types of Query Processors
| Type | Description |
|---|---|
| Centralized | Single-site processing |
| Distributed | Multiple sites |
| Parallel | Simultaneous execution |
Layers of Query Processing
Query processing is divided into multiple layers for efficiency.
Layered Architecture
User Query
↓
Query Decomposition Layer
↓
Data Localization Layer
↓
Global Optimization Layer
↓
Local Optimization Layer
↓
Execution Layer
Explanation of Layers
1. Query Decomposition Layer
- Breaks query into smaller parts
- Converts SQL to relational algebra
2. Data Localisation Layer
- Identifies where data is stored
3. Global Optimisation Layer
- Creates an efficient distributed plan
4. Local Optimisation Layer
- Optimises queries at each site
5. Execution Layer
- Executes the query and returns the result
Query Decomposition
Process of breaking a high-level query into simpler operations.
Steps in Query Decomposition
Decomposition Process
SQL Query
↓
Parsing
↓
Normalization
↓
Relational Algebra Conversion
↓
Optimization
Explanation
1. Parsing
- Check syntax & correctness
2. Normalisation
- Simplify query
3. Relational Algebra
-
Convert into operations like:
- Selection (σ)
- Projection (π)
- Join (⨝)
Example
SQL: SELECT name FROM Students WHERE marks > 80;
Relational Algebra:π name (σ marks > 80 (Students))
Localisation of Distributed Data
Process of mapping a global query to local data fragments.
Why Needed?
- Data is distributed across sites
- Query must be executed where data exists
Localization Diagram
Global Query
↓
Fragment Mapping
↓
Local Queries
↓
Execution at Sites
Example
Global Table: Students
Fragmented as:
- Site1 → Students with marks > 50
- Site2 → Students with marks ≤ 50
Query:
SELECT name FROM Students;
👉 Converted to:
- Query Site1
- Query Site2
- Combine results
Query Optimisation in Distributed Systems
Key Factors
| Factor | Description |
|---|---|
| Data Transfer | Minimise network usage |
| Join Strategy | Efficient joins |
| Parallel Execution | Faster processing |
Strategies
- Move query to data
- Move data to query
- Hybrid approach
Query Processing Example (Complete)
Full Workflow
User Query
↓
Query Decomposition
↓
Data Localization
↓
Global Optimization
↓
Local Execution
↓
Final Result
Centralised vs Distributed Query Processing
| Feature | Centralized | Distributed |
|---|---|---|
| Data Location | Single | Multiple |
| Complexity | Low | High |
| Speed | Moderate | High (parallel) |
Important Exam Questions
Short Questions
- Define query processing.
- What is query decomposition?
- What is data localisation?
Long Questions
- Explain layers of query processing.
- Describe query decomposition steps with an example.
- Explain the localisation of distributed data.
Numerical/Case Questions
- Convert SQL to relational algebra.
- Explain the execution plan for the distributed query.
Final Summary
- Query Processing → convert SQL → execution plan
- Decomposition → break query
- Localisation → map to sites
- Goal → minimise cost + fast execution
Distributed Query Optimisation
Query optimisation is the process of selecting the most efficient execution plan for a query.
In a distributed DBMS, the goal is to minimise total cost (especially communication cost).
Why Needed?
- The same query can have multiple execution plans
-
Choose the plan with:
- Minimum time
- Minimum data transfer
- Efficient resource usage
Query Optimisation Flow
SQL Query
↓
Query Parser
↓
Multiple Execution Plans
↓
Cost Estimation
↓
Best Plan Selected
Cost Factors
| Factor | Description |
|---|---|
| CPU Cost | Processing time |
| I/O Cost | Disk access |
| Communication Cost | Data transfer between sites |
| Response Time | Total time |
Most important in DDBMS → Communication Cost
Centralised Query Optimisation
Optimisation performed as if the data were located at a single site.
Features
- Ignores data distribution
- Uses traditional optimisation techniques
- Simpler than distributed optimisation
Steps
- Parse query
- Convert to relational algebra
- Apply optimisation rules
- Generate execution plan
Centralized Optimization Diagram
Query
↓
Relational Algebra
↓
Optimization Rules
↓
Execution Plan
Optimization Techniques
1. Selection Pushdown
Apply selection early.
σ marks > 80 (Students)
2. Projection Pushdown
Select only required columns.
3. Join Reordering
Change the order of joins to reduce cost.
Example
Instead of: (Students' ⨝ Marks) ⨝ Department
Use: Students ⨝ (Marks ⨝ Department)
Choose a smaller intermediate result.
Limitations
- Does not consider network cost
- Not suitable for distributed systems
Distributed Query Optimisation
Optimisation considering data distribution across multiple sites.
Goals
| Goal | Description |
|---|---|
| Minimise Communication Cost | Reduce data transfer |
| Parallel Execution | Faster results |
| Load Balancing | Efficient use of nodes |
Distributed Optimisation Flow
Global Query
↓
Fragment Mapping
↓
Multiple Distributed Plans
↓
Cost Evaluation
↓
Best Distributed Plan
Key Decisions
- Where to execute the query?
- Where to perform joins?
- Whether to move data or query?
Strategies
1. Query Shipping
Send a query to the data site.
2. Data ShippingQuery → Site → Execution → Result
Bring data to the query site.
3. Hybrid ApproachData → Site → Query Execution
Combine both.
Comparison
| Strategy | Advantage | Disadvantage |
|---|---|---|
| Query Shipping | Less data transfer | Remote processing |
| Data Shipping | Simple | High network cost |
| Hybrid | Balanced | Complex |
Distributed Query Optimisation Algorithms
1. Heuristic-Based Optimisation
Uses rules (heuristics) instead of full cost calculation.
Rules
- Apply selection early
- Reduce intermediate results
- Perform joins efficiently
Advantages
- Fast
- Simple
Disadvantages
- Not always optimal
2. Cost-Based Optimisation
Evaluates multiple query plans using cost functions.
Steps
- Generate possible plans
- Estimate cost
- Select the minimum cost plan
Cost Formula
Total Cost = CPU + I/O + Communication
Advantages
- Accurate
- Optimal plan
Disadvantages
- Time-consuming
- Complex
3. Dynamic Programming Algorithm
Breaks the query into smaller sub-problems.
Features
- Stores intermediate results
- Avoids recomputation
Working
Small Plans → Combine → Optimal Plan
Advantage
- Efficient for complex queries
4. Genetic Algorithm (Advanced)
Uses evolutionary techniques to find an optimal plan.
Features
- Works for a large search space
- Near-optimal solutions
Algorithm Comparison
| Algorithm | Speed | Accuracy |
|---|---|---|
| Heuristic | Fast | Medium |
| Cost-Based | Slow | High |
| Dynamic Programming | Medium | High |
| Genetic | Medium | Very High |
Join Optimisation in Distributed Systems
Join Strategies
- Nested Loop Join: Simple but slow
- Hash Join: Faster for large data
Semi-Join
Semi-Join Process
Site1 → Send join attribute → Site2
Site2 → Filter data → Send back
Advantage
- Reduces data transfer
Example of Distributed Optimisation
Query
SELECT name FROM Students S, Marks M
WHERE S.id = M.id;
Optimization Steps
- Identify data location
- Choose a join site
- Apply semi-join
- Minimize transfer
Centralised vs Distributed Optimisation
| Feature | Centralized | Distributed |
|---|---|---|
| Data Location | Single | Multiple |
| Cost Factor | CPU, I/O | + Communication |
| Complexity | Low | High |
Final Workflow
Query
↓
Decomposition
↓
Localization
↓
Optimization
↓
Execution
Important Exam Questions
Short Questions
- Define query optimization
- What is semi-join?
- The difference between heuristic and cost-based
Long Questions
- Explain distributed query optimisation
- Describe optimisation algorithms
- Compare centralised vs. distributed optimisation
Case-Based
- Optimise a distributed join query
- Explain query shipping vs data shipping
Final Summary
- Optimisation → choose the best plan
- Centralised → ignores distribution
- Distributed → minimises communication cost
- Algorithms → Heuristic, Cost-based, DP, Genetic
- Semi-join → key technique