Data Warehouse Process and Technology

Data Warehouse Process

The Data Warehouse Process refers to the complete lifecycle involved in planning, designing, building, managing, and maintaining a data warehouse so that it effectively supports business decision-making.

In simple words: It is how a data warehouse is planned, built, operated, and supported.

Warehousing Strategy

A Warehousing Strategy defines why, what, and how data will be stored in the warehouse.

Objectives of Warehousing Strategy

Improve decision-making
Provide a single version of truth
Reduce reporting time
Support business growth

Types of Warehousing Strategies

Strategy	Explanation
Enterprise Data Warehouse (EDW)	Centralized warehouse for entire organization
Data Mart Strategy	Department-specific warehouses
Top-Down Approach	Build EDW first, then data marts
Bottom-Up Approach	Build data marts first, then integrate
Hybrid Approach	Combination of top-down and bottom-up

For exams:

Top-down = costly but integrated
Bottom-up = faster and cheaper

Warehouse Management and Support Processes

Once the warehouse is built, it must be managed and supported continuously.

Key Warehouse Management Processes

Process	Purpose
Data Loading	Regular ETL data refresh
Metadata Management	Maintain data definitions
Performance Monitoring	Ensure fast query response
Security Management	Control user access
Backup & Recovery	Prevent data loss
Error Handling	Fix data inconsistencies

Support Processes

Support Process	Description
User Training	Teach users how to use tools
Query Support	Assist in complex queries
Change Management	Handle schema or requirement changes
System Auditing	Track usage and performance

Warehouse Planning and Implementation

This is the most important practical section for MCA students.

Steps in Warehouse Planning & Implementation

Step 1: Business Requirement Analysis

Identify business goals
Decide KPIs and reports

Step 2: Technical Assessment

Data source analysis
Hardware and software selection

Step 3: Data Modeling

Choose schema (Star/Snowflake)
Design fact and dimension tables

Step 4: ETL Design

Extract data
Clean and transform
Load warehouse

Step 5: Testing

Data accuracy testing
Performance testing

Step 6: Deployment & Maintenance

Release for users
Continuous monitoring

Planning Challenges

Challenge	Reason
Data quality	Inconsistent source data
Cost	Hardware and tools are expensive
Scalability	Growing data volume
User adoption	Lack of training

Hardware for Data Warehousing

A data warehouse handles huge volumes of data, so powerful hardware is required.

Hardware Components

Component	Role
CPU	Handles query processing
Memory (RAM)	Faster data access
Storage	Stores historical data
Network	Connects data sources
Backup Devices	Data protection

Hardware Architecture Types

Architecture	Description
SMP (Shared Memory)	Multiple CPUs share memory
MPP (Shared Nothing)	Each processor has own memory
NUMA	Hybrid memory architecture

MPP architecture is most popular for large warehouses.

Operating Systems for Data Warehousing

The Operating System (OS) controls how hardware resources are used.

OS Requirements for Data Warehousing

Requirement	Reason
Scalability	Handle growing data
Reliability	24×7 availability
Parallel Processing	Faster query execution
Security	Protect sensitive data

Common Operating Systems Used

OS	Usage
Linux	Most widely used (cost-effective)
UNIX	High-end enterprise systems
Windows Server	Small to medium warehouses

Combined Summary Table (Exam Revision)

Topic	Key Points
Warehousing Strategy	Defines approach & scope
Management Processes	Monitoring, security, metadata
Support Processes	User training, change management
Planning & Implementation	Step-by-step lifecycle
Hardware	High-performance systems
Operating Systems	Linux & UNIX preferred

Exam-Oriented Conclusion

Data Warehouse process covers strategy to maintenance
Proper planning ensures success
Management and support processes keep system reliable
Hardware and OS selection directly affect performance
Parallel processing is critical for large datasets

Client / Server Computing Model & Data Warehousing

The Client/Server Computing Model is the foundation on which most data warehouse systems are built.

What is Client/Server Computing?

Client: Requests data (user interface, reports, dashboards)
Server: Processes requests and manages data

Role in Data Warehousing

Clients run OLAP tools
Servers store and process warehouse data
Enables centralized storage and distributed access

Types of Client/Server Architecture

Type	Description
2-Tier	Client ↔ Database Server
3-Tier	Client ↔ Application Server ↔ Database
N-Tier	Multiple layers for scalability

Data warehouses mostly use 3-tier architecture.

Parallel Processors & Cluster Systems

Large data warehouses require high-speed processing, achieved using parallel processors and cluster systems.

Parallel Processing (Simple Explanation)

Tasks are divided into smaller parts
Multiple processors work simultaneously
Results are combined

Types of Parallel Architectures

Architecture	Description
Shared Memory	All CPUs share same memory
Shared Disk	Each CPU has private memory but common disk
Shared Nothing	Each CPU has own memory & disk

Shared Nothing (MPP) is best for large warehouses.

Cluster Systems

Group of independent computers
Work together as a single system
High availability and fault tolerance

Distributed DBMS Implementations

A Distributed DBMS (DDBMS) stores data across multiple locations but appears as a single database to users.

Key Features

Feature	Explanation
Data Distribution	Data stored at multiple sites
Transparency	User sees one database
Scalability	Easy to expand
Fault Tolerance	Failure at one site doesn’t stop system

Implementation Techniques

Technique	Description
Fragmentation	Data split into parts
Replication	Copies of data at multiple sites
Allocation	Placement of data fragments

Warehousing Software

Warehousing Software refers to tools used to build, manage, and analyze data warehouses.

Categories of Warehousing Software

Software Type	Purpose
ETL Tools	Extract, transform, load data
Database Software	Store warehouse data
OLAP Tools	Multidimensional analysis
Reporting Tools	Reports and dashboards
Metadata Tools	Manage data definitions

Popular Examples (For Theory)

ETL: Informatica, Talend
DB: Oracle, SQL Server, Teradata
OLAP: Cognos, SSAS

Warehouse Schema Design

Warehouse Schema Design defines how data is structured inside the warehouse.

Basic Components

Component	Meaning
Fact Table	Stores numeric measures
Dimension Table	Stores descriptive data
Measure	Sales, profit, quantity

Star Schema

Feature	Description
Structure	One fact table, many dimensions
Complexity	Simple
Performance	Fast queries
Usage	Most common

Snowflake Schema

Feature	Description
Structure	Normalized dimensions
Complexity	High
Storage	Less redundancy
Performance	Slower than star

Fact Constellation (Galaxy Schema)

Feature	Description
Fact Tables	Multiple
Dimensions	Shared
Complexity	Very high
Usage	Enterprise-level warehouses

Comparative Summary (Exam-Ready Table)

Topic	Key Idea
Client/Server Model	Centralized server, distributed clients
Parallel Processing	Faster query execution
Cluster Systems	High availability
Distributed DBMS	Data across multiple locations
Warehousing Software	ETL, OLAP, reporting
Schema Design	Star, Snowflake, Galaxy

Exam-Friendly Conclusion

Client/Server model supports scalable warehousing
Parallel processors improve performance
Cluster systems ensure reliability
Distributed DBMS improves availability
Proper schema design is critical for speed and analysis

Data Warehouse Process and Technology