Data Warehouse Process and Technology



Data Warehouse Process 

The Data Warehouse Process refers to the complete lifecycle involved in planning, designing, building, managing, and maintaining a data warehouse so that it effectively supports business decision-making.

In simple words: It is how a data warehouse is planned, built, operated, and supported.

Warehousing Strategy

A Warehousing Strategy defines why, what, and how data will be stored in the warehouse.

Objectives of Warehousing Strategy

  • Improve decision-making
  • Provide a single version of truth
  • Reduce reporting time
  • Support business growth

Types of Warehousing Strategies

StrategyExplanation
Enterprise Data Warehouse (EDW)Centralized warehouse for entire organization
Data Mart StrategyDepartment-specific warehouses
Top-Down ApproachBuild EDW first, then data marts
Bottom-Up ApproachBuild data marts first, then integrate
Hybrid ApproachCombination of top-down and bottom-up

For exams:

  • Top-down = costly but integrated
  • Bottom-up = faster and cheaper

Warehouse Management and Support Processes

Once the warehouse is built, it must be managed and supported continuously.


Key Warehouse Management Processes

ProcessPurpose
Data LoadingRegular ETL data refresh
Metadata ManagementMaintain data definitions
Performance MonitoringEnsure fast query response
Security ManagementControl user access
Backup & RecoveryPrevent data loss
Error HandlingFix data inconsistencies

Support Processes

Support ProcessDescription
User TrainingTeach users how to use tools
Query SupportAssist in complex queries
Change ManagementHandle schema or requirement changes
System AuditingTrack usage and performance

Warehouse Planning and Implementation

This is the most important practical section for MCA students.

Steps in Warehouse Planning & Implementation

Step 1: Business Requirement Analysis

  • Identify business goals
  • Decide KPIs and reports

Step 2: Technical Assessment

  • Data source analysis
  • Hardware and software selection

Step 3: Data Modeling

  • Choose schema (Star/Snowflake)
  • Design fact and dimension tables

Step 4: ETL Design

  • Extract data
  • Clean and transform
  • Load warehouse

Step 5: Testing

  • Data accuracy testing
  • Performance testing

Step 6: Deployment & Maintenance

  • Release for users
  • Continuous monitoring

Planning Challenges

ChallengeReason
Data qualityInconsistent source data
CostHardware and tools are expensive
ScalabilityGrowing data volume
User adoptionLack of training

Hardware for Data Warehousing

A data warehouse handles huge volumes of data, so powerful hardware is required.

Hardware Components

ComponentRole
CPUHandles query processing
Memory (RAM)Faster data access
StorageStores historical data
NetworkConnects data sources
Backup DevicesData protection

Hardware Architecture Types

ArchitectureDescription
SMP (Shared Memory)Multiple CPUs share memory
MPP (Shared Nothing)Each processor has own memory
NUMAHybrid memory architecture

MPP architecture is most popular for large warehouses.

Operating Systems for Data Warehousing

The Operating System (OS) controls how hardware resources are used.

OS Requirements for Data Warehousing

RequirementReason
ScalabilityHandle growing data
Reliability24×7 availability
Parallel ProcessingFaster query execution
SecurityProtect sensitive data

Common Operating Systems Used

OSUsage
LinuxMost widely used (cost-effective)
UNIXHigh-end enterprise systems
Windows ServerSmall to medium warehouses

Combined Summary Table (Exam Revision)

TopicKey Points
Warehousing StrategyDefines approach & scope
Management ProcessesMonitoring, security, metadata
Support ProcessesUser training, change management
Planning & ImplementationStep-by-step lifecycle
HardwareHigh-performance systems
Operating SystemsLinux & UNIX preferred

Exam-Oriented Conclusion

  • Data Warehouse process covers strategy to maintenance
  • Proper planning ensures success
  • Management and support processes keep system reliable
  • Hardware and OS selection directly affect performance
  • Parallel processing is critical for large datasets

Client / Server Computing Model & Data Warehousing

The Client/Server Computing Model is the foundation on which most data warehouse systems are built.

What is Client/Server Computing?

  • Client: Requests data (user interface, reports, dashboards)
  • Server: Processes requests and manages data

Role in Data Warehousing

  • Clients run OLAP tools
  • Servers store and process warehouse data
  • Enables centralized storage and distributed access

Types of Client/Server Architecture

TypeDescription
2-TierClient ↔ Database Server
3-TierClient ↔ Application Server ↔ Database
N-TierMultiple layers for scalability

Data warehouses mostly use 3-tier architecture.

Parallel Processors & Cluster Systems

Large data warehouses require high-speed processing, achieved using parallel processors and cluster systems.

Parallel Processing (Simple Explanation)

  • Tasks are divided into smaller parts
  • Multiple processors work simultaneously
  • Results are combined

Types of Parallel Architectures

ArchitectureDescription
Shared MemoryAll CPUs share same memory
Shared DiskEach CPU has private memory but common disk
Shared NothingEach CPU has own memory & disk

Shared Nothing (MPP) is best for large warehouses.

Cluster Systems

  • Group of independent computers
  • Work together as a single system
  • High availability and fault tolerance

Distributed DBMS Implementations

A Distributed DBMS (DDBMS) stores data across multiple locations but appears as a single database to users.

Key Features

FeatureExplanation
Data DistributionData stored at multiple sites
TransparencyUser sees one database
ScalabilityEasy to expand
Fault ToleranceFailure at one site doesn’t stop system

Implementation Techniques

TechniqueDescription
FragmentationData split into parts
ReplicationCopies of data at multiple sites
AllocationPlacement of data fragments

Warehousing Software

Warehousing Software refers to tools used to build, manage, and analyze data warehouses.

Categories of Warehousing Software

Software TypePurpose
ETL ToolsExtract, transform, load data
Database SoftwareStore warehouse data
OLAP ToolsMultidimensional analysis
Reporting ToolsReports and dashboards
Metadata ToolsManage data definitions

Popular Examples (For Theory)

  • ETL: Informatica, Talend
  • DB: Oracle, SQL Server, Teradata
  • OLAP: Cognos, SSAS

Warehouse Schema Design

Warehouse Schema Design defines how data is structured inside the warehouse.

Basic Components

ComponentMeaning
Fact TableStores numeric measures
Dimension TableStores descriptive data
MeasureSales, profit, quantity

Star Schema

FeatureDescription
StructureOne fact table, many dimensions
ComplexitySimple
PerformanceFast queries
UsageMost common

Snowflake Schema

FeatureDescription
StructureNormalized dimensions
ComplexityHigh
StorageLess redundancy
PerformanceSlower than star

Fact Constellation (Galaxy Schema)

FeatureDescription
Fact TablesMultiple
DimensionsShared
ComplexityVery high
UsageEnterprise-level warehouses

Comparative Summary (Exam-Ready Table)

TopicKey Idea
Client/Server ModelCentralized server, distributed clients
Parallel ProcessingFaster query execution
Cluster SystemsHigh availability
Distributed DBMSData across multiple locations
Warehousing SoftwareETL, OLAP, reporting
Schema DesignStar, Snowflake, Galaxy

Exam-Friendly Conclusion

  • Client/Server model supports scalable warehousing
  • Parallel processors improve performance
  • Cluster systems ensure reliability
  • Distributed DBMS improves availability
  • Proper schema design is critical for speed and analysis