Data Warehouse



Overview of Data Warehousing

In today’s world, organizations generate a huge amount of data every day from different sources such as sales systems, banking software, websites, mobile apps, and ERP systems. This data is scattered, unorganized, and stored in different formats.

Data Warehousing is a technique used to collect, store, integrate, and manage large volumes of data from multiple sources in one central place so that meaningful analysis and decision-making can be done.

Simply put: Operational systems run the business, while data warehouses help analyze the business.

Definition of Data Warehouse

A Data Warehouse is a centralized repository that stores integrated, historical, and subject-oriented data to support decision-making and reporting.

Key Characteristics

FeatureMeaning (Simple Language)
Subject-orientedData is organized by topics like sales, customers, finance
IntegratedData comes from multiple sources and is made consistent
Time-variantStores historical data (years/months/days)
Non-volatileData is not frequently changed or deleted

Data Warehousing Components

A data warehouse system is made up of several important components that work together.

Major Components Explained

ComponentDescription
Data SourcesOLTP systems, databases, ERP, CRM, files, APIs
ETL ToolsExtract, Transform, Load data into the warehouse
Staging AreaTemporary storage before data is cleaned
Data Warehouse DatabaseCentral storage of processed data
Metadata RepositoryStores data definitions and rules
OLAP ToolsUsed for analysis and reporting
Front-End ToolsDashboards, reports, data visualization

Building a Data Warehouse (Step-by-Step)

Building a data warehouse is a systematic process.

Step 1: Requirement Analysis

  • Identify business needs
  • Decide what data is required
  • Understand reporting requirements

Step 2: Data Source Identification

  • Identify operational databases
  • Understand data formats and structures

Step 3: ETL Process

  • Extract data from sources
  • Transform data (clean, remove duplicates, standardize)
  • Load data into the warehouse

Step 4: Warehouse Design

  • Choose schema (Star or Snowflake)
  • Design fact tables and dimension tables

Step 5: Data Storage

  • Store historical and summarized data
  • Optimize for fast querying

Step 6: Testing and Deployment

  • Validate data accuracy
  • Deploy warehouse for users

Warehouse Database

The Warehouse Database is the core of the data warehouse where all cleaned and integrated data is stored.

Types of Tables Used

Table TypePurpose
Fact TableStores quantitative data (sales amount, quantity)
Dimension TableStores descriptive data (customer, time, product)
Aggregate TableStores summarized data for faster queries

Schema Types

https://miro.medium.com/0%2A0suSNUOuzRO0VYN2.png

SchemaDescription
Star SchemaSimple design, one fact table connected to dimensions
Snowflake SchemaNormalized dimensions, complex structure
Galaxy SchemaMultiple fact tables

Mapping the Data Warehouse to a Multiprocessor Architecture

To handle large volumes of data efficiently, data warehouses are mapped to multiprocessor systems.

Why Multiprocessor Architecture?

  • Faster query processing
  • High scalability
  • Better performance for large datasets

Types of Multiprocessor Architectures

ArchitectureDescription
Shared MemoryAll processors share the same memory
Shared DiskEach processor has its own memory but shares disk
Shared NothingEach processor has its own memory and disk

How Mapping Works

  • Data is partitioned across processors
  • Queries are executed in parallel
  • Results are combined for final output

Advantages of Data Warehousing

BenefitExplanation
Better decision-makingUses historical and integrated data
Faster queriesOptimized for analysis
Data consistencySingle version of truth
Business intelligenceSupports forecasting and trends

Real-Life Example

Example: Retail Company

  • Sales data from multiple branches
  • Customer data from CRM
  • Inventory data from ERP

All data is stored in a data warehouse
Managers analyze sales trends, customer behavior, and inventory needs

Summary (Exam-Friendly)

  • Data Warehousing stores large historical data
  • Helps in analysis, not daily transactions
  • Uses ETL process
  • Organized using fact and dimension tables
  • Supports parallel processing using multiprocessor architecture

Difference Between Database System and Data Warehouse

Multidimensional Data Model, Data Cubes, Star, Snowflake, Fact Constellation Concepts

Difference Between Database System and Data Warehouse

This is a very important MCA exam question and is often asked in table form.

BasisDatabase System (OLTP)Data Warehouse (OLAP)
PurposeDaily operationsAnalysis & decision making
Data TypeCurrent, real-time dataHistorical data
Data SizeSmall to mediumVery large
UpdatesFrequent insert, update, deleteRarely updated
QueriesSimple queriesComplex analytical queries
UsersClerks, staffManagers, analysts
ExampleBanking transactionSales trend analysis
PerformanceOptimized for speed of transactionsOptimized for reporting
Data StructureHighly normalizedDenormalized

In simple words:

  • Database = runs the business
  • Data Warehouse = analyzes the business

Multidimensional Data Model

The Multidimensional Data Model is used in data warehouses to represent data in multiple dimensions (views).

Key Elements

ElementMeaning
FactNumeric data (sales, profit)
DimensionDescriptive data (time, product, location)
MeasureValues stored in fact table

Example

Sales can be analyzed by:

  • Time (Year, Month)
  • Product
  • Region

Data Cubes

A Data Cube is a logical representation of multidimensional data.

Explanation in Simple Language

Think of a cube where:

  • X-axis = Product
  • Y-axis = Location
  • Z-axis = Time

Each cell shows sales value.

Operations on Data Cube

OperationMeaning
Roll-upSummarize data
Drill-downView detailed data
SliceSelect one dimension
DiceSelect multiple dimensions
PivotRotate cube view

Star Schema Concept

The Star Schema is the simplest and most commonly used schema in data warehouses.

Structure

  • One Fact Table in the center
  • Multiple Dimension Tables around it
  • Shape looks like a ⭐ star

Advantages

AdvantageReason
Simple designEasy to understand
Fast queriesFewer joins
Easy maintenanceSimple structure

Snowflake Schema Concept

The Snowflake Schema is an extension of the star schema with normalized dimensions.

Key Features

FeatureDescription
NormalizedDimension tables are split
ComplexMore joins required
StorageLess redundancy

Star vs Snowflake (Quick View)

BasisStar SchemaSnowflake Schema
ComplexityLowHigh
JoinsFewerMore
StorageMoreLess

Fact Constellation (Galaxy Schema)

A Fact Constellation schema contains multiple fact tables sharing common dimension tables.

Example

  • Fact Tables: Sales, Shipping
  • Shared Dimensions: Time, Product, Location

Characteristics

FeatureDescription
Fact TablesMore than one
ComplexityHigh
UsageLarge enterprises
FlexibilityVery high

Concept Summary (One-Look Revision Table)

ConceptMeaning
Database SystemHandles daily transactions
Data WarehouseSupports analysis
Multidimensional ModelData viewed from many angles
Data CubeLogical multidimensional structure
Star SchemaSimple fact-dimension design
Snowflake SchemaNormalized dimension design
Fact ConstellationMultiple fact tables

Exam-Friendly Conclusion

  • Data Warehouse is read-oriented
  • Uses multidimensional data model
  • Data cubes improve analytical queries
  • Star schema is simplest
  • Snowflake reduces redundancy
  • Fact constellation supports complex analysis