Unit 2: Data Warehouse Modeling and Implementation



Data Warehouse Modeling and Implementation

Data warehouse modeling is about organizing data so it can be analyzed easily for business decisions.
The most common approach is Multidimensional Data Modeling.

Multidimensional Data Modeling (MDM)

MDM organizes data into facts and dimensions:

  • Fact = numeric data related to business processes (e.g., sales amount, profit, quantity)
  • Dimensions = descriptive attributes (e.g., time, product, region, customer)

Together, they form a cube that allows slicing, dicing, drill-down, and roll-up.

Star Schema

  • The fact table is in the center.
  • All dimension tables directly connect to it.
  • Simple, fast, widely used.

Text Diagram (Star Schema)

Product Dimension ↑ Customer Dimension ← Fact TableTime Dimension ↓ Store Dimension

Business Use Case

A retail company analyzing:

  • Monthly sales
  • Best-selling products
  • Regional performance

Simple structure → quick reporting.

Snowflake Schema

  • An extension of the star schema
  • Dimension tables are normalized (split into sub-tables)
  • More complex but saves storage

Text Diagram (Snowflake)

Product CategoryProduct DimensionFact TableTime DimensionCalendarSupplier Dimension

Business Use Case

Used when:

  • Dimensions have hierarchical data
  • Business wants better data accuracy and less redundancy Example: Product → Category → Department → Brand

Fact Constellation Schema (Galaxy Schema)

  • Multiple fact tables share common dimension tables
  • Used in large enterprises with many business processes

Text Diagram (Fact Constellation)

Product Dimension ↑ ↑ Sales Fact ←––– –––→ Inventory Fact ↓ ↓ Time Dimension Store Dimension

Business Use Case

Used by companies with:

  • Sales data
  • Inventory data
  • Shipping data
  • Finance data

All linked through shared dimensions like Time, Product, Region.

Example: Amazon, Walmart, Big retail chains

Dimensional Modeling

A design technique that structures data into facts and dimensions to support business decision-making efficiently.

Developed by Ralph Kimball.

Elements of Dimensional Modeling

  • Fact tables
  • Dimension tables
  • Star, snowflake, or constellation schema

Steps in Dimensional Modeling (Business-Focused)

Step 1: Identify Business Process

Example:

  • Sales
  • Purchase
  • Inventory
  • Marketing campaign

Step 2: Identify Grain (Granularity)

Level of detail:

  • Daily sales
  • Monthly sales
  • Per transaction

Step 3: Choose Dimensions

Examples:

  • Time
  • Customer
  • Store
  • Product
  • Agent

Step 4: Identify Facts

Examples:

  • Sales amount
  • Profit
  • Quantity sold
  • Discount

Step 5: Build Schema

Choose star/snowflake depending on complexity.

Business Use Cases of Dimensional Modeling

Retail

  • Sales forecasting
  • Customer behavior analysis
  • Inventory optimization

Banking

  • Loan performance
  • ATM transaction monitoring
  • Fraud detection

E-commerce

  • Customer lifetime value
  • Web traffic analysis
  • Recommendation engines

Telecom

  • Call records analysis
  • Churn prediction
  • Profitability per user

Healthcare

  • Patient history analysis
  • Disease trend mapping

Quick Comparison Table

SchemaStructurePerformanceUse Case
StarSimple, denormalizedFastRetail, dashboards
SnowflakeComplex, normalizedModerateData with hierarchies
Fact ConstellationMultiple factsSlower but scalableLarge enterprises

Very Simple Real-Life Example (Retail Store)

Fact Table: Sales Fact

  • Sale amount
  • Quantity
  • Profit

Dimensions:

  • Time (Year, Month, Day)
  • Product (Name, Category, Brand)
  • Customer (Gender, Age, Location)
  • Store (City, Region)

Managers can now ask:

  • Which product sells the most?
  • Which region performs the best?
  • What is the monthly trend?

This is the power of multidimensional modeling.

OLAP (Online Analytical Processing)

OLAP is a technology used in data warehouse systems to help managers and analysts analyze data quickly from multiple viewpoints (e.g., by time, product, region, customer).

It allows fast reporting, forecasting, trend analysis, and decision-making.

What is an OLAP Cube?

An OLAP cube is a multi-dimensional structure that stores data in a way that makes it easy to analyze.

Key Components

  • Dimensions → descriptive categories (Time, Product, Geography, Customer, etc.)
  • Measures (Facts) → numeric values (Sales amount, Profit, Quantity)

Simple Text Diagram (3D Cube)

Product Dimension ↑ | Time Dimension → | ← Geography Dimension | [ OLAP CUBE ]

Inside the cube, you can view:

  • Sales by product
  • Sales by time
  • Sales by location
  • Or any combination (e.g., monthly sales of mobiles in Delhi)

OLAP Operations 

These operations help managers explore data from different angles.

1. Roll-Up 

Roll-up means moving from detailed data to summarized data.

Example:

  • From daily sales → monthly sales
  • From city → state → country

Think: Zooming out on Google Maps.

2. Drill-Down (More Detail)

Drill-down is the opposite of roll-up.
It means moving from summary to details.

Example:

  • From yearly sales → quarterly → monthly → daily
  • From country level → state → district → city

Think: Zooming in on Google Maps.

3. Slice (Single Layer View)

Slice means selecting one value of a dimension to view only one section of the cube.

Example:

Show only:

  • Sales for 2025
  • Sales for Mobiles
  • Sales for Delhi

Think: Cutting a single slice of bread from a loaf.

4. Dice (Multi-Dimensional Filtering)

Dice means selecting multiple values across dimensions.

Example:

Show only:

  • Sales for Products = Mobiles + Laptops
  • Time = Q1 + Q2
  • Region = Delhi + Mumbai

Think: Cutting a cube into a smaller cube.

5. Pivot (Rotate View)

Pivot means changing the orientation of the cube to see data differently.

Example:

  • Rows = Product
  • Columns = Month → Pivot to

  • Rows = Month
  • Columns = Product

Think: Rotating a cube to see another side.

OLAP vs. OLTP 

FeatureOLAPOLTP
PurposeAnalysis & decisionsDaily operations
DataHistorical, large volumeReal-time, transactional
UsersManagers, analystsFrontline workers
OperationsRead-heavy (queries, reports)Write-heavy (insert/update)
ExamplesSales reports, dashboardsATM transactions, order booking
SpeedSlower for writes, fast for readsVery fast transactions
DesignStar/Snowflake schemaER schema

Simple Explanation:

  • OLTP = daily business operations
  • OLAP = business intelligence & analysis

Applications of OLAP in Business Analytics

OLAP is widely used for decision-making across industries.

A. Marketing

  • Customer segmentation
  • Campaign performance
  • Customer lifetime value
  • Pricing analytics

Example: Flipkart analyzes monthly category sales via OLAP cubes.

B. Finance & Banking

  • Credit risk analysis
  • Fraud detection
  • Profitability analysis

Example: Banks monitor loan performance by region and time.

C. Retail

  • Best-selling products
  • Store-wise performance
  • Inventory planning

Example: Walmart uses OLAP for real-time inventory analysis.

D. E-Commerce

  • Conversion rate analysis
  • Abandoned cart analysis
  • Product recommendation insights

E. Supply Chain & Operations

  • Vendor performance
  • Demand forecasting
  • Logistics optimization

F. HR Analytics

  • Headcount trend
  • Employee productivity
  • Attrition forecasting

Simple OLAP Cube Example (For Class Use)

Dimensions:

  • Time → Year, Quarter, Month
  • Product → Category, Brand
  • Geography → State, City

Measure:

  • Sales Amount

Managers can ask:

  • What were the sales of smartphones in Delhi in Q3?
  • What is the bestselling product category in 2025?
  • Which region has declining sales?

All answered instantly using OLAP.

ETL Processes in Data Warehousing

ETL stands for:

  • E = Extraction
  • T = Transformation
  • L = Loading

It is the backbone of any data warehouse.

1. Data Extraction (E)

Extracting data from different source systems.

Sources include:

  • Databases (Oracle, MySQL)
  • ERP systems (SAP)
  • CRM systems (Salesforce)
  • Excel files
  • E-commerce platforms
  • Social media

Types of Extraction:

  1. Full Extraction – all data extracted at once
  2. Incremental Extraction – only new or updated data is extracted

Example: Flipkart extracts sales, customer, product, and delivery data daily from multiple systems.

Data Transformation (T)

Convert extracted data into clean, consistent, usable form.

Major activities:

  • Data cleaning (remove duplicates, nulls, errors)
  • Standardization (formatting dates, names)
  • Integration (combining datasets)
  • Filtering (remove unnecessary data)
  • Aggregation (daily → monthly)
  • Validation (check accuracy)
  • Encoding/conversion (e.g., "M" = "Male")

Example: “UP”, “U.P.” and “Uttar Pradesh” → standardized to “Uttar Pradesh”.

Data Loading (L)

Loading transformed data into:

  • Data warehouse (historical storage)
  • Data marts (department-level data)

Loading types:

  • Initial Load – first full load
  • Incremental Load – periodic updates
  • Full Refresh – replace old data entirely

Example: Daily sales data is loaded at midnight into Amazon’s warehouse.

Simple ETL Diagram

Data Sources (ERP, CRM, Excel, Web Apps) ↓ [ Extraction ][ Transformation ][ Loading ] ↓ Data Warehouse (EDW)

Data Integration & Metadata Management

Data Integration

Combining data from multiple sources into a single, unified view.

Why important?

  • Removes data silos
  • Creates “single version of truth”
  • Helps management make accurate decisions

Example:

Bank integrates data from:

  • Credit cards
  • Savings accounts
  • Loans

To build a complete customer profile.

Metadata Management

Metadata = Data about Data

Types of metadata:

  1. Technical metadata – table names, column types
  2. Business metadata – meaning of data (e.g., "Net Sales = Revenue − Returns")
  3. Operational metadata – load times, data lineage, job logs

Why metadata is important?

  • Helps understand how data is created
  • Improves data governance
  • Supports data auditing and quality assurance

Data Quality and Improvement Methods

High-quality data is accurate, complete, consistent, reliable, and timely.

Methods to improve data quality

MethodExplanationExample
Validation rulesCheck for errorsMinimum salary = 8000
StandardizationSame format everywhereDate = YYYY-MM-DD
DeduplicationRemove duplicatesSame customer registered twice
Data cleansing toolsUse tools like Informatica, TalendAutomated data correction
Master Data Management (MDM)Create single master recordsOne customer profile
Data profilingAnalyze data before loadingIdentify missing values

Warehouse Implementation Architectures

There are three main architectures:

A. Centralized Data Warehouse

Features:

  • One large enterprise data warehouse (EDW)
  • All departments fetch data from one place

Benefits:

  • Single version of truth
  • Easy to manage
  • Better data governance

Drawback:

  • Expensive and time-consuming to implement

B. Federated Data Warehouse

Features:

  • No single central warehouse
  • Data is accessed from multiple distributed sources
  • Uses virtual integration

Benefits:

  • Lower cost
  • Faster implementation

Drawbacks:

  • Data inconsistency
  • Performance issues

Example: A company keeps data in separate department databases but connects them using middleware.

C. Real-Time Data Warehouse (Active Warehouse)

Features:

  • Data is updated continuously (near real-time)
  • Supports real-time dashboards

Use Cases:

  • Stock markets
  • E-commerce personalization
  • Fraud detection

Example: Swiggy updates delivery and order tracking data in real-time.

Implementation Challenges

Major challenges:

  • Data quality issues
  • Legacy systems incompatibility
  • High cost and time to implement
  • Integration complexity
  • Lack of skilled staff
  • Resistance to change
  • Security and privacy issues

Best Practices in Warehouse Implementation

Best practices for success:

  • Start small → grow gradually (agile approach)
  • Use strong ETL tools (Informatica, Talend, SSIS)
  • Ensure data governance framework
  • Maintain high-quality metadata
  • Include business users early
  • Build scalable architecture
  • Regular data auditing
  • Implement strong security controls
  • Provide user training and documentation

Quick Revision Table

TopicExplanationBusiness Benefit
ETLExtract–Transform–Load processClean & reliable data
Data IntegrationCombine data from sourcesUnified analytics
MetadataData about dataBetter governance
Data QualityAccuracy & consistencyTrusted decisions
Centralized warehouseOne EDWStrong governance
Federated warehouseDistributed dataCost saving
Real-time warehouseLive data updatesFast decisions