Unit 2: Data Warehouse Modeling and Implementation
Data Warehouse Modeling and Implementation
Data warehouse modeling is about organizing data so it can be analyzed easily for business decisions.
The most common approach is Multidimensional Data Modeling.
Multidimensional Data Modeling (MDM)
MDM organizes data into facts and dimensions:
- Fact = numeric data related to business processes (e.g., sales amount, profit, quantity)
- Dimensions = descriptive attributes (e.g., time, product, region, customer)
Together, they form a cube that allows slicing, dicing, drill-down, and roll-up.
Star Schema
- The fact table is in the center.
- All dimension tables directly connect to it.
- Simple, fast, widely used.
Text Diagram (Star Schema)
Business Use Case
A retail company analyzing:
- Monthly sales
- Best-selling products
- Regional performance
Simple structure → quick reporting.
Snowflake Schema
- An extension of the star schema
- Dimension tables are normalized (split into sub-tables)
- More complex but saves storage
Text Diagram (Snowflake)
Business Use Case
Used when:
- Dimensions have hierarchical data
- Business wants better data accuracy and less redundancy Example: Product → Category → Department → Brand
Fact Constellation Schema (Galaxy Schema)
- Multiple fact tables share common dimension tables
- Used in large enterprises with many business processes
Text Diagram (Fact Constellation)
Business Use Case
Used by companies with:
- Sales data
- Inventory data
- Shipping data
- Finance data
All linked through shared dimensions like Time, Product, Region.
Example: Amazon, Walmart, Big retail chains
Dimensional Modeling
A design technique that structures data into facts and dimensions to support business decision-making efficiently.
Developed by Ralph Kimball.
Elements of Dimensional Modeling
- Fact tables
- Dimension tables
- Star, snowflake, or constellation schema
Steps in Dimensional Modeling (Business-Focused)
Step 1: Identify Business Process
Example:
- Sales
- Purchase
- Inventory
- Marketing campaign
Step 2: Identify Grain (Granularity)
Level of detail:
- Daily sales
- Monthly sales
- Per transaction
Step 3: Choose Dimensions
Examples:
- Time
- Customer
- Store
- Product
- Agent
Step 4: Identify Facts
Examples:
- Sales amount
- Profit
- Quantity sold
- Discount
Step 5: Build Schema
Choose star/snowflake depending on complexity.
Business Use Cases of Dimensional Modeling
Retail
- Sales forecasting
- Customer behavior analysis
- Inventory optimization
Banking
- Loan performance
- ATM transaction monitoring
- Fraud detection
E-commerce
- Customer lifetime value
- Web traffic analysis
- Recommendation engines
Telecom
- Call records analysis
- Churn prediction
- Profitability per user
Healthcare
- Patient history analysis
- Disease trend mapping
Quick Comparison Table
| Schema | Structure | Performance | Use Case |
|---|---|---|---|
| Star | Simple, denormalized | Fast | Retail, dashboards |
| Snowflake | Complex, normalized | Moderate | Data with hierarchies |
| Fact Constellation | Multiple facts | Slower but scalable | Large enterprises |
Very Simple Real-Life Example (Retail Store)
Fact Table: Sales Fact
- Sale amount
- Quantity
- Profit
Dimensions:
- Time (Year, Month, Day)
- Product (Name, Category, Brand)
- Customer (Gender, Age, Location)
- Store (City, Region)
Managers can now ask:
- Which product sells the most?
- Which region performs the best?
- What is the monthly trend?
This is the power of multidimensional modeling.
OLAP (Online Analytical Processing)
OLAP is a technology used in data warehouse systems to help managers and analysts analyze data quickly from multiple viewpoints (e.g., by time, product, region, customer).
It allows fast reporting, forecasting, trend analysis, and decision-making.
What is an OLAP Cube?
An OLAP cube is a multi-dimensional structure that stores data in a way that makes it easy to analyze.
Key Components
- Dimensions → descriptive categories (Time, Product, Geography, Customer, etc.)
- Measures (Facts) → numeric values (Sales amount, Profit, Quantity)
Simple Text Diagram (3D Cube)
Inside the cube, you can view:
- Sales by product
- Sales by time
- Sales by location
- Or any combination (e.g., monthly sales of mobiles in Delhi)
OLAP Operations
These operations help managers explore data from different angles.
1. Roll-Up
Roll-up means moving from detailed data to summarized data.
Example:
- From daily sales → monthly sales
- From city → state → country
Think: Zooming out on Google Maps.
2. Drill-Down (More Detail)
Drill-down is the opposite of roll-up.
It means moving from summary to details.
Example:
- From yearly sales → quarterly → monthly → daily
- From country level → state → district → city
Think: Zooming in on Google Maps.
3. Slice (Single Layer View)
Slice means selecting one value of a dimension to view only one section of the cube.
Example:
Show only:
- Sales for 2025
- Sales for Mobiles
- Sales for Delhi
Think: Cutting a single slice of bread from a loaf.
4. Dice (Multi-Dimensional Filtering)
Dice means selecting multiple values across dimensions.
Example:
Show only:
- Sales for Products = Mobiles + Laptops
- Time = Q1 + Q2
- Region = Delhi + Mumbai
Think: Cutting a cube into a smaller cube.
5. Pivot (Rotate View)
Pivot means changing the orientation of the cube to see data differently.
Example:
- Rows = Product
- Columns = Month → Pivot to
- Rows = Month
- Columns = Product
Think: Rotating a cube to see another side.
OLAP vs. OLTP
| Feature | OLAP | OLTP |
|---|---|---|
| Purpose | Analysis & decisions | Daily operations |
| Data | Historical, large volume | Real-time, transactional |
| Users | Managers, analysts | Frontline workers |
| Operations | Read-heavy (queries, reports) | Write-heavy (insert/update) |
| Examples | Sales reports, dashboards | ATM transactions, order booking |
| Speed | Slower for writes, fast for reads | Very fast transactions |
| Design | Star/Snowflake schema | ER schema |
Simple Explanation:
- OLTP = daily business operations
- OLAP = business intelligence & analysis
Applications of OLAP in Business Analytics
OLAP is widely used for decision-making across industries.
A. Marketing
- Customer segmentation
- Campaign performance
- Customer lifetime value
- Pricing analytics
Example: Flipkart analyzes monthly category sales via OLAP cubes.
B. Finance & Banking
- Credit risk analysis
- Fraud detection
- Profitability analysis
Example: Banks monitor loan performance by region and time.
C. Retail
- Best-selling products
- Store-wise performance
- Inventory planning
Example: Walmart uses OLAP for real-time inventory analysis.
D. E-Commerce
- Conversion rate analysis
- Abandoned cart analysis
- Product recommendation insights
E. Supply Chain & Operations
- Vendor performance
- Demand forecasting
- Logistics optimization
F. HR Analytics
- Headcount trend
- Employee productivity
- Attrition forecasting
Simple OLAP Cube Example (For Class Use)
Dimensions:
- Time → Year, Quarter, Month
- Product → Category, Brand
- Geography → State, City
Measure:
-
Sales Amount
Managers can ask:
- What were the sales of smartphones in Delhi in Q3?
- What is the bestselling product category in 2025?
- Which region has declining sales?
All answered instantly using OLAP.
ETL Processes in Data Warehousing
ETL stands for:
- E = Extraction
- T = Transformation
- L = Loading
It is the backbone of any data warehouse.
1. Data Extraction (E)
Extracting data from different source systems.
Sources include:
- Databases (Oracle, MySQL)
- ERP systems (SAP)
- CRM systems (Salesforce)
- Excel files
- E-commerce platforms
- Social media
Types of Extraction:
- Full Extraction – all data extracted at once
- Incremental Extraction – only new or updated data is extracted
Example: Flipkart extracts sales, customer, product, and delivery data daily from multiple systems.
Data Transformation (T)
Convert extracted data into clean, consistent, usable form.
Major activities:
- Data cleaning (remove duplicates, nulls, errors)
- Standardization (formatting dates, names)
- Integration (combining datasets)
- Filtering (remove unnecessary data)
- Aggregation (daily → monthly)
- Validation (check accuracy)
- Encoding/conversion (e.g., "M" = "Male")
Example: “UP”, “U.P.” and “Uttar Pradesh” → standardized to “Uttar Pradesh”.
Data Loading (L)
Loading transformed data into:
- Data warehouse (historical storage)
- Data marts (department-level data)
Loading types:
- Initial Load – first full load
- Incremental Load – periodic updates
- Full Refresh – replace old data entirely
Example: Daily sales data is loaded at midnight into Amazon’s warehouse.
Simple ETL Diagram
Data Integration & Metadata Management
Data Integration
Combining data from multiple sources into a single, unified view.
Why important?
- Removes data silos
- Creates “single version of truth”
- Helps management make accurate decisions
Example:
Bank integrates data from:
- Credit cards
- Savings accounts
- Loans
Metadata Management
Metadata = Data about Data
Types of metadata:
- Technical metadata – table names, column types
- Business metadata – meaning of data (e.g., "Net Sales = Revenue − Returns")
- Operational metadata – load times, data lineage, job logs
Why metadata is important?
- Helps understand how data is created
- Improves data governance
- Supports data auditing and quality assurance
Data Quality and Improvement Methods
High-quality data is accurate, complete, consistent, reliable, and timely.
Methods to improve data quality
| Method | Explanation | Example |
|---|---|---|
| Validation rules | Check for errors | Minimum salary = 8000 |
| Standardization | Same format everywhere | Date = YYYY-MM-DD |
| Deduplication | Remove duplicates | Same customer registered twice |
| Data cleansing tools | Use tools like Informatica, Talend | Automated data correction |
| Master Data Management (MDM) | Create single master records | One customer profile |
| Data profiling | Analyze data before loading | Identify missing values |
Warehouse Implementation Architectures
There are three main architectures:
A. Centralized Data Warehouse
Features:
- One large enterprise data warehouse (EDW)
- All departments fetch data from one place
Benefits:
- Single version of truth
- Easy to manage
- Better data governance
Drawback:
-
Expensive and time-consuming to implement
B. Federated Data Warehouse
Features:
- No single central warehouse
- Data is accessed from multiple distributed sources
- Uses virtual integration
Benefits:
- Lower cost
- Faster implementation
Drawbacks:
- Data inconsistency
- Performance issues
Example: A company keeps data in separate department databases but connects them using middleware.
C. Real-Time Data Warehouse (Active Warehouse)
Features:
- Data is updated continuously (near real-time)
- Supports real-time dashboards
Use Cases:
- Stock markets
- E-commerce personalization
- Fraud detection
Example: Swiggy updates delivery and order tracking data in real-time.
Implementation Challenges
Major challenges:
- Data quality issues
- Legacy systems incompatibility
- High cost and time to implement
- Integration complexity
- Lack of skilled staff
- Resistance to change
- Security and privacy issues
Best Practices in Warehouse Implementation
Best practices for success:
- Start small → grow gradually (agile approach)
- Use strong ETL tools (Informatica, Talend, SSIS)
- Ensure data governance framework
- Maintain high-quality metadata
- Include business users early
- Build scalable architecture
- Regular data auditing
- Implement strong security controls
- Provide user training and documentation
Quick Revision Table
| Topic | Explanation | Business Benefit |
|---|---|---|
| ETL | Extract–Transform–Load process | Clean & reliable data |
| Data Integration | Combine data from sources | Unified analytics |
| Metadata | Data about data | Better governance |
| Data Quality | Accuracy & consistency | Trusted decisions |
| Centralized warehouse | One EDW | Strong governance |
| Federated warehouse | Distributed data | Cost saving |
| Real-time warehouse | Live data updates | Fast decisions |