Unit 1: Introduction to Data Warehousing and Data Mining
Introduction to Data Warehousing and Data Mining
Data Warehousing
A data warehouse is a large, central storage system where an organization keeps all its important data collected from different departments like sales, finance, HR, marketing, operations, etc.
Simple meaning: It is like a big storage room where all company data is cleaned, organized, and stored so managers can use it anytime for decision-making.
Key Features of a Data Warehouse
- Subject-oriented: focuses on key areas like customers, products, sales, cost.
- Integrated: data from different systems is combined in one place.
- Time-variant: stores historical data for months/years to understand trends.
- Non-volatile: once data is stored, it is not frequently changed or deleted.
Business Value of Data Warehousing
A data warehouse helps companies:
- Make better and faster decisions using organised data.
- Find trends and patterns, like sales growth or customer behavior.
- Improve forecasting, budgeting, and planning.
- Reduce data duplication and data errors.
- Make information available to all managers in one place.
Example: Big retail companies like Amazon or Walmart use data warehouses to analyze customer purchases and plan stock levels.
Introduction and Evolution of Data Mining
Data Mining means discovering useful patterns, relationships, and insights from large datasets.
Simple meaning: It is like digging for gold in a huge pile of data to find valuable information.
Evolution of Data Mining
- 1970s–1980s: Basic data collection and simple statistical analysis.
- 1990s: Growth of databases → need for better techniques → birth of “data mining.”
- 2000s: Use of machine learning and advanced algorithms.
- Today: Data mining combined with AI, Big Data, cloud computing, predictive analytics.
Now data mining helps companies predict behaviour, not just study past data.
Goals of Data Mining
The main goals include:
- Prediction - Predict future outcomes Example: predicting sales next month.
- Classification - Group things into categories Example: classifying customers as “loyal,” “new,” “high-value.”
- Clustering - Finding natural groups in data Example: grouping customers by buying habits.
- Association - Finding relationships between items Example: “People who buy bread also buy butter.”
- Anomaly Detection - Finding unusual patterns Example: detecting fraud in credit card transactions.
- Trend Analysis - Studying long-term movement in data Example: analyzing consumer preferences over time.
Myths About Data Mining (Common Misunderstandings)
Myth 1: Data mining automatically gives perfect answers
Reality: It helps discover patterns but still needs expert interpretation.
Myth 2: Data mining works without good data
Reality: If the data is poor, results will also be poor (“garbage in, garbage out”).
Myth 3: Data mining is only for big companies
Reality: Even small businesses use tools like Google Analytics and CRM analytics.
Myth 4: Data mining is only about technology
Reality: It also needs business understanding and human judgment.
Myth 5: Data mining replaces human decision-making
Reality: It supports decisions but does not replace managers.
Myth 6: Data mining is just another name for statistics
Reality: It uses statistics, but also machine learning, AI, and database systems.
Data Mining Process
Data mining follows a systematic process known as the Knowledge Discovery in Databases (KDD) process.
Step 1: Problem Definition - Understand the business problem. Example: “Why are customers leaving?” or “Which products will sell more next month?”
Step 2: Data Collection - Collect data from different sources like CRM, ERP, sales, website, social media, etc.
Step 3: Data Cleaning - Remove errors, duplicates, missing values, Convert raw data into usable format.
Step 4: Data Integration - Combine data from multiple sources into one dataset.
Step 5: Data Selection - Select only relevant variables (e.g., age, income, buying history).
Step 6: Data Transformation - Normalize, convert formats, and prepare data for modelling.
Step 7: Data Mining / Pattern Discovery
Use techniques such as:
- Classification
- Clustering
- Association rules
- Prediction
- Regression
- Anomaly detection
Step 8: Pattern Evaluation - Verify if patterns are meaningful and useful for the business.
Step 9: Knowledge Presentation - Show results in charts, dashboards, reports for managers to take decisions.
Simple example: A bank uses data mining to predict loan defaults by analyzing past customer data.
Business Relevance of Data Mining & Data Warehousing
Why Data Mining is Useful in Business
Data mining helps companies:
- Understand customer behaviour
- Predict sales, risks, and trends
- Improve marketing strategies
- Reduce fraud
- Increase profitability
- Improve customer satisfaction
Examples:
- Amazon uses data mining for product recommendations.
- Banks use it for fraud detection.
- Telecom companies use it to reduce customer churn.
Data Warehousing vs. OLTP Systems
| Feature | Data Warehousing (DW) | OLTP (Online Transaction Processing) |
|---|---|---|
| Purpose | Analysis & decision-making | Day-to-day operations |
| Data Type | Historical + summarized data | Real-time transactional data |
| Users | Managers, analysts | Frontline staff, cashiers, salespeople |
| Operations | Complex queries, reports | Insert, update, delete transactions |
| Speed | Optimized for reading | Optimized for writing |
| Examples | Sales trends, forecasting | ATM withdrawals, order entry |
Simple Explanation:
- OLTP = daily work system (e.g., booking ticket, banking transaction)
- Data Warehouse = thinking system (e.g., which route has most bookings?)
Roles of Data Warehousing and Data Mining in Business Decisions
Role of Data Warehousing
- Stores clean, organized, historical data for analysis.
- Helps managers create dashboards and understand long-term trends.
- Supports strategic planning like pricing, budgeting, forecasting.
Role of Data Mining
- Finds hidden patterns and relationships in warehouse data.
- Helps predict future outcomes and risks.
- Supports tactical decisions like customer targeting, cross-selling, fraud detection.
Together:
Data warehouse = “Data storage brain”
Data mining = “Data analysis intelligence”
Basic Architecture of a Data Warehouse System
Below is a simple architecture in steps:
A. Data Sources
- Operational databases (OLTP)
- Excel files
- CRM, ERP systems
- Social media, sensors, mobile apps
B. ETL Layer (Extract, Transform, Load)
This is the backbone of the data warehouse:
- Extract data from multiple sources
- Transform (clean, filter, convert, check for errors)
- Load into the warehouse for storage
C. Data Storage Area
It includes:
- Data Warehouse (main storage)
- Data Marts (small, department-specific databases) – Example: marketing mart, finance mart
- Metadata repository (information about data)
D. OLAP (Online Analytical Processing) Engine
Used for:
- Fast queries
- Drilling down/up (year → month → day)
- Slicing/dicing reports
- Multi-dimensional analysis (product × region × time)
E. Front-End Tools / User Interface
Used by managers and employees:
- Dashboards
- Reports
- Data visualization
- BI tools (Power BI, Tableau, QlikView)
Simple Diagram
Layers of a Data Warehouse Architecture
A complete data warehouse has four major layers:
A. Staging Layer (Data Staging Area)
What it does:
- Collects and stores raw data temporarily
- Cleans data, removes errors, duplicates, missing values
- Transforms data into a standard format
Think of it as: A “washing machine” where dirty data is cleaned before storage.
Example: Sales data from different branches (Delhi, Mumbai, Jaipur) is collected here and cleaned.
B. Integration Layer (Data Integration or ETL Layer)
What it does:
- Combines data from multiple sources
- Applies business rules
- Loads the integrated data into the main warehouse
Think of it as: A “kitchen” where all raw ingredients are mixed and prepared.
Example: Merging online + offline sales into one unified view.
C. Access Layer (Data Access or Presentation Layer)
What it does:
- Provides access for end users (managers, analysts)
- Supports reporting, dashboards, OLAP queries, data mining
Think of it as: A “restaurant serving area” where finished dishes (reports) are presented.
Example: Marketing manager sees monthly sales trends on a dashboard.
D. Enterprise Data Warehouse (EDW)
What it does:
- Stores integrated, cleaned, historical data
- Acts as the central brain of the organization’s data
- Feeds data marts (Finance, HR, Marketing)
Think of it as: A “central library” of all company data.
Example: EDW contains 10 years of customer purchasing history used for decision-making.
Simple Text Diagram
Knowledge Discovery in Databases (KDD)
KDD is the overall process of discovering useful knowledge from large datasets.
Data Mining is only one step inside the KDD process.
KDD = Data selection + cleaning + transformation + mining + interpretation
Steps in the KDD Process
Step 1: Data Selection - Selecting relevant data from databases. Example: Selecting customer demographics for churn prediction.
Step 2: Data Preprocessing (Cleaning) - Fixing missing values, removing duplicates, correcting errors. Example: Filling missing age values using averages.
Step 3: Data Transformation - Converting data into suitable formats. Example: Normalizing income to ranges (low, medium, high).
Step 4: Data Mining
Applying algorithms to identify patterns:
- Classification
- Clustering
- Prediction
- Association Rules
- Regression
Example: Identifying which customers are likely to buy a new credit card.
Step 5: Interpretation & Evaluation - Understanding discovered patterns and converting them into insights. Example: “Customers aged 25–35 with high online activity are more likely to buy premium products.”
Step 6: Knowledge Utilization - Business uses the results to take decisions. Example: Marketing team targets these customers with personalized ads.
Simple Diagram of KDD Process
Business Applications of KDD
1. Marketing
- Customer segmentation
- Targeted advertising
- Cross-selling & upselling
Example: Amazon suggestions = KDD + data mining.
2. Finance
- Credit scoring
- Fraud detection
- Loan default prediction
Example: Banks detect unusual transactions using anomaly detection.
3. Retail
- Market basket analysis
- Store layout design
- Inventory forecasting
Example: Walmart analyzes purchase patterns to place products together.
4. Telecom
- Churn prediction
- Service quality optimization
5. Healthcare
- Disease prediction
- Patient risk analysis
6. E-Commerce
- Personalized recommendations
- Dynamic pricing
- Customer lifetime value prediction
Quick Summary
| Concept | Meaning | Business Benefit |
|---|---|---|
| Staging Layer | Raw data cleaning | High-quality data |
| Integration Layer | Merge data | Single version of truth |
| EDW | Central data storage | Long-term decision-making |
| Access Layer | Reporting tools | Fast insights |
| KDD | Full knowledge discovery process | Better business strategy |
| Data Mining | Pattern identification | Prediction & automation |