Unit 1: Introduction to Data Warehousing and Data Mining



Introduction to Data Warehousing and Data Mining

Data Warehousing

A data warehouse is a large, central storage system where an organization keeps all its important data collected from different departments like sales, finance, HR, marketing, operations, etc.

Simple meaning: It is like a big storage room where all company data is cleaned, organized, and stored so managers can use it anytime for decision-making.

Key Features of a Data Warehouse

  • Subject-oriented: focuses on key areas like customers, products, sales, cost.
  • Integrated: data from different systems is combined in one place.
  • Time-variant: stores historical data for months/years to understand trends.
  • Non-volatile: once data is stored, it is not frequently changed or deleted.

Business Value of Data Warehousing

A data warehouse helps companies:

  • Make better and faster decisions using organised data.
  • Find trends and patterns, like sales growth or customer behavior.
  • Improve forecasting, budgeting, and planning.
  • Reduce data duplication and data errors.
  • Make information available to all managers in one place.

Example: Big retail companies like Amazon or Walmart use data warehouses to analyze customer purchases and plan stock levels.

Introduction and Evolution of Data Mining

Data Mining means discovering useful patterns, relationships, and insights from large datasets.

Simple meaning: It is like digging for gold in a huge pile of data to find valuable information.

Evolution of Data Mining

  • 1970s–1980s: Basic data collection and simple statistical analysis.
  • 1990s: Growth of databases → need for better techniques → birth of “data mining.”
  • 2000s: Use of machine learning and advanced algorithms.
  • Today: Data mining combined with AI, Big Data, cloud computing, predictive analytics.

Now data mining helps companies predict behaviour, not just study past data.

Goals of Data Mining

The main goals include:

  • Prediction - Predict future outcomes Example: predicting sales next month.
  • Classification - Group things into categories Example: classifying customers as “loyal,” “new,” “high-value.”
  • Clustering - Finding natural groups in data Example: grouping customers by buying habits.
  • Association - Finding relationships between items Example: “People who buy bread also buy butter.”
  • Anomaly Detection - Finding unusual patterns Example: detecting fraud in credit card transactions.
  • Trend Analysis - Studying long-term movement in data Example: analyzing consumer preferences over time.

Myths About Data Mining (Common Misunderstandings)

Myth 1: Data mining automatically gives perfect answers

Reality: It helps discover patterns but still needs expert interpretation.

Myth 2: Data mining works without good data

Reality: If the data is poor, results will also be poor (“garbage in, garbage out”).

Myth 3: Data mining is only for big companies

Reality: Even small businesses use tools like Google Analytics and CRM analytics.

Myth 4: Data mining is only about technology

Reality: It also needs business understanding and human judgment.

Myth 5: Data mining replaces human decision-making

Reality: It supports decisions but does not replace managers.

Myth 6: Data mining is just another name for statistics

Reality: It uses statistics, but also machine learning, AI, and database systems.

Data Mining Process 

Data mining follows a systematic process known as the Knowledge Discovery in Databases (KDD) process.

Step 1: Problem Definition - Understand the business problem. Example: “Why are customers leaving?” or “Which products will sell more next month?”

Step 2: Data Collection - Collect data from different sources like CRM, ERP, sales, website, social media, etc.

Step 3: Data Cleaning - Remove errors, duplicates, missing values, Convert raw data into usable format.

Step 4: Data Integration - Combine data from multiple sources into one dataset.

Step 5: Data Selection - Select only relevant variables (e.g., age, income, buying history).

Step 6: Data Transformation - Normalize, convert formats, and prepare data for modelling.

Step 7: Data Mining / Pattern Discovery

Use techniques such as:

  • Classification
  • Clustering
  • Association rules
  • Prediction
  • Regression
  • Anomaly detection

Step 8: Pattern Evaluation - Verify if patterns are meaningful and useful for the business.

Step 9: Knowledge Presentation - Show results in charts, dashboards, reports for managers to take decisions.

Simple example: A bank uses data mining to predict loan defaults by analyzing past customer data.

Business Relevance of Data Mining & Data Warehousing

Why Data Mining is Useful in Business

Data mining helps companies:

  • Understand customer behaviour
  • Predict sales, risks, and trends
  • Improve marketing strategies
  • Reduce fraud
  • Increase profitability
  • Improve customer satisfaction

Examples:

  • Amazon uses data mining for product recommendations.
  • Banks use it for fraud detection.
  • Telecom companies use it to reduce customer churn.

Data Warehousing vs. OLTP Systems

FeatureData Warehousing (DW)OLTP (Online Transaction Processing)
PurposeAnalysis & decision-makingDay-to-day operations
Data TypeHistorical + summarized dataReal-time transactional data
UsersManagers, analystsFrontline staff, cashiers, salespeople
OperationsComplex queries, reportsInsert, update, delete transactions
SpeedOptimized for readingOptimized for writing
ExamplesSales trends, forecastingATM withdrawals, order entry

Simple Explanation:

  • OLTP = daily work system (e.g., booking ticket, banking transaction)
  • Data Warehouse = thinking system (e.g., which route has most bookings?)

Roles of Data Warehousing and Data Mining in Business Decisions

Role of Data Warehousing

  • Stores clean, organized, historical data for analysis.
  • Helps managers create dashboards and understand long-term trends.
  • Supports strategic planning like pricing, budgeting, forecasting.

Role of Data Mining

  • Finds hidden patterns and relationships in warehouse data.
  • Helps predict future outcomes and risks.
  • Supports tactical decisions like customer targeting, cross-selling, fraud detection.

Together:
Data warehouse = “Data storage brain”
Data mining = “Data analysis intelligence”

Basic Architecture of a Data Warehouse System

Below is a simple architecture in steps:

A. Data Sources

  • Operational databases (OLTP)
  • Excel files
  • CRM, ERP systems
  • Social media, sensors, mobile apps

B. ETL Layer (Extract, Transform, Load)

This is the backbone of the data warehouse:

  1. Extract data from multiple sources
  2. Transform (clean, filter, convert, check for errors)
  3. Load into the warehouse for storage

C. Data Storage Area

It includes:

  • Data Warehouse (main storage)
  • Data Marts (small, department-specific databases) – Example: marketing mart, finance mart
  • Metadata repository (information about data)

D. OLAP (Online Analytical Processing) Engine

Used for:

  • Fast queries
  • Drilling down/up (year → month → day)
  • Slicing/dicing reports
  • Multi-dimensional analysis (product × region × time)

E. Front-End Tools / User Interface

Used by managers and employees:

  • Dashboards
  • Reports
  • Data visualization
  • BI tools (Power BI, Tableau, QlikView)

Simple Diagram

Data Sources → ETL → Data Warehouse → OLAP → Dashboards/Reports/BI Tools

Layers of a Data Warehouse Architecture

A complete data warehouse has four major layers:

A. Staging Layer (Data Staging Area)

What it does:

  • Collects and stores raw data temporarily
  • Cleans data, removes errors, duplicates, missing values
  • Transforms data into a standard format

Think of it as: A “washing machine” where dirty data is cleaned before storage.

Example: Sales data from different branches (Delhi, Mumbai, Jaipur) is collected here and cleaned.

B. Integration Layer (Data Integration or ETL Layer)

What it does:

  • Combines data from multiple sources
  • Applies business rules
  • Loads the integrated data into the main warehouse

Think of it as: A “kitchen” where all raw ingredients are mixed and prepared.

Example: Merging online + offline sales into one unified view.

C. Access Layer (Data Access or Presentation Layer)

What it does:

  • Provides access for end users (managers, analysts)
  • Supports reporting, dashboards, OLAP queries, data mining

Think of it as: A “restaurant serving area” where finished dishes (reports) are presented.

Example: Marketing manager sees monthly sales trends on a dashboard.

D. Enterprise Data Warehouse (EDW)

What it does:

  • Stores integrated, cleaned, historical data
  • Acts as the central brain of the organization’s data
  • Feeds data marts (Finance, HR, Marketing)

Think of it as: A “central library” of all company data.

Example: EDW contains 10 years of customer purchasing history used for decision-making.

Simple Text Diagram

Data Sources ↓ Staging Layer (cleaning + formatting) ↓ Integration Layer (ETL + combining data) ↓ Enterprise Data Warehouse (central storage) ↓ Access Layer (reports, dashboards, OLAP, mining)

Knowledge Discovery in Databases (KDD)

KDD is the overall process of discovering useful knowledge from large datasets.
Data Mining is only one step inside the KDD process.

KDD = Data selection + cleaning + transformation + mining + interpretation

Steps in the KDD Process

Step 1: Data Selection - Selecting relevant data from databases. Example: Selecting customer demographics for churn prediction.

Step 2: Data Preprocessing (Cleaning) - Fixing missing values, removing duplicates, correcting errors. Example: Filling missing age values using averages.

Step 3: Data Transformation - Converting data into suitable formats. Example: Normalizing income to ranges (low, medium, high).

Step 4: Data Mining

Applying algorithms to identify patterns:

  • Classification
  • Clustering
  • Prediction
  • Association Rules
  • Regression

Example: Identifying which customers are likely to buy a new credit card.

Step 5: Interpretation & Evaluation - Understanding discovered patterns and converting them into insights. Example: “Customers aged 25–35 with high online activity are more likely to buy premium products.”

Step 6: Knowledge Utilization - Business uses the results to take decisions. Example: Marketing team targets these customers with personalized ads.

Simple Diagram of KDD Process

Data → Selection → Cleaning → Transformation → Data Mining → Evaluation → Knowledge

Business Applications of KDD

1. Marketing

  • Customer segmentation
  • Targeted advertising
  • Cross-selling & upselling

Example: Amazon suggestions = KDD + data mining.

2. Finance

  • Credit scoring
  • Fraud detection
  • Loan default prediction

Example: Banks detect unusual transactions using anomaly detection.

3. Retail

  • Market basket analysis
  • Store layout design
  • Inventory forecasting

Example: Walmart analyzes purchase patterns to place products together.

4. Telecom

  • Churn prediction
  • Service quality optimization

5. Healthcare

  • Disease prediction
  • Patient risk analysis

6. E-Commerce

  • Personalized recommendations
  • Dynamic pricing
  • Customer lifetime value prediction

Quick Summary

ConceptMeaningBusiness Benefit
Staging LayerRaw data cleaningHigh-quality data
Integration LayerMerge dataSingle version of truth
EDWCentral data storageLong-term decision-making
Access LayerReporting toolsFast insights
KDDFull knowledge discovery processBetter business strategy
Data MiningPattern identificationPrediction & automation