Unit 4: Data Mining Methods

Data Mining Methods: Association Rule Mining

Association Rule Mining is a data mining technique used to find relationships or patterns among items in large datasets.

It is mainly used to answer: “Which items are likely to be purchased or used together?”

Mining Frequent Patterns

Frequent patterns are combinations of items that appear often in a dataset.

Example:

In a supermarket:

60% of customers buy bread
Out of these, 40% also buy butter

So “bread → butter” is a frequent pattern.

Market Basket Analysis (MBA)

Market Basket Analysis is the most common application of association rule mining.

It helps a retailer understand:

Which products are purchased together?
What to place close on shelves?
How to create combo offers?

Example:

If many customers buy:

Chips + Coke
Shampoo + Conditioner
Mobile + Back Cover

Retailers can create:

Bundles
Discounts
Better store layout

MBA is widely used in Amazon, Flipkart, Big Bazaar, D-Mart, etc.

Apriori Algorithm (Core Concept)

Apriori is the oldest and most popular algorithm for mining frequent itemsets.

Basic Idea:

Count how often each item appears
Generate frequent item combinations
Remove combinations that occur rarely
Create strong association rules

Why called Apriori?

Because it uses prior knowledge (previous frequent patterns) to find new patterns.

Example:

Frequent single items: - {bread}, {butter}, {milk}
Frequent pairs: {bread, butter}
Frequent triplets: {bread, butter, milk}

Apriori eliminates combinations with support < minimum level.

Advanced Techniques (Beyond Apriori)

Technique	What it Does	Why Useful?
FP-Growth (Frequent Pattern Growth)	Uses a tree structure instead of combinations	Much faster than Apriori
Eclat Algorithm	Uses vertical data format	Efficient for large datasets
Closed and Max Patterns	Reduce number of patterns	Avoid redundant rules
Fuzzy Association Rules	Finds rules with uncertainty	Useful in customer behaviour

These methods improve performance and reduce computation time.

Constraint-Based Mining

Sometimes we do not want all patterns, only those meeting specific business constraints.

Examples of Constraints:

Items with price > ₹500
Transactions from premium customers
Patterns related to electronics only
Patterns with profit margin above 10%

Why useful?

It saves time and focuses on business-relevant rules.

Correlation Mining

Correlation mining checks if items are truly related or just appear together by chance.

Example: People may buy milk + bread together because they are basic items, but this is not a strong relationship.

Correlation mining helps:

Identify real relationships
Avoid misleading patterns
Improve promotional strategies

Metrics used:

Lift
Chi-square
Leverage

Summary Table (Easy to Memorize)

Topic	Meaning	Example
Frequent Patterns	Items appearing together often	Bread + Butter
Market Basket Analysis	Analyse purchase combinations	Chips + Coke
Apriori Algorithm	Step-by-step frequent pattern algorithm	Bread → Butter
FP-Growth	Fast pattern mining using trees	Big data transactions
Constraint-Based Mining	Patterns under conditions	Items > ₹500
Correlation Mining	Check real vs fake relationships	Milk + eggs correlation

Classification (Easy Overview)

Classification assigns data to predefined categories (classes).

Examples:

Spam vs. Not Spam
Fraud vs. Legit Transaction
High-value vs. Low-value customer
Approve Loan vs. Reject Loan

Input: Labeled training data
Output: A model that predicts future labels.

Major Classification Techniques

A. Decision Trees

A decision tree is like a flowchart that splits data into branches based on rules (Yes/No decisions).

Why popular?

Very easy to understand
Visual representation
Works well with business data

Real-life example: Bank Loan Decision


                Income?
               /       \
           High        Low
          /              \
   Credit Score?       Reject
     /     \
  Good     Bad
   |         |
 Approve   Reject

Business Uses:

Credit scoring
Customer churn prediction
Medical diagnosis
Sales forecasting

B. Bayesian Classifiers (Naive Bayes)

Uses probability to predict class.

Assumption: All features are independent (Naive assumption).

Example: Predict if a customer will buy a smartphone based on:

Income
Age
Online usage

Simple, fast, and works well for:

Spam filtering
Sentiment analysis
Text classification

C. Support Vector Machines (SVM)

SVM finds the best line or boundary that separates different classes.

Works well for:

High-dimensional data
Complex patterns

Example: Classifying images into:

Cat vs Dog
Tumor: Benign vs Malignant

Used in:

Image recognition
Bioinformatics
Fraud detection

D. Rule-Based Classifiers

Rules are generated as:

IF condition THEN class

Example:


IF purchase_amount > ₹5000 AND visits > 3  
THEN customer = "Premium"

Used in:

CRM
Retail decisions
Targeted advertising

Prediction (Regression Techniques)

Prediction forecasts numerical values.

Types of Regression

A. Linear Regression

Predicts based on straight-line relationships.

Example: Predict monthly sales based on:

Advertising spend
Price
Season

B. Multiple Regression

Uses multiple variables to predict a target.

Example: Predict property prices using:

Area
Location
Number of rooms

C. Logistic Regression

Predicts category probability (yes/no outcome).
Used for:

Loan default
Churn
Disease prediction

Prediction Accuracy & Evaluation

Common metrics:

(A) Confusion Matrix

Shows:

True positives
True negatives
False positives
False negatives

(B) Accuracy

Correct predictions / total predictions

(C) Precision & Recall

Useful for fraud and spam detection.

(D) F1-Score

Balance between precision and recall.

(E) ROC Curve & AUC

Check model performance across thresholds.

(F) Mean Squared Error (MSE)

For regression models.

Ensemble Methods (Powerful Modern Techniques)

Ensemble = Combining multiple models to improve accuracy.

A. Bagging (Bootstrap Aggregating)

Example: Random Forest

Used for:

Fraud detection
Credit scoring

B. Boosting

Strongest modern method.
Example: XGBoost, AdaBoost, LightGBM

Used in:

Kaggle competitions
Customer churn prediction
Sales forecasting

C. Stacking

Combines different models and chooses the best.

Business Use Cases of Classification & Prediction

Industry	Use Case	Technique
Banking	Loan approval, fraud detection	SVM, Decision Trees
E-commerce	Product recommendation	Naive Bayes
Insurance	Risk scoring	Logistic regression
Retail	Sales forecasting	Linear regression
Telecom	Churn prediction	Random Forest
Healthcare	Disease detection	SVM, Decision Trees
Marketing	Customer segmentation	Rule-based classifiers

Quick Summary Table

Method	Purpose	Simple Meaning	Example
Decision Tree	Classification	Yes/No branches	Loan approval
Bayesian Classifier	Probability classification	Based on likelihood	Spam filtering
SVM	Separation boundary	Draw best dividing line	Cancer detection
Rule-Based	IF-THEN rules	Simple conditions	Target marketing
Regression	Predict numbers	Forecast quantities	Sales forecast
Ensemble	Improve accuracy	Combine models	Fraud analytics

Clustering (Unsupervised Learning)

Clustering is a data mining method that groups similar data points into clusters without predefined labels.

Think of it as automatically discovering natural groups in data.

What is Clustering?

Clustering groups customers, objects, or data points based on similarity.

Examples:

Grouping customers by buying behavior
Grouping cities by weather patterns
Grouping stocks based on performance

Clustering helps businesses understand patterns, segments, behaviors, and anomalies.

Major Clustering Algorithms

A. K-Means Clustering

Most popular, simple, and widely used.

How it works:

Select the number of clusters (K)
Randomly place K centroids
Assign each point to the nearest centroid
Recalculate centroids
Repeat until stable

Example: Retail store wants 3 clusters:

Cluster 1: Budget buyers
Cluster 2: Mid-value buyers
Cluster 3: Premium buyers

Strengths:

Fast
Works well on large datasets

Limitations:

Must choose K in advance
Sensitive to outliers

B. Hierarchical Clustering

Builds a tree-like structure (dendrogram).

Types:

Agglomerative → bottom-up (start with each point, merge groups)
Divisive → top-down (start with one group, divide gradually)

Example:

Grouping students based on:

Marks
Attendance
Activities

Strengths:

No need to predefine number of clusters
Visual (dendrogram)

Limitations:

Very slow for big data

C. Density-Based Clustering (DBSCAN)

Forms clusters based on dense regions and labels low-density regions as outliers.

Useful for:

Irregular-shaped clusters
Outlier detection

Example:

Detecting fraud transactions:
A few transactions far away from normal behavior become outliers.

Strengths:

Excellent for non-linear data
Naturally detects outliers

Limitations:

Hard to use in very high-dimensional data

D. Grid-Based Clustering

Divides the entire data space into grid cells and performs clustering on each grid cell.

Examples:

STING
CLIQUE

Strengths:

Very fast
Useful for large databases

Applications:

GIS (Geographical Information Systems)
Sensor data
Heatmap-based analysis

Clustering High-Dimensional Data

High-dimensional data = many attributes (e.g., 100+ features)

Examples:

Genomics
Text documents
Customer behavior (100 features)

Challenges:

Distance becomes less meaningful
Algorithms become slow

Solutions:

Principal Component Analysis (PCA)
Dimensionality reduction
Subspace clustering
Feature selection

Outlier Detection

Outliers are data points that don’t belong to any cluster.

Examples:

Fraud credit card transactions
Unusual login patterns
Extreme medical values
Fake reviews

Clustering algorithms like DBSCAN automatically detect outliers.

Business Applications of Clustering

A. Customer Segmentation

Retail/e-commerce clusters:

Value shoppers
Impulse buyers
Loyal buyers

Used in:

Personalized marketing
Product recommendation
Pricing strategies

B. Target Marketing

Target specific groups:

Students
Working professionals
High-spending customers

Example: Banks target wealthy customers with premium credit cards.

C. Fraud Detection

Clustering helps detect:

Transactions that are unusual
Outlier behavior
Fake identities

Banks and fintech companies use clustering to detect anomalies.

D. Image Processing

Group similar images for:

Face recognition
Medical image analysis

E. Supply Chain & Logistics

Cluster:

Regions with similar demand
Delivery locations
Warehousing groups

Summary Table

Algorithm	Core Idea	Strength	Limitation	Use Case
K-Means	Groups around centroids	Fast, simple	Needs K, sensitive to outliers	Customer segmentation
Hierarchical	Builds tree of clusters	No need K	Slow for big data	Social grouping, product clustering
DBSCAN	Finds dense areas	Detects outliers	Hard for high dimensions	Fraud detection
Grid-Based	Divides data into grids	Very fast	Less accurate	GIS, spatial data

Unit 4: Data Mining Methods

Data Mining Methods: Association Rule Mining

Mining Frequent Patterns

Example:

Market Basket Analysis (MBA)

Example:

Apriori Algorithm (Core Concept)

Basic Idea:

Why called Apriori?

Example:

Advanced Techniques (Beyond Apriori)

Constraint-Based Mining

Examples of Constraints:

Why useful?

Correlation Mining

Summary Table (Easy to Memorize)

Classification (Easy Overview)

Major Classification Techniques

A. Decision Trees

Why popular?

Real-life example: Bank Loan Decision

Business Uses:

B. Bayesian Classifiers (Naive Bayes)

C. Support Vector Machines (SVM)

D. Rule-Based Classifiers

IF condition THEN class

Prediction (Regression Techniques)

Types of Regression

A. Linear Regression

B. Multiple Regression

C. Logistic Regression

Prediction Accuracy & Evaluation

(A) Confusion Matrix

(B) Accuracy

(C) Precision & Recall

(D) F1-Score

(E) ROC Curve & AUC

(F) Mean Squared Error (MSE)

Ensemble Methods (Powerful Modern Techniques)

A. Bagging (Bootstrap Aggregating)

B. Boosting

C. Stacking

Business Use Cases of Classification & Prediction

Quick Summary Table

Clustering (Unsupervised Learning)

What is Clustering?

Major Clustering Algorithms

A. K-Means Clustering

How it works:

Strengths:

Limitations:

B. Hierarchical Clustering

Types:

Example:

Strengths:

Limitations:

C. Density-Based Clustering (DBSCAN)

Useful for:

Example:

Strengths:

Limitations:

D. Grid-Based Clustering

Strengths:

Applications:

Clustering High-Dimensional Data

Outlier Detection

Business Applications of Clustering

A. Customer Segmentation

B. Target Marketing

C. Fraud Detection

D. Image Processing

E. Supply Chain & Logistics

Summary Table

You might like