Unit 4: Data Mining Methods



Data Mining Methods: Association Rule Mining 

Association Rule Mining is a data mining technique used to find relationships or patterns among items in large datasets.

It is mainly used to answer: “Which items are likely to be purchased or used together?”

Mining Frequent Patterns

Frequent patterns are combinations of items that appear often in a dataset.

Example:

In a supermarket:

  • 60% of customers buy bread
  • Out of these, 40% also buy butter

So “bread → butter” is a frequent pattern.

Market Basket Analysis (MBA)

Market Basket Analysis is the most common application of association rule mining.

It helps a retailer understand:

  • Which products are purchased together?
  • What to place close on shelves?
  • How to create combo offers?

Example:

If many customers buy:

  • Chips + Coke
  • Shampoo + Conditioner
  • Mobile + Back Cover

Retailers can create:

  • Bundles
  • Discounts
  • Better store layout

MBA is widely used in Amazon, Flipkart, Big Bazaar, D-Mart, etc.

Apriori Algorithm (Core Concept)

Apriori is the oldest and most popular algorithm for mining frequent itemsets.

Basic Idea:

  • Count how often each item appears
  • Generate frequent item combinations
  • Remove combinations that occur rarely
  • Create strong association rules

Why called Apriori?

Because it uses prior knowledge (previous frequent patterns) to find new patterns.

Example:

  • Frequent single items: - {bread}, {butter}, {milk}
  • Frequent pairs: {bread, butter}
  • Frequent triplets: {bread, butter, milk}

Apriori eliminates combinations with support < minimum level.

Advanced Techniques (Beyond Apriori)

TechniqueWhat it DoesWhy Useful?
FP-Growth (Frequent Pattern Growth)Uses a tree structure instead of combinationsMuch faster than Apriori
Eclat AlgorithmUses vertical data formatEfficient for large datasets
Closed and Max PatternsReduce number of patternsAvoid redundant rules
Fuzzy Association RulesFinds rules with uncertaintyUseful in customer behaviour

These methods improve performance and reduce computation time.

Constraint-Based Mining

Sometimes we do not want all patterns, only those meeting specific business constraints.

Examples of Constraints:

  • Items with price > ₹500
  • Transactions from premium customers
  • Patterns related to electronics only
  • Patterns with profit margin above 10%

Why useful?

It saves time and focuses on business-relevant rules.

Correlation Mining

Correlation mining checks if items are truly related or just appear together by chance.

Example: People may buy milk + bread together because they are basic items, but this is not a strong relationship.

Correlation mining helps:

  • Identify real relationships
  • Avoid misleading patterns
  • Improve promotional strategies

Metrics used:

  • Lift
  • Chi-square
  • Leverage

Summary Table (Easy to Memorize)

TopicMeaningExample
Frequent PatternsItems appearing together oftenBread + Butter
Market Basket AnalysisAnalyse purchase combinationsChips + Coke
Apriori AlgorithmStep-by-step frequent pattern algorithmBread → Butter
FP-GrowthFast pattern mining using treesBig data transactions
Constraint-Based MiningPatterns under conditionsItems > ₹500
Correlation MiningCheck real vs fake relationshipsMilk + eggs correlation

Classification (Easy Overview)

Classification assigns data to predefined categories (classes).

Examples:

  • Spam vs. Not Spam
  • Fraud vs. Legit Transaction
  • High-value vs. Low-value customer
  • Approve Loan vs. Reject Loan

Input: Labeled training data
Output: A model that predicts future labels.

Major Classification Techniques

A. Decision Trees

A decision tree is like a flowchart that splits data into branches based on rules (Yes/No decisions).

Why popular?

  • Very easy to understand

  • Visual representation

  • Works well with business data

Real-life example: Bank Loan Decision

Income? / \ High Low / \ Credit Score? Reject / \ Good Bad | | Approve Reject

Business Uses:

  • Credit scoring
  • Customer churn prediction
  • Medical diagnosis
  • Sales forecasting

B. Bayesian Classifiers (Naive Bayes)

Uses probability to predict class.

Assumption: All features are independent (Naive assumption).

Example: Predict if a customer will buy a smartphone based on:

  • Income
  • Age
  • Online usage

Simple, fast, and works well for:

  • Spam filtering
  • Sentiment analysis
  • Text classification

C. Support Vector Machines (SVM)

SVM finds the best line or boundary that separates different classes.

Works well for:

  • High-dimensional data
  • Complex patterns

Example: Classifying images into:

  • Cat vs Dog
  • Tumor: Benign vs Malignant

Used in:

  • Image recognition
  • Bioinformatics
  • Fraud detection

D. Rule-Based Classifiers

Rules are generated as:

IF condition THEN class

Example:

IF purchase_amount > ₹5000 AND visits > 3 THEN customer = "Premium"

Used in:

  • CRM
  • Retail decisions
  • Targeted advertising

Prediction (Regression Techniques)

Prediction forecasts numerical values.

Types of Regression

A. Linear Regression

Predicts based on straight-line relationships.

Example: Predict monthly sales based on:

  • Advertising spend
  • Price
  • Season

B. Multiple Regression

Uses multiple variables to predict a target.

Example: Predict property prices using:

  • Area
  • Location
  • Number of rooms

C. Logistic Regression

Predicts category probability (yes/no outcome).
Used for:

  • Loan default
  • Churn
  • Disease prediction

Prediction Accuracy & Evaluation

Common metrics:

(A) Confusion Matrix

Shows:

  • True positives
  • True negatives
  • False positives
  • False negatives

(B) Accuracy

Correct predictions / total predictions

(C) Precision & Recall

Useful for fraud and spam detection.

(D) F1-Score

Balance between precision and recall.

(E) ROC Curve & AUC

Check model performance across thresholds.

(F) Mean Squared Error (MSE)

For regression models.

Ensemble Methods (Powerful Modern Techniques)

Ensemble = Combining multiple models to improve accuracy.

A. Bagging (Bootstrap Aggregating)

Example: Random Forest

Used for:

  • Fraud detection
  • Credit scoring

B. Boosting

Strongest modern method.
Example: XGBoost, AdaBoost, LightGBM

Used in:

  • Kaggle competitions
  • Customer churn prediction
  • Sales forecasting

C. Stacking

Combines different models and chooses the best.

Business Use Cases of Classification & Prediction

IndustryUse CaseTechnique
BankingLoan approval, fraud detectionSVM, Decision Trees
E-commerceProduct recommendationNaive Bayes
InsuranceRisk scoringLogistic regression
RetailSales forecastingLinear regression
TelecomChurn predictionRandom Forest
HealthcareDisease detectionSVM, Decision Trees
MarketingCustomer segmentationRule-based classifiers

Quick Summary Table

MethodPurposeSimple MeaningExample
Decision TreeClassificationYes/No branchesLoan approval
Bayesian ClassifierProbability classificationBased on likelihoodSpam filtering
SVMSeparation boundaryDraw best dividing lineCancer detection
Rule-BasedIF-THEN rulesSimple conditionsTarget marketing
RegressionPredict numbersForecast quantitiesSales forecast
EnsembleImprove accuracyCombine modelsFraud analytics

Clustering (Unsupervised Learning) 

Clustering is a data mining method that groups similar data points into clusters without predefined labels.

Think of it as automatically discovering natural groups in data.

What is Clustering? 

Clustering groups customers, objects, or data points based on similarity.

Examples:

  • Grouping customers by buying behavior
  • Grouping cities by weather patterns
  • Grouping stocks based on performance

Clustering helps businesses understand patterns, segments, behaviors, and anomalies.

Major Clustering Algorithms

A. K-Means Clustering

Most popular, simple, and widely used.

How it works:

  1. Select the number of clusters (K)
  2. Randomly place K centroids
  3. Assign each point to the nearest centroid
  4. Recalculate centroids
  5. Repeat until stable

Example: Retail store wants 3 clusters:

  • Cluster 1: Budget buyers
  • Cluster 2: Mid-value buyers
  • Cluster 3: Premium buyers

Strengths:

  • Fast
  • Works well on large datasets

Limitations:

  • Must choose K in advance
  • Sensitive to outliers

B. Hierarchical Clustering

Builds a tree-like structure (dendrogram).

Types:

  1. Agglomerative → bottom-up (start with each point, merge groups)
  2. Divisive → top-down (start with one group, divide gradually)

Example:

Grouping students based on:

  • Marks
  • Attendance
  • Activities

Strengths:

  • No need to predefine number of clusters
  • Visual (dendrogram)

Limitations:

  • Very slow for big data

C. Density-Based Clustering (DBSCAN)

Forms clusters based on dense regions and labels low-density regions as outliers.

Useful for:

  • Irregular-shaped clusters
  • Outlier detection

Example:

  • Detecting fraud transactions:
  • A few transactions far away from normal behavior become outliers.

Strengths:

  • Excellent for non-linear data
  • Naturally detects outliers

Limitations:

  • Hard to use in very high-dimensional data

D. Grid-Based Clustering

Divides the entire data space into grid cells and performs clustering on each grid cell.

Examples:

  • STING
  • CLIQUE

Strengths:

  • Very fast
  • Useful for large databases

Applications:

  • GIS (Geographical Information Systems)
  • Sensor data
  • Heatmap-based analysis

Clustering High-Dimensional Data

High-dimensional data = many attributes (e.g., 100+ features)

Examples:

  • Genomics
  • Text documents
  • Customer behavior (100 features)

Challenges:

  • Distance becomes less meaningful
  • Algorithms become slow

Solutions:

  • Principal Component Analysis (PCA)
  • Dimensionality reduction
  • Subspace clustering
  • Feature selection

Outlier Detection

Outliers are data points that don’t belong to any cluster.

Examples:

  • Fraud credit card transactions
  • Unusual login patterns
  • Extreme medical values
  • Fake reviews

Clustering algorithms like DBSCAN automatically detect outliers.

Business Applications of Clustering

A. Customer Segmentation

Retail/e-commerce clusters:

  • Value shoppers
  • Impulse buyers
  • Loyal buyers

Used in:

  • Personalized marketing
  • Product recommendation
  • Pricing strategies

B. Target Marketing

Target specific groups:

  • Students
  • Working professionals
  • High-spending customers

Example: Banks target wealthy customers with premium credit cards.

C. Fraud Detection

Clustering helps detect:

  • Transactions that are unusual
  • Outlier behavior
  • Fake identities

Banks and fintech companies use clustering to detect anomalies.

D. Image Processing

Group similar images for:

  • Face recognition
  • Medical image analysis

E. Supply Chain & Logistics

Cluster:

  • Regions with similar demand
  • Delivery locations
  • Warehousing groups

Summary Table 

AlgorithmCore IdeaStrengthLimitationUse Case
K-MeansGroups around centroidsFast, simpleNeeds K, sensitive to outliersCustomer segmentation
HierarchicalBuilds tree of clustersNo need KSlow for big dataSocial grouping, product clustering
DBSCANFinds dense areasDetects outliersHard for high dimensionsFraud detection
Grid-BasedDivides data into gridsVery fastLess accurateGIS, spatial data