Unit 4: Data Mining Methods
Data Mining Methods: Association Rule Mining
Association Rule Mining is a data mining technique used to find relationships or patterns among items in large datasets.
It is mainly used to answer: “Which items are likely to be purchased or used together?”
Mining Frequent Patterns
Frequent patterns are combinations of items that appear often in a dataset.
Example:
In a supermarket:
- 60% of customers buy bread
- Out of these, 40% also buy butter
So “bread → butter” is a frequent pattern.
Market Basket Analysis (MBA)
Market Basket Analysis is the most common application of association rule mining.
It helps a retailer understand:
- Which products are purchased together?
- What to place close on shelves?
- How to create combo offers?
Example:
If many customers buy:
- Chips + Coke
- Shampoo + Conditioner
- Mobile + Back Cover
Retailers can create:
- Bundles
- Discounts
- Better store layout
MBA is widely used in Amazon, Flipkart, Big Bazaar, D-Mart, etc.
Apriori Algorithm (Core Concept)
Apriori is the oldest and most popular algorithm for mining frequent itemsets.
Basic Idea:
- Count how often each item appears
- Generate frequent item combinations
- Remove combinations that occur rarely
- Create strong association rules
Why called Apriori?
Because it uses prior knowledge (previous frequent patterns) to find new patterns.
Example:
- Frequent single items: - {bread}, {butter}, {milk}
- Frequent pairs: {bread, butter}
- Frequent triplets: {bread, butter, milk}
Apriori eliminates combinations with support < minimum level.
Advanced Techniques (Beyond Apriori)
| Technique | What it Does | Why Useful? |
|---|---|---|
| FP-Growth (Frequent Pattern Growth) | Uses a tree structure instead of combinations | Much faster than Apriori |
| Eclat Algorithm | Uses vertical data format | Efficient for large datasets |
| Closed and Max Patterns | Reduce number of patterns | Avoid redundant rules |
| Fuzzy Association Rules | Finds rules with uncertainty | Useful in customer behaviour |
These methods improve performance and reduce computation time.
Constraint-Based Mining
Sometimes we do not want all patterns, only those meeting specific business constraints.
Examples of Constraints:
- Items with price > ₹500
- Transactions from premium customers
- Patterns related to electronics only
- Patterns with profit margin above 10%
Why useful?
It saves time and focuses on business-relevant rules.
Correlation Mining
Correlation mining checks if items are truly related or just appear together by chance.
Example: People may buy milk + bread together because they are basic items, but this is not a strong relationship.
Correlation mining helps:
- Identify real relationships
- Avoid misleading patterns
- Improve promotional strategies
Metrics used:
- Lift
- Chi-square
- Leverage
Summary Table (Easy to Memorize)
| Topic | Meaning | Example |
|---|---|---|
| Frequent Patterns | Items appearing together often | Bread + Butter |
| Market Basket Analysis | Analyse purchase combinations | Chips + Coke |
| Apriori Algorithm | Step-by-step frequent pattern algorithm | Bread → Butter |
| FP-Growth | Fast pattern mining using trees | Big data transactions |
| Constraint-Based Mining | Patterns under conditions | Items > ₹500 |
| Correlation Mining | Check real vs fake relationships | Milk + eggs correlation |
Classification (Easy Overview)
Classification assigns data to predefined categories (classes).
Examples:
- Spam vs. Not Spam
- Fraud vs. Legit Transaction
- High-value vs. Low-value customer
- Approve Loan vs. Reject Loan
Input: Labeled training data
Output: A model that predicts future labels.
Major Classification Techniques
A. Decision Trees
A decision tree is like a flowchart that splits data into branches based on rules (Yes/No decisions).
Why popular?
-
Very easy to understand
-
Visual representation
-
Works well with business data
Real-life example: Bank Loan Decision
Business Uses:
- Credit scoring
- Customer churn prediction
- Medical diagnosis
- Sales forecasting
B. Bayesian Classifiers (Naive Bayes)
Uses probability to predict class.
Assumption: All features are independent (Naive assumption).
Example: Predict if a customer will buy a smartphone based on:
- Income
- Age
- Online usage
Simple, fast, and works well for:
- Spam filtering
- Sentiment analysis
- Text classification
C. Support Vector Machines (SVM)
SVM finds the best line or boundary that separates different classes.
Works well for:
- High-dimensional data
- Complex patterns
Example: Classifying images into:
- Cat vs Dog
- Tumor: Benign vs Malignant
Used in:
- Image recognition
- Bioinformatics
- Fraud detection
D. Rule-Based Classifiers
Rules are generated as:
IF condition THEN class
Example:
Used in:
- CRM
- Retail decisions
- Targeted advertising
Prediction (Regression Techniques)
Prediction forecasts numerical values.
Types of Regression
A. Linear Regression
Predicts based on straight-line relationships.
Example: Predict monthly sales based on:
- Advertising spend
- Price
- Season
B. Multiple Regression
Uses multiple variables to predict a target.
Example: Predict property prices using:
- Area
- Location
- Number of rooms
C. Logistic Regression
Predicts category probability (yes/no outcome).
Used for:
- Loan default
- Churn
- Disease prediction
Prediction Accuracy & Evaluation
Common metrics:
(A) Confusion Matrix
Shows:
- True positives
- True negatives
- False positives
- False negatives
(B) Accuracy
Correct predictions / total predictions
(C) Precision & Recall
Useful for fraud and spam detection.
(D) F1-Score
Balance between precision and recall.
(E) ROC Curve & AUC
Check model performance across thresholds.
(F) Mean Squared Error (MSE)
For regression models.
Ensemble Methods (Powerful Modern Techniques)
Ensemble = Combining multiple models to improve accuracy.
A. Bagging (Bootstrap Aggregating)
Example: Random Forest
Used for:
- Fraud detection
- Credit scoring
B. Boosting
Strongest modern method.
Example: XGBoost, AdaBoost, LightGBM
Used in:
- Kaggle competitions
- Customer churn prediction
- Sales forecasting
C. Stacking
Combines different models and chooses the best.
Business Use Cases of Classification & Prediction
| Industry | Use Case | Technique |
|---|---|---|
| Banking | Loan approval, fraud detection | SVM, Decision Trees |
| E-commerce | Product recommendation | Naive Bayes |
| Insurance | Risk scoring | Logistic regression |
| Retail | Sales forecasting | Linear regression |
| Telecom | Churn prediction | Random Forest |
| Healthcare | Disease detection | SVM, Decision Trees |
| Marketing | Customer segmentation | Rule-based classifiers |
Quick Summary Table
| Method | Purpose | Simple Meaning | Example |
|---|---|---|---|
| Decision Tree | Classification | Yes/No branches | Loan approval |
| Bayesian Classifier | Probability classification | Based on likelihood | Spam filtering |
| SVM | Separation boundary | Draw best dividing line | Cancer detection |
| Rule-Based | IF-THEN rules | Simple conditions | Target marketing |
| Regression | Predict numbers | Forecast quantities | Sales forecast |
| Ensemble | Improve accuracy | Combine models | Fraud analytics |
Clustering (Unsupervised Learning)
Clustering is a data mining method that groups similar data points into clusters without predefined labels.
Think of it as automatically discovering natural groups in data.
What is Clustering?
Clustering groups customers, objects, or data points based on similarity.
Examples:
- Grouping customers by buying behavior
- Grouping cities by weather patterns
- Grouping stocks based on performance
Clustering helps businesses understand patterns, segments, behaviors, and anomalies.
Major Clustering Algorithms
A. K-Means Clustering
Most popular, simple, and widely used.
How it works:
- Select the number of clusters (K)
- Randomly place K centroids
- Assign each point to the nearest centroid
- Recalculate centroids
- Repeat until stable
Example: Retail store wants 3 clusters:
- Cluster 1: Budget buyers
- Cluster 2: Mid-value buyers
- Cluster 3: Premium buyers
Strengths:
- Fast
- Works well on large datasets
Limitations:
- Must choose K in advance
- Sensitive to outliers
B. Hierarchical Clustering
Builds a tree-like structure (dendrogram).
Types:
- Agglomerative → bottom-up (start with each point, merge groups)
- Divisive → top-down (start with one group, divide gradually)
Example:
Grouping students based on:
- Marks
- Attendance
- Activities
Strengths:
- No need to predefine number of clusters
- Visual (dendrogram)
Limitations:
-
Very slow for big data
C. Density-Based Clustering (DBSCAN)
Forms clusters based on dense regions and labels low-density regions as outliers.
Useful for:
- Irregular-shaped clusters
- Outlier detection
Example:
- Detecting fraud transactions:
- A few transactions far away from normal behavior become outliers.
Strengths:
- Excellent for non-linear data
- Naturally detects outliers
Limitations:
-
Hard to use in very high-dimensional data
D. Grid-Based Clustering
Divides the entire data space into grid cells and performs clustering on each grid cell.
Examples:
- STING
- CLIQUE
Strengths:
- Very fast
- Useful for large databases
Applications:
- GIS (Geographical Information Systems)
- Sensor data
- Heatmap-based analysis
Clustering High-Dimensional Data
High-dimensional data = many attributes (e.g., 100+ features)
Examples:
- Genomics
- Text documents
- Customer behavior (100 features)
Challenges:
- Distance becomes less meaningful
- Algorithms become slow
Solutions:
- Principal Component Analysis (PCA)
- Dimensionality reduction
- Subspace clustering
- Feature selection
Outlier Detection
Outliers are data points that don’t belong to any cluster.
Examples:
- Fraud credit card transactions
- Unusual login patterns
- Extreme medical values
- Fake reviews
Clustering algorithms like DBSCAN automatically detect outliers.
Business Applications of Clustering
A. Customer Segmentation
Retail/e-commerce clusters:
- Value shoppers
- Impulse buyers
- Loyal buyers
Used in:
- Personalized marketing
- Product recommendation
- Pricing strategies
B. Target Marketing
Target specific groups:
- Students
- Working professionals
- High-spending customers
Example: Banks target wealthy customers with premium credit cards.
C. Fraud Detection
Clustering helps detect:
- Transactions that are unusual
- Outlier behavior
- Fake identities
Banks and fintech companies use clustering to detect anomalies.
D. Image Processing
Group similar images for:
- Face recognition
- Medical image analysis
E. Supply Chain & Logistics
Cluster:
- Regions with similar demand
- Delivery locations
- Warehousing groups
Summary Table
| Algorithm | Core Idea | Strength | Limitation | Use Case |
|---|---|---|---|---|
| K-Means | Groups around centroids | Fast, simple | Needs K, sensitive to outliers | Customer segmentation |
| Hierarchical | Builds tree of clusters | No need K | Slow for big data | Social grouping, product clustering |
| DBSCAN | Finds dense areas | Detects outliers | Hard for high dimensions | Fraud detection |
| Grid-Based | Divides data into grids | Very fast | Less accurate | GIS, spatial data |