Classification & Clustering
Classification
Classification is a supervised data mining technique used to assign data items to predefined classes or categories.
In simple words: Classification predicts the class label of new data based on past data.
Example
- Email → Spam or Not Spam
- Student → Pass or Fail
- Loan applicant → Approved or Rejected
Data Generalization
Data Generalization replaces low-level data with higher-level concepts using concept hierarchies.
Example
- City → State → Country
- Age → Young / Adult / Senior
Purpose
- Simplifies data
- Improves mining efficiency
- Helps in high-level analysis
| Low-Level Data | Generalized Data |
|---|---|
| Delhi, Mumbai | India |
| 22, 25, 28 | Young |
Analytical Characterization
Analytical Characterization summarizes the general features of a target class.
It answers: “What are the common characteristics of this class?”
Example
Target class: Premium Customers
- High income
- Frequent purchases
- Urban location
Output
- Descriptive rules
- Summary tables
- Statistical measures
Analysis of Attribute Relevance
Not all attributes are equally important for classification.
Attribute relevance analysis identifies the most useful attributes.
Why It Is Needed
- Reduces noise
- Improves accuracy
- Reduces computation
Techniques Used
| Technique | Explanation |
|---|---|
| Information Gain | Measures reduction in entropy |
| Chi-Square Test | Measures dependency |
| Correlation | Measures relationship strength |
| Feature selection | Removes irrelevant attributes |
Exam Tip: Information Gain is widely used in decision trees.
Mining Class Comparisons
Class Comparison compares two or more classes to find differences.
It answers: “How is Class A different from Class B?”
Example
- Buyers vs Non-buyers
- Passed vs Failed students
Output
- Discriminant rules
- Comparative statistics
Statistical Measures in Large Databases
Statistical measures summarize and describe large datasets.
Common Measures
| Measure | Meaning |
|---|---|
| Mean | Average value |
| Median | Middle value |
| Mode | Most frequent value |
| Variance | Spread of data |
| Standard Deviation | Data dispersion |
| Correlation | Relationship between attributes |
Used in characterization, comparison, and prediction.
Statistical-Based Algorithms
These algorithms use statistical principles for classification.
Examples
| Algorithm | Description |
|---|---|
| Naïve Bayes | Based on Bayes’ theorem |
| Bayesian Networks | Probabilistic relationships |
| Linear Discriminant Analysis | Linear separation |
Advantages
- Fast
- Works well with large datasets
Limitation
-
Assumes data independence
Distance-Based Algorithms
These algorithms classify data based on distance or similarity.
Common Algorithm: k-Nearest Neighbor (k-NN)
- Finds k closest data points
- Assigns class by majority voting
Distance Measures
| Measure | Use |
|---|---|
| Euclidean | Numeric data |
| Manhattan | Grid-based data |
| Cosine | Text data |
Pros & Cons
| Advantage | Disadvantage |
|---|---|
| Simple | Slow for large data |
| No training | High memory use |
Decision Tree-Based Algorithms
Decision Tree algorithms classify data using a tree-like structure.
Popular Algorithms
| Algorithm | Key Idea |
|---|---|
| ID3 | Uses Information Gain |
| C4.5 | Handles continuous data |
| CART | Uses Gini Index |
Advantages
- Easy to understand
- Rule-based output
- Handles mixed data
Disadvantages
- Overfitting
- Sensitive to noisy data
Comparative View of Classification Algorithms
| Basis | Statistical | Distance-Based | Decision Tree |
|---|---|---|---|
| Principle | Probability | Distance | Rules |
| Example | Naïve Bayes | k-NN | ID3 |
| Speed | Fast | Slow | Medium |
| Interpretability | Medium | Low | High |
| Accuracy | Good | Depends on k | High |
Exam-Friendly Summary
- Classification predicts predefined classes
- Data generalization simplifies data
- Characterization describes class features
- Attribute relevance improves accuracy
- Class comparison finds differences
- Statistical, distance-based, and tree-based algorithms are widely used
Clustering
Clustering is an unsupervised data mining technique that groups similar data objects together without using predefined class labels.
In simple words: Clustering = Automatically grouping similar data
Examples
- Grouping customers with similar buying behavior
- Grouping documents by topic
- Grouping cities by climate
Key Characteristics
| Feature | Description |
|---|---|
| Unsupervised | No class labels |
| Similarity-based | Objects in same cluster are similar |
| Exploratory | Used to discover patterns |
Similarity and Distance Measures
Clustering depends on how similarity or distance is measured between data objects.
Common Distance Measures
| Measure | Formula Idea | Used For |
|---|---|---|
| Euclidean Distance | Straight-line distance | Numeric data |
| Manhattan Distance | Grid distance | City-block data |
| Minkowski Distance | Generalized form | Flexible |
| Cosine Similarity | Angle between vectors | Text data |
| Jaccard Coefficient | Shared attributes | Binary data |
Exam Tip:
- Euclidean → most commonly used
- Cosine → text mining
Types of Clustering Algorithms
Clustering algorithms are broadly classified into:
| Type | Idea |
|---|---|
| Hierarchical | Builds a tree of clusters |
| Partitional | Divides data into k clusters |
| Density-Based | Finds dense regions |
| Grid-Based | Uses grid structure |
Hierarchical Clustering
Hierarchical clustering builds clusters step-by-step.
Types
| Type | Description |
|---|---|
| Agglomerative | Bottom-up (merge clusters) |
| Divisive | Top-down (split clusters) |
Advantages
- No need to specify number of clusters
- Easy to visualize (dendrogram)
Disadvantages
- Not scalable for large datasets
- Sensitive to noise
CURE (Clustering Using REpresentatives)
CURE is a hierarchical clustering algorithm designed to handle large datasets and outliers.
Key Idea
- Uses multiple representative points for each cluster
- Shrinks representative points toward cluster center
Features
| Feature | Description |
|---|---|
| Shape | Detects arbitrary shapes |
| Outliers | Handles well |
| Scalability | Better than traditional hierarchical |
CHAMELEON
CHAMELEON is an advanced hierarchical clustering algorithm based on dynamic modeling.
Key Concept
Clusters are merged based on:
- Relative Inter-Connectivity
- Relative Closeness
Advantages
-
Adapts to cluster structure
-
Handles complex shapes
Limitation
-
Computationally expensive
Density-Based Clustering Methods
Density-based methods group points that are closely packed together and treat sparse points as noise.
DBSCAN (Density-Based Spatial Clustering of Applications with Noise)
Key Parameters
| Parameter | Meaning |
|---|---|
| ε (epsilon) | Radius of neighborhood |
| MinPts | Minimum points to form cluster |
Advantages
- Detects arbitrary shaped clusters
- Handles noise well
- No need to specify number of clusters
Disadvantages
- Sensitive to ε value
- Poor with varying density
OPTICS (Ordering Points To Identify Clustering Structure)
OPTICS is an extension of DBSCAN.
Key Features
| Feature | Description |
|---|---|
| Density | Handles varying densities |
| Output | Reachability plot |
| Flexibility | More robust than DBSCAN |
Grid-Based Clustering Methods
Grid-based methods divide the data space into a finite number of cells.
STING (Statistical Information Grid)
Characteristics
| Feature | Description |
|---|---|
| Structure | Hierarchical grid |
| Data | Uses statistical info |
| Speed | Very fast |
Limitation
-
Cluster quality depends on grid size
CLIQUE (Clustering In QUEst)
Key Idea
- Finds clusters in subspaces
- Suitable for high-dimensional data
Advantages
- Handles high dimensions
- Efficient
Comparison Table (Very Important for Exams)
| Algorithm | Type | Shape | Noise Handling | Scalability |
|---|---|---|---|---|
| Hierarchical | Tree-based | Limited | Poor | Low |
| CURE | Hierarchical | Arbitrary | Good | Medium |
| CHAMELEON | Hierarchical | Complex | Good | Medium |
| DBSCAN | Density-based | Arbitrary | Excellent | Medium |
| OPTICS | Density-based | Arbitrary | Excellent | High |
| STING | Grid-based | Rectangular | Average | Very High |
| CLIQUE | Grid-based | Subspace | Good | Very High |
Exam-Friendly Summary
- Clustering is unsupervised learning
- Similarity measures define cluster quality
- Hierarchical clustering builds tree structures
- CURE and CHAMELEON improve traditional hierarchical methods
- DBSCAN & OPTICS handle noise and arbitrary shapes