Classification & Clustering

Classification

Classification is a supervised data mining technique used to assign data items to predefined classes or categories.

In simple words: Classification predicts the class label of new data based on past data.

Example

Email → Spam or Not Spam
Student → Pass or Fail
Loan applicant → Approved or Rejected

Data Generalization

Data Generalization replaces low-level data with higher-level concepts using concept hierarchies.

Example

City → State → Country
Age → Young / Adult / Senior

Purpose

Simplifies data
Improves mining efficiency
Helps in high-level analysis

Low-Level Data Generalized Data
Delhi, Mumbai India
22, 25, 28 Young

Low-Level Data	Generalized Data
Delhi, Mumbai	India
22, 25, 28	Young

Analytical Characterization

Analytical Characterization summarizes the general features of a target class.

It answers: “What are the common characteristics of this class?”

Example

Target class: Premium Customers

High income
Frequent purchases
Urban location

Output

Descriptive rules
Summary tables
Statistical measures

Analysis of Attribute Relevance

Not all attributes are equally important for classification.

Attribute relevance analysis identifies the most useful attributes.

Why It Is Needed

Reduces noise
Improves accuracy
Reduces computation

Techniques Used

Technique	Explanation
Information Gain	Measures reduction in entropy
Chi-Square Test	Measures dependency
Correlation	Measures relationship strength
Feature selection	Removes irrelevant attributes

Exam Tip: Information Gain is widely used in decision trees.

Mining Class Comparisons

Class Comparison compares two or more classes to find differences.

It answers: “How is Class A different from Class B?”

Example

Buyers vs Non-buyers
Passed vs Failed students

Output

Discriminant rules
Comparative statistics

Statistical Measures in Large Databases

Statistical measures summarize and describe large datasets.

Common Measures

Measure	Meaning
Mean	Average value
Median	Middle value
Mode	Most frequent value
Variance	Spread of data
Standard Deviation	Data dispersion
Correlation	Relationship between attributes

Used in characterization, comparison, and prediction.

Statistical-Based Algorithms

These algorithms use statistical principles for classification.

Examples

Algorithm	Description
Naïve Bayes	Based on Bayes’ theorem
Bayesian Networks	Probabilistic relationships
Linear Discriminant Analysis	Linear separation

Advantages

Fast
Works well with large datasets

Limitation

Assumes data independence

Distance-Based Algorithms

These algorithms classify data based on distance or similarity.

Common Algorithm: k-Nearest Neighbor (k-NN)

Finds k closest data points
Assigns class by majority voting

Distance Measures

Measure	Use
Euclidean	Numeric data
Manhattan	Grid-based data
Cosine	Text data

Pros & Cons

Advantage	Disadvantage
Simple	Slow for large data
No training	High memory use

Decision Tree-Based Algorithms

Decision Tree algorithms classify data using a tree-like structure.

Popular Algorithms

Algorithm	Key Idea
ID3	Uses Information Gain
C4.5	Handles continuous data
CART	Uses Gini Index

Advantages

Easy to understand
Rule-based output
Handles mixed data

Disadvantages

Overfitting
Sensitive to noisy data

Comparative View of Classification Algorithms

Basis	Statistical	Distance-Based	Decision Tree
Principle	Probability	Distance	Rules
Example	Naïve Bayes	k-NN	ID3
Speed	Fast	Slow	Medium
Interpretability	Medium	Low	High
Accuracy	Good	Depends on k	High

Exam-Friendly Summary

Classification predicts predefined classes
Data generalization simplifies data
Characterization describes class features
Attribute relevance improves accuracy
Class comparison finds differences
Statistical, distance-based, and tree-based algorithms are widely used

Clustering

Clustering is an unsupervised data mining technique that groups similar data objects together without using predefined class labels.

In simple words: Clustering = Automatically grouping similar data

Examples

Grouping customers with similar buying behavior
Grouping documents by topic
Grouping cities by climate

Key Characteristics

Feature	Description
Unsupervised	No class labels
Similarity-based	Objects in same cluster are similar
Exploratory	Used to discover patterns

Similarity and Distance Measures

Clustering depends on how similarity or distance is measured between data objects.

Common Distance Measures

Measure	Formula Idea	Used For
Euclidean Distance	Straight-line distance	Numeric data
Manhattan Distance	Grid distance	City-block data
Minkowski Distance	Generalized form	Flexible
Cosine Similarity	Angle between vectors	Text data
Jaccard Coefficient	Shared attributes	Binary data

Exam Tip:

Euclidean → most commonly used
Cosine → text mining

Types of Clustering Algorithms

Clustering algorithms are broadly classified into:

Type	Idea
Hierarchical	Builds a tree of clusters
Partitional	Divides data into k clusters
Density-Based	Finds dense regions
Grid-Based	Uses grid structure

Hierarchical Clustering

Hierarchical clustering builds clusters step-by-step.

Types

Type	Description
Agglomerative	Bottom-up (merge clusters)
Divisive	Top-down (split clusters)

Advantages

No need to specify number of clusters
Easy to visualize (dendrogram)

Disadvantages

Not scalable for large datasets
Sensitive to noise

CURE (Clustering Using REpresentatives)

CURE is a hierarchical clustering algorithm designed to handle large datasets and outliers.

Key Idea

Uses multiple representative points for each cluster
Shrinks representative points toward cluster center

Features

Feature	Description
Shape	Detects arbitrary shapes
Outliers	Handles well
Scalability	Better than traditional hierarchical

CHAMELEON

CHAMELEON is an advanced hierarchical clustering algorithm based on dynamic modeling.

Key Concept

Clusters are merged based on:

Relative Inter-Connectivity
Relative Closeness

Advantages

Adapts to cluster structure
Handles complex shapes

Limitation

Computationally expensive

Density-Based Clustering Methods

Density-based methods group points that are closely packed together and treat sparse points as noise.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Key Parameters

Parameter	Meaning
ε (epsilon)	Radius of neighborhood
MinPts	Minimum points to form cluster

Advantages

Detects arbitrary shaped clusters
Handles noise well
No need to specify number of clusters

Disadvantages

Sensitive to ε value
Poor with varying density

OPTICS (Ordering Points To Identify Clustering Structure)

OPTICS is an extension of DBSCAN.

Key Features

Feature	Description
Density	Handles varying densities
Output	Reachability plot
Flexibility	More robust than DBSCAN

Grid-Based Clustering Methods

Grid-based methods divide the data space into a finite number of cells.

STING (Statistical Information Grid)

Characteristics

Feature	Description
Structure	Hierarchical grid
Data	Uses statistical info
Speed	Very fast

Limitation

Cluster quality depends on grid size

CLIQUE (Clustering In QUEst)

Key Idea

Finds clusters in subspaces
Suitable for high-dimensional data

Advantages

Handles high dimensions
Efficient

Comparison Table (Very Important for Exams)

Algorithm	Type	Shape	Noise Handling	Scalability
Hierarchical	Tree-based	Limited	Poor	Low
CURE	Hierarchical	Arbitrary	Good	Medium
CHAMELEON	Hierarchical	Complex	Good	Medium
DBSCAN	Density-based	Arbitrary	Excellent	Medium
OPTICS	Density-based	Arbitrary	Excellent	High
STING	Grid-based	Rectangular	Average	Very High
CLIQUE	Grid-based	Subspace	Good	Very High

Exam-Friendly Summary

Clustering is unsupervised learning
Similarity measures define cluster quality
Hierarchical clustering builds tree structures
CURE and CHAMELEON improve traditional hierarchical methods
DBSCAN & OPTICS handle noise and arbitrary shapes

Classification & Clustering

Classification

Example

Data Generalization

Example

Purpose

Low-Level DataGeneralized DataDelhi, MumbaiIndia22, 25, 28Young

Analytical Characterization

Example

Output

Analysis of Attribute Relevance

Why It Is Needed

Techniques Used

Mining Class Comparisons

Example

Output

Statistical Measures in Large Databases

Common Measures

Statistical-Based Algorithms

Examples

Advantages

Limitation

Distance-Based Algorithms

Common Algorithm: k-Nearest Neighbor (k-NN)

Distance Measures

Pros & Cons

Decision Tree-Based Algorithms

Popular Algorithms

Advantages

Disadvantages

Comparative View of Classification Algorithms

Exam-Friendly Summary

Clustering

Examples

Key Characteristics

Similarity and Distance Measures

Common Distance Measures

Types of Clustering Algorithms

Hierarchical Clustering

Types

Advantages

Disadvantages

CURE (Clustering Using REpresentatives)

Key Idea

Features

CHAMELEON

Key Concept

Advantages

Limitation

Density-Based Clustering Methods

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Key Parameters

Advantages

Disadvantages

OPTICS (Ordering Points To Identify Clustering Structure)

Key Features

Grid-Based Clustering Methods

STING (Statistical Information Grid)

Characteristics

Limitation

CLIQUE (Clustering In QUEst)

Key Idea

Advantages

Comparison Table (Very Important for Exams)

Exam-Friendly Summary

You might like

Low-Level Data Generalized Data
Delhi, Mumbai India
22, 25, 28 Young