Classification & Clustering



Classification 

Classification is a supervised data mining technique used to assign data items to predefined classes or categories.

In simple words: Classification predicts the class label of new data based on past data.

Example

  • Email → Spam or Not Spam
  • Student → Pass or Fail
  • Loan applicant → Approved or Rejected

Data Generalization

Data Generalization replaces low-level data with higher-level concepts using concept hierarchies.

Example

  • City → State → Country
  • Age → Young / Adult / Senior

Purpose

  • Simplifies data
  • Improves mining efficiency
  • Helps in high-level analysis

Low-Level DataGeneralized Data
Delhi, MumbaiIndia
22, 25, 28Young

Analytical Characterization

Analytical Characterization summarizes the general features of a target class.

It answers: “What are the common characteristics of this class?”

Example

Target class: Premium Customers

  • High income
  • Frequent purchases
  • Urban location

Output

  • Descriptive rules
  • Summary tables
  • Statistical measures

Analysis of Attribute Relevance

Not all attributes are equally important for classification.

Attribute relevance analysis identifies the most useful attributes.

Why It Is Needed

  • Reduces noise
  • Improves accuracy
  • Reduces computation

Techniques Used

TechniqueExplanation
Information GainMeasures reduction in entropy
Chi-Square TestMeasures dependency
CorrelationMeasures relationship strength
Feature selectionRemoves irrelevant attributes

Exam Tip: Information Gain is widely used in decision trees.

Mining Class Comparisons

Class Comparison compares two or more classes to find differences.

It answers: “How is Class A different from Class B?”

Example

  • Buyers vs Non-buyers
  • Passed vs Failed students

Output

  • Discriminant rules
  • Comparative statistics

Statistical Measures in Large Databases

Statistical measures summarize and describe large datasets.

Common Measures

MeasureMeaning
MeanAverage value
MedianMiddle value
ModeMost frequent value
VarianceSpread of data
Standard DeviationData dispersion
CorrelationRelationship between attributes

Used in characterization, comparison, and prediction.

Statistical-Based Algorithms

These algorithms use statistical principles for classification.

Examples

AlgorithmDescription
Naïve BayesBased on Bayes’ theorem
Bayesian NetworksProbabilistic relationships
Linear Discriminant AnalysisLinear separation

Advantages

  • Fast
  • Works well with large datasets

Limitation

  • Assumes data independence

Distance-Based Algorithms

These algorithms classify data based on distance or similarity.

Common Algorithm: k-Nearest Neighbor (k-NN)

  • Finds k closest data points
  • Assigns class by majority voting

Distance Measures

MeasureUse
EuclideanNumeric data
ManhattanGrid-based data
CosineText data

Pros & Cons

AdvantageDisadvantage
SimpleSlow for large data
No trainingHigh memory use

Decision Tree-Based Algorithms

Decision Tree algorithms classify data using a tree-like structure.

Popular Algorithms

AlgorithmKey Idea
ID3Uses Information Gain
C4.5Handles continuous data
CARTUses Gini Index

Advantages

  • Easy to understand
  • Rule-based output
  • Handles mixed data

Disadvantages

  • Overfitting
  • Sensitive to noisy data

Comparative View of Classification Algorithms

BasisStatisticalDistance-BasedDecision Tree
PrincipleProbabilityDistanceRules
ExampleNaïve Bayesk-NNID3
SpeedFastSlowMedium
InterpretabilityMediumLowHigh
AccuracyGoodDepends on kHigh

Exam-Friendly Summary

  • Classification predicts predefined classes
  • Data generalization simplifies data
  • Characterization describes class features
  • Attribute relevance improves accuracy
  • Class comparison finds differences
  • Statistical, distance-based, and tree-based algorithms are widely used

Clustering 

Clustering is an unsupervised data mining technique that groups similar data objects together without using predefined class labels.

In simple words: Clustering = Automatically grouping similar data

Examples

  • Grouping customers with similar buying behavior
  • Grouping documents by topic
  • Grouping cities by climate

Key Characteristics

FeatureDescription
UnsupervisedNo class labels
Similarity-basedObjects in same cluster are similar
ExploratoryUsed to discover patterns

Similarity and Distance Measures

Clustering depends on how similarity or distance is measured between data objects.

Common Distance Measures

MeasureFormula IdeaUsed For
Euclidean DistanceStraight-line distanceNumeric data
Manhattan DistanceGrid distanceCity-block data
Minkowski DistanceGeneralized formFlexible
Cosine SimilarityAngle between vectorsText data
Jaccard CoefficientShared attributesBinary data

Exam Tip:

  • Euclidean → most commonly used
  • Cosine → text mining

Types of Clustering Algorithms

Clustering algorithms are broadly classified into:

TypeIdea
HierarchicalBuilds a tree of clusters
PartitionalDivides data into k clusters
Density-BasedFinds dense regions
Grid-BasedUses grid structure

Hierarchical Clustering

Hierarchical clustering builds clusters step-by-step.

Types

TypeDescription
AgglomerativeBottom-up (merge clusters)
DivisiveTop-down (split clusters)

Advantages

  • No need to specify number of clusters
  • Easy to visualize (dendrogram)

Disadvantages

  • Not scalable for large datasets
  • Sensitive to noise

CURE (Clustering Using REpresentatives)

CURE is a hierarchical clustering algorithm designed to handle large datasets and outliers.

Key Idea

  • Uses multiple representative points for each cluster
  • Shrinks representative points toward cluster center

Features

FeatureDescription
ShapeDetects arbitrary shapes
OutliersHandles well
ScalabilityBetter than traditional hierarchical

CHAMELEON

CHAMELEON is an advanced hierarchical clustering algorithm based on dynamic modeling.

Key Concept

Clusters are merged based on:

  • Relative Inter-Connectivity
  • Relative Closeness

Advantages

  • Adapts to cluster structure

  • Handles complex shapes

Limitation

  • Computationally expensive

Density-Based Clustering Methods

Density-based methods group points that are closely packed together and treat sparse points as noise.

DBSCAN (Density-Based Spatial Clustering of Applications with Noise)

Key Parameters

ParameterMeaning
ε (epsilon)Radius of neighborhood
MinPtsMinimum points to form cluster

Advantages

  • Detects arbitrary shaped clusters
  • Handles noise well
  • No need to specify number of clusters

Disadvantages

  • Sensitive to ε value
  • Poor with varying density

OPTICS (Ordering Points To Identify Clustering Structure)

OPTICS is an extension of DBSCAN.

Key Features

FeatureDescription
DensityHandles varying densities
OutputReachability plot
FlexibilityMore robust than DBSCAN

Grid-Based Clustering Methods

Grid-based methods divide the data space into a finite number of cells.

STING (Statistical Information Grid)

Characteristics

FeatureDescription
StructureHierarchical grid
DataUses statistical info
SpeedVery fast

Limitation

  • Cluster quality depends on grid size

CLIQUE (Clustering In QUEst)

Key Idea

  • Finds clusters in subspaces
  • Suitable for high-dimensional data

Advantages

  • Handles high dimensions
  • Efficient

Comparison Table (Very Important for Exams)

AlgorithmTypeShapeNoise HandlingScalability
HierarchicalTree-basedLimitedPoorLow
CUREHierarchicalArbitraryGoodMedium
CHAMELEONHierarchicalComplexGoodMedium
DBSCANDensity-basedArbitraryExcellentMedium
OPTICSDensity-basedArbitraryExcellentHigh
STINGGrid-basedRectangularAverageVery High
CLIQUEGrid-basedSubspaceGoodVery High

Exam-Friendly Summary

  • Clustering is unsupervised learning
  • Similarity measures define cluster quality
  • Hierarchical clustering builds tree structures
  • CURE and CHAMELEON improve traditional hierarchical methods
  • DBSCAN & OPTICS handle noise and arbitrary shapes