Unsupervised Learning and Clustering
Unsupervised learning is a type of machine learning where the computer learns from data without any given answers. This means no one tells the computer what is right or wrong. The system only looks at the data and tries to find patterns, similarities, or groups by itself.
It is like giving a box of mixed items to a student and asking them to arrange the items in meaningful groups without any instructions. The computer uses the structure of data to understand how things are related.
In daily life, we often do unsupervised learning without realising it. For example, when you arrange photos in your mobile gallery into folders like family, friends, and college, you group similar photos together.
No one tells you how to group them; you decide based on similarity. In the same way, the computer groups similar data items.
Key Ideas
No labelled data is given
Computer finds hidden patterns
Useful when large data has no answers
Exam Tip
Unsupervised learning = learning without
teacher.
Why Unsupervised Learning Matters
Unsupervised learning helps us understand large amounts of data easily. Companies collect huge amounts of data from users every day. Without grouping or organising this data, it becomes useless.
Unsupervised learning helps companies see trends, user behaviour, and customer interest. This makes better business decisions possible.
For example, online shopping websites group customers based on what they buy. Students who buy programming books are placed in one group. Students who buy novels are placed in another group. These groups help websites suggest correct products.
Key Ideas
Handles big data
Finds hidden structure
Helps in business and research
Remember This
Unsupervised learning turns raw data into
useful information.
Clustering
Clustering is the process of dividing data into groups so that similar items stay together. Each group is called a cluster. Items inside one cluster are very similar, while items in different clusters are different from each other. Clustering is one of the most important tasks in unsupervised learning.
Think of your college library. Books are arranged into sections like programming, mathematics, and management. Books with similar topics stay together. This is clustering in real life.
Key Ideas
Groups similar data
Each group = cluster
No predefined labels
Exam Tip
Clustering = grouping similar objects.
Criterion Functions for Clustering
A criterion function is a rule that tells us how good a clustering result is. It measures the quality of clusters. The main aim is to make data inside a cluster as similar as possible and data in different clusters as different as possible.
Imagine you arrange students into study groups. If students in one group have very different subjects, then grouping is poor. If students in one group study the same subject, grouping is good. The criterion function checks this quality.
Key Ideas
Measures cluster quality
Helps compare clustering results
Used to improve clustering
Remember This
Better clustering = high similarity inside, low similarity outside.
Square Error Criterion (Basic Idea)
The square error measures the distance between data points and the centre of their cluster. Distance simply means how far two items are from each other. The aim is to keep this distance small. A smaller distance means better clustering.
Think of a group of students standing around a class leader. If all students stand close to the leader, grouping is good. If students stand far away, grouping is poor.
Key Ideas
Measures closeness
Smaller value = better clusters
Used in K-means
Iterative Square-Error Partitional Clustering
This method divides data into a fixed number of clusters and improves the result step by step. The word iterative means repeating steps again and again. The algorithm keeps changing clusters until the error becomes very small.
For example, you first divide students randomly into two groups. After seeing mistakes, you rearrange the students. You repeat this until the groups become correct.
Key Ideas
Fixed number of clusters
Repeats steps
Minimises square error
K-Means Clustering
K-means is the most popular partitional clustering method. K means the number of clusters. The algorithm chooses K centres and assigns each data point to the nearest centre. Then it updates the centres and repeats the process.
Suppose you want to divide students into three groups based on marks. You choose K = 3. The computer groups students into three clusters and keeps improving the groups.
Key Ideas
Choose K value
Find cluster centres
Repeat until stable
Exam Tip
K-means = simple and fast clustering method.
Steps of the K-Means Algorithm
The algorithm first selects K initial centres. Next, each data point goes to the nearest centre. Then new centres are calculated. These steps repeat until clusters stop changing.
Think of organising hostel rooms. You first place students randomly. Then you rearrange students based on habits. You repeat until everyone fits well.
Key Ideas
Select K
Assign data
Update centre
Repeat
Agglomerative Hierarchical Clustering
Agglomerative clustering builds clusters step by step. At the beginning, each data point is its own cluster. Then the closest clusters are merged again and again until only one big cluster remains.
Imagine each student stands alone. Then, students who know each other join. Then small groups join into bigger groups.
Key Ideas
Bottom-up approach
Merges clusters
Creates tree structure
Dendrogram (Tree Diagram)
A dendrogram is a tree-like diagram that shows how clusters merge. It helps us understand cluster formation visually.
Think of a family tree showing relationships. A dendrogram shows how data groups connect.
Key Ideas
Tree diagram
Shows merging
Used in hierarchical clustering
Differences: K-Means vs Hierarchical
| Feature | K-Means | Hierarchical |
|---|---|---|
| Number of clusters | Must choose K | Not needed |
| Speed | Fast | Slow |
| Structure | Flat | Tree |
Cluster Validation
Cluster validation checks whether clustering result is good or not. It helps ensure that clusters make sense and are useful.
Imagine a teacher checking group project quality. If groups are poorly formed, teacher reorganises.
Key Ideas
Checks quality
Finds best result
Avoids poor clusters
Types of Cluster Validation
Internal validation checks clustering using the data itself. External validation compares with known correct grouping. Relative validation compares different clustering methods.
Example: You compare two ways of grouping students and choose the better one.
Key Ideas
Internal
External
Relative
Why This Topic Matters
Clustering helps in customer analysis, medical research, image processing, and recommendation systems. It improves business and technology.
Example: Netflix groups users by movie interest.
Key Ideas
Used in industry
Helps decision making
Possible Exam Questions
Short Questions
Define unsupervised learning
What is clustering?
Explain K-means
Long Questions
Explain the K-means algorithm
Describe hierarchical clustering
Discuss cluster validation
Remember This
Unsupervised learning = no labels
Clustering = grouping
K-means = partitional
Hierarchical = tree-based
Detailed Summary
Unsupervised learning allows machines to learn from data without answers. Clustering is the most important task of unsupervised learning. It groups similar data into clusters. Criterion functions measure cluster quality. K-means is fast and simple, while hierarchical clustering builds tree-like groups.
Cluster validation ensures results are correct. These techniques help companies understand data and improve services.
Key Takeaways
Data can organise itself
Similar items form clusters
Clustering supports real-world systems
These notes are written to build a strong understanding and help students score well in exams.