Unsupervised Learning

What Is Unsupervised Learning?

In supervised learning, every training example comes with a correct label y. Unsupervised learning removes that assumption entirely. The algorithm is given only inputs x — no correct answers, no categories, no labels of any kind — and its job is to find structure or patterns hiding in the data on its own.

We call it unsupervised precisely because there is no teacher. No human has pre-labeled the data. The algorithm must discover whatever organisation exists in the data without being told what to look for.

Supervised: given (x, y) — learn f(x) → y Unsupervised: given x only — discover structure, patterns, groupings The algorithm decides what the structure is. No labels. No correct answer.

Types of Unsupervised Learning

Unsupervised learning covers several distinct problem types, each asking a different question about the data.

Type	Question it asks	Example
Clustering	Which data points naturally belong together?	Google News grouping articles by topic
Anomaly Detection	Which data points are unusual or don't fit?	Fraud detection in bank transactions
Dimensionality Reduction	Can we compress the data while keeping its structure?	Visualising high-dimensional data in 2D

Clustering

Clustering divides data into groups — called clusters — where points inside a cluster are similar to each other and different from points in other clusters. The algorithm decides both how many groups to form and which points belong together, entirely from the data itself.

Google News

Google News receives millions of articles every day from thousands of sources around the world. No team of humans manually reads and categorises each one. Instead, a clustering algorithm continuously groups articles that cover the same story — same event, same topic, same entities mentioned — into a single cluster, which becomes the "story card" you see on the news feed.

For e.g., every article about a cricket match between India and Australia — written by ESPN, BBC, Times of India, and dozens of other outlets — gets automatically grouped into one story. The algorithm was never told what "cricket" or "India vs Australia" means. It discovered that these articles share similar words, locations, and named entities and assigned them to the same cluster.

Diagram

Thousands of unlabeled articles feed into a clustering algorithm. Articles covering the same topic group together automatically — no labels required.

Quick Check

Google News groups articles about the same event together without anyone manually labeling them. Which type of unsupervised learning is this?

Anomaly Detection

Anomaly detection identifies data points that are unusual — ones that do not fit the pattern of the rest of the data. The algorithm first learns what normal looks like from the bulk of the data, then flags anything that deviates significantly.

Fraud detection: the vast majority of bank transactions follow a predictable pattern (location, amount, frequency). A transaction that suddenly deviates — unusual amount, foreign country, odd hour — gets flagged as a potential anomaly.
Manufacturing: sensors on a production line record normal operating conditions. An anomaly detection model flags readings that deviate from normal before a failure occurs.
Network security: normal network traffic has recognisable patterns. Intrusion detection systems flag traffic that looks unusual — a potential cyberattack.

The power of anomaly detection is that you do not need labeled examples of fraud or failures. You only need enough normal data for the algorithm to learn what normal is.

Quick Check

A bank's anomaly detection model was never given labeled examples of fraudulent transactions. It only trained on normal transaction patterns. How does it identify fraud?

Dimensionality Reduction

Real-world datasets often have hundreds or thousands of features (dimensions). Dimensionality reduction compresses the data into far fewer dimensions — two or three — while preserving as much of the original structure as possible.

This serves two main purposes. First, it makes data easier to visualise — you cannot plot 500-dimensional data, but you can plot a 2D compression of it and still see whether natural clusters exist. Second, it removes noise and redundant features, which can improve the performance of downstream models.

PCA (Principal Component Analysis): finds the directions of greatest variance in the data and projects everything onto those axes.
t-SNE: a non-linear method particularly effective for visualising high-dimensional data like word embeddings or image features in 2D.