Classification Metrics

Why Accuracy Is Not Enough

Accuracy — the fraction of predictions the model got right — sounds like the obvious metric. But it breaks down whenever classes are imbalanced.

For e.g., a classifier that always predicts "not fraud" on a dataset where only 5% of cases are fraud will score 95% accuracy while catching zero fraud cases. It has learned nothing useful, yet its accuracy looks excellent. This is why we need metrics that look separately at different types of error.

Accuracy = (TP + TN) / Total — misleading when one class dominates. The real question is: which errors matter more for your problem?

The Confusion Matrix

Before the metrics, you need the four building blocks. For a binary classifier with a positive class (e.g., "fraud") and a negative class (e.g., "not fraud"):

	Predicted Positive	Predicted Negative
Actually Positive	TP — True Positive	FN — False Negative (missed it)
Actually Negative	FP — False Positive (false alarm)	TN — True Negative

TP: model said positive, it was positive. Correct catch.
FP: model said positive, it was negative. False alarm.
FN: model said negative, it was positive. Missed case.
TN: model said negative, it was negative. Correct dismissal.

Every metric below is a ratio built from these four numbers.

Quick Check

A cancer screening model misses 40 actual cancer cases, labelling them as healthy. What type of error is this?

Precision and Recall

These two metrics capture the tradeoff between false alarms and missed cases.

Precision answers: of everything I flagged as positive, how many were actually positive?

Precision = TP / (TP + FP)

Recall (also called sensitivity) answers: of all the actual positives, how many did I catch?

Recall = TP / (TP + FN)

Metric	Punishes	Use when…
Precision	False alarms (FP)	False positives are costly — e.g., spam filter blocking real emails
Recall	Missed cases (FN)	False negatives are costly — e.g., cancer screening, fraud detection

For e.g., a fraud model flags 100 transactions as fraudulent; 80 are genuinely fraudulent and 20 are not. Precision = 80 / (80 + 20) = 80%. To know recall, you would need the total number of actual frauds in the dataset.

Quick Check

A fraud detection model flags 100 transactions. 80 are genuine fraud, 20 are not. What is the model's precision?

F1 Score

Precision and recall pull in opposite directions — boosting one often hurts the other. The F1 Score is their harmonic mean, giving a single number that balances both.

F1 = 2 × (Precision × Recall) / (Precision + Recall)

The harmonic mean is used instead of the arithmetic mean because it penalises large imbalances. A model with precision 1.0 and recall 0.0 would have an arithmetic mean of 0.5 but an F1 of 0 — correctly flagging it as useless.

F1 is most useful when classes are imbalanced and both false positives and false negatives carry cost. If one matters far more than the other, use precision or recall directly.

Quick Check

A model has precision 0.90 and recall 0.10. What is its F1 score (approximately)?

ROC-AUC

Most classifiers output a probability score, not a hard label. The threshold you choose (e.g., flag as fraud if score > 0.5) sets the precision-recall tradeoff. The ROC curve plots the True Positive Rate (recall) against the False Positive Rate at every possible threshold.

AUC (Area Under the Curve) summarises the entire curve as a single number:

AUC = 1.0: perfect classifier — separates all positives from all negatives.
AUC = 0.5: random guessing — no better than a coin flip.
AUC < 0.5: worse than random (predictions are systematically inverted).

ROC-AUC is threshold-independent, making it useful for comparing models before deciding on a deployment threshold. It can be misleading with extreme class imbalance — in those cases, the Precision-Recall AUC is often more informative.

Why Accuracy Is Not Enough

The Confusion Matrix

Precision and Recall

F1 Score

ROC-AUC

Test Your Knowledge