Why Accuracy Is Not Enough
Accuracy — the fraction of predictions the model got right — sounds like the obvious metric. But it breaks down whenever classes are imbalanced.
For e.g., a classifier that always predicts "not fraud" on a dataset where only 5% of cases are fraud will score 95% accuracy while catching zero fraud cases. It has learned nothing useful, yet its accuracy looks excellent. This is why we need metrics that look separately at different types of error.
Accuracy = (TP + TN) / Total — misleading when one class dominates. The real question is: which errors matter more for your problem?
The Confusion Matrix
Before the metrics, you need the four building blocks. For a binary classifier with a positive class (e.g., "fraud") and a negative class (e.g., "not fraud"):
| Predicted Positive | Predicted Negative | |
|---|---|---|
| Actually Positive | TP — True Positive | FN — False Negative (missed it) |
| Actually Negative | FP — False Positive (false alarm) | TN — True Negative |
- TP: model said positive, it was positive. Correct catch.
- FP: model said positive, it was negative. False alarm.
- FN: model said negative, it was positive. Missed case.
- TN: model said negative, it was negative. Correct dismissal.
Every metric below is a ratio built from these four numbers.
A cancer screening model misses 40 actual cancer cases, labelling them as healthy. What type of error is this?
Precision and Recall
These two metrics capture the tradeoff between false alarms and missed cases.
Precision answers: of everything I flagged as positive, how many were actually positive?
Precision = TP / (TP + FP)
Recall (also called sensitivity) answers: of all the actual positives, how many did I catch?
Recall = TP / (TP + FN)
| Metric | Punishes | Use when… |
|---|---|---|
| Precision | False alarms (FP) | False positives are costly — e.g., spam filter blocking real emails |
| Recall | Missed cases (FN) | False negatives are costly — e.g., cancer screening, fraud detection |
For e.g., a fraud model flags 100 transactions as fraudulent; 80 are genuinely fraudulent and 20 are not. Precision = 80 / (80 + 20) = 80%. To know recall, you would need the total number of actual frauds in the dataset.
A fraud detection model flags 100 transactions. 80 are genuine fraud, 20 are not. What is the model's precision?
F1 Score
Precision and recall pull in opposite directions — boosting one often hurts the other. The F1 Score is their harmonic mean, giving a single number that balances both.
F1 = 2 × (Precision × Recall) / (Precision + Recall)
The harmonic mean is used instead of the arithmetic mean because it penalises large imbalances. A model with precision 1.0 and recall 0.0 would have an arithmetic mean of 0.5 but an F1 of 0 — correctly flagging it as useless.
F1 is most useful when classes are imbalanced and both false positives and false negatives carry cost. If one matters far more than the other, use precision or recall directly.
A model has precision 0.90 and recall 0.10. What is its F1 score (approximately)?
ROC-AUC
Most classifiers output a probability score, not a hard label. The threshold you choose (e.g., flag as fraud if score > 0.5) sets the precision-recall tradeoff. The ROC curve plots the True Positive Rate (recall) against the False Positive Rate at every possible threshold.
AUC (Area Under the Curve) summarises the entire curve as a single number:
- AUC = 1.0: perfect classifier — separates all positives from all negatives.
- AUC = 0.5: random guessing — no better than a coin flip.
- AUC < 0.5: worse than random (predictions are systematically inverted).
ROC-AUC is threshold-independent, making it useful for comparing models before deciding on a deployment threshold. It can be misleading with extreme class imbalance — in those cases, the Precision-Recall AUC is often more informative.
