What Is Overfitting?
Memorisation is not learning. A model that memorises training examples can reproduce them perfectly but has learned nothing it can apply to new data. That is overfitting: strong training performance, poor generalisation.
This course covers how to detect overfitting and the techniques that control it. It builds on Introduction to Machine Learning and connects to Gradient Descent & Optimization for training dynamics.
Telltale sign: very low training loss, but validation loss is much higher and diverging.
Underfitting vs Overfitting
| Training error | Validation error | Cause | |
|---|---|---|---|
| Underfitting (high bias) | High | High | Model too simple |
| Good fit | Low | Low, similar to training | Right complexity |
| Overfitting (high variance) | Low | Much higher than training | Model too complex |
A model achieves 98% accuracy on training data but only 71% on the validation set. What is the most likely problem?
The Bias-Variance Decomposition
Prediction error decomposes into three parts: bias² (from wrong assumptions), variance (from sensitivity to small changes in the training set), and irreducible noise. Reducing bias usually increases variance, and the reverse. This decomposition is exact for mean squared error in regression — for classification tasks, several incompatible formulations exist and there is no single agreed-upon version.
Regularization Techniques
L2 Regularization (Ridge / Weight Decay)
Adds a penalty proportional to the sum of squared weights to the loss. The optimizer shrinks every weight toward zero on each step, discouraging the model from relying too heavily on any single feature.
loss_total = loss_original + λ * sum(w**2 for w in weights)
# λ controls the regularization strength
L1 Regularization (Lasso)
Adds a penalty proportional to the sum of absolute weight values. Unlike L2, L1 drives many weights exactly to zero — effectively performing feature selection. For e.g., useful when you suspect only a small subset of input features are genuinely predictive.
Dropout
During each training step, a random fraction p of neuron outputs is set to zero. This prevents co-adaptation — neurons cannot rely on specific other neurons always being present.
- Typical rates: 0.1–0.5 for hidden layers, 0.5–0.8 for NLP embedding layers.
- Never apply dropout to the output layer.
- Effectively trains an ensemble of 2^n subnetworks simultaneously.
- Modern implementations (PyTorch, TensorFlow) use inverted dropout: activations are scaled up by 1/(1 − p) during training, so no adjustment is needed at inference. Older formulations scaled down by (1 − p) at test time — both are mathematically equivalent.
What does dropout do during training?
Early Stopping
Monitor validation loss during training and stop when it stops improving. Save the best checkpoint. No additional computation cost — it is the most cost-free regularization technique available.
Data Augmentation
Generate additional training examples by applying label-preserving transformations to existing ones. The model sees more varied inputs, making exact memorisation harder.
- Images: random flips, rotations, crops, colour jitter, cutout, mixup.
- Text: synonym replacement, back-translation, random insertion or deletion.
- Audio: time stretching, pitch shifting, adding background noise.
Training, Validation, and Test Sets
The test set must stay unseen until the very end. If you use test results to make any decision — choosing a model, tuning a threshold — you have implicitly trained on it, and your reported performance is optimistic.
- Training set (60–80%): used to fit model parameters.
- Validation set (10–20%): used to tune hyperparameters and compare models.
- Test set (10–20%): used exactly once to report final performance.
- Cross-validation: rotate the validation split to get a better generalisation estimate when data is scarce.
When to Apply What
| Symptom | Diagnosis | Fix |
|---|---|---|
| High training AND validation error | Underfitting | Increase capacity, train longer |
| Low training error, high validation error | Overfitting | Add regularization, more data, reduce model size |
| Both high but validation ≈ training | Data-limited | Collect more data or try a different architecture |
Always start with the simplest model that could work, then add complexity only when the data supports it.