AAI Logo
Loading...
AAI Logo
Loading...
Machine Learning
Machine LearningBeginner

Overfitting & Regularization

overfittingregularizationdropoutL1 regularizationL2 regularizationbias variance tradeoff
No reviews yet — be the first!

What Is Overfitting?

Memorisation is not learning. A model that memorises training examples can reproduce them perfectly but has learned nothing it can apply to new data. That is overfitting: strong training performance, poor generalisation.

This course covers how to detect overfitting and the techniques that control it. It builds on Introduction to Machine Learning and connects to Gradient Descent & Optimization for training dynamics.

Telltale sign: very low training loss, but validation loss is much higher and diverging.

Underfitting vs Overfitting

Training errorValidation errorCause
Underfitting (high bias)HighHighModel too simple
Good fitLowLow, similar to trainingRight complexity
Overfitting (high variance)LowMuch higher than trainingModel too complex
Quick Check

A model achieves 98% accuracy on training data but only 71% on the validation set. What is the most likely problem?

The Bias-Variance Decomposition

Prediction error decomposes into three parts: bias² (from wrong assumptions), variance (from sensitivity to small changes in the training set), and irreducible noise. Reducing bias usually increases variance, and the reverse. This decomposition is exact for mean squared error in regression — for classification tasks, several incompatible formulations exist and there is no single agreed-upon version.

Diagram
sweet spotUnderfitting(high bias)Overfitting(high variance)0.250.500.751.00ErrorModel Complexity →Bias²VarianceTotal
As model complexity grows, bias falls but variance rises. The optimal complexity minimises total error — the sweet spot between underfitting and overfitting.

Regularization Techniques

L2 Regularization (Ridge / Weight Decay)

Adds a penalty proportional to the sum of squared weights to the loss. The optimizer shrinks every weight toward zero on each step, discouraging the model from relying too heavily on any single feature.

loss_total = loss_original + λ * sum(w**2 for w in weights)
# λ controls the regularization strength

L1 Regularization (Lasso)

Adds a penalty proportional to the sum of absolute weight values. Unlike L2, L1 drives many weights exactly to zero — effectively performing feature selection. For e.g., useful when you suspect only a small subset of input features are genuinely predictive.

Dropout

During each training step, a random fraction p of neuron outputs is set to zero. This prevents co-adaptation — neurons cannot rely on specific other neurons always being present.

  • Typical rates: 0.1–0.5 for hidden layers, 0.5–0.8 for NLP embedding layers.
  • Never apply dropout to the output layer.
  • Effectively trains an ensemble of 2^n subnetworks simultaneously.
  • Modern implementations (PyTorch, TensorFlow) use inverted dropout: activations are scaled up by 1/(1 − p) during training, so no adjustment is needed at inference. Older formulations scaled down by (1 − p) at test time — both are mathematically equivalent.
Quick Check

What does dropout do during training?

Early Stopping

Monitor validation loss during training and stop when it stops improving. Save the best checkpoint. No additional computation cost — it is the most cost-free regularization technique available.

Data Augmentation

Generate additional training examples by applying label-preserving transformations to existing ones. The model sees more varied inputs, making exact memorisation harder.

  • Images: random flips, rotations, crops, colour jitter, cutout, mixup.
  • Text: synonym replacement, back-translation, random insertion or deletion.
  • Audio: time stretching, pitch shifting, adding background noise.

Training, Validation, and Test Sets

The test set must stay unseen until the very end. If you use test results to make any decision — choosing a model, tuning a threshold — you have implicitly trained on it, and your reported performance is optimistic.

  • Training set (60–80%): used to fit model parameters.
  • Validation set (10–20%): used to tune hyperparameters and compare models.
  • Test set (10–20%): used exactly once to report final performance.
  • Cross-validation: rotate the validation split to get a better generalisation estimate when data is scarce.

When to Apply What

SymptomDiagnosisFix
High training AND validation errorUnderfittingIncrease capacity, train longer
Low training error, high validation errorOverfittingAdd regularization, more data, reduce model size
Both high but validation ≈ trainingData-limitedCollect more data or try a different architecture

Always start with the simplest model that could work, then add complexity only when the data supports it.

Test Your Knowledge

Ready to check how much you remember? Take the quiz for Overfitting & Regularization and see your score on the leaderboard.

Take the Quiz