AAI Logo
Loading...
AAI Logo
Loading...
Machine Learning
Machine LearningBeginner

Overfitting & Regularization

overfittingregularizationbias variance tradeoffL2 regularizationlambda
No reviews yet — be the first!

What Is Overfitting?

Memorisation is not learning. A model that memorises training examples can reproduce them perfectly but has learned nothing it can apply to new data. That is overfitting: strong training performance, poor generalisation.

To measure this, you split your data into two sets before training begins. The training set is what the model learns from. The test set is held back entirely — the model never sees it during training. After training, you measure error on both:

  • Train error — how wrong the model is on the examples it was trained on. A low train error means the model fits the training data well.
  • Test error — how wrong the model is on unseen examples. A low test error means the model generalises — it has learned something real, not just memorised the training data.

The gap between the two is the signal. A model with low train error and high test error has memorised rather than learned.

Every model faces a choice between two failure modes:

FailureCauseSymptomName
Too simpleIgnored real patternsHigh error on both train and testUnderfitting — high bias
Too complexMemorised training noiseLow train error, high test errorOverfitting — high variance

The goal is a model that sits between these extremes — one that learns the real pattern without memorising the noise.

Seeing It in Regression — Crop Yield

Say you have measurements from eight farms: rainfall (mm) and crop yield (t/ha). One farm recorded an unusually high yield at moderate rainfall — a fluke caused by an exceptional soil batch that season. You want a model that predicts yield for new farms you have not seen yet.

Three models are possible:

  1. Underfit (high bias) — a nearly flat line. The algorithm barely responds to rainfall. It fits the training farms poorly and fits new farms equally poorly — it is wrong everywhere, not just on unseen data.
  2. Good fit — a straight line with the right slope. It fits the training farms well, and it fits new farms well too — the train error and test error are both low. It correctly ignores the anomaly because that spike was a fluke, not a real pattern.
  3. Overfit (high variance) — a high-degree polynomial that bends to hit every training point, including the anomaly. It fits the training farms perfectly (near-zero train error), but it predicts ~6.8 t/ha at 130mm rainfall for new farms — wrong. Test error is much higher than train error. The model memorised the specific examples it was trained on, not the pattern.

The names come from the type of error each failure produces. Bias is a systematic error — an underfit model carries a built-in bias toward a fixed, oversimplified prediction. A flat line always predicts near the same value no matter how much rainfall there was. It is not responding to the input; it has already decided what to predict. That stubborn pull toward one answer is the "bias." Variance is sensitivity to the training set — an overfit model's predictions vary wildly when the training data changes slightly. Swap a few training examples and you get a completely different curve. The model learned which specific examples were in this training set, not the underlying pattern, so any change to the set changes the model.

Diagram
Crop Yield vs Rainfall — Three Model Fits0246870100130160200Rainfall (mm)Yield (t/ha)anomalyJUST RIGHTA straight line captures the real trend. It correctly ignores the anomaly.New farm data will mostly fall near this line. Good generalisation.
Three fits to the same crop yield data. The overfit curve spikes to include the anomaly — a fluke data point. On new farms it gives wildly wrong predictions at 130mm rainfall.

The overfit model performs perfectly on the training farms. On any new farm, it performs worse than the simple straight line. Strong training performance is not the goal — generalisation is.

Why Overfitting Is Bad — High Variance

A model that overfits is highly sensitive to small changes in the training set. If you remove one farm from the dataset, re-train, and get a completely different curve — that is high variance. The model learned the specific noise in this particular sample of farms, not the relationship between rainfall and yield that holds across all farms.

Train errorTest errorChanges a lot if you swap training data?
UnderfitHighHighNo — equally wrong everywhere
Good fitLowLowNo — learned the real pattern
OverfitVery lowMuch higherYes — memorised this specific sample
Quick Check

You train two models on crop yield data. Model A has train error 0.4 and test error 0.42. Model B has train error 0.05 and test error 1.8. Which one is overfitting?

The Bias-Variance Tradeoff

As you increase model complexity, two things happen simultaneously. Bias falls — the model can fit more complex real patterns. Variance rises — the model becomes more sensitive to noise in the specific training sample. Total error is the sum of both. The sweet spot is the complexity level where total error is lowest.

Diagram
sweet spotUnderfitting(high bias)Overfitting(high variance)0.250.500.751.00ErrorModel Complexity →Bias²VarianceTotalThe sweet spot minimises total error — neither too simple nor too complex.
As complexity increases, bias falls but variance rises. The sweet spot minimises total error — the point where the model is complex enough to capture the real pattern but not so complex that it memorises noise.
Quick Check

What does 'high variance' mean in the context of overfitting?

How Do We Fix Overfitting?

There are three practical approaches to address overfitting.

ApproachWhat it doesWhen to use
More training dataGives the model more examples to learn fromWhen data collection is feasible
Feature selectionRemove features that are unlikely to matterWhen you have too many, weakly related features
RegularizationPenalise large weights in the cost functionAlmost always — the most general technique

Option 1 — Collect More Training Data

More data is the most reliable fix. With more examples, the model cannot memorise every quirk of the training set because the patterns it needs to reproduce become too large and diverse.

The downside is practical: data collection is expensive, slow, or sometimes impossible. For e.g., in an agricultural study you may only ever have data from a small number of farms.

More data reduces variance without any changes to the model. If data is available, get it first before changing anything else.

Option 2 — Feature Selection

If you trained on 20 features and only 5 are genuinely informative, the remaining 15 give the model 15 extra dimensions to fit noise into. Dropping those features reduces the model's capacity to overfit.

The cost is information loss. When you remove a feature, you are also removing any real signal it carried, even if that signal is small. For e.g., removing a weakly correlated feature like humidity from a crop yield model might seem safe, but humidity does carry some true predictive signal.

Feature selection forces a hard binary decision — keep or discard. A feature that is removed cannot contribute to the prediction at all. Regularization, covered next, offers a softer alternative.

Quick Check

You are training a linear regression model with 30 features. After inspection, you remove 20 features you believe are uninformative. What is the main risk of this approach?

Option 3 — Regularization

Regularization keeps all features but penalises the model for assigning large weights to any of them. Instead of discarding a feature entirely, it nearly cancels out its effect by forcing its weight toward zero. The model retains every feature's small contribution but is prevented from leaning heavily on any one of them.

Consider what happens when a linear regression model overfits using a high-degree polynomial feature. The weight on that feature becomes very large, allowing the curve to swing wildly to fit training noise. Regularization adds a penalty to the cost function that grows with the size of each weight — if a weight is large, the penalty makes it expensive, and gradient descent is pushed to reduce it.

MethodEffect on featuresThrows away information?
Feature selectionHard remove — weight becomes 0Yes
RegularizationSoft shrink — weight approaches 0 but staysNo

How Regularization Works

Choosing which specific weights to penalise manually is impractical — you rarely know in advance which features are causing the overfitting. Instead, regularization penalises all weights and lets the model decide. Features whose weights are small are barely affected by the penalty. Features whose weights would have been large — the ones driving overfitting — are penalised the most.

The regularization penalty is proportional to the square of each weight, summed across all n features:

Regularization term = (λ / 2m) · nΣj=1 wⱼ²

This term is added to the cost function. λ (lambda) controls how strongly you penalise large weights. A larger λ shrinks weights more aggressively. A smaller λ barely changes the original cost.

Squaring the weights means large weights are penalised disproportionately. A weight of 4 contributes 16 to the sum — four times more than a weight of 2 contributing 4. The penalty targets the biggest offenders first.

The Regularized Cost Function

The full regularized cost function adds the penalty term to the original mean squared error cost:

J(w, b) = (1/2m) · mΣi=1 (fw,b(x(i)) − y(i))² + (λ/2m) · nΣj=1 wⱼ²

The first term is unchanged — it still measures fit to the training data. The second term is the regularization penalty: the sum of squared weights, scaled by λ/2m. Together they balance two competing objectives: fit the training data well AND keep the weights small.

We generally do not regularize the bias term b. The bias is a single scalar offset and has very little capacity to cause overfitting on its own. In practice its contribution is negligible, so regularization is applied only to the weights w₁, w₂, …, wₙ.

TermWhat it does
(1/2m) · Σ(ŷ⁽ⁱ⁾ − y⁽ⁱ⁾)²Rewards fitting the training data
(λ/2m) · Σwⱼ²Penalises large weights
λControls the balance — larger λ means stronger shrinkage

The Lambda Hyperparameter

Lambda (λ) is the regularization hyperparameter. It must be greater than zero and controls the strength of the penalty.

λ valueEffect
λ = 0No regularization — original cost function
λ very small (0.001–0.01)Light penalty — most weights barely change
λ moderate (0.1–1)Meaningful shrinkage — reduces overfitting noticeably
λ very large (100+)All weights forced near zero — model underfits

Choosing λ is a tuning problem. You try several values and evaluate performance on a held-out test set, the same way you tune the learning rate. Typical starting values are 0.01, 0.1, 1, and 10.

Quick Check

In the regularized cost function, what happens to the model when λ is set to a very large value?

Lambda is a balance dial between fitting the data and keeping the model simple. Too small: overfitting. Too large: underfitting. The right value is found by evaluating on a test set, not the training set.

Knowledge check

Test Your Knowledge

Ready to check how much you remember? Take the quiz for Overfitting & Regularization and see your score on the leaderboard.

Up next

Regularized Linear Regression

Next, we apply regularization directly to linear regression — derive the regularized cost, see exactly what changes in the gradient, and implement weight decay in Python.

Continue learning