What Is Overfitting?
Memorisation is not learning. A model that memorises training examples can reproduce them perfectly but has learned nothing it can apply to new data. That is overfitting: strong training performance, poor generalisation.
To measure this, you split your data into two sets before training begins. The training set is what the model learns from. The test set is held back entirely — the model never sees it during training. After training, you measure error on both:
- Train error — how wrong the model is on the examples it was trained on. A low train error means the model fits the training data well.
- Test error — how wrong the model is on unseen examples. A low test error means the model generalises — it has learned something real, not just memorised the training data.
The gap between the two is the signal. A model with low train error and high test error has memorised rather than learned.
Every model faces a choice between two failure modes:
| Failure | Cause | Symptom | Name |
|---|---|---|---|
| Too simple | Ignored real patterns | High error on both train and test | Underfitting — high bias |
| Too complex | Memorised training noise | Low train error, high test error | Overfitting — high variance |
The goal is a model that sits between these extremes — one that learns the real pattern without memorising the noise.
Seeing It in Regression — Crop Yield
Say you have measurements from eight farms: rainfall (mm) and crop yield (t/ha). One farm recorded an unusually high yield at moderate rainfall — a fluke caused by an exceptional soil batch that season. You want a model that predicts yield for new farms you have not seen yet.
Three models are possible:
- Underfit (high bias) — a nearly flat line. The algorithm barely responds to rainfall. It fits the training farms poorly and fits new farms equally poorly — it is wrong everywhere, not just on unseen data.
- Good fit — a straight line with the right slope. It fits the training farms well, and it fits new farms well too — the train error and test error are both low. It correctly ignores the anomaly because that spike was a fluke, not a real pattern.
- Overfit (high variance) — a high-degree polynomial that bends to hit every training point, including the anomaly. It fits the training farms perfectly (near-zero train error), but it predicts ~6.8 t/ha at 130mm rainfall for new farms — wrong. Test error is much higher than train error. The model memorised the specific examples it was trained on, not the pattern.
The names come from the type of error each failure produces. Bias is a systematic error — an underfit model carries a built-in bias toward a fixed, oversimplified prediction. A flat line always predicts near the same value no matter how much rainfall there was. It is not responding to the input; it has already decided what to predict. That stubborn pull toward one answer is the "bias." Variance is sensitivity to the training set — an overfit model's predictions vary wildly when the training data changes slightly. Swap a few training examples and you get a completely different curve. The model learned which specific examples were in this training set, not the underlying pattern, so any change to the set changes the model.
The overfit model performs perfectly on the training farms. On any new farm, it performs worse than the simple straight line. Strong training performance is not the goal — generalisation is.
Why Overfitting Is Bad — High Variance
A model that overfits is highly sensitive to small changes in the training set. If you remove one farm from the dataset, re-train, and get a completely different curve — that is high variance. The model learned the specific noise in this particular sample of farms, not the relationship between rainfall and yield that holds across all farms.
| Train error | Test error | Changes a lot if you swap training data? | |
|---|---|---|---|
| Underfit | High | High | No — equally wrong everywhere |
| Good fit | Low | Low | No — learned the real pattern |
| Overfit | Very low | Much higher | Yes — memorised this specific sample |
You train two models on crop yield data. Model A has train error 0.4 and test error 0.42. Model B has train error 0.05 and test error 1.8. Which one is overfitting?
The Bias-Variance Tradeoff
As you increase model complexity, two things happen simultaneously. Bias falls — the model can fit more complex real patterns. Variance rises — the model becomes more sensitive to noise in the specific training sample. Total error is the sum of both. The sweet spot is the complexity level where total error is lowest.
What does 'high variance' mean in the context of overfitting?
How Do We Fix Overfitting?
There are three practical approaches to address overfitting.
| Approach | What it does | When to use |
|---|---|---|
| More training data | Gives the model more examples to learn from | When data collection is feasible |
| Feature selection | Remove features that are unlikely to matter | When you have too many, weakly related features |
| Regularization | Penalise large weights in the cost function | Almost always — the most general technique |
Option 1 — Collect More Training Data
More data is the most reliable fix. With more examples, the model cannot memorise every quirk of the training set because the patterns it needs to reproduce become too large and diverse.
The downside is practical: data collection is expensive, slow, or sometimes impossible. For e.g., in an agricultural study you may only ever have data from a small number of farms.
More data reduces variance without any changes to the model. If data is available, get it first before changing anything else.
Option 2 — Feature Selection
If you trained on 20 features and only 5 are genuinely informative, the remaining 15 give the model 15 extra dimensions to fit noise into. Dropping those features reduces the model's capacity to overfit.
The cost is information loss. When you remove a feature, you are also removing any real signal it carried, even if that signal is small. For e.g., removing a weakly correlated feature like humidity from a crop yield model might seem safe, but humidity does carry some true predictive signal.
Feature selection forces a hard binary decision — keep or discard. A feature that is removed cannot contribute to the prediction at all. Regularization, covered next, offers a softer alternative.
You are training a linear regression model with 30 features. After inspection, you remove 20 features you believe are uninformative. What is the main risk of this approach?
Option 3 — Regularization
Regularization keeps all features but penalises the model for assigning large weights to any of them. Instead of discarding a feature entirely, it nearly cancels out its effect by forcing its weight toward zero. The model retains every feature's small contribution but is prevented from leaning heavily on any one of them.
Consider what happens when a linear regression model overfits using a high-degree polynomial feature. The weight on that feature becomes very large, allowing the curve to swing wildly to fit training noise. Regularization adds a penalty to the cost function that grows with the size of each weight — if a weight is large, the penalty makes it expensive, and gradient descent is pushed to reduce it.
| Method | Effect on features | Throws away information? |
|---|---|---|
| Feature selection | Hard remove — weight becomes 0 | Yes |
| Regularization | Soft shrink — weight approaches 0 but stays | No |
How Regularization Works
Choosing which specific weights to penalise manually is impractical — you rarely know in advance which features are causing the overfitting. Instead, regularization penalises all weights and lets the model decide. Features whose weights are small are barely affected by the penalty. Features whose weights would have been large — the ones driving overfitting — are penalised the most.
The regularization penalty is proportional to the square of each weight, summed across all n features:
This term is added to the cost function. λ (lambda) controls how strongly you penalise large weights. A larger λ shrinks weights more aggressively. A smaller λ barely changes the original cost.
Squaring the weights means large weights are penalised disproportionately. A weight of 4 contributes 16 to the sum — four times more than a weight of 2 contributing 4. The penalty targets the biggest offenders first.
The Regularized Cost Function
The full regularized cost function adds the penalty term to the original mean squared error cost:
The first term is unchanged — it still measures fit to the training data. The second term is the regularization penalty: the sum of squared weights, scaled by λ/2m. Together they balance two competing objectives: fit the training data well AND keep the weights small.
We generally do not regularize the bias term b. The bias is a single scalar offset and has very little capacity to cause overfitting on its own. In practice its contribution is negligible, so regularization is applied only to the weights w₁, w₂, …, wₙ.
| Term | What it does |
|---|---|
| (1/2m) · Σ(ŷ⁽ⁱ⁾ − y⁽ⁱ⁾)² | Rewards fitting the training data |
| (λ/2m) · Σwⱼ² | Penalises large weights |
| λ | Controls the balance — larger λ means stronger shrinkage |
The Lambda Hyperparameter
Lambda (λ) is the regularization hyperparameter. It must be greater than zero and controls the strength of the penalty.
| λ value | Effect |
|---|---|
| λ = 0 | No regularization — original cost function |
| λ very small (0.001–0.01) | Light penalty — most weights barely change |
| λ moderate (0.1–1) | Meaningful shrinkage — reduces overfitting noticeably |
| λ very large (100+) | All weights forced near zero — model underfits |
Choosing λ is a tuning problem. You try several values and evaluate performance on a held-out test set, the same way you tune the learning rate. Typical starting values are 0.01, 0.1, 1, and 10.
In the regularized cost function, what happens to the model when λ is set to a very large value?
Lambda is a balance dial between fitting the data and keeping the model simple. Too small: overfitting. Too large: underfitting. The right value is found by evaluating on a test set, not the training set.
