How Do We Fix Overfitting?
The previous module introduced overfitting: the model performs well on training data but poorly on new examples. There are three practical approaches to address it.
| Approach | What it does | When to use |
|---|---|---|
| More training data | Gives the model more examples to learn from | When data collection is feasible |
| Feature selection | Remove features that are unlikely to matter | When you have too many, weakly related features |
| Regularization | Penalise large weights in the cost function | Almost always — the most general technique |
Option 1 — Collect More Training Data
More data is the most reliable fix. With more examples, the model cannot memorise every quirk of the training set because the patterns it needs to reproduce become too large and diverse.
The downside is practical: data collection is expensive, slow, or sometimes impossible. For e.g., in a medical study you may only ever have 200 patient records.
More data reduces variance without any changes to the model. If data is available, get it first before changing anything else.
Option 2 — Feature Selection
If you trained on 100 features and only 10 are genuinely informative, the remaining 90 give the model 90 extra dimensions to fit noise into. Dropping those features reduces the model's capacity to overfit.
The cost is information loss. When you remove a feature, you are also removing any real signal it carried, even if that signal is small. For e.g., removing a weakly correlated feature like resting heart rate from a diabetes model might seem safe, but heart rate does carry some true predictive signal.
Feature selection forces a hard binary decision — keep or discard. A feature that is removed cannot contribute to the prediction at all. Regularization, covered next, offers a softer alternative.
You are training a logistic regression model with 50 features. After inspection, you remove 30 features you believe are uninformative. What is the main risk of this approach?
Option 3 — Regularization
Regularization keeps all features but penalises the model for assigning large weights to any of them. Instead of discarding a feature entirely, it nearly cancels out its effect by forcing its weight toward zero. The model retains every feature's small contribution but is prevented from leaning heavily on any one of them.
Consider what happens when a logistic regression model overfits using a high-degree polynomial feature. The weight on that feature becomes very large, allowing the decision boundary to swing wildly to fit training noise. Regularization adds a penalty to the cost function that grows with the size of each weight — if a weight is large, the penalty makes it expensive, and gradient descent is pushed to reduce it.
| Method | Effect on features | Throws away information? |
|---|---|---|
| Feature selection | Hard remove — weight becomes 0 | Yes |
| Regularization | Soft shrink — weight approaches 0 but stays | No |
How Regularization Works
Choosing which specific weights to penalise manually is impractical — you rarely know in advance which features are causing the overfitting. Instead, regularization penalises all weights and lets the model decide. Features whose weights are small are barely affected by the penalty. Features whose weights would have been large — the ones driving overfitting — are penalised the most.
The regularization penalty is proportional to the square of each weight, summed across all n features:
This term is added to the cost function. λ (lambda) controls how strongly you penalise large weights. A larger λ shrinks weights more aggressively. A smaller λ barely changes the original cost.
Squaring the weights means large weights are penalised disproportionately. A weight of 4 contributes 16 to the sum — four times more than a weight of 2 contributing 4. The penalty targets the biggest offenders first.
The Regularized Cost Function
The full regularized cost function adds the penalty term to the original binary cross-entropy cost:
The first term is the original cross-entropy cost. The second term is the regularization penalty. Together they balance two competing objectives: fit the training data well AND keep the weights small.
We generally do not regularize the bias term b. The bias is a single scalar offset and has very little capacity to cause overfitting on its own. If you want to regularize b as well, the term is added separately:
In practice, the contribution of b to overfitting is negligible compared to the n weight parameters, so the bias regularization term is almost always omitted.
The Lambda Hyperparameter
Lambda (λ) is the regularization hyperparameter. It must be greater than zero and controls the strength of the penalty.
| λ value | Effect |
|---|---|
| λ = 0 | No regularization — original cost function |
| λ very small (0.001–0.01) | Light penalty — most weights barely change |
| λ moderate (0.1–1) | Meaningful shrinkage — reduces overfitting noticeably |
| λ very large (100+) | All weights forced near zero — model underfits |
Choosing λ is a tuning problem. You try several values and evaluate performance on a held-out validation set, the same way you tune the learning rate. Typical starting values are 0.01, 0.1, 1, and 10.
In the regularized cost function J(w, b), what happens to the model when λ is set to a very large value?
Lambda is a balance dial between fitting the data and keeping the model simple. Too small: overfitting. Too large: underfitting. The right value is found by evaluating on a validation set, not the training set.
