Regularization for Logistic Regression

How Do We Fix Overfitting?

The previous module introduced overfitting: the model performs well on training data but poorly on new examples. There are three practical approaches to address it.

Approach	What it does	When to use
More training data	Gives the model more examples to learn from	When data collection is feasible
Feature selection	Remove features that are unlikely to matter	When you have too many, weakly related features
Regularization	Penalise large weights in the cost function	Almost always — the most general technique

Option 1 — Collect More Training Data

More data is the most reliable fix. With more examples, the model cannot memorise every quirk of the training set because the patterns it needs to reproduce become too large and diverse.

The downside is practical: data collection is expensive, slow, or sometimes impossible. For e.g., in a medical study you may only ever have 200 patient records.

More data reduces variance without any changes to the model. If data is available, get it first before changing anything else.

Option 2 — Feature Selection

If you trained on 100 features and only 10 are genuinely informative, the remaining 90 give the model 90 extra dimensions to fit noise into. Dropping those features reduces the model's capacity to overfit.

The cost is information loss. When you remove a feature, you are also removing any real signal it carried, even if that signal is small. For e.g., removing a weakly correlated feature like resting heart rate from a diabetes model might seem safe, but heart rate does carry some true predictive signal.

Feature selection forces a hard binary decision — keep or discard. A feature that is removed cannot contribute to the prediction at all. Regularization, covered next, offers a softer alternative.

Quick Check

You are training a logistic regression model with 50 features. After inspection, you remove 30 features you believe are uninformative. What is the main risk of this approach?

Option 3 — Regularization

Regularization keeps all features but penalises the model for assigning large weights to any of them. Instead of discarding a feature entirely, it nearly cancels out its effect by forcing its weight toward zero. The model retains every feature's small contribution but is prevented from leaning heavily on any one of them.

Consider what happens when a logistic regression model overfits using a high-degree polynomial feature. The weight on that feature becomes very large, allowing the decision boundary to swing wildly to fit training noise. Regularization adds a penalty to the cost function that grows with the size of each weight — if a weight is large, the penalty makes it expensive, and gradient descent is pushed to reduce it.

Method	Effect on features	Throws away information?
Feature selection	Hard remove — weight becomes 0	Yes
Regularization	Soft shrink — weight approaches 0 but stays	No

How Regularization Works

Choosing which specific weights to penalise manually is impractical — you rarely know in advance which features are causing the overfitting. Instead, regularization penalises all weights and lets the model decide. Features whose weights are small are barely affected by the penalty. Features whose weights would have been large — the ones driving overfitting — are penalised the most.

The regularization penalty is proportional to the square of each weight, summed across all n features:

Regularization term = (λ / 2m) · nΣj=1 wⱼ²

This term is added to the cost function. λ (lambda) controls how strongly you penalise large weights. A larger λ shrinks weights more aggressively. A smaller λ barely changes the original cost.

Squaring the weights means large weights are penalised disproportionately. A weight of 4 contributes 16 to the sum — four times more than a weight of 2 contributing 4. The penalty targets the biggest offenders first.

The Regularized Cost Function

The full regularized cost function adds the penalty term to the original binary cross-entropy cost:

J(w, b) = -(1/m) · Σᵢ [y⁽ⁱ⁾·log(ŷ⁽ⁱ⁾) + (1−y⁽ⁱ⁾)·log(1−ŷ⁽ⁱ⁾)] + (λ/2m) · nΣj=1 wⱼ²

The first term is the original cross-entropy cost. The second term is the regularization penalty. Together they balance two competing objectives: fit the training data well AND keep the weights small.

We generally do not regularize the bias term b. The bias is a single scalar offset and has very little capacity to cause overfitting on its own. If you want to regularize b as well, the term is added separately:

J(w, b) = -(1/m) · Σᵢ [y⁽ⁱ⁾·log(ŷ⁽ⁱ⁾) + (1−y⁽ⁱ⁾)·log(1−ŷ⁽ⁱ⁾)] + (λ/2m) · nΣj=1 wⱼ² + (λ/2m) · b²

In practice, the contribution of b to overfitting is negligible compared to the n weight parameters, so the bias regularization term is almost always omitted.

The Lambda Hyperparameter

Lambda (λ) is the regularization hyperparameter. It must be greater than zero and controls the strength of the penalty.

λ value	Effect
λ = 0	No regularization — original cost function
λ very small (0.001–0.01)	Light penalty — most weights barely change
λ moderate (0.1–1)	Meaningful shrinkage — reduces overfitting noticeably
λ very large (100+)	All weights forced near zero — model underfits

Choosing λ is a tuning problem. You try several values and evaluate performance on a held-out validation set, the same way you tune the learning rate. Typical starting values are 0.01, 0.1, 1, and 10.