AAI Logo
Loading...
AAI Logo
Loading...
Deep Learning
Deep LearningIntermediate

Regularization for Logistic Regression

regularizationoverfittinglogistic regressioncost functionlambda
No reviews yet — be the first!

How Do We Fix Overfitting?

The previous module introduced overfitting: the model performs well on training data but poorly on new examples. There are three practical approaches to address it.

ApproachWhat it doesWhen to use
More training dataGives the model more examples to learn fromWhen data collection is feasible
Feature selectionRemove features that are unlikely to matterWhen you have too many, weakly related features
RegularizationPenalise large weights in the cost functionAlmost always — the most general technique

Option 1 — Collect More Training Data

More data is the most reliable fix. With more examples, the model cannot memorise every quirk of the training set because the patterns it needs to reproduce become too large and diverse.

The downside is practical: data collection is expensive, slow, or sometimes impossible. For e.g., in a medical study you may only ever have 200 patient records.

More data reduces variance without any changes to the model. If data is available, get it first before changing anything else.

Option 2 — Feature Selection

If you trained on 100 features and only 10 are genuinely informative, the remaining 90 give the model 90 extra dimensions to fit noise into. Dropping those features reduces the model's capacity to overfit.

The cost is information loss. When you remove a feature, you are also removing any real signal it carried, even if that signal is small. For e.g., removing a weakly correlated feature like resting heart rate from a diabetes model might seem safe, but heart rate does carry some true predictive signal.

Feature selection forces a hard binary decision — keep or discard. A feature that is removed cannot contribute to the prediction at all. Regularization, covered next, offers a softer alternative.

Quick Check

You are training a logistic regression model with 50 features. After inspection, you remove 30 features you believe are uninformative. What is the main risk of this approach?

Option 3 — Regularization

Regularization keeps all features but penalises the model for assigning large weights to any of them. Instead of discarding a feature entirely, it nearly cancels out its effect by forcing its weight toward zero. The model retains every feature's small contribution but is prevented from leaning heavily on any one of them.

Consider what happens when a logistic regression model overfits using a high-degree polynomial feature. The weight on that feature becomes very large, allowing the decision boundary to swing wildly to fit training noise. Regularization adds a penalty to the cost function that grows with the size of each weight — if a weight is large, the penalty makes it expensive, and gradient descent is pushed to reduce it.

MethodEffect on featuresThrows away information?
Feature selectionHard remove — weight becomes 0Yes
RegularizationSoft shrink — weight approaches 0 but staysNo

How Regularization Works

Choosing which specific weights to penalise manually is impractical — you rarely know in advance which features are causing the overfitting. Instead, regularization penalises all weights and lets the model decide. Features whose weights are small are barely affected by the penalty. Features whose weights would have been large — the ones driving overfitting — are penalised the most.

The regularization penalty is proportional to the square of each weight, summed across all n features:

Regularization term = (λ / 2m) · nΣj=1 wⱼ²

This term is added to the cost function. λ (lambda) controls how strongly you penalise large weights. A larger λ shrinks weights more aggressively. A smaller λ barely changes the original cost.

Squaring the weights means large weights are penalised disproportionately. A weight of 4 contributes 16 to the sum — four times more than a weight of 2 contributing 4. The penalty targets the biggest offenders first.

The Regularized Cost Function

The full regularized cost function adds the penalty term to the original binary cross-entropy cost:

J(w, b) = -(1/m) · Σᵢ [y⁽ⁱ⁾·log(ŷ⁽ⁱ⁾) + (1−y⁽ⁱ⁾)·log(1−ŷ⁽ⁱ⁾)] + (λ/2m) · nΣj=1 wⱼ²

The first term is the original cross-entropy cost. The second term is the regularization penalty. Together they balance two competing objectives: fit the training data well AND keep the weights small.

We generally do not regularize the bias term b. The bias is a single scalar offset and has very little capacity to cause overfitting on its own. If you want to regularize b as well, the term is added separately:

J(w, b) = -(1/m) · Σᵢ [y⁽ⁱ⁾·log(ŷ⁽ⁱ⁾) + (1−y⁽ⁱ⁾)·log(1−ŷ⁽ⁱ⁾)] + (λ/2m) · nΣj=1 wⱼ² + (λ/2m) · b²

In practice, the contribution of b to overfitting is negligible compared to the n weight parameters, so the bias regularization term is almost always omitted.

The Lambda Hyperparameter

Lambda (λ) is the regularization hyperparameter. It must be greater than zero and controls the strength of the penalty.

λ valueEffect
λ = 0No regularization — original cost function
λ very small (0.001–0.01)Light penalty — most weights barely change
λ moderate (0.1–1)Meaningful shrinkage — reduces overfitting noticeably
λ very large (100+)All weights forced near zero — model underfits

Choosing λ is a tuning problem. You try several values and evaluate performance on a held-out validation set, the same way you tune the learning rate. Typical starting values are 0.01, 0.1, 1, and 10.

Quick Check

In the regularized cost function J(w, b), what happens to the model when λ is set to a very large value?

Lambda is a balance dial between fitting the data and keeping the model simple. Too small: overfitting. Too large: underfitting. The right value is found by evaluating on a validation set, not the training set.

Test Your Knowledge

Ready to check how much you remember? Take the quiz for Regularization for Logistic Regression and see your score on the leaderboard.

Take the Quiz

Up next

Next, we build the complete logistic regression model — forward pass, cost, gradients, and prediction — and put everything together in working Python.

Regularized Logistic Regression