Recalling the Full Equation Chain
Before adding regularization, it helps to have every equation in one place. Logistic regression passes each training example through four steps.
Step 1 — Linear score
In vectorized notation, with W as the (n × 1) weight vector and x⁽ⁱ⁾ as the (n × 1) feature vector:
Step 2 — Sigmoid activation
The sigmoid maps any real-valued score to a probability in (0, 1). Values above 0.5 predict class 1; values below predict class 0.
Step 3 — Binary cross-entropy loss per example
When y⁽ⁱ⁾ = 1 the loss reduces to −log(ŷ⁽ⁱ⁾) — it penalises low predicted probability. When y⁽ⁱ⁾ = 0 it reduces to −log(1 − ŷ⁽ⁱ⁾) — it penalises high predicted probability.
Step 4 — Cost function (average loss over all m examples)
Expanding the loss:
This is the unregularized cost. It measures how well the model fits the training data, but it places no constraint on how large the weights are allowed to grow.
Adding the Regularization Term — Step by Step
Regularization modifies J by appending a penalty that grows with the size of the weights. We build it in three steps.
Step 1 — Choose what to penalise
We want to discourage large weights. The natural choice is the sum of squared weights across all n features:
Squaring ensures the penalty is always non-negative and penalises large weights disproportionately — a weight of 4 contributes 16, while a weight of 2 contributes only 4.
Step 2 — Scale by λ/2m
We divide by 2m so the penalty is on the same scale as the cost and so the derivative simplifies cleanly (the 2 in the denominator cancels with the 2 from differentiating wⱼ²). λ (lambda) controls the strength of the penalty:
Step 3 — Add to the original cost
This is the regularized cost function. The first term rewards good predictions. The second term penalises large weights. Together they force gradient descent to find a solution that both fits the data and keeps the weights small.
The 1/2m scaling is a convention — it keeps the penalty proportional to the training set size and makes the derivative tidy. The factor of 2 in 2m cancels with the 2 that appears when you differentiate wⱼ², leaving a clean (λ/m)·wⱼ term.
Gradient of the Regularized Cost
To run gradient descent on J_reg we need its partial derivatives with respect to each wⱼ and with respect to b.
Gradient with respect to wⱼ
The regularized cost is the sum of two terms. We differentiate each separately:
The first term was derived in the gradient descent module:
The second term is the derivative of (λ/2m)·wⱼ² with respect to wⱼ — the 2 cancels:
Combining:
The gradient is the original prediction-error term plus the new regularization term (λ/m)·wⱼ.
Gradient with respect to b
The regularization term does not contain b, so its derivative with respect to b is zero. The gradient of b is unchanged:
The Update Rule — Weight Decay
Substitute the regularized gradient into the standard gradient descent update rule for wⱼ:
Distributing α and collecting the wⱼ terms:
The factor (1 − αλ/m) is slightly less than 1 for any positive α, λ, and m. This means every gradient descent step multiplies each weight by a number slightly below 1 before applying the usual gradient correction. The weight is shrunk a little on every single update — this is called weight decay.
The bias update is identical to the unregularized case:
| Update | Unregularized | Regularized |
|---|---|---|
| wⱼ | wⱼ − α·(∂J/∂wⱼ) | wⱼ·(1 − αλ/m) − (α/m)·Σᵢ error·xⱼ⁽ⁱ⁾ |
| b | b − α·(∂J/∂b) | b − (α/m)·Σᵢ error (unchanged) |
Weight decay is a useful mental model: before each gradient step, every weight is multiplied by (1 − αλ/m). If the gradient does not push back, the weight slowly decays toward zero over many iterations. Only weights that carry genuine predictive signal survive — gradient descent keeps rebuilding them.
The regularized weight update is wⱼ := wⱼ·(1 − αλ/m) − (α/m)·Σᵢ(ŷ⁽ⁱ⁾−y⁽ⁱ⁾)·xⱼ⁽ⁱ⁾. What does the factor (1 − αλ/m) do on every update step?
How Regularization Controls the Model
To understand why adding a penalty to the cost function reduces overfitting, trace what happens during gradient descent.
Without regularization, gradient descent minimises only the cross-entropy loss — it will push weights as large as needed to fit every training example, including noise. With regularization, every gradient update must also reduce the penalty term. A weight wⱼ that is large in magnitude contributes a large (λ/m)·wⱼ to the gradient — positive if wⱼ is positive, negative if wⱼ is negative — which in both cases pushes that weight toward zero. A weight that is already small contributes very little, so it is barely affected.
The result is that the model is free to keep features that genuinely help prediction (their weights remain non-zero) but cannot rely heavily on any single feature (no weight grows very large). This is the soft-shrinkage effect: features are kept, but their influence is constrained.
| Scenario | Without regularization | With regularization |
|---|---|---|
| High-degree polynomial feature | Weight grows large, decision boundary contorts | Weight stays small, boundary stays smooth |
| Irrelevant noisy feature | Weight picks up training noise | Weight shrinks toward zero |
| Genuinely informative feature | Weight set to correct value | Weight slightly reduced but remains meaningful |
In the regularized cost function, why does adding (λ/2m)·Σwⱼ² reduce overfitting?
Vectorized Form
In practice, the update is applied to all n weights simultaneously using matrix operations. With A of shape (1, m) for predicted probabilities and Y of shape (1, m) for true labels:
The dW expression is identical to the unregularized gradient plus the weight decay correction (λ/m)·W. The db expression is unchanged. Written as a single weight update expression:
Code
def compute_cost_regularized(A, Y, W, lambd, m):
"""
A: (1, m) — predicted probabilities
Y: (1, m) — true labels
W: (n, 1) — weights
lambd: float — regularization strength (lambda)
"""
cross_entropy = -(1/m) * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))
regularization = (lambd / (2 * m)) * np.sum(np.square(W))
return cross_entropy + regularization
def compute_gradients_regularized(X, Y, A, W, lambd):
"""
X: (m, n)
Y: (1, m)
A: (1, m)
W: (n, 1)
Returns dW: (n, 1), db: scalar
"""
m = X.shape[0]
dZ = A - Y # (1, m) — prediction error
dW = (1/m) * np.dot(X.T, dZ.T) + (lambd/m) * W # (n, 1) — regularized gradient
db = (1/m) * np.sum(dZ) # scalar — unchanged
return dW, db
def update_parameters(W, b, dW, db, learning_rate):
W = W - learning_rate * dW # weight decay is already inside dW
b = b - learning_rate * db
return W, bThe regularization is entirely inside dW. The update_parameters function does not change — it subtracts the gradient as before, but the gradient now carries the weight decay term automatically.
