Regularized Logistic Regression

Recalling the Full Equation Chain

Before adding regularization, it helps to have every equation in one place. Logistic regression passes each training example through four steps.

Step 1 — Linear score

z⁽ⁱ⁾ = w₁x₁⁽ⁱ⁾ + w₂x₂⁽ⁱ⁾ + ⋯ + wₙxₙ⁽ⁱ⁾ + b

In vectorized notation, with W as the (n × 1) weight vector and x⁽ⁱ⁾ as the (n × 1) feature vector:

z⁽ⁱ⁾ = Wᵀ · x⁽ⁱ⁾ + b

Step 2 — Sigmoid activation

ŷ⁽ⁱ⁾ = σ(z⁽ⁱ⁾) = 1 / (1 + e^(−z⁽ⁱ⁾))

The sigmoid maps any real-valued score to a probability in (0, 1). Values above 0.5 predict class 1; values below predict class 0.

Step 3 — Binary cross-entropy loss per example

L⁽ⁱ⁾ = −[y⁽ⁱ⁾ · log(ŷ⁽ⁱ⁾) + (1 − y⁽ⁱ⁾) · log(1 − ŷ⁽ⁱ⁾)]

When y⁽ⁱ⁾ = 1 the loss reduces to −log(ŷ⁽ⁱ⁾) — it penalises low predicted probability. When y⁽ⁱ⁾ = 0 it reduces to −log(1 − ŷ⁽ⁱ⁾) — it penalises high predicted probability.

Step 4 — Cost function (average loss over all m examples)

J(W, b) = (1/m) · Σᵢ L⁽ⁱ⁾

Expanding the loss:

J(W, b) = −(1/m) · Σᵢ [y⁽ⁱ⁾ · log(ŷ⁽ⁱ⁾) + (1 − y⁽ⁱ⁾) · log(1 − ŷ⁽ⁱ⁾)]

This is the unregularized cost. It measures how well the model fits the training data, but it places no constraint on how large the weights are allowed to grow.

Adding the Regularization Term — Step by Step

Regularization modifies J by appending a penalty that grows with the size of the weights. We build it in three steps.

Step 1 — Choose what to penalise

We want to discourage large weights. The natural choice is the sum of squared weights across all n features:

nΣj=1 wⱼ²

Squaring ensures the penalty is always non-negative and penalises large weights disproportionately — a weight of 4 contributes 16, while a weight of 2 contributes only 4.

Step 2 — Scale by λ/2m

We divide by 2m so the penalty is on the same scale as the cost and so the derivative simplifies cleanly (the 2 in the denominator cancels with the 2 from differentiating wⱼ²). λ (lambda) controls the strength of the penalty:

Regularization term = (λ / 2m) · nΣj=1 wⱼ²

Step 3 — Add to the original cost

J_reg(W, b) = −(1/m) · Σᵢ [y⁽ⁱ⁾ · log(ŷ⁽ⁱ⁾) + (1 − y⁽ⁱ⁾) · log(1 − ŷ⁽ⁱ⁾)] + (λ/2m) · nΣj=1 wⱼ²

This is the regularized cost function. The first term rewards good predictions. The second term penalises large weights. Together they force gradient descent to find a solution that both fits the data and keeps the weights small.

The 1/2m scaling is a convention — it keeps the penalty proportional to the training set size and makes the derivative tidy. The factor of 2 in 2m cancels with the 2 that appears when you differentiate wⱼ², leaving a clean (λ/m)·wⱼ term.

Gradient of the Regularized Cost

To run gradient descent on J_reg we need its partial derivatives with respect to each wⱼ and with respect to b.

Gradient with respect to wⱼ

The regularized cost is the sum of two terms. We differentiate each separately:

∂J_reg/∂wⱼ = ∂J/∂wⱼ + ∂/∂wⱼ [(λ/2m) · nΣk=1 wₖ²]

The first term was derived in the gradient descent module:

∂J/∂wⱼ = (1/m) · Σᵢ (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾) · xⱼ⁽ⁱ⁾

The second term is the derivative of (λ/2m)·wⱼ² with respect to wⱼ — the 2 cancels:

∂/∂wⱼ [(λ/2m) · wⱼ²] = (λ/m) · wⱼ

Combining:

∂J_reg/∂wⱼ = (1/m) · Σᵢ (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾) · xⱼ⁽ⁱ⁾ + (λ/m) · wⱼ

The gradient is the original prediction-error term plus the new regularization term (λ/m)·wⱼ.

Gradient with respect to b

The regularization term does not contain b, so its derivative with respect to b is zero. The gradient of b is unchanged:

∂J_reg/∂b = (1/m) · Σᵢ (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾)

The Update Rule — Weight Decay

Substitute the regularized gradient into the standard gradient descent update rule for wⱼ:

wⱼ := wⱼ − α · [(1/m) · Σᵢ (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾) · xⱼ⁽ⁱ⁾ + (λ/m) · wⱼ]

Distributing α and collecting the wⱼ terms:

wⱼ := wⱼ · (1 − αλ/m) − (α/m) · Σᵢ (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾) · xⱼ⁽ⁱ⁾

The factor (1 − αλ/m) is slightly less than 1 for any positive α, λ, and m. This means every gradient descent step multiplies each weight by a number slightly below 1 before applying the usual gradient correction. The weight is shrunk a little on every single update — this is called weight decay.

The bias update is identical to the unregularized case:

b := b − (α/m) · Σᵢ (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾)

Update	Unregularized	Regularized
wⱼ	wⱼ − α·(∂J/∂wⱼ)	wⱼ·(1 − αλ/m) − (α/m)·Σᵢ error·xⱼ⁽ⁱ⁾
b	b − α·(∂J/∂b)	b − (α/m)·Σᵢ error (unchanged)

Weight decay is a useful mental model: before each gradient step, every weight is multiplied by (1 − αλ/m). If the gradient does not push back, the weight slowly decays toward zero over many iterations. Only weights that carry genuine predictive signal survive — gradient descent keeps rebuilding them.

Quick Check

The regularized weight update is wⱼ := wⱼ·(1 − αλ/m) − (α/m)·Σᵢ(ŷ⁽ⁱ⁾−y⁽ⁱ⁾)·xⱼ⁽ⁱ⁾. What does the factor (1 − αλ/m) do on every update step?

How Regularization Controls the Model

To understand why adding a penalty to the cost function reduces overfitting, trace what happens during gradient descent.

Without regularization, gradient descent minimises only the cross-entropy loss — it will push weights as large as needed to fit every training example, including noise. With regularization, every gradient update must also reduce the penalty term. A weight wⱼ that is large in magnitude contributes a large (λ/m)·wⱼ to the gradient — positive if wⱼ is positive, negative if wⱼ is negative — which in both cases pushes that weight toward zero. A weight that is already small contributes very little, so it is barely affected.

The result is that the model is free to keep features that genuinely help prediction (their weights remain non-zero) but cannot rely heavily on any single feature (no weight grows very large). This is the soft-shrinkage effect: features are kept, but their influence is constrained.

Scenario	Without regularization	With regularization
High-degree polynomial feature	Weight grows large, decision boundary contorts	Weight stays small, boundary stays smooth
Irrelevant noisy feature	Weight picks up training noise	Weight shrinks toward zero
Genuinely informative feature	Weight set to correct value	Weight slightly reduced but remains meaningful

Quick Check

In the regularized cost function, why does adding (λ/2m)·Σwⱼ² reduce overfitting?

Vectorized Form

In practice, the update is applied to all n weights simultaneously using matrix operations. With A of shape (1, m) for predicted probabilities and Y of shape (1, m) for true labels:

dW = (1/m) · Xᵀ · (A − Y)ᵀ + (λ/m) · W

db = (1/m) · Σ(A − Y)

W := W − α · dW

b := b − α · db

The dW expression is identical to the unregularized gradient plus the weight decay correction (λ/m)·W. The db expression is unchanged. Written as a single weight update expression:

W := W · (1 − αλ/m) − (α/m) · Xᵀ · (A − Y)ᵀ

Code

python

def compute_cost_regularized(A, Y, W, lambd, m):
    """
    A:      (1, m) — predicted probabilities
    Y:      (1, m) — true labels
    W:      (n, 1) — weights
    lambd:  float  — regularization strength (lambda)
    """
    cross_entropy  = -(1/m) * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))
    regularization = (lambd / (2 * m)) * np.sum(np.square(W))
    return cross_entropy + regularization


def compute_gradients_regularized(X, Y, A, W, lambd):
    """
    X:     (m, n)
    Y:     (1, m)
    A:     (1, m)
    W:     (n, 1)
    Returns dW: (n, 1), db: scalar
    """
    m  = X.shape[0]
    dZ = A - Y                                         # (1, m) — prediction error
    dW = (1/m) * np.dot(X.T, dZ.T) + (lambd/m) * W   # (n, 1) — regularized gradient
    db = (1/m) * np.sum(dZ)                            # scalar — unchanged
    return dW, db


def update_parameters(W, b, dW, db, learning_rate):
    W = W - learning_rate * dW   # weight decay is already inside dW
    b = b - learning_rate * db
    return W, b

The regularization is entirely inside dW. The update_parameters function does not change — it subtracts the gradient as before, but the gradient now carries the weight decay term automatically.