Regularized Linear Regression

Recalling the Regularized Cost

The previous module introduced the regularized cost function. It adds a penalty to the original mean squared error cost that grows with the size of the weights:

J(→w, b) = (1/2m) · mΣi=1 (f_→w,b(→x⁽ⁱ⁾) − y⁽ⁱ⁾)² + (λ/2m) · nΣj=1 wⱼ²

The first term rewards fitting the training data. The second term penalises large weights. λ controls the balance — larger λ means stronger shrinkage.

What we have not yet done is work out what this change means for gradient descent. To run gradient descent on the regularized cost, we need its partial derivatives with respect to each parameter.

What Changes and What Stays the Same

Adding the regularization term changes the gradient for wⱼ. The gradient for b is completely unchanged — the regularization term does not contain b, so its derivative with respect to b is zero.

Parameter	Gradient changes?	Why
wⱼ	Yes — one extra term added	The penalty (λ/2m)·wⱼ² depends on wⱼ
b	No — identical to before	The penalty does not contain b

This is the key insight: everything about the gradient descent update is the same, except wⱼ gets one additional term.

Deriving the New Gradient for wⱼ

The regularized cost is the sum of two terms. We differentiate each separately.

Term 1 — the fit term (unchanged from before):

∂/∂wⱼ [(1/2m) · mΣi=1 (f_→w,b(→x⁽ⁱ⁾) − y⁽ⁱ⁾)²] = (1/m) · mΣi=1 (f_→w,b(→x⁽ⁱ⁾) − y⁽ⁱ⁾) · xⱼ⁽ⁱ⁾

Term 2 — the regularization penalty (new):

∂/∂wⱼ [(λ/2m) · wⱼ²] = (λ/m) · wⱼ

The 2 in the denominator cancels with the 2 that appears when differentiating wⱼ².

Combined gradient:

∂J/∂wⱼ = (1/m) · mΣi=1 (f_→w,b(→x⁽ⁱ⁾) − y⁽ⁱ⁾) · xⱼ⁽ⁱ⁾ + (λ/m) · wⱼ

The gradient is the original prediction-error term plus the new regularization term (λ/m)·wⱼ.

The gradient of b is unchanged:

∂J/∂b = (1/m) · mΣi=1 (f_→w,b(→x⁽ⁱ⁾) − y⁽ⁱ⁾)

The 1/2m scaling in the cost is a convention — it keeps the penalty proportional to the training set size and makes the derivative tidy. The factor of 2 in 2m cancels with the 2 that appears when differentiating wⱼ², leaving a clean (λ/m)·wⱼ term in the gradient.

The Update Rule — Weight Decay

Substitute the regularized gradient into the standard gradient descent update rule for wⱼ:

wⱼ := wⱼ − α · [(1/m) · mΣi=1 (f_→w,b(→x⁽ⁱ⁾) − y⁽ⁱ⁾) · xⱼ⁽ⁱ⁾ + (λ/m) · wⱼ]

Distribute α and collect the wⱼ terms:

wⱼ := wⱼ · (1 − αλ/m) − (α/m) · mΣi=1 (f_→w,b(→x⁽ⁱ⁾) − y⁽ⁱ⁾) · xⱼ⁽ⁱ⁾

The factor (1 − αλ/m) is slightly less than 1 for any positive α, λ, and m. Every gradient descent step multiplies each weight by a number slightly below 1 before applying the usual gradient correction. The weight is shrunk a little on every single update — this is called weight decay.

Update	Unregularized	Regularized
wⱼ	wⱼ − (α/m)·Σᵢ error·xⱼ⁽ⁱ⁾	wⱼ·(1 − αλ/m) − (α/m)·Σᵢ error·xⱼ⁽ⁱ⁾
b	b − (α/m)·Σᵢ error	b − (α/m)·Σᵢ error (unchanged)

Quick Check

The regularized update is wⱼ := wⱼ·(1 − αλ/m) − (α/m)·Σᵢ error·xⱼ⁽ⁱ⁾. What does the factor (1 − αλ/m) do on every update step?

How Regularization Controls the Model

To understand why adding a penalty reduces overfitting, trace what happens during gradient descent.

Without regularization, gradient descent minimises only the MSE cost — it will push weights as large as needed to fit every training example, including noise. With regularization, every gradient update must also reduce the penalty term. A weight wⱼ that is large contributes a large (λ/m)·wⱼ to the gradient — which pushes that weight toward zero. A weight that is already small contributes very little, so it is barely affected.

The result is that the model can keep features that genuinely help prediction (their weights remain non-zero) but cannot rely heavily on any single feature (no weight grows very large). This is the soft-shrinkage effect: features are kept, but their influence is constrained.

Scenario	Without regularization	With regularization
High-degree polynomial feature	Weight grows large, curve contorts to fit noise	Weight stays small, curve stays smooth
Irrelevant noisy feature	Weight picks up training noise	Weight shrinks toward zero
Genuinely informative feature	Weight set to correct value	Weight slightly reduced but remains meaningful

Weight decay is a useful mental model: before each gradient step, every weight is multiplied by (1 − αλ/m). If the gradient does not push back, the weight slowly decays toward zero over many iterations. Only weights that carry genuine predictive signal survive — gradient descent keeps rebuilding them.

Simultaneous Update

As with unregularized gradient descent, all parameters must be updated simultaneously using gradients computed from the current values of w and b — not the updated ones.

pseudocode

Compute ∂J/∂w₁, ∂J/∂w₂, …, ∂J/∂wₙ, ∂J/∂b  ← all using current w and b
Then update: w₁, w₂, …, wₙ, b               ← all at once

Using updated values mid-step would mean later weights are computed using a partially-updated model, introducing inconsistency.

Python Implementation

python

def compute_cost_regularized(X, y, w, b, lambda_):
    """
    X:        (m, n) — feature matrix
    y:        (m,)   — true labels
    w:        (n,)   — weights
    b:        float  — bias
    lambda_:  float  — regularization strength
    """
    m = X.shape[0]
    predictions = X @ w + b
    fit_cost    = (1 / (2 * m)) * np.sum((predictions - y) ** 2)
    reg_cost    = (lambda_ / (2 * m)) * np.sum(w ** 2)  # b is not regularized
    return fit_cost + reg_cost


def compute_gradients_regularized(X, y, w, b, lambda_):
    """
    Returns dw: (n,), db: float
    """
    m     = X.shape[0]
    error = X @ w + b - y                                 # (m,) — prediction error
    dw    = (1 / m) * (X.T @ error) + (lambda_ / m) * w  # extra term vs unregularized
    db    = (1 / m) * np.sum(error)                       # unchanged
    return dw, db


def gradient_descent_regularized(X, y, w, b, alpha, lambda_, iterations):
    costs = []
    for i in range(iterations):
        dw, db = compute_gradients_regularized(X, y, w, b, lambda_)
        w = w - alpha * dw   # simultaneous update
        b = b - alpha * db   # simultaneous update
        if i % 100 == 0:
            costs.append(compute_cost_regularized(X, y, w, b, lambda_))
    return w, b, costs

The only line that differs from the unregularized version is the dw computation — the + (lambda_ / m) * w term at the end. Everything else is identical.