AAI Logo
Loading...
AAI Logo
Loading...
Machine Learning
Machine LearningBeginner

Regularized Linear Regression

regularizationlinear regressiongradient descentcost functionweight decaylambda
No reviews yet — be the first!

Recalling the Regularized Cost

The previous module introduced the regularized cost function. It adds a penalty to the original mean squared error cost that grows with the size of the weights:

J(w, b) = (1/2m) · mΣi=1 (fw,b(x(i)) − y(i))² + (λ/2m) · nΣj=1 wⱼ²

The first term rewards fitting the training data. The second term penalises large weights. λ controls the balance — larger λ means stronger shrinkage.

What we have not yet done is work out what this change means for gradient descent. To run gradient descent on the regularized cost, we need its partial derivatives with respect to each parameter.

What Changes and What Stays the Same

Adding the regularization term changes the gradient for wⱼ. The gradient for b is completely unchanged — the regularization term does not contain b, so its derivative with respect to b is zero.

ParameterGradient changes?Why
wⱼYes — one extra term addedThe penalty (λ/2m)·wⱼ² depends on wⱼ
bNo — identical to beforeThe penalty does not contain b

This is the key insight: everything about the gradient descent update is the same, except wⱼ gets one additional term.

Deriving the New Gradient for wⱼ

The regularized cost is the sum of two terms. We differentiate each separately.

Term 1 — the fit term (unchanged from before):

∂/∂wⱼ [(1/2m) · mΣi=1 (fw,b(x(i)) − y(i))²] = (1/m) · mΣi=1 (fw,b(x(i)) − y(i)) · xⱼ(i)

Term 2 — the regularization penalty (new):

∂/∂wⱼ [(λ/2m) · wⱼ²] = (λ/m) · wⱼ

The 2 in the denominator cancels with the 2 that appears when differentiating wⱼ².

Combined gradient:

∂J/∂wⱼ = (1/m) · mΣi=1 (fw,b(x(i)) − y(i)) · xⱼ(i) + (λ/m) · wⱼ

The gradient is the original prediction-error term plus the new regularization term (λ/m)·wⱼ.

The gradient of b is unchanged:

∂J/∂b = (1/m) · mΣi=1 (fw,b(x(i)) − y(i))

The 1/2m scaling in the cost is a convention — it keeps the penalty proportional to the training set size and makes the derivative tidy. The factor of 2 in 2m cancels with the 2 that appears when differentiating wⱼ², leaving a clean (λ/m)·wⱼ term in the gradient.

The Update Rule — Weight Decay

Substitute the regularized gradient into the standard gradient descent update rule for wⱼ:

wⱼ := wⱼ − α · [(1/m) · mΣi=1 (fw,b(x(i)) − y(i)) · xⱼ(i) + (λ/m) · wⱼ]

Distribute α and collect the wⱼ terms:

wⱼ := wⱼ · (1 − αλ/m) − (α/m) · mΣi=1 (fw,b(x(i)) − y(i)) · xⱼ(i)

The factor (1 − αλ/m) is slightly less than 1 for any positive α, λ, and m. Every gradient descent step multiplies each weight by a number slightly below 1 before applying the usual gradient correction. The weight is shrunk a little on every single update — this is called weight decay.

UpdateUnregularizedRegularized
wⱼwⱼ − (α/m)·Σᵢ error·xⱼ⁽ⁱ⁾wⱼ·(1 − αλ/m) − (α/m)·Σᵢ error·xⱼ⁽ⁱ⁾
bb − (α/m)·Σᵢ errorb − (α/m)·Σᵢ error (unchanged)
Quick Check

The regularized update is wⱼ := wⱼ·(1 − αλ/m) − (α/m)·Σᵢ error·xⱼ⁽ⁱ⁾. What does the factor (1 − αλ/m) do on every update step?

How Regularization Controls the Model

To understand why adding a penalty reduces overfitting, trace what happens during gradient descent.

Without regularization, gradient descent minimises only the MSE cost — it will push weights as large as needed to fit every training example, including noise. With regularization, every gradient update must also reduce the penalty term. A weight wⱼ that is large contributes a large (λ/m)·wⱼ to the gradient — which pushes that weight toward zero. A weight that is already small contributes very little, so it is barely affected.

The result is that the model can keep features that genuinely help prediction (their weights remain non-zero) but cannot rely heavily on any single feature (no weight grows very large). This is the soft-shrinkage effect: features are kept, but their influence is constrained.

ScenarioWithout regularizationWith regularization
High-degree polynomial featureWeight grows large, curve contorts to fit noiseWeight stays small, curve stays smooth
Irrelevant noisy featureWeight picks up training noiseWeight shrinks toward zero
Genuinely informative featureWeight set to correct valueWeight slightly reduced but remains meaningful

Weight decay is a useful mental model: before each gradient step, every weight is multiplied by (1 − αλ/m). If the gradient does not push back, the weight slowly decays toward zero over many iterations. Only weights that carry genuine predictive signal survive — gradient descent keeps rebuilding them.

Simultaneous Update

As with unregularized gradient descent, all parameters must be updated simultaneously using gradients computed from the current values of w and b — not the updated ones.

pseudocode
Compute ∂J/∂w₁, ∂J/∂w₂, …, ∂J/∂wₙ, ∂J/∂b  ← all using current w and b
Then update: w₁, w₂, …, wₙ, b               ← all at once

Using updated values mid-step would mean later weights are computed using a partially-updated model, introducing inconsistency.

Python Implementation

python
def compute_cost_regularized(X, y, w, b, lambda_):
    """
    X:        (m, n) — feature matrix
    y:        (m,)   — true labels
    w:        (n,)   — weights
    b:        float  — bias
    lambda_:  float  — regularization strength
    """
    m = X.shape[0]
    predictions = X @ w + b
    fit_cost    = (1 / (2 * m)) * np.sum((predictions - y) ** 2)
    reg_cost    = (lambda_ / (2 * m)) * np.sum(w ** 2)  # b is not regularized
    return fit_cost + reg_cost


def compute_gradients_regularized(X, y, w, b, lambda_):
    """
    Returns dw: (n,), db: float
    """
    m     = X.shape[0]
    error = X @ w + b - y                                 # (m,) — prediction error
    dw    = (1 / m) * (X.T @ error) + (lambda_ / m) * w  # extra term vs unregularized
    db    = (1 / m) * np.sum(error)                       # unchanged
    return dw, db


def gradient_descent_regularized(X, y, w, b, alpha, lambda_, iterations):
    costs = []
    for i in range(iterations):
        dw, db = compute_gradients_regularized(X, y, w, b, lambda_)
        w = w - alpha * dw   # simultaneous update
        b = b - alpha * db   # simultaneous update
        if i % 100 == 0:
            costs.append(compute_cost_regularized(X, y, w, b, lambda_))
    return w, b, costs

The only line that differs from the unregularized version is the dw computation — the + (lambda_ / m) * w term at the end. Everything else is identical.

Quick Check

In the regularized cost function, why does adding (λ/2m)·Σwⱼ² reduce overfitting?

Test Your Knowledge

Ready to check how much you remember? Take the quiz for Regularized Linear Regression and see your score on the leaderboard.

Take the Quiz

Up next

Next, we look at classification metrics — why accuracy alone misleads and how precision, recall, and F1 score give a complete picture of model performance.

Classification Metrics