Recalling the Regularized Cost
The previous module introduced the regularized cost function. It adds a penalty to the original mean squared error cost that grows with the size of the weights:
The first term rewards fitting the training data. The second term penalises large weights. λ controls the balance — larger λ means stronger shrinkage.
What we have not yet done is work out what this change means for gradient descent. To run gradient descent on the regularized cost, we need its partial derivatives with respect to each parameter.
What Changes and What Stays the Same
Adding the regularization term changes the gradient for wⱼ. The gradient for b is completely unchanged — the regularization term does not contain b, so its derivative with respect to b is zero.
| Parameter | Gradient changes? | Why |
|---|---|---|
| wⱼ | Yes — one extra term added | The penalty (λ/2m)·wⱼ² depends on wⱼ |
| b | No — identical to before | The penalty does not contain b |
This is the key insight: everything about the gradient descent update is the same, except wⱼ gets one additional term.
Deriving the New Gradient for wⱼ
The regularized cost is the sum of two terms. We differentiate each separately.
Term 1 — the fit term (unchanged from before):
Term 2 — the regularization penalty (new):
The 2 in the denominator cancels with the 2 that appears when differentiating wⱼ².
Combined gradient:
The gradient is the original prediction-error term plus the new regularization term (λ/m)·wⱼ.
The gradient of b is unchanged:
The 1/2m scaling in the cost is a convention — it keeps the penalty proportional to the training set size and makes the derivative tidy. The factor of 2 in 2m cancels with the 2 that appears when differentiating wⱼ², leaving a clean (λ/m)·wⱼ term in the gradient.
The Update Rule — Weight Decay
Substitute the regularized gradient into the standard gradient descent update rule for wⱼ:
Distribute α and collect the wⱼ terms:
The factor (1 − αλ/m) is slightly less than 1 for any positive α, λ, and m. Every gradient descent step multiplies each weight by a number slightly below 1 before applying the usual gradient correction. The weight is shrunk a little on every single update — this is called weight decay.
| Update | Unregularized | Regularized |
|---|---|---|
| wⱼ | wⱼ − (α/m)·Σᵢ error·xⱼ⁽ⁱ⁾ | wⱼ·(1 − αλ/m) − (α/m)·Σᵢ error·xⱼ⁽ⁱ⁾ |
| b | b − (α/m)·Σᵢ error | b − (α/m)·Σᵢ error (unchanged) |
The regularized update is wⱼ := wⱼ·(1 − αλ/m) − (α/m)·Σᵢ error·xⱼ⁽ⁱ⁾. What does the factor (1 − αλ/m) do on every update step?
How Regularization Controls the Model
To understand why adding a penalty reduces overfitting, trace what happens during gradient descent.
Without regularization, gradient descent minimises only the MSE cost — it will push weights as large as needed to fit every training example, including noise. With regularization, every gradient update must also reduce the penalty term. A weight wⱼ that is large contributes a large (λ/m)·wⱼ to the gradient — which pushes that weight toward zero. A weight that is already small contributes very little, so it is barely affected.
The result is that the model can keep features that genuinely help prediction (their weights remain non-zero) but cannot rely heavily on any single feature (no weight grows very large). This is the soft-shrinkage effect: features are kept, but their influence is constrained.
| Scenario | Without regularization | With regularization |
|---|---|---|
| High-degree polynomial feature | Weight grows large, curve contorts to fit noise | Weight stays small, curve stays smooth |
| Irrelevant noisy feature | Weight picks up training noise | Weight shrinks toward zero |
| Genuinely informative feature | Weight set to correct value | Weight slightly reduced but remains meaningful |
Weight decay is a useful mental model: before each gradient step, every weight is multiplied by (1 − αλ/m). If the gradient does not push back, the weight slowly decays toward zero over many iterations. Only weights that carry genuine predictive signal survive — gradient descent keeps rebuilding them.
Simultaneous Update
As with unregularized gradient descent, all parameters must be updated simultaneously using gradients computed from the current values of w and b — not the updated ones.
Compute ∂J/∂w₁, ∂J/∂w₂, …, ∂J/∂wₙ, ∂J/∂b ← all using current w and b
Then update: w₁, w₂, …, wₙ, b ← all at onceUsing updated values mid-step would mean later weights are computed using a partially-updated model, introducing inconsistency.
Python Implementation
def compute_cost_regularized(X, y, w, b, lambda_):
"""
X: (m, n) — feature matrix
y: (m,) — true labels
w: (n,) — weights
b: float — bias
lambda_: float — regularization strength
"""
m = X.shape[0]
predictions = X @ w + b
fit_cost = (1 / (2 * m)) * np.sum((predictions - y) ** 2)
reg_cost = (lambda_ / (2 * m)) * np.sum(w ** 2) # b is not regularized
return fit_cost + reg_cost
def compute_gradients_regularized(X, y, w, b, lambda_):
"""
Returns dw: (n,), db: float
"""
m = X.shape[0]
error = X @ w + b - y # (m,) — prediction error
dw = (1 / m) * (X.T @ error) + (lambda_ / m) * w # extra term vs unregularized
db = (1 / m) * np.sum(error) # unchanged
return dw, db
def gradient_descent_regularized(X, y, w, b, alpha, lambda_, iterations):
costs = []
for i in range(iterations):
dw, db = compute_gradients_regularized(X, y, w, b, lambda_)
w = w - alpha * dw # simultaneous update
b = b - alpha * db # simultaneous update
if i % 100 == 0:
costs.append(compute_cost_regularized(X, y, w, b, lambda_))
return w, b, costsThe only line that differs from the unregularized version is the dw computation — the + (lambda_ / m) * w term at the end. Everything else is identical.
In the regularized cost function, why does adding (λ/2m)·Σwⱼ² reduce overfitting?
