What Is Gradient Descent?
Gradient descent is an algorithm used to find the parameter values that minimise a cost function. In our case, it finds the values of w and b that minimise J(w, b) — the squared error cost function we defined in the previous module.
Gradient descent is not limited to linear regression. It is the same core algorithm that trains deep neural networks, transformers, and virtually every other modern machine learning model.
Gradient descent finds the minimum of any differentiable cost function. • Used in: linear regression, neural networks, deep learning — everywhere. • Input: a cost function J(w, b) and a starting point. • Output: the values of w and b where J is smallest.
The Aim
The goal is always the same: minimise J(w, b) — the cost function that measures how badly the model is performing.
We want to find:
— the values of w and b for which J reaches its lowest point. Once we have w* and b*, we have the best possible straight-line fit to our training data.
Gradient descent minimises J(w, b). In plain terms, what does minimising J actually mean for our model?
How Gradient Descent Works
Gradient descent operates on one key intuition: the gradient tells you which way is uphill. If you go in the opposite direction — downhill — you reduce J.
The algorithm repeats two steps until it converges:
- Start with some w and b. A common initialisation is w = 0, b = 0. The specific starting values usually do not matter much for linear regression (J has only one minimum).
- Keep changing w and b to reduce J. At each step, look at the current slope of J and take a small step in the downhill direction. Repeat until J is no longer decreasing significantly — that is convergence.
The key question is: how exactly do we take a step downhill? That is what the formal algorithm defines.
The Gradient Descent Algorithm
Here is the pseudocode for the algorithm.
Step 1 — Initialise
Set w = 0 and b = 0 (or any other starting values). Choose a learning rate α (alpha) — a small positive number that controls the step size.
Step 2 — Compute the gradient
For the current w and b, calculate how steeply J is sloping with respect to each parameter:
- The partial derivative of
Jwith respect tow— call it∂J/∂w. - The partial derivative of
Jwith respect tob— call it∂J/∂b.
A positive ∂J/∂w means increasing w makes J go up. A negative ∂J/∂w means increasing w makes J go down.
Step 3 — Update both parameters simultaneously
Move w and b in the direction that decreases J:
w ← w − α · (∂J/∂w)
b ← b − α · (∂J/∂b)The subtraction moves each parameter in the opposite direction to the gradient — downhill. Both updates must be computed using the current w and b before either is changed.
Step 4 — Check for convergence
If w and b have barely changed — or if J has stopped decreasing — stop. Otherwise, go back to Step 2 with the updated w and b.
Step 5 — Return w and b
The final values of w and b are the parameters that (approximately) minimise J. These define the best-fit line for the training data.
Repeat until convergence:
w ← w − α · ∂J/∂w
b ← b − α · ∂J/∂bKey rule: compute both partial derivatives using the current w and b before updating either one.
For the squared error cost function, the partial derivatives work out to:
∂J/∂w = (1/m) · Σᵢ (f(x⁽ⁱ⁾) − y⁽ⁱ⁾) · x⁽ⁱ⁾
∂J/∂b = (1/m) · Σᵢ (f(x⁽ⁱ⁾) − y⁽ⁱ⁾)These are the slopes of the cost surface at the current point. Each update moves w and b one small step downhill.
In the update rule w ← w − α · ∂J/∂w, why do we subtract the gradient rather than add it?
Visualising Gradient Descent
The diagram below shows the actual squared error cost surface J(w, b) computed from sample training data. Because the squared error cost is convex, the surface is a smooth bowl with exactly one minimum. Both starting points descend to the same optimal w* and b*.
For a general cost function the surface can have multiple hills and valleys. The diagram below shows what that looks like — starting point A and B each get trapped in a different local minimum.
For linear regression with the squared error cost function, J(w, b) is always a bowl-shaped surface — convex, with exactly one minimum. Gradient descent will always find it regardless of where it starts. This property does not hold for more complex models like neural networks, where the cost surface has many local minima.
Why does the starting point matter for gradient descent on a neural network but not on linear regression?
Implementing Gradient Descent in Python
Below is a complete implementation of gradient descent for the crop yield linear regression problem.
import numpy as np
def compute_cost(X, y, w, b):
"""Squared error cost function J(w, b) = (1/2m) * sum((wx+b - y)^2)"""
m = len(y)
predictions = w * X + b
cost = (1 / (2 * m)) * np.sum((predictions - y) ** 2)
return cost
def compute_gradients(X, y, w, b):
"""
Partial derivatives of J with respect to w and b.
dJ/dw = (1/m) * sum((wx+b - y) * x)
dJ/db = (1/m) * sum(wx+b - y)
"""
m = len(y)
predictions = w * X + b
errors = predictions - y # shape: (m,)
dw = (1 / m) * np.dot(errors, X) # scalar
db = (1 / m) * np.sum(errors) # scalar
return dw, db
def gradient_descent(X, y, w_init, b_init, alpha, num_iterations):
"""
Run gradient descent to minimise J(w, b).
Parameters
----------
X : array of input features (rainfall mm)
y : array of labels (yield t/ha)
w_init, b_init: starting parameter values
alpha : learning rate
num_iterations: number of update steps
Returns
-------
w, b : optimised parameters
cost_history : J value recorded every 100 iterations
"""
w, b = w_init, b_init
cost_history = []
for i in range(num_iterations):
dw, db = compute_gradients(X, y, w, b)
# Simultaneous update — use current w, b for both gradients before updating
w = w - alpha * dw
b = b - alpha * db
if i % 100 == 0:
cost = compute_cost(X, y, w, b)
cost_history.append(cost)
print(f"Iter {i:4d} | J = {cost:.5f} | w = {w:.4f} | b = {b:.4f}")
return w, b, cost_history
# ── Training data: (rainfall mm, yield t/ha) ──────────────────────────────
X_train = np.array([80.0, 100.0, 120.0, 145.0, 160.0, 185.0])
y_train = np.array([ 2.2, 3.1, 3.9, 5.0, 5.4, 6.1])
# ── Run gradient descent ──────────────────────────────────────────────────
w_final, b_final, history = gradient_descent(
X_train, y_train,
w_init=0.0, b_init=0.0,
alpha=0.0001,
num_iterations=10_000,
)
print(f"\nOptimal w = {w_final:.4f} (slope)")
print(f"Optimal b = {b_final:.4f} (intercept)")
print(f"Final cost J = {compute_cost(X_train, y_train, w_final, b_final):.5f}")
# ── Predict for a new field ───────────────────────────────────────────────
x_new = 130
y_hat = w_final * x_new + b_final
print(f"\nPrediction for {x_new} mm rainfall: {y_hat:.2f} t/ha")
print(f"Expected revenue at $500/t: ${y_hat * 500:,.0f}")
# Sample output:
# Iter 0 | J = 13.04722 | w = 0.0432 | b = 0.0003
# Iter 100 | J = 0.45231 | w = 0.0291 | b = 0.1847
# ...
# Iter 9900 | J = 0.02341 | w = 0.0319 | b = -0.2781
#
# Optimal w = 0.0319 (slope)
# Optimal b = -0.2781 (intercept)
# Final cost J = 0.02341
#
# Prediction for 130 mm rainfall: 3.87 t/ha
# Expected revenue at $500/t: $1,935Notice that w and b are updated after computing both gradients with the current values. Updating w first and then using the new w to compute the gradient for b would give wrong results — the simultaneous update is essential.
Verifying Gradient Descent is Working Correctly
The objective of gradient descent is to minimise J(w, b) — to find the values of w and b where the model's predictions are as close to the training labels as possible. Each iteration should bring J closer to that minimum.
Plot J against the number of iterations to get the learning curve. This is the standard way to verify that gradient descent is running correctly.
J must decrease after every single iteration without exception. For e.g., using the crop yield model:
| Iteration | J(w, b) |
|---|---|
| 0 | 13.0 |
| 200 | 1.18 |
| 600 | 0.11 |
Each row is smaller than the one before — this is correct behaviour. When the curve flattens and J stops changing significantly, gradient descent has converged and the model has found the optimal w and b.
You can also declare convergence automatically using a threshold ε (epsilon). If J decreases by less than ε in a single iteration — for e.g., ε = 0.001 — stop and treat the current w and b as the final parameters. In practice, choosing the right ε is difficult, so most implementations run for a fixed number of iterations and rely on the learning curve to confirm convergence visually.
If J ever increases after an iteration, gradient descent is diverging — not converging. The cause is almost always a learning rate α that is too large. Oversized steps jump past the minimum and land on a higher point of the cost surface. The fix is to reduce α.
The learning rate α controls how large each update step is. Choosing α correctly is critical — the next module covers this in full, including what happens when α is too large, too small, or just right.
