AAI Logo
Loading...
AAI Logo
Loading...
Machine Learning
Machine LearningBeginner

Gradient Descent

gradient descentlearning rateoptimizationw and bconvergencealpha
No reviews yet — be the first!

What Is Gradient Descent?

Gradient descent is an algorithm used to find the parameter values that minimise a cost function. In our case, it finds the values of w and b that minimise J(w, b) — the squared error cost function we defined in the previous module.

Gradient descent is not limited to linear regression. It is the same core algorithm that trains deep neural networks, transformers, and virtually every other modern machine learning model.

Gradient descent finds the minimum of any differentiable cost function. • Used in: linear regression, neural networks, deep learning — everywhere. • Input: a cost function J(w, b) and a starting point. • Output: the values of w and b where J is smallest.

The Aim

The goal is always the same: minimise J(w, b) — the cost function that measures how badly the model is performing.

We want to find:

w*, b* = argmin J(w, b)

— the values of w and b for which J reaches its lowest point. Once we have w* and b*, we have the best possible straight-line fit to our training data.

Quick Check

Gradient descent minimises J(w, b). In plain terms, what does minimising J actually mean for our model?

How Gradient Descent Works

Gradient descent operates on one key intuition: the gradient tells you which way is uphill. If you go in the opposite direction — downhill — you reduce J.

The algorithm repeats two steps until it converges:

  1. Start with some w and b. A common initialisation is w = 0, b = 0. The specific starting values usually do not matter much for linear regression (J has only one minimum).
  2. Keep changing w and b to reduce J. At each step, look at the current slope of J and take a small step in the downhill direction. Repeat until J is no longer decreasing significantly — that is convergence.

The key question is: how exactly do we take a step downhill? That is what the formal algorithm defines.

The Gradient Descent Algorithm

Here is the pseudocode for the algorithm.

Step 1 — Initialise Set w = 0 and b = 0 (or any other starting values). Choose a learning rate α (alpha) — a small positive number that controls the step size.

Step 2 — Compute the gradient For the current w and b, calculate how steeply J is sloping with respect to each parameter:

  • The partial derivative of J with respect to w — call it ∂J/∂w.
  • The partial derivative of J with respect to b — call it ∂J/∂b.

A positive ∂J/∂w means increasing w makes J go up. A negative ∂J/∂w means increasing w makes J go down.

Step 3 — Update both parameters simultaneously Move w and b in the direction that decreases J:

pseudocode
w ← w − α · (∂J/∂w)
b ← b − α · (∂J/∂b)

The subtraction moves each parameter in the opposite direction to the gradient — downhill. Both updates must be computed using the current w and b before either is changed.

Step 4 — Check for convergence If w and b have barely changed — or if J has stopped decreasing — stop. Otherwise, go back to Step 2 with the updated w and b.

Step 5 — Return w and b The final values of w and b are the parameters that (approximately) minimise J. These define the best-fit line for the training data.

Repeat until convergence:

pseudocode
w ← w − α · ∂J/∂w
b ← b − α · ∂J/∂b

Key rule: compute both partial derivatives using the current w and b before updating either one.

For the squared error cost function, the partial derivatives work out to:

pseudocode
∂J/∂w = (1/m) · Σᵢ (f(x⁽ⁱ⁾) − y⁽ⁱ⁾) · x⁽ⁱ⁾
∂J/∂b = (1/m) · Σᵢ (f(x⁽ⁱ⁾) − y⁽ⁱ⁾)

These are the slopes of the cost surface at the current point. Each update moves w and b one small step downhill.

Quick Check

In the update rule w ← w − α · ∂J/∂w, why do we subtract the gradient rather than add it?

Visualising Gradient Descent

The diagram below shows the actual squared error cost surface J(w, b) computed from sample training data. Because the squared error cost is convex, the surface is a smooth bowl with exactly one minimum. Both starting points descend to the same optimal w* and b*.

Diagram
The squared error cost surface J(w, b) is a convex bowl. Gradient descent from any starting point converges to the same global minimum — w* ≈ 1.03 and b* ≈ 0.04 for this training data.

For a general cost function the surface can have multiple hills and valleys. The diagram below shows what that looks like — starting point A and B each get trapped in a different local minimum.

Diagram
A non-convex cost surface with multiple local minima. Starting point A descends to local minimum A; starting point B descends to a different minimum. This is common in neural networks but not in linear regression.

For linear regression with the squared error cost function, J(w, b) is always a bowl-shaped surface — convex, with exactly one minimum. Gradient descent will always find it regardless of where it starts. This property does not hold for more complex models like neural networks, where the cost surface has many local minima.

Quick Check

Why does the starting point matter for gradient descent on a neural network but not on linear regression?

Implementing Gradient Descent in Python

Below is a complete implementation of gradient descent for the crop yield linear regression problem.

python
import numpy as np

def compute_cost(X, y, w, b):
    """Squared error cost function J(w, b) = (1/2m) * sum((wx+b - y)^2)"""
    m = len(y)
    predictions = w * X + b
    cost = (1 / (2 * m)) * np.sum((predictions - y) ** 2)
    return cost


def compute_gradients(X, y, w, b):
    """
    Partial derivatives of J with respect to w and b.
    dJ/dw = (1/m) * sum((wx+b - y) * x)
    dJ/db = (1/m) * sum(wx+b - y)
    """
    m = len(y)
    predictions = w * X + b
    errors = predictions - y          # shape: (m,)
    dw = (1 / m) * np.dot(errors, X)  # scalar
    db = (1 / m) * np.sum(errors)     # scalar
    return dw, db


def gradient_descent(X, y, w_init, b_init, alpha, num_iterations):
    """
    Run gradient descent to minimise J(w, b).

    Parameters
    ----------
    X             : array of input features (rainfall mm)
    y             : array of labels (yield t/ha)
    w_init, b_init: starting parameter values
    alpha         : learning rate
    num_iterations: number of update steps

    Returns
    -------
    w, b          : optimised parameters
    cost_history  : J value recorded every 100 iterations
    """
    w, b = w_init, b_init
    cost_history = []

    for i in range(num_iterations):
        dw, db = compute_gradients(X, y, w, b)

        # Simultaneous update — use current w, b for both gradients before updating
        w = w - alpha * dw
        b = b - alpha * db

        if i % 100 == 0:
            cost = compute_cost(X, y, w, b)
            cost_history.append(cost)
            print(f"Iter {i:4d} | J = {cost:.5f} | w = {w:.4f} | b = {b:.4f}")

    return w, b, cost_history


# ── Training data: (rainfall mm, yield t/ha) ──────────────────────────────
X_train = np.array([80.0, 100.0, 120.0, 145.0, 160.0, 185.0])
y_train = np.array([ 2.2,   3.1,   3.9,   5.0,   5.4,   6.1])

# ── Run gradient descent ──────────────────────────────────────────────────
w_final, b_final, history = gradient_descent(
    X_train, y_train,
    w_init=0.0, b_init=0.0,
    alpha=0.0001,
    num_iterations=10_000,
)

print(f"\nOptimal  w = {w_final:.4f}  (slope)")
print(f"Optimal  b = {b_final:.4f}  (intercept)")
print(f"Final cost J = {compute_cost(X_train, y_train, w_final, b_final):.5f}")

# ── Predict for a new field ───────────────────────────────────────────────
x_new = 130
y_hat = w_final * x_new + b_final
print(f"\nPrediction for {x_new} mm rainfall: {y_hat:.2f} t/ha")
print(f"Expected revenue at $500/t:         ${y_hat * 500:,.0f}")

# Sample output:
# Iter    0 | J = 13.04722 | w = 0.0432 | b = 0.0003
# Iter  100 | J =  0.45231 | w = 0.0291 | b = 0.1847
# ...
# Iter 9900 | J =  0.02341 | w = 0.0319 | b = -0.2781
#
# Optimal  w = 0.0319  (slope)
# Optimal  b = -0.2781 (intercept)
# Final cost J = 0.02341
#
# Prediction for 130 mm rainfall: 3.87 t/ha
# Expected revenue at $500/t: $1,935

Notice that w and b are updated after computing both gradients with the current values. Updating w first and then using the new w to compute the gradient for b would give wrong results — the simultaneous update is essential.

Verifying Gradient Descent is Working Correctly

The objective of gradient descent is to minimise J(w, b) — to find the values of w and b where the model's predictions are as close to the training labels as possible. Each iteration should bring J closer to that minimum.

Plot J against the number of iterations to get the learning curve. This is the standard way to verify that gradient descent is running correctly.

Diagram
Learning CurveCost J vs. number of iterationsconverged02004006008001000Iterations02468101214J(w, b) — costIter 0J = 13.0Iter 200J = 1.18Iter 600J = 0.11↓ must decreaseevery iteration
The learning curve plots cost J against iterations. J must decrease after every iteration. When the curve flattens, gradient descent has converged.

J must decrease after every single iteration without exception. For e.g., using the crop yield model:

IterationJ(w, b)
013.0
2001.18
6000.11

Each row is smaller than the one before — this is correct behaviour. When the curve flattens and J stops changing significantly, gradient descent has converged and the model has found the optimal w and b.

You can also declare convergence automatically using a threshold ε (epsilon). If J decreases by less than ε in a single iteration — for e.g., ε = 0.001 — stop and treat the current w and b as the final parameters. In practice, choosing the right ε is difficult, so most implementations run for a fixed number of iterations and rely on the learning curve to confirm convergence visually.

If J ever increases after an iteration, gradient descent is diverging — not converging. The cause is almost always a learning rate α that is too large. Oversized steps jump past the minimum and land on a higher point of the cost surface. The fix is to reduce α.

The learning rate α controls how large each update step is. Choosing α correctly is critical — the next module covers this in full, including what happens when α is too large, too small, or just right.

Test Your Knowledge

Ready to check how much you remember? Take the quiz for Gradient Descent and see your score on the leaderboard.

Take the Quiz

Up next

In the next module, we take a deep dive into the learning rate — how α controls every gradient descent step, what happens when it is too large or too small, and how to choose the right value.

The Learning Rate