Gradient Descent for Logistic Regression

What Is Gradient Descent?

Gradient descent is the optimisation algorithm that trains logistic regression (and every neural network). It repeatedly adjusts W and b in the direction that reduces the cost J. The key insight is that the gradient of J with respect to W and b tells you which direction increases the cost — so you move in the opposite direction.

Deriving Gradient Descent for Logistic Regression

Gradient descent updates W and b by subtracting a fraction of the gradient of J with respect to each parameter. The generic update rule is:

W := W − α · ∂J/∂W

b := b − α · ∂J/∂b

To apply this to logistic regression, we need to compute ∂J/∂W and ∂J/∂b explicitly. We do this using the chain rule, tracing back through the three-step chain that connects W and b to J.

The Prediction Chain

Every prediction passes through three steps before producing a loss:

Step	Formula	Shape
Linear score	z⁽ⁱ⁾ = W·x⁽ⁱ⁾ + b	scalar per example
Sigmoid	ŷ⁽ⁱ⁾ = σ(z⁽ⁱ⁾)	scalar in (0, 1)
Loss	L⁽ⁱ⁾ = −[y⁽ⁱ⁾·log(ŷ⁽ⁱ⁾) + (1−y⁽ⁱ⁾)·log(1−ŷ⁽ⁱ⁾)]	scalar ≥ 0

The chain rule connects ∂L/∂W across all three steps:

∂L/∂W = ∂L/∂ŷ · ∂ŷ/∂z · ∂z/∂W

Step 1 — ∂L/∂ŷ

Differentiate the binary cross-entropy loss with respect to ŷ:

L = −[y·log(ŷ) + (1−y)·log(1−ŷ)]

∂L/∂ŷ = −y/ŷ + (1−y)/(1−ŷ)

Combining over the common denominator ŷ(1−ŷ):

∂L/∂ŷ = (ŷ − y) / (ŷ · (1 − ŷ))

Step 2 — ∂ŷ/∂z (Sigmoid Derivative)

The sigmoid σ(z) = 1/(1 + e^−z) has a well-known derivative. Differentiating directly:

∂ŷ/∂z = σ(z) · (1 − σ(z)) = ŷ · (1 − ŷ)

Step 3 — ∂z/∂W and ∂z/∂b

The linear step z = W·x + b is a simple linear function, so its partial derivatives are:

∂z/∂W = x

∂z/∂b = 1

Combining the Chain Rule — ∂L/∂W

Multiplying all three links together:

∂L/∂W = (ŷ−y) / (ŷ·(1−ŷ)) · ŷ·(1−ŷ) · x

The ŷ(1−ŷ) terms cancel exactly:

∂L/∂W = (ŷ − y) · x

Combining the Chain Rule — ∂L/∂b

The same chain, but the last link is ∂z/∂b = 1:

∂L/∂b = (ŷ−y) / (ŷ·(1−ŷ)) · ŷ·(1−ŷ) · 1

∂L/∂b = ŷ − y

Quick Check

After the chain rule cancellation, what does ∂L/∂b simplify to?

The ŷ(1−ŷ) from the sigmoid derivative cancels perfectly with the denominator in ∂L/∂ŷ. This is not a coincidence — binary cross-entropy and sigmoid are designed as a pair. The gradient simplifies to pure prediction error: (ŷ − y).

Averaging Over All m Examples — From ∂L/∂W to ∂J/∂W

We derived ∂L/∂W for a single training example. But gradient descent minimises J, not L. The bridge is the definition of J itself:

J = (1/m) · Σᵢ L⁽ⁱ⁾

Differentiating both sides with respect to W — differentiation distributes over a sum, and (1/m) is a constant:

∂J/∂W = (1/m) · Σᵢ ∂L⁽ⁱ⁾/∂W

We already know ∂L/∂W = (ŷ − y) · x for a single example. Substituting that in for every example i:

∂J/∂W = (1/m) · Σᵢ (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾) · x⁽ⁱ⁾

The same logic applies to b. Since ∂L/∂b = (ŷ − y):

∂J/∂b = (1/m) · Σᵢ (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾)

Vectorised Form

Before removing the loop, we stack the per-example values into matrices:

Symbol	Shape	Contains
A	(1, m)	All predicted probabilities — ŷ⁽¹⁾, ŷ⁽²⁾, …, ŷ⁽ᵐ⁾
Y	(1, m)	All true labels — y⁽¹⁾, y⁽²⁾, …, y⁽ᵐ⁾
X	(m, n)	All input examples stacked as rows

The element-wise difference (A − Y) gives the prediction error for every example simultaneously — it is the vectorised form of (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾). The sum over i in the gradient formulas then becomes a matrix multiply:

∂J/∂W = (1/m) · Xᵀ · (A − Y)ᵀ

∂J/∂b = (1/m) · Σ(A − Y)

The transposes are purely about making the shapes compatible. X is (m, n) and (A − Y) is (1, m) — multiplying them directly gives no valid shape. Transposing X gives (n, m) and transposing (A − Y) gives (m, 1). Then (n, m) · (m, 1) = (n, 1), which matches the shape of W exactly — one gradient value per weight.

Quantity	Shape	Meaning
A − Y	(1, m)	Prediction error per example
Xᵀ · (A−Y)ᵀ	(n, m)·(m, 1) = (n, 1)	Gradient for each weight
Σ(A − Y)	scalar	Gradient for the bias

The Gradient in Code

python

def compute_gradients(X, Y, A):
    """
    X: (m, n)
    Y: (1, m) — true labels
    A: (1, m) — predicted probabilities
    Returns dW: (n, 1), db: scalar
    """
    m = X.shape[0]
    dZ = A - Y               # (1, m) — prediction error
    dW = (1/m) * np.dot(X.T, dZ.T)  # (n, m) dot (m, 1) = (n, 1)
    db = (1/m) * np.sum(dZ)   # scalar
    return dW, db

The Update Rule

After computing gradients, you update W and b by subtracting a fraction of the gradient. The fraction is the learning rate α — a hyperparameter you choose. A learning rate that is too large causes the cost to oscillate and diverge. Too small, and training takes too long.

python

def update_parameters(W, b, dW, db, learning_rate):
    W = W - learning_rate * dW
    b = b - learning_rate * db
    return W, b

The learning rate α is the most important hyperparameter to tune. Start with 0.01. If the cost decreases smoothly, training is working. If the cost jumps around or increases, reduce the learning rate.

The Learning Rate — deep dive with visuals

Full Training Loop

With the forward pass, cost, gradients, and update rule all defined, the training loop is just these four steps repeated. The predict function is defined in the previous module — it applies forward_pass and thresholds at 0.5.

python

def train_logistic_regression(X_train, Y_train, learning_rate=0.01, num_iterations=1000):
    m, n = X_train.shape
    W, b = np.zeros((n, 1)), 0.0
    costs = []

    for i in range(num_iterations):
        # Forward pass
        A = forward_pass(X_train, W, b)

        # Compute cost
        cost = compute_cost(A, Y_train)

        # Backward pass — gradients
        dW, db = compute_gradients(X_train, Y_train, A)

        # Update parameters
        W, b = update_parameters(W, b, dW, db, learning_rate)

        if i % 100 == 0:
            costs.append(cost)
            print(f"Cost at iteration {i}: {cost:.4f}")

    return W, b, costs

Evaluating the Trained Model

After training, you evaluate on the test set to check generalisation. Training accuracy tells you how well the model fits the training data. Test accuracy tells you how well it generalises to unseen data. Plotting the cost over iterations confirms whether gradient descent actually converged.

python

import matplotlib.pyplot as plt

W, b, costs = train_logistic_regression(X_train, Y_train, learning_rate=0.005, num_iterations=2000)

# Accuracy
train_preds = predict(X_train, W, b)
test_preds  = predict(X_test,  W, b)

train_acc = np.mean(train_preds == Y_train) * 100
test_acc  = np.mean(test_preds  == Y_test)  * 100

print(f"Train accuracy: {train_acc:.1f}%")
print(f"Test accuracy:  {test_acc:.1f}%")

# Cost curve — should decrease smoothly; oscillation means α is too large
plt.plot(costs)
plt.xlabel('Iterations (×100)')
plt.ylabel('Cost J')
plt.title('Training Cost Over Time')
plt.show()

Gradient Descent Variants

The version above is batch gradient descent — it uses all m training examples to compute each gradient update. For large datasets this is slow. Mini-batch gradient descent splits the data into small batches and updates once per batch — much faster.

Variant	Batch size	Update frequency	Pros	Cons
Batch	All m	Once per epoch	Stable, exact gradients	Slow for large datasets
Stochastic (SGD)	1	m times per epoch	Fast updates	Noisy, hard to converge
Mini-batch	32–256	m/batch per epoch	Balance of speed and stability	Extra hyperparameter

Mini-batch gradient descent with batch size 32 or 64 is the standard in practice. PyTorch and TensorFlow DataLoaders handle the batching automatically.

What Is Gradient Descent?

Deriving Gradient Descent for Logistic Regression

The Prediction Chain

Step 1 — ∂L/∂ŷ

Step 2 — ∂ŷ/∂z (Sigmoid Derivative)

Step 3 — ∂z/∂W and ∂z/∂b

Combining the Chain Rule — ∂L/∂W

Combining the Chain Rule — ∂L/∂b

Averaging Over All m Examples — From ∂L/∂W to ∂J/∂W

Vectorised Form

The Gradient in Code

The Update Rule

Full Training Loop

Evaluating the Trained Model

Gradient Descent Variants

Test Your Knowledge