What Is Gradient Descent?
Gradient descent is the optimisation algorithm that trains logistic regression (and every neural network). It repeatedly adjusts W and b in the direction that reduces the cost J. The key insight is that the gradient of J with respect to W and b tells you which direction increases the cost — so you move in the opposite direction.
Deriving Gradient Descent for Logistic Regression
Gradient descent updates W and b by subtracting a fraction of the gradient of J with respect to each parameter. The generic update rule is:
To apply this to logistic regression, we need to compute ∂J/∂W and ∂J/∂b explicitly. We do this using the chain rule, tracing back through the three-step chain that connects W and b to J.
The Prediction Chain
Every prediction passes through three steps before producing a loss:
| Step | Formula | Shape |
|---|---|---|
| Linear score | z(i) = W·x(i) + b | scalar per example |
| Sigmoid | ŷ(i) = σ(z(i)) | scalar in (0, 1) |
| Loss | L(i) = −[y(i)·log(ŷ(i)) + (1−y(i))·log(1−ŷ(i))] | scalar ≥ 0 |
The chain rule connects ∂L/∂W across all three steps:
Step 1 — ∂L/∂ŷ
Differentiate the binary cross-entropy loss with respect to ŷ:
Combining over the common denominator ŷ(1−ŷ):
Step 2 — ∂ŷ/∂z (Sigmoid Derivative)
The sigmoid σ(z) = 1/(1 + e^−z) has a well-known derivative. Differentiating directly:
Step 3 — ∂z/∂W and ∂z/∂b
The linear step z = W·x + b is a simple linear function, so its partial derivatives are:
Combining the Chain Rule — ∂L/∂W
Multiplying all three links together:
The ŷ(1−ŷ) terms cancel exactly:
Combining the Chain Rule — ∂L/∂b
The same chain, but the last link is ∂z/∂b = 1:
After the chain rule cancellation, what does ∂L/∂b simplify to?
The ŷ(1−ŷ) from the sigmoid derivative cancels perfectly with the denominator in ∂L/∂ŷ. This is not a coincidence — binary cross-entropy and sigmoid are designed as a pair. The gradient simplifies to pure prediction error: (ŷ − y).
Averaging Over All m Examples — From ∂L/∂W to ∂J/∂W
We derived ∂L/∂W for a single training example. But gradient descent minimises J, not L. The bridge is the definition of J itself:
Differentiating both sides with respect to W — differentiation distributes over a sum, and (1/m) is a constant:
We already know ∂L/∂W = (ŷ − y) · x for a single example. Substituting that in for every example i:
The same logic applies to b. Since ∂L/∂b = (ŷ − y):
Vectorised Form
Before removing the loop, we stack the per-example values into matrices:
| Symbol | Shape | Contains |
|---|---|---|
| A | (1, m) | All predicted probabilities — ŷ⁽¹⁾, ŷ⁽²⁾, …, ŷ⁽ᵐ⁾ |
| Y | (1, m) | All true labels — y⁽¹⁾, y⁽²⁾, …, y⁽ᵐ⁾ |
| X | (m, n) | All input examples stacked as rows |
The element-wise difference (A − Y) gives the prediction error for every example simultaneously — it is the vectorised form of (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾). The sum over i in the gradient formulas then becomes a matrix multiply:
The transposes are purely about making the shapes compatible. X is (m, n) and (A − Y) is (1, m) — multiplying them directly gives no valid shape. Transposing X gives (n, m) and transposing (A − Y) gives (m, 1). Then (n, m) · (m, 1) = (n, 1), which matches the shape of W exactly — one gradient value per weight.
| Quantity | Shape | Meaning |
|---|---|---|
| A − Y | (1, m) | Prediction error per example |
| Xᵀ · (A−Y)ᵀ | (n, m)·(m, 1) = (n, 1) | Gradient for each weight |
| Σ(A − Y) | scalar | Gradient for the bias |
The Gradient in Code
def compute_gradients(X, Y, A):
"""
X: (m, n)
Y: (1, m) — true labels
A: (1, m) — predicted probabilities
Returns dW: (n, 1), db: scalar
"""
m = X.shape[0]
dZ = A - Y # (1, m) — prediction error
dW = (1/m) * np.dot(X.T, dZ.T) # (n, m) dot (m, 1) = (n, 1)
db = (1/m) * np.sum(dZ) # scalar
return dW, dbThe Update Rule
After computing gradients, you update W and b by subtracting a fraction of the gradient. The fraction is the learning rate α — a hyperparameter you choose. A learning rate that is too large causes the cost to oscillate and diverge. Too small, and training takes too long.
def update_parameters(W, b, dW, db, learning_rate):
W = W - learning_rate * dW
b = b - learning_rate * db
return W, bThe learning rate α is the most important hyperparameter to tune. Start with 0.01. If the cost decreases smoothly, training is working. If the cost jumps around or increases, reduce the learning rate.
Full Training Loop
With the forward pass, cost, gradients, and update rule all defined, the training loop is just these four steps repeated. The predict function is defined in the previous module — it applies forward_pass and thresholds at 0.5.
def train_logistic_regression(X_train, Y_train, learning_rate=0.01, num_iterations=1000):
m, n = X_train.shape
W, b = np.zeros((n, 1)), 0.0
costs = []
for i in range(num_iterations):
# Forward pass
A = forward_pass(X_train, W, b)
# Compute cost
cost = compute_cost(A, Y_train)
# Backward pass — gradients
dW, db = compute_gradients(X_train, Y_train, A)
# Update parameters
W, b = update_parameters(W, b, dW, db, learning_rate)
if i % 100 == 0:
costs.append(cost)
print(f"Cost at iteration {i}: {cost:.4f}")
return W, b, costsEvaluating the Trained Model
After training, you evaluate on the test set to check generalisation. Training accuracy tells you how well the model fits the training data. Test accuracy tells you how well it generalises to unseen data. Plotting the cost over iterations confirms whether gradient descent actually converged.
import matplotlib.pyplot as plt
W, b, costs = train_logistic_regression(X_train, Y_train, learning_rate=0.005, num_iterations=2000)
# Accuracy
train_preds = predict(X_train, W, b)
test_preds = predict(X_test, W, b)
train_acc = np.mean(train_preds == Y_train) * 100
test_acc = np.mean(test_preds == Y_test) * 100
print(f"Train accuracy: {train_acc:.1f}%")
print(f"Test accuracy: {test_acc:.1f}%")
# Cost curve — should decrease smoothly; oscillation means α is too large
plt.plot(costs)
plt.xlabel('Iterations (×100)')
plt.ylabel('Cost J')
plt.title('Training Cost Over Time')
plt.show()Gradient Descent Variants
The version above is batch gradient descent — it uses all m training examples to compute each gradient update. For large datasets this is slow. Mini-batch gradient descent splits the data into small batches and updates once per batch — much faster.
| Variant | Batch size | Update frequency | Pros | Cons |
|---|---|---|---|---|
| Batch | All m | Once per epoch | Stable, exact gradients | Slow for large datasets |
| Stochastic (SGD) | 1 | m times per epoch | Fast updates | Noisy, hard to converge |
| Mini-batch | 32–256 | m/batch per epoch | Balance of speed and stability | Extra hyperparameter |
Mini-batch gradient descent with batch size 32 or 64 is the standard in practice. PyTorch and TensorFlow DataLoaders handle the batching automatically.
