AAI Logo
Loading...
AAI Logo
Loading...
Deep Learning
Deep LearningIntermediate

Gradient Descent for Logistic Regression

pythongradient descentlogistic regressionoptimization
No reviews yet — be the first!

What Is Gradient Descent?

Gradient descent is the optimisation algorithm that trains logistic regression (and every neural network). It repeatedly adjusts W and b in the direction that reduces the cost J. The key insight is that the gradient of J with respect to W and b tells you which direction increases the cost — so you move in the opposite direction.

Deriving Gradient Descent for Logistic Regression

Gradient descent updates W and b by subtracting a fraction of the gradient of J with respect to each parameter. The generic update rule is:

W := W − α · ∂J/∂W
b := b − α · ∂J/∂b

To apply this to logistic regression, we need to compute ∂J/∂W and ∂J/∂b explicitly. We do this using the chain rule, tracing back through the three-step chain that connects W and b to J.

The Prediction Chain

Every prediction passes through three steps before producing a loss:

StepFormulaShape
Linear scorez(i) = W·x(i) + bscalar per example
Sigmoidŷ(i) = σ(z(i))scalar in (0, 1)
LossL(i) = −[y(i)·log(ŷ(i)) + (1−y(i))·log(1−ŷ(i))]scalar ≥ 0

The chain rule connects ∂L/∂W across all three steps:

∂L/∂W = ∂L/∂ŷ · ∂ŷ/∂z · ∂z/∂W

Step 1 — ∂L/∂ŷ

Differentiate the binary cross-entropy loss with respect to ŷ:

L = −[y·log(ŷ) + (1−y)·log(1−ŷ)]
∂L/∂ŷ = −y/ŷ + (1−y)/(1−ŷ)

Combining over the common denominator ŷ(1−ŷ):

∂L/∂ŷ = (ŷ − y) / (ŷ · (1 − ŷ))

Step 2 — ∂ŷ/∂z (Sigmoid Derivative)

The sigmoid σ(z) = 1/(1 + e^−z) has a well-known derivative. Differentiating directly:

∂ŷ/∂z = σ(z) · (1 − σ(z)) = ŷ · (1 − ŷ)

Step 3 — ∂z/∂W and ∂z/∂b

The linear step z = W·x + b is a simple linear function, so its partial derivatives are:

∂z/∂W = x
∂z/∂b = 1

Combining the Chain Rule — ∂L/∂W

Multiplying all three links together:

∂L/∂W = (ŷ−y) / (ŷ·(1−ŷ)) · ŷ·(1−ŷ) · x

The ŷ(1−ŷ) terms cancel exactly:

∂L/∂W = (ŷ − y) · x

Combining the Chain Rule — ∂L/∂b

The same chain, but the last link is ∂z/∂b = 1:

∂L/∂b = (ŷ−y) / (ŷ·(1−ŷ)) · ŷ·(1−ŷ) · 1
∂L/∂b = ŷ − y
Quick Check

After the chain rule cancellation, what does ∂L/∂b simplify to?

The ŷ(1−ŷ) from the sigmoid derivative cancels perfectly with the denominator in ∂L/∂ŷ. This is not a coincidence — binary cross-entropy and sigmoid are designed as a pair. The gradient simplifies to pure prediction error: (ŷ − y).

Averaging Over All m Examples — From ∂L/∂W to ∂J/∂W

We derived ∂L/∂W for a single training example. But gradient descent minimises J, not L. The bridge is the definition of J itself:

J = (1/m) · Σᵢ L⁽ⁱ⁾

Differentiating both sides with respect to W — differentiation distributes over a sum, and (1/m) is a constant:

∂J/∂W = (1/m) · Σᵢ ∂L⁽ⁱ⁾/∂W

We already know ∂L/∂W = (ŷ − y) · x for a single example. Substituting that in for every example i:

∂J/∂W = (1/m) · Σᵢ (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾) · x⁽ⁱ⁾

The same logic applies to b. Since ∂L/∂b = (ŷ − y):

∂J/∂b = (1/m) · Σᵢ (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾)

Vectorised Form

Before removing the loop, we stack the per-example values into matrices:

SymbolShapeContains
A(1, m)All predicted probabilities — ŷ⁽¹⁾, ŷ⁽²⁾, …, ŷ⁽ᵐ⁾
Y(1, m)All true labels — y⁽¹⁾, y⁽²⁾, …, y⁽ᵐ⁾
X(m, n)All input examples stacked as rows

The element-wise difference (A − Y) gives the prediction error for every example simultaneously — it is the vectorised form of (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾). The sum over i in the gradient formulas then becomes a matrix multiply:

∂J/∂W = (1/m) · Xᵀ · (A − Y)ᵀ
∂J/∂b = (1/m) · Σ(A − Y)

The transposes are purely about making the shapes compatible. X is (m, n) and (A − Y) is (1, m) — multiplying them directly gives no valid shape. Transposing X gives (n, m) and transposing (A − Y) gives (m, 1). Then (n, m) · (m, 1) = (n, 1), which matches the shape of W exactly — one gradient value per weight.

QuantityShapeMeaning
A − Y(1, m)Prediction error per example
Xᵀ · (A−Y)ᵀ(n, m)·(m, 1) = (n, 1)Gradient for each weight
Σ(A − Y)scalarGradient for the bias

The Gradient in Code

python
def compute_gradients(X, Y, A):
    """
    X: (m, n)
    Y: (1, m) — true labels
    A: (1, m) — predicted probabilities
    Returns dW: (n, 1), db: scalar
    """
    m = X.shape[0]
    dZ = A - Y               # (1, m) — prediction error
    dW = (1/m) * np.dot(X.T, dZ.T)  # (n, m) dot (m, 1) = (n, 1)
    db = (1/m) * np.sum(dZ)   # scalar
    return dW, db

The Update Rule

After computing gradients, you update W and b by subtracting a fraction of the gradient. The fraction is the learning rate α — a hyperparameter you choose. A learning rate that is too large causes the cost to oscillate and diverge. Too small, and training takes too long.

python
def update_parameters(W, b, dW, db, learning_rate):
    W = W - learning_rate * dW
    b = b - learning_rate * db
    return W, b

The learning rate α is the most important hyperparameter to tune. Start with 0.01. If the cost decreases smoothly, training is working. If the cost jumps around or increases, reduce the learning rate.

The Learning Rate — deep dive with visuals

Full Training Loop

With the forward pass, cost, gradients, and update rule all defined, the training loop is just these four steps repeated. The predict function is defined in the previous module — it applies forward_pass and thresholds at 0.5.

python
def train_logistic_regression(X_train, Y_train, learning_rate=0.01, num_iterations=1000):
    m, n = X_train.shape
    W, b = np.zeros((n, 1)), 0.0
    costs = []

    for i in range(num_iterations):
        # Forward pass
        A = forward_pass(X_train, W, b)

        # Compute cost
        cost = compute_cost(A, Y_train)

        # Backward pass — gradients
        dW, db = compute_gradients(X_train, Y_train, A)

        # Update parameters
        W, b = update_parameters(W, b, dW, db, learning_rate)

        if i % 100 == 0:
            costs.append(cost)
            print(f"Cost at iteration {i}: {cost:.4f}")

    return W, b, costs

Evaluating the Trained Model

After training, you evaluate on the test set to check generalisation. Training accuracy tells you how well the model fits the training data. Test accuracy tells you how well it generalises to unseen data. Plotting the cost over iterations confirms whether gradient descent actually converged.

python
import matplotlib.pyplot as plt

W, b, costs = train_logistic_regression(X_train, Y_train, learning_rate=0.005, num_iterations=2000)

# Accuracy
train_preds = predict(X_train, W, b)
test_preds  = predict(X_test,  W, b)

train_acc = np.mean(train_preds == Y_train) * 100
test_acc  = np.mean(test_preds  == Y_test)  * 100

print(f"Train accuracy: {train_acc:.1f}%")
print(f"Test accuracy:  {test_acc:.1f}%")

# Cost curve — should decrease smoothly; oscillation means α is too large
plt.plot(costs)
plt.xlabel('Iterations (×100)')
plt.ylabel('Cost J')
plt.title('Training Cost Over Time')
plt.show()

Gradient Descent Variants

The version above is batch gradient descent — it uses all m training examples to compute each gradient update. For large datasets this is slow. Mini-batch gradient descent splits the data into small batches and updates once per batch — much faster.

VariantBatch sizeUpdate frequencyProsCons
BatchAll mOnce per epochStable, exact gradientsSlow for large datasets
Stochastic (SGD)1m times per epochFast updatesNoisy, hard to converge
Mini-batch32–256m/batch per epochBalance of speed and stabilityExtra hyperparameter

Mini-batch gradient descent with batch size 32 or 64 is the standard in practice. PyTorch and TensorFlow DataLoaders handle the batching automatically.

Test Your Knowledge

Ready to check how much you remember? Take the quiz for Gradient Descent for Logistic Regression and see your score on the leaderboard.

Take the Quiz

Up next

Next, we look at overfitting in logistic regression — what it means when your model fits the training data too well, and how to spot it.

Overfitting in Logistic Regression