Gradient Descent & Optimization

What Is Gradient Descent?

Training a model means finding the parameter values that minimise a loss function — a measure of how wrong the model's predictions are. Gradient descent is the algorithm that does this: it reads the gradient of the loss at the current parameters and moves in the direction that reduces it fastest.

This course covers gradient descent variants, adaptive optimizers, and learning rate strategies. It builds on Introduction to Machine Learning and pairs with Overfitting & Regularization.

# One step of gradient descent
for param in model.parameters():
    param -= learning_rate * param.grad

Three Variants

All three variants follow the same logic: compute a gradient, update parameters. They differ only in how much data they use to estimate that gradient.

Variant	Data per update	Gradient quality	When to use
Batch GD	Full dataset	Most accurate	Small datasets only
SGD	1 example	Very noisy	Rarely used alone
Mini-Batch GD	32–512 examples	Good balance	Standard practice

Mini-batch is the practical default. GPU hardware is optimised for matrix operations on batches. Whether the moderate gradient noise improves generalisation by helping escape sharp minima is actively debated — the mechanism is not fully understood.

Quick Check

Why is mini-batch gradient descent preferred over full-batch or single-sample SGD in practice?

The Learning Rate

The learning rate (η) controls how large each parameter update step is. Too high and training diverges or oscillates. Too low and training converges so slowly it is impractical.

Rule of thumb: start with 1e-3 for Adam, 1e-1 for SGD with momentum. Use a learning rate finder to scan for the best value before committing to a schedule.

Diagram

A too-high learning rate oscillates and fails to converge; a too-low rate makes negligible progress. The optimal rate descends smoothly to a low loss.

Momentum

Standard gradient descent oscillates in narrow valleys — it takes large steps across the narrow dimension and small steps in the direction of progress. Momentum fixes this by accumulating a running average of past gradients. It builds speed in consistent directions and dampens oscillation in high-curvature ones.

# SGD with momentum
velocity = momentum * velocity - learning_rate * gradient
param += velocity

Adaptive Optimizers

RMSProp

Maintains a running average of squared gradients per parameter. Parameters with large gradients get a smaller effective learning rate, preventing them from dominating updates. For e.g., well-suited for recurrent networks where gradient magnitudes vary widely across time steps.

Adam (Adaptive Moment Estimation)

The most widely used optimizer in deep learning. Adam combines momentum with RMSProp — each parameter gets its own adaptive learning rate based on the running mean of gradients and the running mean of squared gradients.

m = β₁ · m + (1 − β₁) · g (first moment, default β₁ = 0.9)
v = β₂ · v + (1 − β₂) · g² (second moment, default β₂ = 0.999)
param -= lr · m̂ / (√v̂ + ε) (bias-corrected update)

Quick Check

What two gradient statistics does the Adam optimizer maintain per parameter?

Learning Rate Schedules

A fixed learning rate is rarely optimal across the full training run. Common schedules:

Step Decay: reduce by a factor every N epochs. Simple and predictable.
Cosine Annealing: decay following a cosine curve, optionally with warm restarts.
Linear Warmup: start small, ramp up, then decay. Standard for Transformer training.
Cyclical LR: oscillate between a minimum and maximum. Can help escape sharp minima.
Reduce on Plateau: decrease when the validation metric stops improving.

Loss Landscape Concepts

Local minimum: lower than its immediate neighbours, but not the global lowest point.
Saddle point: gradient is zero, but the landscape curves up in some directions and down in others — not a true minimum.
Plateau: flat region where gradients are near-zero, stalling training progress.
Sharp vs flat minima: flat minima are often associated with better generalisation, and optimizers like Sharpness-Aware Minimization (SAM) explicitly seek them. This is, however, actively debated — Dinh et al. (2017) showed that flatness can be made arbitrarily large by reparametrising a network without changing its behaviour, calling into question whether flatness is causal or merely correlated with good generalisation.

Practical Tips

Gradient clipping: cap gradient norms to prevent exploding gradients. For e.g., essential in RNNs and Transformers where long sequences amplify gradients.
Weight initialization: use He initialization for ReLU networks, Glorot/Xavier for tanh. Poor initialization causes vanishing or exploding gradients from step one.
Batch size: larger batches give more accurate gradients but require proportional learning rate scaling. Large batches have been observed to converge to sharper minima in some settings, though subsequent research suggests this may be a learning rate effect rather than batch size per se — the relationship is not fully settled.
Mixed precision: use float16 for forward and backward passes to halve memory and roughly double speed on modern GPUs.