What Is the Learning Rate?
The learning rate α (alpha) is a positive number you choose before running gradient descent. It controls how large each parameter update step is. It appears in every update rule:
w ← w − α · ∂J/∂w
b ← b − α · ∂J/∂bThe gradient ∂J/∂w tells us the direction — which way J slopes. The learning rate α tells us how far to step in that direction. Everything else in gradient descent is automatic; the learning rate is the one value the developer sets.
α is a hyperparameter — it is set before training, not learned from data.
Too large: steps overshoot the minimum and J diverges.
Too small: steps are tiny and convergence takes far too long.
Just right: J falls smoothly and reaches the minimum efficiently.
Visualising the Steps
The diagram below shows three runs of gradient descent on the same J(w) curve, each starting from the same point but with a different α.
Top: α too large — J alternately rises and falls, never settling.
Middle: α just right — J falls smoothly to the minimum.
Bottom: α too small — steps are so tiny the curve barely moves.
The arrows below the curve show each horizontal update. Notice how the large-α arrows leap past the minimum and land on the other side, while the optimal-α arrows get progressively shorter as the slope flattens near the minimum.
When α Is Too Large
- Each step is so big that it overshoots the minimum — the update lands on the other side of the cost bowl.
- After overshooting, the next gradient points back the other way, so the algorithm bounces back and forth.
Joscillates: it may decrease on one iteration and increase on the next, with no consistent progress.- In the worst case, each overshoot lands on a higher part of the curve than the last —
Jdiverges, growing without bound. wandbnever settle — they oscillate indefinitely and never reach stable values.- How to detect it:
Jincreases after an update, or oscillates rather than falling consistently. - How to fix it: reduce
α— try dividing by 3 or 10 and observe whetherJstarts falling smoothly.
When α Is Too Small
- Each step moves
wandbby a tiny amount — the update is almost invisible on the cost curve. - Gradient descent does converge correctly, but it requires an extremely large number of iterations.
- After thousands of steps,
Jhas barely decreased from its starting value. - Compute time and iteration budget are wasted on steps that contribute almost nothing.
- How to detect it:
Jdecreases but the change per iteration is negligibly small even after many thousands of steps. - How to fix it: increase
α— try multiplying by 3 or 10 while watching thatJdoes not start oscillating.
When α Is Just Right
Jdecreases consistently after every single update — no oscillations, no reversals.- The step size is large enough to make real progress, but small enough not to overshoot.
- As the algorithm approaches the minimum, the gradient naturally gets smaller — so the steps get shorter automatically, without changing
α. - The algorithm converges to the minimum in a reasonable number of iterations.
- How to identify it: plot
Jvs iteration number — the curve should fall steeply at first, then level off smoothly as it approaches the minimum.
After 200 iterations of gradient descent, you notice J(w,b) has increased compared to iteration 1. What is the most likely cause and fix?
How α Behaves at the Minimum
Once the algorithm reaches the minimum, the gradient ∂J/∂w equals zero — the cost surface is flat there. The update rule becomes:
w ← w − α · 0 = wThe parameter stops changing. This happens automatically, without reducing α. Even a fixed learning rate brings gradient descent to a complete stop exactly at the minimum.
This is why gradient descent works: the gradient shrinks naturally as the algorithm approaches the minimum, so the effective step size decreases on its own even when α stays constant throughout training.
At the minimum, ∂J/∂w = 0 — the gradient vanishes.
The parameter update w ← w − α·0 = w leaves w unchanged.
Gradient descent stops automatically. No manual step-size reduction needed.
Why does gradient descent stop updating w and b when it reaches the minimum of J, even though α stays constant throughout training?
How the Loss Curve Looks
The most practical way to set α is to watch how J evolves over iterations.
Red: α too high — J oscillates with growing amplitude and diverges.
Green: α just right — J falls smoothly to a low value.
Grey dashed: α too low — J barely decreases after many iterations.
The ideal loss curve drops quickly in the early iterations, then flattens as it approaches the minimum. If J oscillates or rises, reduce α immediately. If J barely moves after hundreds of iterations, increase α.
Choosing a Starting Value
There is no single correct α — it depends on the problem, the data scale, and the cost function. A reliable starting approach:
- Start at
α = 0.001and run for a fixed number of iterations. - If
Joscillates or diverges: tryα = 0.0001(divide by 10). - If
Jdecreases but very slowly: tryα = 0.01(multiply by 10). - Once
Jfalls smoothly, fine-tune within that order of magnitude. - Common values used in practice:
0.0001,0.001,0.01,0.1— always tuned per problem.
The learning rate is a hyperparameter — a configuration value you set before training, not a value the algorithm learns from the data.
The learning rate α is called a hyperparameter. What makes it different from w and b?
