AAI Logo
Loading...
AAI Logo
Loading...
Machine Learning
Machine LearningBeginner

The Learning Rate

learning ratealphagradient descentconvergencehyperparameteroptimization
No reviews yet — be the first!

What Is the Learning Rate?

The learning rate α (alpha) is a positive number you choose before running gradient descent. It controls how large each parameter update step is. It appears in every update rule:

pseudocode
w ← w − α · ∂J/∂w
b ← b − α · ∂J/∂b

The gradient ∂J/∂w tells us the direction — which way J slopes. The learning rate α tells us how far to step in that direction. Everything else in gradient descent is automatic; the learning rate is the one value the developer sets.

α is a hyperparameter — it is set before training, not learned from data. Too large: steps overshoot the minimum and J diverges. Too small: steps are tiny and convergence takes far too long. Just right: J falls smoothly and reaches the minimum efficiently.

Visualising the Steps

The diagram below shows three runs of gradient descent on the same J(w) curve, each starting from the same point but with a different α.

Diagram
How the Learning Rate α Affects Each Stepα too largeovershoots each step — J oscillatesstep 0 / 5J bounces up and down — never convergesJ(w)w50w*0α just rightα = 0.28step 0 / 10J falls smoothly — converges to minimumJ(w)w50w*0α too smallα = 0.025step 0 / 10J barely decreases — thousands of steps neededJ(w)w50w*0start (step 0)minimum w*numbered dots = each gradient descent step↻ replays automatically
Each dot is one gradient descent step. Gold = starting point, green = minimum.
Top: α too large — J alternately rises and falls, never settling.
Middle: α just right — J falls smoothly to the minimum.
Bottom: α too small — steps are so tiny the curve barely moves.

The arrows below the curve show each horizontal update. Notice how the large-α arrows leap past the minimum and land on the other side, while the optimal-α arrows get progressively shorter as the slope flattens near the minimum.

When α Is Too Large

  • Each step is so big that it overshoots the minimum — the update lands on the other side of the cost bowl.
  • After overshooting, the next gradient points back the other way, so the algorithm bounces back and forth.
  • J oscillates: it may decrease on one iteration and increase on the next, with no consistent progress.
  • In the worst case, each overshoot lands on a higher part of the curve than the last — J diverges, growing without bound.
  • w and b never settle — they oscillate indefinitely and never reach stable values.
  • How to detect it: J increases after an update, or oscillates rather than falling consistently.
  • How to fix it: reduce α — try dividing by 3 or 10 and observe whether J starts falling smoothly.

When α Is Too Small

  • Each step moves w and b by a tiny amount — the update is almost invisible on the cost curve.
  • Gradient descent does converge correctly, but it requires an extremely large number of iterations.
  • After thousands of steps, J has barely decreased from its starting value.
  • Compute time and iteration budget are wasted on steps that contribute almost nothing.
  • How to detect it: J decreases but the change per iteration is negligibly small even after many thousands of steps.
  • How to fix it: increase α — try multiplying by 3 or 10 while watching that J does not start oscillating.

When α Is Just Right

  • J decreases consistently after every single update — no oscillations, no reversals.
  • The step size is large enough to make real progress, but small enough not to overshoot.
  • As the algorithm approaches the minimum, the gradient naturally gets smaller — so the steps get shorter automatically, without changing α.
  • The algorithm converges to the minimum in a reasonable number of iterations.
  • How to identify it: plot J vs iteration number — the curve should fall steeply at first, then level off smoothly as it approaches the minimum.
Quick Check

After 200 iterations of gradient descent, you notice J(w,b) has increased compared to iteration 1. What is the most likely cause and fix?

How α Behaves at the Minimum

Once the algorithm reaches the minimum, the gradient ∂J/∂w equals zero — the cost surface is flat there. The update rule becomes:

pseudocode
w ← w − α · 0 = w

The parameter stops changing. This happens automatically, without reducing α. Even a fixed learning rate brings gradient descent to a complete stop exactly at the minimum.

This is why gradient descent works: the gradient shrinks naturally as the algorithm approaches the minimum, so the effective step size decreases on its own even when α stays constant throughout training.

At the minimum, ∂J/∂w = 0 — the gradient vanishes. The parameter update w ← w − α·0 = w leaves w unchanged. Gradient descent stops automatically. No manual step-size reduction needed.

Quick Check

Why does gradient descent stop updating w and b when it reaches the minimum of J, even though α stays constant throughout training?

How the Loss Curve Looks

The most practical way to set α is to watch how J evolves over iterations.

Diagram
↑ divergingJ≈2.50J≈2.500.51.01.52.02.53.0Training Loss J01020304050607080Iterationiter 0α too high — divergesα just right — convergesα too low — stalls
Training loss across iterations for three learning rates.
Red: α too high — J oscillates with growing amplitude and diverges.
Green: α just right — J falls smoothly to a low value.
Grey dashed: α too low — J barely decreases after many iterations.

The ideal loss curve drops quickly in the early iterations, then flattens as it approaches the minimum. If J oscillates or rises, reduce α immediately. If J barely moves after hundreds of iterations, increase α.

Choosing a Starting Value

There is no single correct α — it depends on the problem, the data scale, and the cost function. A reliable starting approach:

  • Start at α = 0.001 and run for a fixed number of iterations.
  • If J oscillates or diverges: try α = 0.0001 (divide by 10).
  • If J decreases but very slowly: try α = 0.01 (multiply by 10).
  • Once J falls smoothly, fine-tune within that order of magnitude.
  • Common values used in practice: 0.0001, 0.001, 0.01, 0.1 — always tuned per problem.

The learning rate is a hyperparameter — a configuration value you set before training, not a value the algorithm learns from the data.

Quick Check

The learning rate α is called a hyperparameter. What makes it different from w and b?

Test Your Knowledge

Ready to check how much you remember? Take the quiz for The Learning Rate and see your score on the leaderboard.

Take the Quiz

Up next

In the next module, we extend gradient descent to multiple inputs — adding rainfall, temperature, and soil quality to the crop yield model and writing the full multi-parameter update loop in Python.

Regression with Multiple Inputs