What Is a Cost Function?
A cost function measures how well — or how badly — a model is performing. It takes the model's predictions, compares them to the actual correct labels, and returns a single number summarising how far off the predictions are across the entire training set.
The lower the cost, the better the model fits the data. The goal of training is to find the parameters w and b that minimise the cost.
Why: The cost tells us how well the model performs — one number, not m separate errors. Where: Used throughout training to guide every parameter update. How: Compare ŷ(i) (predicted) to y(i) (actual), square each difference, average. Goal: Minimise J(w, b) by finding the best w and b.
Choosing w and b
Recall the model from the previous module: fw,b(x) = wx + b. Different values of w and b give different lines through the training data. Some fit the data well; others are far off.
Use the sliders below to explore. Some combinations of w and b will make the line pass close to most data points; others will miss badly.
We need a systematic, numerical way to decide which (w, b) pair is the best. Eyeballing the chart is not reliable and cannot be automated. That is exactly what the cost function provides.
We have two candidate lines for the crop yield data. How do we objectively decide which one fits better?
The Squared Error Cost Function
The most widely used cost function for regression is the squared error cost function. Here is how it is built step by step.
For each training example i, we compute the error — the gap between the model's prediction and the correct label:
We then square each error, sum all squared errors across all m examples, and divide by 2m:
| Symbol | Meaning |
|---|---|
| J(w, b) | The cost — a single number summarising total prediction error |
| m | Number of training examples |
| fw,b(x(i)) | Model's prediction for the i-th input (= ŷ(i)) |
| y(i) | The actual label for the i-th example |
| (ŷ(i) − y(i))² | Squared error for one example |
| Σᵢ | Sum over all m training examples |
| 1 / 2m | Normalises the cost; the ½ is a convention that simplifies calculus later |
J(w, b) = (1/2m) · Σᵢ (ŷ(i) − y(i))² • Squaring makes all errors positive — a prediction that is too high and one equally too low both contribute equally. • Squaring penalises large errors disproportionately more than small ones. • Dividing by m averages over all examples; the ½ is convention.
Why do we square the error (ŷ − y) rather than summing the raw differences?
Visualising the Cost Function
Blue = low J (good fit). Red = high J (poor fit). Gold dot = w*, b* — the optimal parameters.
Every training algorithm's job is to find that gold dot.
J(w, b) is a three-dimensional surface — one cost value for every possible combination of w and b. The surface is a smooth bowl with a single lowest point. That point is w*, b* — the parameter values where the model fits the training data best.
The colour tells you how well the model is performing at that (w, b) pair: blue regions are low-cost (good fit), red regions are high-cost (poor fit). Training is the process of sliding down the bowl to the gold dot at the bottom.
For e.g., the diagram below fixes one candidate line (w = 0.025, b = 1.0) and shows the errors as vertical red bars — one per training example. Each bar is the gap between the model's prediction and the actual label.
When the bars are long, J is large — the line is a poor fit. When the bars are short, J is small — the line is close to the data. Training moves w and b in the direction that makes J smaller after each step.
A model produces predictions [3.0, 4.5, 5.8] for training examples with actual labels [3.1, 5.0, 6.1]. Which error bar is the longest?
How Cost Changes with w
To build intuition, consider fixing b at 0 and asking: what happens to J as we vary w alone? Each value of w gives a different slope, a different line, and therefore a different total cost.
The curve is a parabola — it has exactly one minimum. At that minimum, the line fits the training data better than at any other value of w.
Here are four concrete values of w and the cost each one produces, with b fixed at 0:
# Training data: (rainfall mm, yield t/ha)
X = [80, 100, 120, 145, 160, 185]
y = [2.2, 3.1, 3.9, 5.0, 5.4, 6.1]
m = len(y)
def cost(w, b=0):
total = sum((w * x + b - yi) ** 2 for x, yi in zip(X, y))
return total / (2 * m)
print(f"w = 0.010 → J = {cost(0.010):.3f}") # too flat — predictions far below actual
print(f"w = 0.020 → J = {cost(0.020):.3f}") # better, still too low
print(f"w = 0.033 → J = {cost(0.033):.3f}") # near the minimum — best fit
print(f"w = 0.050 → J = {cost(0.050):.3f}") # too steep — predictions overshoot
# Output:
# w = 0.010 → J = 4.897 ← high cost, line too flat
# w = 0.020 → J = 1.569 ← cost falling as w rises toward optimal
# w = 0.033 → J = 0.025 ← minimum — closest line to all data points
# w = 0.050 → J = 2.750 ← cost rising again, line now too steepOn the J(w) curve, what does the bottom of the parabola — the minimum — represent?
The Goal: Minimise J(w, b)
Training a linear regression model reduces to one objective: find w and b that minimise J(w, b).
Once we have those optimal parameters, we have a model f that makes the best possible predictions given the function form we chose (a straight line) and the data we trained on. In the next modules, we will look at gradient descent — the algorithm that systematically moves w and b toward the minimum of J, step by step, until it converges.
