The Cost Function

What Is a Cost Function?

A cost function measures how well — or how badly — a model is performing. It takes the model's predictions, compares them to the actual correct labels, and returns a single number summarising how far off the predictions are across the entire training set.

The lower the cost, the better the model fits the data. The goal of training is to find the parameters w and b that minimise the cost.

Why: The cost tells us how well the model performs — one number, not m separate errors. Where: Used throughout training to guide every parameter update. How: Compare ŷ⁽ⁱ⁾ (predicted) to y⁽ⁱ⁾ (actual), square each difference, average. Goal: Minimise J(w, b) by finding the best w and b.

Choosing w and b

Recall the model from the previous module: f_w,b(x) = wx + b. Different values of w and b give different lines through the training data. Some fit the data well; others are far off.

Use the sliders below to explore. Some combinations of w and b will make the line pass close to most data points; others will miss badly.

Diagram

w (slope)0.032

0.000.010.020.030.040.050.060.070.08

b (intercept)+0.00

012345

Adjust w and b — try to pass the line as close to all purple points as possible.

Try different values of w and b. Some combinations fit the data well — the line passes close to the purple points. Others miss badly. We need a way to measure which is best.

We need a systematic, numerical way to decide which (w, b) pair is the best. Eyeballing the chart is not reliable and cannot be automated. That is exactly what the cost function provides.

Quick Check

We have two candidate lines for the crop yield data. How do we objectively decide which one fits better?

The Squared Error Cost Function

The most widely used cost function for regression is the squared error cost function. Here is how it is built step by step.

For each training example i, we compute the error — the gap between the model's prediction and the correct label:

error⁽ⁱ⁾ = ŷ⁽ⁱ⁾ − y⁽ⁱ⁾ = f_w,b(x⁽ⁱ⁾) − y⁽ⁱ⁾

We then square each error, sum all squared errors across all m examples, and divide by 2m:

J(w, b) = (1 / 2m) · Σᵢ ( f_w,b(x⁽ⁱ⁾) − y⁽ⁱ⁾ )²

Symbol	Meaning
J(w, b)	The cost — a single number summarising total prediction error
m	Number of training examples
f_w,b(x⁽ⁱ⁾)	Model's prediction for the i-th input (= ŷ⁽ⁱ⁾)
y⁽ⁱ⁾	The actual label for the i-th example
(ŷ⁽ⁱ⁾ − y⁽ⁱ⁾)²	Squared error for one example
Σᵢ	Sum over all m training examples
1 / 2m	Normalises the cost; the ½ is a convention that simplifies calculus later

J(w, b) = (1/2m) · Σᵢ (ŷ⁽ⁱ⁾ − y⁽ⁱ⁾)² • Squaring makes all errors positive — a prediction that is too high and one equally too low both contribute equally. • Squaring penalises large errors disproportionately more than small ones. • Dividing by m averages over all examples; the ½ is convention.

Quick Check

Why do we square the error (ŷ − y) rather than summing the raw differences?

Visualising the Cost Function

Diagram

The cost surface J(w, b) is a convex paraboloid — a smooth bowl with one global minimum.
Blue = low J (good fit). Red = high J (poor fit). Gold dot = w*, b* — the optimal parameters.
Every training algorithm's job is to find that gold dot.

J(w, b) is a three-dimensional surface — one cost value for every possible combination of w and b. The surface is a smooth bowl with a single lowest point. That point is w*, b* — the parameter values where the model fits the training data best.

The colour tells you how well the model is performing at that (w, b) pair: blue regions are low-cost (good fit), red regions are high-cost (poor fit). Training is the process of sliding down the bowl to the gold dot at the bottom.

For e.g., the diagram below fixes one candidate line (w = 0.025, b = 1.0) and shows the errors as vertical red bars — one per training example. Each bar is the gap between the model's prediction and the actual label.

Diagram

Each red bar is the error ŷ<sup>(i)</sup> − y<sup>(i)</sup> for one training example. J(w,b) is (1/2m) times the sum of all squared bar lengths. Shorter bars mean lower cost.

When the bars are long, J is large — the line is a poor fit. When the bars are short, J is small — the line is close to the data. Training moves w and b in the direction that makes J smaller after each step.

Quick Check

A model produces predictions [3.0, 4.5, 5.8] for training examples with actual labels [3.1, 5.0, 6.1]. Which error bar is the longest?

How Cost Changes with w

To build intuition, consider fixing b at 0 and asking: what happens to J as we vary w alone? Each value of w gives a different slope, a different line, and therefore a different total cost.

Diagram

J(w) plotted as a function of w for fixed b = 0. The cost forms a parabola with a single minimum. At the minimum, the line fits the training data better than at any other value of w.

The curve is a parabola — it has exactly one minimum. At that minimum, the line fits the training data better than at any other value of w.

Here are four concrete values of w and the cost each one produces, with b fixed at 0:

python

# Training data: (rainfall mm, yield t/ha)
X = [80, 100, 120, 145, 160, 185]
y = [2.2, 3.1, 3.9, 5.0, 5.4, 6.1]
m = len(y)

def cost(w, b=0):
    total = sum((w * x + b - yi) ** 2 for x, yi in zip(X, y))
    return total / (2 * m)

print(f"w = 0.010 → J = {cost(0.010):.3f}")   # too flat — predictions far below actual
print(f"w = 0.020 → J = {cost(0.020):.3f}")   # better, still too low
print(f"w = 0.033 → J = {cost(0.033):.3f}")   # near the minimum — best fit
print(f"w = 0.050 → J = {cost(0.050):.3f}")   # too steep — predictions overshoot

# Output:
# w = 0.010 → J = 4.897   ← high cost, line too flat
# w = 0.020 → J = 1.569   ← cost falling as w rises toward optimal
# w = 0.033 → J = 0.025   ← minimum — closest line to all data points
# w = 0.050 → J = 2.750   ← cost rising again, line now too steep

Quick Check

On the J(w) curve, what does the bottom of the parabola — the minimum — represent?

The Goal: Minimise J(w, b)

Training a linear regression model reduces to one objective: find w and b that minimise J(w, b).

Once we have those optimal parameters, we have a model f that makes the best possible predictions given the function form we chose (a straight line) and the data we trained on. In the next modules, we will look at gradient descent — the algorithm that systematically moves w and b toward the minimum of J, step by step, until it converges.

What Is a Cost Function?

Choosing w and b

The Squared Error Cost Function

Visualising the Cost Function

How Cost Changes with w

The Goal: Minimise J(w, b)

Test Your Knowledge