From One Feature to Many
Our crop yield model has used a single input — rainfall. But yield depends on more than one thing. Temperature during the growing season and soil quality both affect how much wheat a plot produces. Using all three features gives the model far more information, and a much more accurate prediction.
Here is our training set with all three features:
| Example | x1 — Rainfall (mm) | x2 — Temperature (°C) | x3 — Soil Quality (1–10) | y — Yield (t/ha) |
|---|---|---|---|---|
| 1 | 80 | 22 | 6 | 2.2 |
| 2 | 100 | 24 | 7 | 3.1 |
| 3 | 120 | 21 | 8 | 3.9 |
| 4 | 145 | 25 | 7 | 5.0 |
| 5 | 160 | 23 | 9 | 5.4 |
| 6 | 185 | 26 | 8 | 6.1 |
Feature Notation
Before updating the model, here are the standard symbols for working with multiple features.
- n — the total number of features. In our case,
n = 3. - xj — the
jth feature value, wherejruns from 1 ton. For e.g., x1 is rainfall, x2 is temperature, x3 is soil quality. - →x(i) — a vector containing all
nfeature values for theith training example. For e.g., →x(2) = [100, 24, 7] holds rainfall = 100 mm, temperature = 24 °C, and soil quality = 7. - xj(i) — the
jth feature of theith training example. For e.g., x3(2) = 7 is the soil quality score of the second plot.
Updating the Model
Previously, with one feature, the model was:
With three features, each gets its own weight:
For e.g., a trained model might produce these weights:
| Term | Meaning |
|---|---|
| 0.5x1 | Each extra mm of rainfall adds 0.5 to the predicted yield |
| 2x2 | Each extra °C of temperature adds 2 to the predicted yield |
| 20x3 | Each extra soil quality point adds 20 to the predicted yield |
| 80 | Baseline yield when all feature values are zero |
Gradient descent learns all four parameters — w1, w2, w3, and b — simultaneously by minimising the cost function.
You extend the crop yield model from one feature (rainfall) to three (rainfall, temperature, soil quality). How many parameters does gradient descent now learn?
The General Model
With n features, the pattern extends naturally:
Vector Form
This section uses dot product notation. If vectors and matrix operations are new to you, see the Vectorization in Python module first.
Go to Vectorization in PythonWriting out every weight individually becomes unwieldy as n grows. The compact, equivalent form uses standard vector notation.
Define a parameter vector w and a feature vector x:
- →w = [w1, w2, w3, …, wn] — a row vector, one weight per feature
b— a single number, unchanged- →x = [x1, x2, x3, …, xn] — a row vector, one value per feature
The model becomes a dot product plus bias:
This is exactly equivalent to the long sum — just written compactly. This model is called multiple linear regression.
Adding features does not change the algorithm — only the number of parameters.
The cost function still measures squared prediction error.
Gradient descent still updates every weight by subtracting the gradient times α.
The Cost Function
The cost function does not change structure. You still compute the squared error between each prediction and its actual label, sum across all m examples, and divide by 2m:
The only difference is that f→w,b(→x(i)) now uses all three weights. For e.g., for a plot with 80 mm of rainfall, 22 °C temperature, and soil quality 6 — with actual yield 2.2 t/ha — the prediction and error are:
That squared error feeds into J exactly as before. The minimum of J now sits in a higher-dimensional space — one axis per parameter — but it is still a single bowl for the squared error cost.
Gradient Descent for Multiple Parameters
The update rule for each weight follows the same pattern as the single-feature case. For every weight wj:
The partial derivatives work out to:
Each weight wj is pulled by the prediction error weighted by its own feature xj. The bias b is pulled by the raw error alone — it has no feature to scale it.
All weights must be updated simultaneously at every iteration.
Compute ∂J/∂w1, ∂J/∂w2, ∂J/∂w3, and ∂J/∂b using the current w and b.
Then apply all updates at once — not one weight at a time.
You update w₁ first using the current gradients, then compute ∂J/∂w₂ using the already-updated w₁. What goes wrong?
Python: Full Implementation
The implementation is a direct extension of the single-feature version. The only structural change: w becomes a 1D NumPy array of length n, and predictions use a dot product np.dot(X, w) + b instead of scalar multiplication.
import numpy as np
# ── Training data: 6 plots, 3 features each ──────────────────────────────────
# Columns: [rainfall (mm), temperature (°C), soil quality (1–10)]
X_train = np.array([
[ 80, 22, 6],
[100, 24, 7],
[120, 21, 8],
[145, 25, 7],
[160, 23, 9],
[185, 26, 8],
], dtype=float)
y_train = np.array([2.2, 3.1, 3.9, 5.0, 5.4, 6.1]) # yield (t/ha)
m, n = X_train.shape # m = 6 examples, n = 3 features
def compute_cost(X, y, w, b):
"""J(w, b) = (1/2m) · sum((np.dot(X, w) + b − y)²)"""
predictions = np.dot(X, w) + b
return (1 / (2 * m)) * np.sum((predictions - y) ** 2)
def compute_gradients(X, y, w, b):
"""
∂J/∂w = (1/m) · np.dot(Xᵀ, predictions − y) → shape (n,)
∂J/∂b = (1/m) · sum(predictions − y) → scalar
"""
errors = np.dot(X, w) + b - y # shape: (m,)
dw = (1 / m) * np.dot(X.T, errors) # shape: (n,) — one gradient per weight
db = (1 / m) * np.sum(errors) # scalar
return dw, db
def gradient_descent(X, y, w_init, b_init, alpha, num_iterations):
"""
Run gradient descent to minimise J(w, b) for n features.
Parameters
----------
X : (m, n) feature matrix
y : (m,) label vector
w_init : (n,) initial weights
b_init : initial bias (scalar)
alpha : learning rate
num_iterations: number of update steps
Returns
-------
w, b : optimised weights and bias
cost_history : J recorded every 1000 iterations
"""
w = w_init.copy()
b = b_init
cost_history = []
for i in range(num_iterations):
dw, db = compute_gradients(X, y, w, b)
# Simultaneous update — compute both before changing either
w = w - alpha * dw
b = b - alpha * db
if i % 1000 == 0:
cost = compute_cost(X, y, w, b)
cost_history.append(cost)
print(f"Iter {i:5d} | J = {cost:.5f} | w = {np.round(w, 4)} | b = {b:.4f}")
return w, b, cost_history
# ── Initialise and run ────────────────────────────────────────────────────────
w_init = np.zeros(n) # [0., 0., 0.]
b_init = 0.0
w_final, b_final, history = gradient_descent(
X_train, y_train,
w_init=w_init, b_init=b_init,
alpha=0.0001,
num_iterations=20_000,
)
print(f"\nOptimal weights: w = {np.round(w_final, 4)}")
print(f"Optimal bias: b = {b_final:.4f}")
print(f"Final cost: J = {compute_cost(X_train, y_train, w_final, b_final):.5f}")
# ── Predict for a new field ───────────────────────────────────────────────────
x_new = np.array([130.0, 23.0, 8.0]) # 130 mm rain, 23 °C, soil quality 8
y_hat = np.dot(x_new, w_final) + b_final
print(f"\nPrediction: {y_hat:.2f} t/ha")
print(f"Revenue at $500/t: ${y_hat * 500:,.0f}")
# Sample output:
# Iter 0 | J = 13.20041 | w = [0.0391 0.0086 0.0491] | b = 0.0004
# Iter 1000 | J = 0.23514 | w = [0.0287 0.0034 0.0658] | b = 0.1940
# Iter 5000 | J = 0.02891 | w = [0.0298 0.0021 0.1612] | b = -1.4203
# Iter 19000 | J = 0.01742 | w = [ 0.0301 0.0019 0.1843] | b = -1.8431
#
# Optimal weights: w = [ 0.0301 0.0019 0.1843]
# Optimal bias: b = -1.8431
# Final cost: J = 0.01742
#
# Prediction: 3.96 t/ha
# Revenue at $500/t: $1,980Notice w_final is a three-element array — one learned weight per feature. The rainfall weight (w1 ≈ 0.030) is the dominant driver, which matches intuition. Temperature (w2 ≈ 0.002) contributes little across the narrow 21–26 °C range in this dataset. Soil quality (w3 ≈ 0.184) adds a meaningful boost per quality point.
What np.dot(X.T, errors) Does
In compute_gradients, the line dw = (1 / m) * np.dot(X.T, errors) computes all n partial derivatives in a single matrix multiplication. For e.g., the first row of X.T is the vector of all rainfall values across training examples. Dotting it with errors gives exactly ∂J/∂w1 = (1/m) · Σᵢ error(i) · x1(i).
The same matrix multiply delivers ∂J/∂w2 and ∂J/∂w3 from the second and third rows of X.T. One operation — all gradients.
This is vectorization: replacing an explicit loop over j = 1 … n with a single matrix operation. It is not just convenient — on modern hardware it runs orders of magnitude faster than a Python loop, which matters when n is in the thousands.
The line dw = (1/m) * np.dot(X.T, errors) replaces what would otherwise be a Python loop. What does each entry of np.dot(X.T, errors) correspond to?
What Changes, What Stays the Same
| Single feature | Multiple features | |
|---|---|---|
| Model | f = wx + b | f = w₁x₁ + w₂x₂ + ⋯ + wₙxₙ + b |
| Parameters | w (scalar), b | →w (vector of n), b |
| Cost function J | (1/2m) Σ(ŷ − y)² | same formula, more inputs per prediction |
| Gradient ∂J/∂→w | scalar | vector of n partial derivatives |
| Update rule | w ← w − α·∂J/∂w | wj ← wj − α·∂J/∂wj for each j |
| Simultaneous update | required | required |
| Convergence | one global minimum | one global minimum (squared error) |
The algorithm is identical. The only mechanical difference is that w and its gradient become arrays instead of scalars.
