Regression with Multiple Inputs

From One Feature to Many

Our crop yield model has used a single input — rainfall. But yield depends on more than one thing. Temperature during the growing season and soil quality both affect how much wheat a plot produces. Using all three features gives the model far more information, and a much more accurate prediction.

Here is our training set with all three features:

Example	x₁ — Rainfall (mm)	x₂ — Temperature (°C)	x₃ — Soil Quality (1–10)	y — Yield (t/ha)
1	80	22	6	2.2
2	100	24	7	3.1
3	120	21	8	3.9
4	145	25	7	5.0
5	160	23	9	5.4
6	185	26	8	6.1

Diagram

Each input feature is multiplied by its own weight. The weighted sum plus bias b produces the prediction ŷ.

Feature Notation

Before updating the model, here are the standard symbols for working with multiple features.

n — the total number of features. In our case, n = 3.
x_j — the j^th feature value, where j runs from 1 to n. For e.g., x₁ is rainfall, x₂ is temperature, x₃ is soil quality.
→x⁽ⁱ⁾ — a vector containing all n feature values for the i^th training example. For e.g., →x⁽²⁾ = [100, 24, 7] holds rainfall = 100 mm, temperature = 24 °C, and soil quality = 7.
x_j⁽ⁱ⁾ — the j^th feature of the i^th training example. For e.g., x₃⁽²⁾ = 7 is the soil quality score of the second plot.

Updating the Model

Previously, with one feature, the model was:

f_w,b(x) = wx + b

With three features, each gets its own weight:

f_→w,b(→x) = w₁x₁ + w₂x₂ + w₃x₃ + b

For e.g., a trained model might produce these weights:

f_→w,b(→x) = 0.5x₁ + 2x₂ + 20x₃ + 80

Term	Meaning
0.5x₁	Each extra mm of rainfall adds 0.5 to the predicted yield
2x₂	Each extra °C of temperature adds 2 to the predicted yield
20x₃	Each extra soil quality point adds 20 to the predicted yield
80	Baseline yield when all feature values are zero

Gradient descent learns all four parameters — w₁, w₂, w₃, and b — simultaneously by minimising the cost function.

Quick Check

You extend the crop yield model from one feature (rainfall) to three (rainfall, temperature, soil quality). How many parameters does gradient descent now learn?

The General Model

With n features, the pattern extends naturally:

f_→w,b(→x) = w₁x₁ + w₂x₂ + ⋯ + w_nx_n + b

Vector Form

This section uses dot product notation. If vectors and matrix operations are new to you, see the Vectorization in Python module first.

Go to Vectorization in Python

Writing out every weight individually becomes unwieldy as n grows. The compact, equivalent form uses standard vector notation.

Define a parameter vector w and a feature vector x:

→w = [w₁, w₂, w₃, …, w_n] — a row vector, one weight per feature
b — a single number, unchanged
→x = [x₁, x₂, x₃, …, x_n] — a row vector, one value per feature

The model becomes a dot product plus bias:

f_→w,b(→x) = →w · →x + b

This is exactly equivalent to the long sum — just written compactly. This model is called multiple linear regression.

Adding features does not change the algorithm — only the number of parameters. The cost function still measures squared prediction error. Gradient descent still updates every weight by subtracting the gradient times α.

The Cost Function

The cost function does not change structure. You still compute the squared error between each prediction and its actual label, sum across all m examples, and divide by 2m:

J(→w, b) = (1/2m) · Σᵢ (f_→w,b(→x⁽ⁱ⁾) − y⁽ⁱ⁾)²

The only difference is that f_→w,b(→x⁽ⁱ⁾) now uses all three weights. For e.g., for a plot with 80 mm of rainfall, 22 °C temperature, and soil quality 6 — with actual yield 2.2 t/ha — the prediction and error are:

ŷ⁽¹⁾ = w₁·80 + w₂·22 + w₃·6 + b

error⁽¹⁾ = ŷ⁽¹⁾ − 2.2

That squared error feeds into J exactly as before. The minimum of J now sits in a higher-dimensional space — one axis per parameter — but it is still a single bowl for the squared error cost.

Gradient Descent for Multiple Parameters

The update rule for each weight follows the same pattern as the single-feature case. For every weight w_j:

w_j ← w_j − α · ∂J/∂w_j for j = 1, 2, …, n

b ← b − α · ∂J/∂b

The partial derivatives work out to:

∂J/∂w_j = (1/m) · Σᵢ ( f(→x⁽ⁱ⁾) − y⁽ⁱ⁾ ) · x_j⁽ⁱ⁾

∂J/∂b = (1/m) · Σᵢ ( f(→x⁽ⁱ⁾) − y⁽ⁱ⁾ )

Each weight w_j is pulled by the prediction error weighted by its own feature x_j. The bias b is pulled by the raw error alone — it has no feature to scale it.

All weights must be updated simultaneously at every iteration. Compute ∂J/∂w₁, ∂J/∂w₂, ∂J/∂w₃, and ∂J/∂b using the current w and b. Then apply all updates at once — not one weight at a time.

Quick Check

You update w₁ first using the current gradients, then compute ∂J/∂w₂ using the already-updated w₁. What goes wrong?

Python: Full Implementation

The implementation is a direct extension of the single-feature version. The only structural change: w becomes a 1D NumPy array of length n, and predictions use a dot product np.dot(X, w) + b instead of scalar multiplication.

python

import numpy as np

# ── Training data: 6 plots, 3 features each ──────────────────────────────────
# Columns: [rainfall (mm), temperature (°C), soil quality (1–10)]
X_train = np.array([
    [ 80, 22, 6],
    [100, 24, 7],
    [120, 21, 8],
    [145, 25, 7],
    [160, 23, 9],
    [185, 26, 8],
], dtype=float)

y_train = np.array([2.2, 3.1, 3.9, 5.0, 5.4, 6.1])   # yield (t/ha)

m, n = X_train.shape   # m = 6 examples, n = 3 features


def compute_cost(X, y, w, b):
    """J(w, b) = (1/2m) · sum((np.dot(X, w) + b − y)²)"""
    predictions = np.dot(X, w) + b
    return (1 / (2 * m)) * np.sum((predictions - y) ** 2)


def compute_gradients(X, y, w, b):
    """
    ∂J/∂w = (1/m) · np.dot(Xᵀ, predictions − y)   → shape (n,)
    ∂J/∂b = (1/m) · sum(predictions − y)            → scalar
    """
    errors = np.dot(X, w) + b - y           # shape: (m,)
    dw = (1 / m) * np.dot(X.T, errors)     # shape: (n,) — one gradient per weight
    db = (1 / m) * np.sum(errors)  # scalar
    return dw, db


def gradient_descent(X, y, w_init, b_init, alpha, num_iterations):
    """
    Run gradient descent to minimise J(w, b) for n features.

    Parameters
    ----------
    X             : (m, n) feature matrix
    y             : (m,) label vector
    w_init        : (n,) initial weights
    b_init        : initial bias (scalar)
    alpha         : learning rate
    num_iterations: number of update steps

    Returns
    -------
    w, b          : optimised weights and bias
    cost_history  : J recorded every 1000 iterations
    """
    w = w_init.copy()
    b = b_init
    cost_history = []

    for i in range(num_iterations):
        dw, db = compute_gradients(X, y, w, b)

        # Simultaneous update — compute both before changing either
        w = w - alpha * dw
        b = b - alpha * db

        if i % 1000 == 0:
            cost = compute_cost(X, y, w, b)
            cost_history.append(cost)
            print(f"Iter {i:5d} | J = {cost:.5f} | w = {np.round(w, 4)} | b = {b:.4f}")

    return w, b, cost_history


# ── Initialise and run ────────────────────────────────────────────────────────
w_init = np.zeros(n)   # [0., 0., 0.]
b_init = 0.0

w_final, b_final, history = gradient_descent(
    X_train, y_train,
    w_init=w_init, b_init=b_init,
    alpha=0.0001,
    num_iterations=20_000,
)

print(f"\nOptimal weights:  w = {np.round(w_final, 4)}")
print(f"Optimal bias:     b = {b_final:.4f}")
print(f"Final cost:       J = {compute_cost(X_train, y_train, w_final, b_final):.5f}")

# ── Predict for a new field ───────────────────────────────────────────────────
x_new = np.array([130.0, 23.0, 8.0])   # 130 mm rain, 23 °C, soil quality 8
y_hat = np.dot(x_new, w_final) + b_final
print(f"\nPrediction: {y_hat:.2f} t/ha")
print(f"Revenue at $500/t: ${y_hat * 500:,.0f}")

# Sample output:
# Iter     0 | J = 13.20041 | w = [0.0391 0.0086 0.0491] | b = 0.0004
# Iter  1000 | J =  0.23514 | w = [0.0287 0.0034 0.0658] | b = 0.1940
# Iter  5000 | J =  0.02891 | w = [0.0298 0.0021 0.1612] | b = -1.4203
# Iter 19000 | J =  0.01742 | w = [ 0.0301  0.0019  0.1843] | b = -1.8431
#
# Optimal weights:  w = [ 0.0301  0.0019  0.1843]
# Optimal bias:     b = -1.8431
# Final cost:       J = 0.01742
#
# Prediction: 3.96 t/ha
# Revenue at $500/t: $1,980

Notice w_final is a three-element array — one learned weight per feature. The rainfall weight (w₁ ≈ 0.030) is the dominant driver, which matches intuition. Temperature (w₂ ≈ 0.002) contributes little across the narrow 21–26 °C range in this dataset. Soil quality (w₃ ≈ 0.184) adds a meaningful boost per quality point.

What `np.dot(X.T, errors)` Does

In compute_gradients, the line dw = (1 / m) * np.dot(X.T, errors) computes all n partial derivatives in a single matrix multiplication. For e.g., the first row of X.T is the vector of all rainfall values across training examples. Dotting it with errors gives exactly ∂J/∂w₁ = (1/m) · Σᵢ error⁽ⁱ⁾ · x₁⁽ⁱ⁾.

The same matrix multiply delivers ∂J/∂w₂ and ∂J/∂w₃ from the second and third rows of X.T. One operation — all gradients.

This is vectorization: replacing an explicit loop over j = 1 … n with a single matrix operation. It is not just convenient — on modern hardware it runs orders of magnitude faster than a Python loop, which matters when n is in the thousands.

Quick Check

The line dw = (1/m) * np.dot(X.T, errors) replaces what would otherwise be a Python loop. What does each entry of np.dot(X.T, errors) correspond to?

What Changes, What Stays the Same

	Single feature	Multiple features
Model	f = wx + b	f = w₁x₁ + w₂x₂ + ⋯ + wₙxₙ + b
Parameters	w (scalar), b	→w (vector of n), b
Cost function J	(1/2m) Σ(ŷ − y)²	same formula, more inputs per prediction
Gradient ∂J/∂→w	scalar	vector of n partial derivatives
Update rule	w ← w − α·∂J/∂w	w_j ← w_j − α·∂J/∂w_j for each j
Simultaneous update	required	required
Convergence	one global minimum	one global minimum (squared error)

The algorithm is identical. The only mechanical difference is that w and its gradient become arrays instead of scalars.

From One Feature to Many

Feature Notation

Updating the Model

The General Model

Vector Form

The Cost Function

Gradient Descent for Multiple Parameters

Python: Full Implementation

What np.dot(X.T, errors) Does

What Changes, What Stays the Same

Test Your Knowledge

What `np.dot(X.T, errors)` Does