AAI Logo
Loading...
AAI Logo
Loading...
Machine Learning
Machine LearningBeginner

Regression with Multiple Inputs

multiple featuresmultivariate regressiongradient descentvectorizationlinear regressionparameters
No reviews yet — be the first!

From One Feature to Many

Our crop yield model has used a single input — rainfall. But yield depends on more than one thing. Temperature during the growing season and soil quality both affect how much wheat a plot produces. Using all three features gives the model far more information, and a much more accurate prediction.

Here is our training set with all three features:

Examplex1 — Rainfall (mm)x2 — Temperature (°C)x3 — Soil Quality (1–10)y — Yield (t/ha)
1802262.2
21002473.1
31202183.9
41452575.0
51602395.4
61852686.1
Diagram
INPUT FEATURESSUMMATIONPREDICTIONw₁w₂w₃x₁Rainfallx₂Temperaturex₃Soil QualitybΣŷCrop Yield
Each input feature is multiplied by its own weight. The weighted sum plus bias b produces the prediction ŷ.

Feature Notation

Before updating the model, here are the standard symbols for working with multiple features.

  • n — the total number of features. In our case, n = 3.
  • xj — the jth feature value, where j runs from 1 to n. For e.g., x1 is rainfall, x2 is temperature, x3 is soil quality.
  • x(i) — a vector containing all n feature values for the ith training example. For e.g., x(2) = [100, 24, 7] holds rainfall = 100 mm, temperature = 24 °C, and soil quality = 7.
  • xj(i) — the jth feature of the ith training example. For e.g., x3(2) = 7 is the soil quality score of the second plot.

Updating the Model

Previously, with one feature, the model was:

fw,b(x) = wx + b

With three features, each gets its own weight:

fw,b(x) = w1x1 + w2x2 + w3x3 + b

For e.g., a trained model might produce these weights:

fw,b(x) = 0.5x1 + 2x2 + 20x3 + 80
TermMeaning
0.5x1Each extra mm of rainfall adds 0.5 to the predicted yield
2x2Each extra °C of temperature adds 2 to the predicted yield
20x3Each extra soil quality point adds 20 to the predicted yield
80Baseline yield when all feature values are zero

Gradient descent learns all four parameters — w1, w2, w3, and b — simultaneously by minimising the cost function.

Quick Check

You extend the crop yield model from one feature (rainfall) to three (rainfall, temperature, soil quality). How many parameters does gradient descent now learn?

The General Model

With n features, the pattern extends naturally:

fw,b(x) = w1x1 + w2x2 + ⋯ + wnxn + b

Vector Form

This section uses dot product notation. If vectors and matrix operations are new to you, see the Vectorization in Python module first.

Go to Vectorization in Python

Writing out every weight individually becomes unwieldy as n grows. The compact, equivalent form uses standard vector notation.

Define a parameter vector w and a feature vector x:

  • w = [w1, w2, w3, …, wn] — a row vector, one weight per feature
  • b — a single number, unchanged
  • x = [x1, x2, x3, …, xn] — a row vector, one value per feature

The model becomes a dot product plus bias:

fw,b(x) = w · x + b

This is exactly equivalent to the long sum — just written compactly. This model is called multiple linear regression.

Adding features does not change the algorithm — only the number of parameters. The cost function still measures squared prediction error. Gradient descent still updates every weight by subtracting the gradient times α.

The Cost Function

The cost function does not change structure. You still compute the squared error between each prediction and its actual label, sum across all m examples, and divide by 2m:

J(w, b) = (1/2m) · Σᵢ (fw,b(x(i)) − y(i)

The only difference is that fw,b(x(i)) now uses all three weights. For e.g., for a plot with 80 mm of rainfall, 22 °C temperature, and soil quality 6 — with actual yield 2.2 t/ha — the prediction and error are:

ŷ⁽¹⁾ = w₁·80 + w₂·22 + w₃·6 + b
error⁽¹⁾ = ŷ⁽¹⁾ − 2.2

That squared error feeds into J exactly as before. The minimum of J now sits in a higher-dimensional space — one axis per parameter — but it is still a single bowl for the squared error cost.

Gradient Descent for Multiple Parameters

The update rule for each weight follows the same pattern as the single-feature case. For every weight wj:

wj ← wj − α · ∂J/∂wj    for j = 1, 2, …, n
b ← b − α · ∂J/∂b

The partial derivatives work out to:

∂J/∂wj = (1/m) · Σᵢ ( f(x(i)) − y(i) ) · xj(i)
∂J/∂b = (1/m) · Σᵢ ( f(x(i)) − y(i) )

Each weight wj is pulled by the prediction error weighted by its own feature xj. The bias b is pulled by the raw error alone — it has no feature to scale it.

All weights must be updated simultaneously at every iteration. Compute ∂J/∂w1, ∂J/∂w2, ∂J/∂w3, and ∂J/∂b using the current w and b. Then apply all updates at once — not one weight at a time.

Quick Check

You update w₁ first using the current gradients, then compute ∂J/∂w₂ using the already-updated w₁. What goes wrong?

Python: Full Implementation

The implementation is a direct extension of the single-feature version. The only structural change: w becomes a 1D NumPy array of length n, and predictions use a dot product np.dot(X, w) + b instead of scalar multiplication.

python
import numpy as np

# ── Training data: 6 plots, 3 features each ──────────────────────────────────
# Columns: [rainfall (mm), temperature (°C), soil quality (1–10)]
X_train = np.array([
    [ 80, 22, 6],
    [100, 24, 7],
    [120, 21, 8],
    [145, 25, 7],
    [160, 23, 9],
    [185, 26, 8],
], dtype=float)

y_train = np.array([2.2, 3.1, 3.9, 5.0, 5.4, 6.1])   # yield (t/ha)

m, n = X_train.shape   # m = 6 examples, n = 3 features


def compute_cost(X, y, w, b):
    """J(w, b) = (1/2m) · sum((np.dot(X, w) + b − y)²)"""
    predictions = np.dot(X, w) + b
    return (1 / (2 * m)) * np.sum((predictions - y) ** 2)


def compute_gradients(X, y, w, b):
    """
    ∂J/∂w = (1/m) · np.dot(Xᵀ, predictions − y)   → shape (n,)
    ∂J/∂b = (1/m) · sum(predictions − y)            → scalar
    """
    errors = np.dot(X, w) + b - y           # shape: (m,)
    dw = (1 / m) * np.dot(X.T, errors)     # shape: (n,) — one gradient per weight
    db = (1 / m) * np.sum(errors)  # scalar
    return dw, db


def gradient_descent(X, y, w_init, b_init, alpha, num_iterations):
    """
    Run gradient descent to minimise J(w, b) for n features.

    Parameters
    ----------
    X             : (m, n) feature matrix
    y             : (m,) label vector
    w_init        : (n,) initial weights
    b_init        : initial bias (scalar)
    alpha         : learning rate
    num_iterations: number of update steps

    Returns
    -------
    w, b          : optimised weights and bias
    cost_history  : J recorded every 1000 iterations
    """
    w = w_init.copy()
    b = b_init
    cost_history = []

    for i in range(num_iterations):
        dw, db = compute_gradients(X, y, w, b)

        # Simultaneous update — compute both before changing either
        w = w - alpha * dw
        b = b - alpha * db

        if i % 1000 == 0:
            cost = compute_cost(X, y, w, b)
            cost_history.append(cost)
            print(f"Iter {i:5d} | J = {cost:.5f} | w = {np.round(w, 4)} | b = {b:.4f}")

    return w, b, cost_history


# ── Initialise and run ────────────────────────────────────────────────────────
w_init = np.zeros(n)   # [0., 0., 0.]
b_init = 0.0

w_final, b_final, history = gradient_descent(
    X_train, y_train,
    w_init=w_init, b_init=b_init,
    alpha=0.0001,
    num_iterations=20_000,
)

print(f"\nOptimal weights:  w = {np.round(w_final, 4)}")
print(f"Optimal bias:     b = {b_final:.4f}")
print(f"Final cost:       J = {compute_cost(X_train, y_train, w_final, b_final):.5f}")

# ── Predict for a new field ───────────────────────────────────────────────────
x_new = np.array([130.0, 23.0, 8.0])   # 130 mm rain, 23 °C, soil quality 8
y_hat = np.dot(x_new, w_final) + b_final
print(f"\nPrediction: {y_hat:.2f} t/ha")
print(f"Revenue at $500/t: ${y_hat * 500:,.0f}")

# Sample output:
# Iter     0 | J = 13.20041 | w = [0.0391 0.0086 0.0491] | b = 0.0004
# Iter  1000 | J =  0.23514 | w = [0.0287 0.0034 0.0658] | b = 0.1940
# Iter  5000 | J =  0.02891 | w = [0.0298 0.0021 0.1612] | b = -1.4203
# Iter 19000 | J =  0.01742 | w = [ 0.0301  0.0019  0.1843] | b = -1.8431
#
# Optimal weights:  w = [ 0.0301  0.0019  0.1843]
# Optimal bias:     b = -1.8431
# Final cost:       J = 0.01742
#
# Prediction: 3.96 t/ha
# Revenue at $500/t: $1,980

Notice w_final is a three-element array — one learned weight per feature. The rainfall weight (w1 ≈ 0.030) is the dominant driver, which matches intuition. Temperature (w2 ≈ 0.002) contributes little across the narrow 21–26 °C range in this dataset. Soil quality (w3 ≈ 0.184) adds a meaningful boost per quality point.

What np.dot(X.T, errors) Does

In compute_gradients, the line dw = (1 / m) * np.dot(X.T, errors) computes all n partial derivatives in a single matrix multiplication. For e.g., the first row of X.T is the vector of all rainfall values across training examples. Dotting it with errors gives exactly ∂J/∂w1 = (1/m) · Σᵢ error(i) · x1(i).

The same matrix multiply delivers ∂J/∂w2 and ∂J/∂w3 from the second and third rows of X.T. One operation — all gradients.

This is vectorization: replacing an explicit loop over j = 1 … n with a single matrix operation. It is not just convenient — on modern hardware it runs orders of magnitude faster than a Python loop, which matters when n is in the thousands.

Quick Check

The line dw = (1/m) * np.dot(X.T, errors) replaces what would otherwise be a Python loop. What does each entry of np.dot(X.T, errors) correspond to?

What Changes, What Stays the Same

Single featureMultiple features
Modelf = wx + bf = w₁x₁ + w₂x₂ + ⋯ + wₙxₙ + b
Parametersw (scalar), bw (vector of n), b
Cost function J(1/2m) Σ(ŷ − y)²same formula, more inputs per prediction
Gradient ∂J/∂wscalarvector of n partial derivatives
Update rulew ← w − α·∂J/∂wwj ← wj − α·∂J/∂wj for each j
Simultaneous updaterequiredrequired
Convergenceone global minimumone global minimum (squared error)

The algorithm is identical. The only mechanical difference is that w and its gradient become arrays instead of scalars.

Test Your Knowledge

Ready to check how much you remember? Take the quiz for Regression with Multiple Inputs and see your score on the leaderboard.

Take the Quiz

Up next

Next, we look at feature engineering — how to scale your inputs so gradient descent runs much faster.

Feature Engineering