Feature Engineering

Feature engineering is the process of transforming raw inputs into better representations for your model. In this module we cover two key techniques:

Feature scaling — rescale inputs so they have comparable ranges, which makes gradient descent run much faster.
Feature creation — combine or transform existing features to create new ones that capture relationships the raw inputs cannot express on their own.

Why Feature Ranges Matter

Take the crop yield problem with two features: rainfall in millimetres and soil quality on a scale of 1 to 10.

Feature	Min	Max	Range
x₁ — Rainfall (mm)	50	200	150
x₂ — Soil Quality (1–10)	1	10	9

Rainfall spans 150 units. Soil quality spans only 9 units. Their ranges differ by a factor of about 17.

Now suppose a trained model learns these weights:

Parameter	Value	Interpretation
w₁ (rainfall)	0.03	Each extra mm of rain adds 0.03 t/ha
w₂ (soil quality)	0.90	Each extra quality point adds 0.90 t/ha
b	−3.0	Baseline offset

Here is the prediction broken down side by side for two plots:

	Plot A — 100 mm rain, soil 5	Plot B — 160 mm rain, soil 7
w₁ · x₁	0.03 × 100 = 3.0	0.03 × 160 = 4.8
w₂ · x₂	0.90 × 5 = 4.5	0.90 × 7 = 6.3
+ b	−3.0	−3.0
ŷ (yield t/ha)	4.5	8.1

Notice that w₁ = 0.03 is small and w₂ = 0.90 is large — even though both features matter equally to yield. The model compensates for the large range of rainfall by learning a tiny weight, and compensates for the small range of soil quality by learning a large weight.

Wide-range feature → small optimal weight. Narrow-range feature → large optimal weight. Feature range and its optimal weight are inversely related — the model must balance out the scale difference through the weight.

How This Slows Gradient Descent

Recall that the gradient for each weight is:

∂J/∂w_j = (1/m) · Σᵢ ( f(→x⁽ⁱ⁾) − y⁽ⁱ⁾ ) · x_j⁽ⁱ⁾

The gradient is scaled by the feature values x_j⁽ⁱ⁾. For rainfall, x₁ values are 80, 100, 120, 160 — large numbers. For soil quality, x₂ values are 3, 5, 7, 9 — small numbers.

This means:

	Gradient magnitude	Cost sensitivity
∂J/∂w₁ (rainfall)	Large — scaled by ~100s	A tiny step in w₁ causes a big change in cost
∂J/∂w₂ (soil quality)	Small — scaled by ~1–10	A large step in w₂ barely changes cost

The cost surface J(w₁, w₂) becomes elongated — very steep in the w₁ direction and very shallow in the w₂ direction. To prevent gradient descent from overshooting in the steep direction, you must use a very small learning rate. That same small learning rate then makes progress in the shallow direction painfully slow. Gradient descent ends up taking thousands of tiny zigzag steps before reaching the minimum.

Unequal feature ranges → unequal gradients → the learning rate must be set small enough for the steepest direction → slow convergence in all directions. Scaling features to a similar range fixes this at the source.

Quick Check

Rainfall ranges from 50–200 mm (range 150) and soil quality ranges from 1–10 (range 9). After training, which weight do you expect to be larger?

How to Achieve Feature Scaling

The goal is to rescale every feature so that all features have a comparable range of values. Ideally between −1 and 1.

Range	Status
−1 to 1	Ideal
−3 to 3	Acceptable
0 to 3	Acceptable
−100 to 100	Too large — rescale
−0.001 to 0.001	Too small — rescale
100 to 105	Too large in absolute value — rescale

The last case is easy to miss: the range is only 5, but absolute values around 100 produce large gradients. Any feature with large absolute values needs scaling, even if its spread is small.

There are three standard techniques. Each produces a scaled version x̃ of the original feature x.

1. Max Normalisation

Divide every value by the maximum value of the feature. The result lies between 0 and 1 (for positive features).

Find the maximum: max(x)
Divide every value: x̃ = x / max(x)

For e.g., rainfall with max = 200 mm:

Original x₁	x̃ = x₁ / 200
50 mm	0.25
100 mm	0.50
160 mm	0.80
200 mm	1.00

Scaled range: 0.25 to 1.0 ✓

2. Mean Normalisation

Centre the feature at zero by subtracting the mean, then divide by the range. The result typically lies between −1 and 1.

Compute the mean: μ = (1/m) · Σᵢ xᵢ
Compute the range: range = max(x) − min(x)
Scale every value: x̃ = (x − μ) / range

For e.g., rainfall with μ = 125 mm and range = 200 − 50 = 150:

Original x₁	x̃ = (x₁ − 125) / 150
50 mm	−0.50
100 mm	−0.17
160 mm	+0.23
200 mm	+0.50

Scaled range: −0.50 to 0.50 ✓

3. Z-Score Normalisation

Standardise the feature to have mean 0 and standard deviation 1. This is the most commonly used method in practice.

Compute the mean: μ = (1/m) · Σᵢ xᵢ
Compute the standard deviation: σ = √( (1/m) · Σᵢ (xᵢ − μ)² )
Scale every value: x̃ = (x − μ) / σ

For e.g., rainfall with μ = 125 mm and σ = 45 mm:

Original x₁	x̃ = (x₁ − 125) / 45
50 mm	−1.67
100 mm	−0.56
160 mm	+0.78
200 mm	+1.67

Scaled range: roughly −1.67 to 1.67 ✓

Z-score normalisation always produces values centred at 0. Values below the mean become negative — this is expected and correct, not an error. Any of the three methods works; z-score is preferred when you do not know the theoretical max or the data has outliers.

Quick Check

After z-score normalisation, a feature value becomes −1.8. What does this mean?

Creating New Features from Existing Ones

Feature scaling adjusts the range of existing inputs. Feature creation goes further — it defines entirely new inputs by combining or transforming the ones you already have. The goal is to give the model a variable that directly captures a relationship that the raw features cannot express linearly.

Going back to the crop yield problem: the model currently uses rainfall x₁ and land width x₂ as separate features. But what actually drives yield is the total area of land being farmed — not width alone. A plot that is 20 m wide and 30 m deep is fundamentally different from one that is 20 m wide and 10 m deep, even though x₂ is the same in both cases.

You can engineer a new feature:

x₃ = x₁ × x₂ (area = frontage × depth)

The model now becomes:

f_→w,b(→x) = w₁x₁ + w₂x₂ + w₃x₃ + b

Gradient descent learns whether area (x₃) is a better predictor than width (x₁) and depth (x₂) separately — if w₃ ends up large and w₁, w₂ end up near zero, the model is telling you that area is what matters, not the individual dimensions.

You are not limited to products. Common feature creation patterns:

Product — x₃ = x₁ × x₂ captures interaction between two features.
Ratio — x₃ = x₁ / x₂ captures relative scale (e.g. yield per mm of rain).
Power — x₃ = x₁² captures non-linear relationships with a single feature.
Log — x₃ = log(x₁) compresses large-range features and linearises exponential relationships.

The choice of what to engineer is guided by domain knowledge — understanding what physically or logically drives the output.

Quick Check

A model predicts house price from frontage (x₁, metres) and depth (x₂, metres). You engineer x₃ = x₁ × x₂. After training, w₃ is large and w₁, w₂ are near zero. What does this tell you?

Why Feature Ranges Matter

How This Slows Gradient Descent

How to Achieve Feature Scaling

1. Max Normalisation

2. Mean Normalisation

3. Z-Score Normalisation

Creating New Features from Existing Ones

Test Your Knowledge