Feature engineering is the process of transforming raw inputs into better representations for your model. In this module we cover two key techniques:
- Feature scaling — rescale inputs so they have comparable ranges, which makes gradient descent run much faster.
- Feature creation — combine or transform existing features to create new ones that capture relationships the raw inputs cannot express on their own.
Why Feature Ranges Matter
Take the crop yield problem with two features: rainfall in millimetres and soil quality on a scale of 1 to 10.
| Feature | Min | Max | Range |
|---|---|---|---|
| x1 — Rainfall (mm) | 50 | 200 | 150 |
| x2 — Soil Quality (1–10) | 1 | 10 | 9 |
Rainfall spans 150 units. Soil quality spans only 9 units. Their ranges differ by a factor of about 17.
Now suppose a trained model learns these weights:
| Parameter | Value | Interpretation |
|---|---|---|
| w1 (rainfall) | 0.03 | Each extra mm of rain adds 0.03 t/ha |
| w2 (soil quality) | 0.90 | Each extra quality point adds 0.90 t/ha |
| b | −3.0 | Baseline offset |
Here is the prediction broken down side by side for two plots:
| Plot A — 100 mm rain, soil 5 | Plot B — 160 mm rain, soil 7 | |
|---|---|---|
| w1 · x1 | 0.03 × 100 = 3.0 | 0.03 × 160 = 4.8 |
| w2 · x2 | 0.90 × 5 = 4.5 | 0.90 × 7 = 6.3 |
| + b | −3.0 | −3.0 |
| ŷ (yield t/ha) | 4.5 | 8.1 |
Notice that w1 = 0.03 is small and w2 = 0.90 is large — even though both features matter equally to yield. The model compensates for the large range of rainfall by learning a tiny weight, and compensates for the small range of soil quality by learning a large weight.
Wide-range feature → small optimal weight. Narrow-range feature → large optimal weight. Feature range and its optimal weight are inversely related — the model must balance out the scale difference through the weight.
How This Slows Gradient Descent
Recall that the gradient for each weight is:
The gradient is scaled by the feature values xj(i). For rainfall, x1 values are 80, 100, 120, 160 — large numbers. For soil quality, x2 values are 3, 5, 7, 9 — small numbers.
This means:
| Gradient magnitude | Cost sensitivity | |
|---|---|---|
| ∂J/∂w1 (rainfall) | Large — scaled by ~100s | A tiny step in w1 causes a big change in cost |
| ∂J/∂w2 (soil quality) | Small — scaled by ~1–10 | A large step in w2 barely changes cost |
The cost surface J(w1, w2) becomes elongated — very steep in the w1 direction and very shallow in the w2 direction. To prevent gradient descent from overshooting in the steep direction, you must use a very small learning rate. That same small learning rate then makes progress in the shallow direction painfully slow. Gradient descent ends up taking thousands of tiny zigzag steps before reaching the minimum.
Unequal feature ranges → unequal gradients → the learning rate must be set small enough for the steepest direction → slow convergence in all directions. Scaling features to a similar range fixes this at the source.
Rainfall ranges from 50–200 mm (range 150) and soil quality ranges from 1–10 (range 9). After training, which weight do you expect to be larger?
How to Achieve Feature Scaling
The goal is to rescale every feature so that all features have a comparable range of values. Ideally between −1 and 1.
| Range | Status |
|---|---|
| −1 to 1 | Ideal |
| −3 to 3 | Acceptable |
| 0 to 3 | Acceptable |
| −100 to 100 | Too large — rescale |
| −0.001 to 0.001 | Too small — rescale |
| 100 to 105 | Too large in absolute value — rescale |
The last case is easy to miss: the range is only 5, but absolute values around 100 produce large gradients. Any feature with large absolute values needs scaling, even if its spread is small.
There are three standard techniques. Each produces a scaled version x̃ of the original feature x.
1. Max Normalisation
Divide every value by the maximum value of the feature. The result lies between 0 and 1 (for positive features).
- Find the maximum: max(x)
- Divide every value: x̃ = x / max(x)
For e.g., rainfall with max = 200 mm:
| Original x₁ | x̃ = x₁ / 200 |
|---|---|
| 50 mm | 0.25 |
| 100 mm | 0.50 |
| 160 mm | 0.80 |
| 200 mm | 1.00 |
Scaled range: 0.25 to 1.0 ✓
2. Mean Normalisation
Centre the feature at zero by subtracting the mean, then divide by the range. The result typically lies between −1 and 1.
- Compute the mean: μ = (1/m) · Σᵢ xᵢ
- Compute the range: range = max(x) − min(x)
- Scale every value: x̃ = (x − μ) / range
For e.g., rainfall with μ = 125 mm and range = 200 − 50 = 150:
| Original x₁ | x̃ = (x₁ − 125) / 150 |
|---|---|
| 50 mm | −0.50 |
| 100 mm | −0.17 |
| 160 mm | +0.23 |
| 200 mm | +0.50 |
Scaled range: −0.50 to 0.50 ✓
3. Z-Score Normalisation
Standardise the feature to have mean 0 and standard deviation 1. This is the most commonly used method in practice.
- Compute the mean: μ = (1/m) · Σᵢ xᵢ
- Compute the standard deviation: σ = √( (1/m) · Σᵢ (xᵢ − μ)² )
- Scale every value: x̃ = (x − μ) / σ
For e.g., rainfall with μ = 125 mm and σ = 45 mm:
| Original x₁ | x̃ = (x₁ − 125) / 45 |
|---|---|
| 50 mm | −1.67 |
| 100 mm | −0.56 |
| 160 mm | +0.78 |
| 200 mm | +1.67 |
Scaled range: roughly −1.67 to 1.67 ✓
Z-score normalisation always produces values centred at 0. Values below the mean become negative — this is expected and correct, not an error. Any of the three methods works; z-score is preferred when you do not know the theoretical max or the data has outliers.
After z-score normalisation, a feature value becomes −1.8. What does this mean?
Creating New Features from Existing Ones
Feature scaling adjusts the range of existing inputs. Feature creation goes further — it defines entirely new inputs by combining or transforming the ones you already have. The goal is to give the model a variable that directly captures a relationship that the raw features cannot express linearly.
Going back to the crop yield problem: the model currently uses rainfall x1 and land width x2 as separate features. But what actually drives yield is the total area of land being farmed — not width alone. A plot that is 20 m wide and 30 m deep is fundamentally different from one that is 20 m wide and 10 m deep, even though x2 is the same in both cases.
You can engineer a new feature:
The model now becomes:
Gradient descent learns whether area (x3) is a better predictor than width (x1) and depth (x2) separately — if w3 ends up large and w1, w2 end up near zero, the model is telling you that area is what matters, not the individual dimensions.
You are not limited to products. Common feature creation patterns:
- Product — x3 = x1 × x2 captures interaction between two features.
- Ratio — x3 = x1 / x2 captures relative scale (e.g. yield per mm of rain).
- Power — x3 = x1² captures non-linear relationships with a single feature.
- Log — x3 = log(x1) compresses large-range features and linearises exponential relationships.
The choice of what to engineer is guided by domain knowledge — understanding what physically or logically drives the output.
A model predicts house price from frontage (x₁, metres) and depth (x₂, metres). You engineer x₃ = x₁ × x₂. After training, w₃ is large and w₁, w₂ are near zero. What does this tell you?
