AAI Logo
Loading...
AAI Logo
Loading...
Machine Learning
Machine LearningBeginner

Feature Engineering

feature scalingnormalisationgradient descentpreprocessing
No reviews yet — be the first!

Feature engineering is the process of transforming raw inputs into better representations for your model. In this module we cover two key techniques:

  • Feature scaling — rescale inputs so they have comparable ranges, which makes gradient descent run much faster.
  • Feature creation — combine or transform existing features to create new ones that capture relationships the raw inputs cannot express on their own.

Why Feature Ranges Matter

Take the crop yield problem with two features: rainfall in millimetres and soil quality on a scale of 1 to 10.

FeatureMinMaxRange
x1 — Rainfall (mm)50200150
x2 — Soil Quality (1–10)1109

Rainfall spans 150 units. Soil quality spans only 9 units. Their ranges differ by a factor of about 17.

Now suppose a trained model learns these weights:

ParameterValueInterpretation
w1 (rainfall)0.03Each extra mm of rain adds 0.03 t/ha
w2 (soil quality)0.90Each extra quality point adds 0.90 t/ha
b−3.0Baseline offset

Here is the prediction broken down side by side for two plots:

Plot A — 100 mm rain, soil 5Plot B — 160 mm rain, soil 7
w1 · x10.03 × 100 = 3.00.03 × 160 = 4.8
w2 · x20.90 × 5 = 4.50.90 × 7 = 6.3
+ b−3.0−3.0
ŷ (yield t/ha)4.58.1

Notice that w1 = 0.03 is small and w2 = 0.90 is large — even though both features matter equally to yield. The model compensates for the large range of rainfall by learning a tiny weight, and compensates for the small range of soil quality by learning a large weight.

Wide-range feature → small optimal weight. Narrow-range feature → large optimal weight. Feature range and its optimal weight are inversely related — the model must balance out the scale difference through the weight.

How This Slows Gradient Descent

Recall that the gradient for each weight is:

∂J/∂wj = (1/m) · Σᵢ ( f(x(i)) − y(i) ) · xj(i)

The gradient is scaled by the feature values xj(i). For rainfall, x1 values are 80, 100, 120, 160 — large numbers. For soil quality, x2 values are 3, 5, 7, 9 — small numbers.

This means:

Gradient magnitudeCost sensitivity
∂J/∂w1 (rainfall)Large — scaled by ~100sA tiny step in w1 causes a big change in cost
∂J/∂w2 (soil quality)Small — scaled by ~1–10A large step in w2 barely changes cost

The cost surface J(w1, w2) becomes elongated — very steep in the w1 direction and very shallow in the w2 direction. To prevent gradient descent from overshooting in the steep direction, you must use a very small learning rate. That same small learning rate then makes progress in the shallow direction painfully slow. Gradient descent ends up taking thousands of tiny zigzag steps before reaching the minimum.

Unequal feature ranges → unequal gradients → the learning rate must be set small enough for the steepest direction → slow convergence in all directions. Scaling features to a similar range fixes this at the source.

Quick Check

Rainfall ranges from 50–200 mm (range 150) and soil quality ranges from 1–10 (range 9). After training, which weight do you expect to be larger?

How to Achieve Feature Scaling

The goal is to rescale every feature so that all features have a comparable range of values. Ideally between −1 and 1.

RangeStatus
−1 to 1Ideal
−3 to 3Acceptable
0 to 3Acceptable
−100 to 100Too large — rescale
−0.001 to 0.001Too small — rescale
100 to 105Too large in absolute value — rescale

The last case is easy to miss: the range is only 5, but absolute values around 100 produce large gradients. Any feature with large absolute values needs scaling, even if its spread is small.

There are three standard techniques. Each produces a scaled version x̃ of the original feature x.

1. Max Normalisation

Divide every value by the maximum value of the feature. The result lies between 0 and 1 (for positive features).

  1. Find the maximum: max(x)
  2. Divide every value: x̃ = x / max(x)

For e.g., rainfall with max = 200 mm:

Original x₁x̃ = x₁ / 200
50 mm0.25
100 mm0.50
160 mm0.80
200 mm1.00

Scaled range: 0.25 to 1.0 ✓

2. Mean Normalisation

Centre the feature at zero by subtracting the mean, then divide by the range. The result typically lies between −1 and 1.

  1. Compute the mean: μ = (1/m) · Σᵢ xᵢ
  2. Compute the range: range = max(x) − min(x)
  3. Scale every value: x̃ = (x − μ) / range

For e.g., rainfall with μ = 125 mm and range = 200 − 50 = 150:

Original x₁x̃ = (x₁ − 125) / 150
50 mm−0.50
100 mm−0.17
160 mm+0.23
200 mm+0.50

Scaled range: −0.50 to 0.50 ✓

3. Z-Score Normalisation

Standardise the feature to have mean 0 and standard deviation 1. This is the most commonly used method in practice.

  1. Compute the mean: μ = (1/m) · Σᵢ xᵢ
  2. Compute the standard deviation: σ = √( (1/m) · Σᵢ (xᵢ − μ)² )
  3. Scale every value: x̃ = (x − μ) / σ

For e.g., rainfall with μ = 125 mm and σ = 45 mm:

Original x₁x̃ = (x₁ − 125) / 45
50 mm−1.67
100 mm−0.56
160 mm+0.78
200 mm+1.67

Scaled range: roughly −1.67 to 1.67 ✓

Z-score normalisation always produces values centred at 0. Values below the mean become negative — this is expected and correct, not an error. Any of the three methods works; z-score is preferred when you do not know the theoretical max or the data has outliers.

Quick Check

After z-score normalisation, a feature value becomes −1.8. What does this mean?

Creating New Features from Existing Ones

Feature scaling adjusts the range of existing inputs. Feature creation goes further — it defines entirely new inputs by combining or transforming the ones you already have. The goal is to give the model a variable that directly captures a relationship that the raw features cannot express linearly.

Going back to the crop yield problem: the model currently uses rainfall x1 and land width x2 as separate features. But what actually drives yield is the total area of land being farmed — not width alone. A plot that is 20 m wide and 30 m deep is fundamentally different from one that is 20 m wide and 10 m deep, even though x2 is the same in both cases.

You can engineer a new feature:

x3 = x1 × x2    (area = frontage × depth)

The model now becomes:

fw,b(x) = w1x1 + w2x2 + w3x3 + b

Gradient descent learns whether area (x3) is a better predictor than width (x1) and depth (x2) separately — if w3 ends up large and w1, w2 end up near zero, the model is telling you that area is what matters, not the individual dimensions.

You are not limited to products. Common feature creation patterns:

  • Product — x3 = x1 × x2 captures interaction between two features.
  • Ratio — x3 = x1 / x2 captures relative scale (e.g. yield per mm of rain).
  • Power — x3 = x1² captures non-linear relationships with a single feature.
  • Log — x3 = log(x1) compresses large-range features and linearises exponential relationships.

The choice of what to engineer is guided by domain knowledge — understanding what physically or logically drives the output.

Quick Check

A model predicts house price from frontage (x₁, metres) and depth (x₂, metres). You engineer x₃ = x₁ × x₂. After training, w₃ is large and w₁, w₂ are near zero. What does this tell you?

Test Your Knowledge

Ready to check how much you remember? Take the quiz for Feature Engineering and see your score on the leaderboard.

Take the Quiz

Up next

Next, we cover the bias-variance tradeoff — the root causes of model error and how to diagnose whether your model is underfitting or overfitting.

The Bias-Variance Tradeoff