What Is Logistic Regression?
Logistic regression is a binary classification algorithm. It takes a set of input features and outputs a probability between 0 and 1 — the probability that the example belongs to class 1.
Consider predicting diabetes from a single feature: fasting blood glucose level (mg/dL). Before building any model, look at what the data tells you:
| Glucose (mg/dL) | Diabetic? | What the data says |
|---|---|---|
| 68 | No (0) | Clearly healthy — very low risk |
| 82 | No (0) | Normal range — low risk |
| 105 | No (0) | Slightly elevated — risk starting to rise |
| 130 | Yes (1) | Above normal — meaningfully elevated risk |
| 158 | Yes (1) | High — strong signal for diabetes |
| 210 | Yes (1) | Very high — near-certain |
The pattern is not a sharp cliff at some threshold. Risk rises slowly at first, accelerates through the middle range, then flattens as it approaches certainty. That shape — flat, steep, flat — is exactly what the sigmoid function produces.
Despite the name, logistic regression is a classification algorithm, not a regression algorithm. The "regression" refers to the underlying linear function — the sigmoid layer converts it into a probability output.
Why Not Linear Regression?
A linear model computes ŷ = w·glucose + b. The problem: for glucose = 280 mg/dL, this gives ŷ = 1.4 — a probability above 1 is meaningless. For glucose = 50 mg/dL, it gives ŷ = −0.1 — a negative probability is equally meaningless.
You need a function that:
- Accepts any real number as input (the linear output can be anything)
- Always outputs a value in (0, 1)
- Rises slowly at the extremes and steeply in the middle
That function is the sigmoid.
The Sigmoid Function
The sigmoid function takes any real number and squashes it into the range (0, 1):
Here e is Euler's number (≈ 2.718) — the base of the natural logarithm. It appears here because exponential growth and decay naturally model how evidence accumulates: doubling confidence does not double the probability, it shifts it along the S-curve.
- z ≫ 0 (large positive): e^−z → 0, so σ(z) → 1. Strong positive evidence → near-certain class 1.
- z = 0: σ(0) = 0.5. No evidence either way — the decision boundary.
- z ≪ 0 (large negative): e^−z → ∞, so σ(z) → 0. Strong negative evidence → near-certain class 0.
- The output is always strictly between 0 and 1 — never exactly 0 or 1.
| z | σ(z) | Interpretation |
|---|---|---|
| −4 | 0.018 | Near-certain class 0 |
| −2 | 0.119 | Low probability of class 1 |
| 0 | 0.500 | Decision boundary — 50/50 |
| +2 | 0.881 | High probability of class 1 |
| +4 | 0.982 | Near-certain class 1 |
The steepest part of the sigmoid is at z = 0. This is where the model is most uncertain. As |z| grows, the model becomes increasingly confident — and the gradient of the sigmoid shrinks toward zero.
Problem Setup and Variables
Before writing any code, you define the variables that describe the problem. These conventions are used throughout machine learning literature.
| Variable | Meaning |
|---|---|
| X | Input matrix — shape (m, n) |
| Y | Output labels — shape (1, m), values 0 or 1 |
| m | Total number of training examples |
| m_train | Number of training examples |
| m_test | Number of test examples |
| n | Number of input features |
| W | Weight vector — shape (n, 1) |
| b | Bias — scalar |
| ŷ (y_hat) | Predicted probability for one example |
| A | Predicted probabilities for all examples — shape (1, m) |
| L | Loss for one example |
| J | Cost — average loss over all m examples |
# Diabetes prediction problem
# Input: fasting glucose, BMI, age → 3 features
# Output: 1 (diabetic) or 0 (not diabetic)
m_train = 700 # training examples
m_test = 100 # test examples
n = 3 # features (glucose, BMI, age)
X_train = np.random.randn(m_train, n) # (700, 3)
Y_train = np.random.randint(0, 2, (1, m_train)) # (1, 700)
X_test = np.random.randn(m_test, n) # (100, 3)
Y_test = np.random.randint(0, 2, (1, m_test)) # (1, 100)The Sigmoid Activation
The model computes a linear score z = W·x + b for each example — a real number that can be anything. Sigmoid converts that score into a probability. Notice how we have used a linear function underneath and passed it to the sigmoid function — hence logistic regression. Applying this to the glucose example after training:
| Glucose (mg/dL) | Linear score z | σ(z) = P(diabetic) |
|---|---|---|
| 68 | −2.8 | 0.06 — 6% risk |
| 105 | −0.4 | 0.40 — 40% risk |
| 130 | +0.9 | 0.71 — 71% risk |
| 210 | +3.1 | 0.96 — 96% risk |
Low glucose → large negative z → sigmoid near 0. High glucose → large positive z → sigmoid near 1.
def sigmoid(z):
return 1 / (1 + np.exp(-z))
# Forward pass for all m examples
def forward_pass(X, W, b):
"""
X: (m, n)
W: (n, 1)
b: scalar
Returns A: (1, m) — probability per example
"""
Z = np.dot(X, W) + b # (m, 1) linear combination
A = sigmoid(Z).T # (1, m) — transpose to row vector
return AThe Decision Boundary
The sigmoid outputs a probability. To make a yes/no prediction, you apply a threshold — typically 0.5. If σ(z) ≥ 0.5, predict class 1; otherwise predict class 0. The decision boundary is the input value where the model is exactly 50/50.
Since σ(z) = 0.5 when z = 0, the boundary is the glucose value where W·x + b = 0. For a trained model with W = 0.042 and b = −5.66, that is: 0.042 × glucose − 5.66 = 0 → glucose = 134.8 mg/dL.
Below are three patients walked through the full prediction pipeline:
Example 1 — glucose = 105 mg/dL (well below boundary)
| Step | Calculation | Result |
|---|---|---|
| 1. Linear score | z = 0.042 × 105 − 5.66 | z = −1.25 |
| 2. Sigmoid | σ(−1.25) = 1 / (1 + e^1.25) | σ = 0.22 |
| 3. Threshold | 0.22 < 0.5 | Predict: not diabetic |
Example 2 — glucose = 135 mg/dL (at the boundary)
| Step | Calculation | Result |
|---|---|---|
| 1. Linear score | z = 0.042 × 135 − 5.66 | z ≈ 0 |
| 2. Sigmoid | σ(0) = 1 / (1 + e^0) | σ = 0.50 |
| 3. Threshold | 0.50 = 0.5 | Predict: boundary — coin flip |
Example 3 — glucose = 170 mg/dL (well above boundary)
| Step | Calculation | Result |
|---|---|---|
| 1. Linear score | z = 0.042 × 170 − 5.66 | z = +1.48 |
| 2. Sigmoid | σ(1.48) = 1 / (1 + e^−1.48) | σ = 0.81 |
| 3. Threshold | 0.81 ≥ 0.5 | Predict: diabetic |
The decision boundary is not a property of the sigmoid — it is a property of the weights. Changing W and b shifts or rotates the boundary. Training is the process of finding the W and b that places the boundary in the right location for your data.
A patient has glucose = 120 mg/dL. Using W = 0.042 and b = −5.66, what does the model predict?
The Loss and Cost Functions
The model makes m predictions — one per training example. To train, we need a single number that summarises how wrong all those predictions are. We build that number bottom-up: first define the loss (error for one example), then stack losses into the cost (average error across all examples).
Step 1 — The Loss Function L
The loss L measures the error for a single prediction. For binary classification, we use binary cross-entropy:
- When y = 1:
L = −log(ŷ)— loss is 0 if ŷ = 1, grows to ∞ as ŷ → 0 - When y = 0:
L = −log(1 − ŷ)— loss is 0 if ŷ = 0, grows to ∞ as ŷ → 1
Both cases combine into one formula:
Or, expanding ŷ to show the model parameters explicitly:
Let's look at each case in detail — when y = 1 and when y = 0 — to see exactly how the penalty scales with the prediction.
Now let's compare the two cases side by side — the same predicted probability ŷ produces very different losses depending on what the true label y actually was.
| Prediction ŷ | True label y | Loss L | Interpretation |
|---|---|---|---|
| 0.95 | 1 | 0.05 | Confident and correct — tiny penalty |
| 0.50 | 1 | 0.69 | Uncertain — moderate penalty |
| 0.05 | 1 | 3.00 | Confident and wrong — severe penalty |
| 0.05 | 0 | 0.05 | Confident and correct — tiny penalty |
| 0.95 | 0 | 3.00 | Confident and wrong — severe penalty |
Step 2 — The Cost Function J
The cost J is built by averaging L over all m training examples. It is the single number gradient descent minimises (we will take a deeper look at gradient descent in the next module):
If J is large, the model is wrong on average. Gradient descent adjusts W and b to push J down — we will take a deeper look at how this works in the next module.
Now let's compute the loss on each individual training example, then average them into the total cost J.
Step 3 — Putting it into code
def compute_cost(A, Y):
"""
A: predicted probabilities — shape (1, m)
Y: true labels — shape (1, m)
Returns scalar cost J
"""
m = Y.shape[1]
J = -(1/m) * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))
return float(J)The cost J is the average of all individual losses. Minimising J by adjusting W and b is the entire training process. The algorithm that does this minimisation is gradient descent — we will take a deeper look in the next module.
