Logistic Regression

What Is Logistic Regression?

Logistic regression is a binary classification algorithm. It takes a set of input features and outputs a probability between 0 and 1 — the probability that the example belongs to class 1.

Consider predicting diabetes from a single feature: fasting blood glucose level (mg/dL). Before building any model, look at what the data tells you:

Glucose (mg/dL)	Diabetic?	What the data says
68	No (0)	Clearly healthy — very low risk
82	No (0)	Normal range — low risk
105	No (0)	Slightly elevated — risk starting to rise
130	Yes (1)	Above normal — meaningfully elevated risk
158	Yes (1)	High — strong signal for diabetes
210	Yes (1)	Very high — near-certain

The pattern is not a sharp cliff at some threshold. Risk rises slowly at first, accelerates through the middle range, then flattens as it approaches certainty. That shape — flat, steep, flat — is exactly what the sigmoid function produces.

Diagram

Each point is a patient. Blue dots are non-diabetic (y=0), pink dots are diabetic (y=1). Low glucose clusters near 0, high glucose clusters near 1 — but there is an overlap zone in the middle where the outcome is uncertain.

Despite the name, logistic regression is a classification algorithm, not a regression algorithm. The "regression" refers to the underlying linear function — the sigmoid layer converts it into a probability output.

Why Not Linear Regression?

A linear model computes ŷ = w·glucose + b. The problem: for glucose = 280 mg/dL, this gives ŷ = 1.4 — a probability above 1 is meaningless. For glucose = 50 mg/dL, it gives ŷ = −0.1 — a negative probability is equally meaningless.

Diagram

The same patient data with a straight line fitted to the labels. The line predicts −0.15 at low glucose and 1.21 at high glucose — both outside the valid probability range. A probability cannot be negative or greater than 1.

You need a function that:

Accepts any real number as input (the linear output can be anything)
Always outputs a value in (0, 1)
Rises slowly at the extremes and steeply in the middle

That function is the sigmoid.

Diagram

Blue dots: non-diabetic patients (y=0). Pink dots: diabetic patients (y=1). The dashed orange line is a linear fit — it predicts −0.15 and 1.21 at the extremes, both impossible probabilities. The green sigmoid curve stays within (0, 1) and matches the true S-shaped relationship.

The Sigmoid Function

The sigmoid function takes any real number and squashes it into the range (0, 1):

σ(z) = 1 / (1 + e^−z)

Here e is Euler's number (≈ 2.718) — the base of the natural logarithm. It appears here because exponential growth and decay naturally model how evidence accumulates: doubling confidence does not double the probability, it shifts it along the S-curve.

z ≫ 0 (large positive): e^−z → 0, so σ(z) → 1. Strong positive evidence → near-certain class 1.
z = 0: σ(0) = 0.5. No evidence either way — the decision boundary.
z ≪ 0 (large negative): e^−z → ∞, so σ(z) → 0. Strong negative evidence → near-certain class 0.
The output is always strictly between 0 and 1 — never exactly 0 or 1.

z	σ(z)	Interpretation
−4	0.018	Near-certain class 0
−2	0.119	Low probability of class 1
0	0.500	Decision boundary — 50/50
+2	0.881	High probability of class 1
+4	0.982	Near-certain class 1

Diagram

Sigmoid maps any real number z to a probability in (0, 1). The curve is steepest at z=0 (the decision boundary) and flattens toward 0 and 1 at the extremes.

The steepest part of the sigmoid is at z = 0. This is where the model is most uncertain. As |z| grows, the model becomes increasingly confident — and the gradient of the sigmoid shrinks toward zero.

Problem Setup and Variables

Before writing any code, you define the variables that describe the problem. These conventions are used throughout machine learning literature.

Variable	Meaning
X	Input matrix — shape (m, n)
Y	Output labels — shape (1, m), values 0 or 1
m	Total number of training examples
m_train	Number of training examples
m_test	Number of test examples
n	Number of input features
W	Weight vector — shape (n, 1)
b	Bias — scalar
ŷ (y_hat)	Predicted probability for one example
A	Predicted probabilities for all examples — shape (1, m)
L	Loss for one example
J	Cost — average loss over all m examples

python

# Diabetes prediction problem
# Input: fasting glucose, BMI, age → 3 features
# Output: 1 (diabetic) or 0 (not diabetic)

m_train = 700    # training examples
m_test  = 100    # test examples
n       = 3      # features (glucose, BMI, age)

X_train = np.random.randn(m_train, n)   # (700, 3)
Y_train = np.random.randint(0, 2, (1, m_train))  # (1, 700)
X_test  = np.random.randn(m_test, n)    # (100, 3)
Y_test  = np.random.randint(0, 2, (1, m_test))   # (1, 100)

The Sigmoid Activation

The model computes a linear score z = W·x + b for each example — a real number that can be anything. Sigmoid converts that score into a probability. Notice how we have used a linear function underneath and passed it to the sigmoid function — hence logistic regression. Applying this to the glucose example after training:

Glucose (mg/dL)	Linear score z	σ(z) = P(diabetic)
68	−2.8	0.06 — 6% risk
105	−0.4	0.40 — 40% risk
130	+0.9	0.71 — 71% risk
210	+3.1	0.96 — 96% risk

Low glucose → large negative z → sigmoid near 0. High glucose → large positive z → sigmoid near 1.

python

def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Forward pass for all m examples
def forward_pass(X, W, b):
    """
    X: (m, n)
    W: (n, 1)
    b: scalar
    Returns A: (1, m) — probability per example
    """
    Z = np.dot(X, W) + b   # (m, 1) linear combination
    A = sigmoid(Z).T        # (1, m) — transpose to row vector
    return A

The Decision Boundary

The sigmoid outputs a probability. To make a yes/no prediction, you apply a threshold — typically 0.5. If σ(z) ≥ 0.5, predict class 1; otherwise predict class 0. The decision boundary is the input value where the model is exactly 50/50.

Since σ(z) = 0.5 when z = 0, the boundary is the glucose value where W·x + b = 0. For a trained model with W = 0.042 and b = −5.66, that is: 0.042 × glucose − 5.66 = 0 → glucose = 134.8 mg/dL.

Below are three patients walked through the full prediction pipeline:

Example 1 — glucose = 105 mg/dL (well below boundary)

Step	Calculation	Result
1. Linear score	z = 0.042 × 105 − 5.66	z = −1.25
2. Sigmoid	σ(−1.25) = 1 / (1 + e^1.25)	σ = 0.22
3. Threshold	0.22 < 0.5	Predict: not diabetic

Example 2 — glucose = 135 mg/dL (at the boundary)

Step	Calculation	Result
1. Linear score	z = 0.042 × 135 − 5.66	z ≈ 0
2. Sigmoid	σ(0) = 1 / (1 + e^0)	σ = 0.50
3. Threshold	0.50 = 0.5	Predict: boundary — coin flip

Example 3 — glucose = 170 mg/dL (well above boundary)

Step	Calculation	Result
1. Linear score	z = 0.042 × 170 − 5.66	z = +1.48
2. Sigmoid	σ(1.48) = 1 / (1 + e^−1.48)	σ = 0.81
3. Threshold	0.81 ≥ 0.5	Predict: diabetic

Diagram

The sigmoid curve mapped to glucose values. The dashed yellow vertical line is the decision boundary at ~135 mg/dL. The three annotated dots correspond to the examples above — blue (105) sits left of the boundary, yellow (135) sits on it, pink (170) sits right of it.

The decision boundary is not a property of the sigmoid — it is a property of the weights. Changing W and b shifts or rotates the boundary. Training is the process of finding the W and b that places the boundary in the right location for your data.

Quick Check

A patient has glucose = 120 mg/dL. Using W = 0.042 and b = −5.66, what does the model predict?

The Loss and Cost Functions

The model makes m predictions — one per training example. To train, we need a single number that summarises how wrong all those predictions are. We build that number bottom-up: first define the loss (error for one example), then stack losses into the cost (average error across all examples).

Step 1 — The Loss Function L

The loss L measures the error for a single prediction. For binary classification, we use binary cross-entropy:

When y = 1: L = −log(ŷ) — loss is 0 if ŷ = 1, grows to ∞ as ŷ → 0
When y = 0: L = −log(1 − ŷ) — loss is 0 if ŷ = 0, grows to ∞ as ŷ → 1

Both cases combine into one formula:

L(ŷ⁽ⁱ⁾, y⁽ⁱ⁾) = −[y⁽ⁱ⁾·log(ŷ⁽ⁱ⁾) + (1−y⁽ⁱ⁾)·log(1−ŷ⁽ⁱ⁾)]

Or, expanding ŷ to show the model parameters explicitly:

L(f_(w^→,b)(x⁽ⁱ⁾), y⁽ⁱ⁾) = −[y⁽ⁱ⁾·log(f_(w^→,b)(x⁽ⁱ⁾)) + (1−y⁽ⁱ⁾)·log(1−f_(w^→,b)(x⁽ⁱ⁾))]

Let's look at each case in detail — when y = 1 and when y = 0 — to see exactly how the penalty scales with the prediction.

Diagram

When y=1 (diabetic), the second term vanishes and L = −log(ŷ). The open circle at the right end shows the curve approaches L=0 as ŷ→1 but never reaches it — the sigmoid can never output exactly 1. The curve climbs to ∞ as ŷ→0.

Diagram

When y=0 (not diabetic), the first term vanishes and L = −log(1−ŷ). The open circle at the left end shows the curve approaches L=0 as ŷ→0 but never reaches it. The curve climbs to ∞ as ŷ→1.

Now let's compare the two cases side by side — the same predicted probability ŷ produces very different losses depending on what the true label y actually was.

Diagram

Left panel: y=1 case — bar length is L = −log(ŷ). Right panel: y=0 case — bar length is L = −log(1−ŷ). Green bars are confident correct predictions (tiny loss), red bars are confident wrong predictions (maximum penalty).

Prediction ŷ	True label y	Loss L	Interpretation
0.95	1	0.05	Confident and correct — tiny penalty
0.50	1	0.69	Uncertain — moderate penalty
0.05	1	3.00	Confident and wrong — severe penalty
0.05	0	0.05	Confident and correct — tiny penalty
0.95	0	3.00	Confident and wrong — severe penalty

Diagram

Binary cross-entropy loss vs predicted probability. Blue: y=1 case — loss explodes as ŷ → 0. Pink: y=0 case — loss explodes as ŷ → 1. Confident wrong predictions are penalised most severely.

Step 2 — The Cost Function J

The cost J is built by averaging L over all m training examples. It is the single number gradient descent minimises (we will take a deeper look at gradient descent in the next module):

J = (1/m) × Σ L(ŷ⁽ⁱ⁾, y⁽ⁱ⁾)

J = (1/m) × Σ L(f_(w^→,b)(x⁽ⁱ⁾), y⁽ⁱ⁾)

If J is large, the model is wrong on average. Gradient descent adjusts W and b to push J down — we will take a deeper look at how this works in the next module.

Now let's compute the loss on each individual training example, then average them into the total cost J.

Diagram

Five training examples, each producing a loss L. J is their average. Examples with low glucose (y=0) and high glucose (y=1) are colour-coded blue and pink. A model predicting confidently and correctly keeps J near 0.

Step 3 — Putting it into code

python

def compute_cost(A, Y):
    """
    A: predicted probabilities — shape (1, m)
    Y: true labels — shape (1, m)
    Returns scalar cost J
    """
    m = Y.shape[1]
    J = -(1/m) * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))
    return float(J)

The cost J is the average of all individual losses. Minimising J by adjusting W and b is the entire training process. The algorithm that does this minimisation is gradient descent — we will take a deeper look in the next module.

What Is Logistic Regression?

Why Not Linear Regression?

The Sigmoid Function

Problem Setup and Variables

The Sigmoid Activation

The Decision Boundary

The Loss and Cost Functions

Test Your Knowledge