AAI Logo
Loading...
AAI Logo
Loading...
Deep Learning
Deep LearningIntermediate

Logistic Regression

pythonlogistic regressionclassificationnumpy
No reviews yet — be the first!

What Is Logistic Regression?

Logistic regression is a binary classification algorithm. It takes a set of input features and outputs a probability between 0 and 1 — the probability that the example belongs to class 1.

Consider predicting diabetes from a single feature: fasting blood glucose level (mg/dL). Before building any model, look at what the data tells you:

Glucose (mg/dL)Diabetic?What the data says
68No (0)Clearly healthy — very low risk
82No (0)Normal range — low risk
105No (0)Slightly elevated — risk starting to rise
130Yes (1)Above normal — meaningfully elevated risk
158Yes (1)High — strong signal for diabetes
210Yes (1)Very high — near-certain

The pattern is not a sharp cliff at some threshold. Risk rises slowly at first, accelerates through the middle range, then flattens as it approaches certainty. That shape — flat, steep, flat — is exactly what the sigmoid function produces.

Diagram
Patient Data: Blood Glucose vs Diabetes Outcome1080100120140160180200220240260Blood Glucose (mg/dL)10Diabetic?Not diabetic (y = 0)Diabetic (y = 1)
Each point is a patient. Blue dots are non-diabetic (y=0), pink dots are diabetic (y=1). Low glucose clusters near 0, high glucose clusters near 1 — but there is an overlap zone in the middle where the outcome is uncertain.

Despite the name, logistic regression is a classification algorithm, not a regression algorithm. The "regression" refers to the underlying linear function — the sigmoid layer converts it into a probability output.

Why Not Linear Regression?

A linear model computes ŷ = w·glucose + b. The problem: for glucose = 280 mg/dL, this gives ŷ = 1.4 — a probability above 1 is meaningless. For glucose = 50 mg/dL, it gives ŷ = −0.1 — a negative probability is equally meaningless.

Diagram
Fitting a Straight Line to Binary Labels10P > 1 — impossibleP < 0 — impossible80100120140160180200220240260Blood Glucose (mg/dL)0.000.250.500.751.00P(diabetic)−0.151.21Linear regression (breaks bounds)Not diabeticDiabetic
The same patient data with a straight line fitted to the labels. The line predicts −0.15 at low glucose and 1.21 at high glucose — both outside the valid probability range. A probability cannot be negative or greater than 1.

You need a function that:

  • Accepts any real number as input (the linear output can be anything)
  • Always outputs a value in (0, 1)
  • Rises slowly at the extremes and steeply in the middle

That function is the sigmoid.

Diagram
Linear Regression vs Sigmoid — Diabetes Data10P > 1 — impossibleP < 0 — impossible80100120140160180200220240260Blood Glucose (mg/dL)0.000.250.500.751.00P(diabetic)threshold (0.5)−0.151.21Linear regressionSigmoid (logistic)Not diabetic (y = 0)Diabetic (y = 1)
Blue dots: non-diabetic patients (y=0). Pink dots: diabetic patients (y=1). The dashed orange line is a linear fit — it predicts −0.15 and 1.21 at the extremes, both impossible probabilities. The green sigmoid curve stays within (0, 1) and matches the true S-shaped relationship.

The Sigmoid Function

The sigmoid function takes any real number and squashes it into the range (0, 1):

σ(z) = 1 / (1 + e−z)

Here e is Euler's number (≈ 2.718) — the base of the natural logarithm. It appears here because exponential growth and decay naturally model how evidence accumulates: doubling confidence does not double the probability, it shifts it along the S-curve.

  • z ≫ 0 (large positive): e^−z → 0, so σ(z) → 1. Strong positive evidence → near-certain class 1.
  • z = 0: σ(0) = 0.5. No evidence either way — the decision boundary.
  • z ≪ 0 (large negative): e^−z → ∞, so σ(z) → 0. Strong negative evidence → near-certain class 0.
  • The output is always strictly between 0 and 1 — never exactly 0 or 1.
zσ(z)Interpretation
−40.018Near-certain class 0
−20.119Low probability of class 1
00.500Decision boundary — 50/50
+20.881High probability of class 1
+40.982Near-certain class 1
Diagram
Sigmoid Function σ(z) = 1 / (1 + e⁻ᶻ)→ 1→ 00.5-6-4-20246z0.000.250.500.751.00σ(z)σ(−4) ≈ 0.02σ(0) = 0.5σ(4) ≈ 0.98outputalways(0, 1)
Sigmoid maps any real number z to a probability in (0, 1). The curve is steepest at z=0 (the decision boundary) and flattens toward 0 and 1 at the extremes.

The steepest part of the sigmoid is at z = 0. This is where the model is most uncertain. As |z| grows, the model becomes increasingly confident — and the gradient of the sigmoid shrinks toward zero.

Problem Setup and Variables

Before writing any code, you define the variables that describe the problem. These conventions are used throughout machine learning literature.

VariableMeaning
XInput matrix — shape (m, n)
YOutput labels — shape (1, m), values 0 or 1
mTotal number of training examples
m_trainNumber of training examples
m_testNumber of test examples
nNumber of input features
WWeight vector — shape (n, 1)
bBias — scalar
ŷ (y_hat)Predicted probability for one example
APredicted probabilities for all examples — shape (1, m)
LLoss for one example
JCost — average loss over all m examples
python
# Diabetes prediction problem
# Input: fasting glucose, BMI, age → 3 features
# Output: 1 (diabetic) or 0 (not diabetic)

m_train = 700    # training examples
m_test  = 100    # test examples
n       = 3      # features (glucose, BMI, age)

X_train = np.random.randn(m_train, n)   # (700, 3)
Y_train = np.random.randint(0, 2, (1, m_train))  # (1, 700)
X_test  = np.random.randn(m_test, n)    # (100, 3)
Y_test  = np.random.randint(0, 2, (1, m_test))   # (1, 100)

The Sigmoid Activation

The model computes a linear score z = W·x + b for each example — a real number that can be anything. Sigmoid converts that score into a probability. Notice how we have used a linear function underneath and passed it to the sigmoid function — hence logistic regression. Applying this to the glucose example after training:

Glucose (mg/dL)Linear score zσ(z) = P(diabetic)
68−2.80.06 — 6% risk
105−0.40.40 — 40% risk
130+0.90.71 — 71% risk
210+3.10.96 — 96% risk

Low glucose → large negative z → sigmoid near 0. High glucose → large positive z → sigmoid near 1.

python
def sigmoid(z):
    return 1 / (1 + np.exp(-z))

# Forward pass for all m examples
def forward_pass(X, W, b):
    """
    X: (m, n)
    W: (n, 1)
    b: scalar
    Returns A: (1, m) — probability per example
    """
    Z = np.dot(X, W) + b   # (m, 1) linear combination
    A = sigmoid(Z).T        # (1, m) — transpose to row vector
    return A

The Decision Boundary

The sigmoid outputs a probability. To make a yes/no prediction, you apply a threshold — typically 0.5. If σ(z) ≥ 0.5, predict class 1; otherwise predict class 0. The decision boundary is the input value where the model is exactly 50/50.

Since σ(z) = 0.5 when z = 0, the boundary is the glucose value where W·x + b = 0. For a trained model with W = 0.042 and b = −5.66, that is: 0.042 × glucose − 5.66 = 0glucose = 134.8 mg/dL.

Below are three patients walked through the full prediction pipeline:

Example 1 — glucose = 105 mg/dL (well below boundary)

StepCalculationResult
1. Linear scorez = 0.042 × 105 − 5.66z = −1.25
2. Sigmoidσ(−1.25) = 1 / (1 + e^1.25)σ = 0.22
3. Threshold0.22 < 0.5Predict: not diabetic

Example 2 — glucose = 135 mg/dL (at the boundary)

StepCalculationResult
1. Linear scorez = 0.042 × 135 − 5.66z ≈ 0
2. Sigmoidσ(0) = 1 / (1 + e^0)σ = 0.50
3. Threshold0.50 = 0.5Predict: boundary — coin flip

Example 3 — glucose = 170 mg/dL (well above boundary)

StepCalculationResult
1. Linear scorez = 0.042 × 170 − 5.66z = +1.48
2. Sigmoidσ(1.48) = 1 / (1 + e^−1.48)σ = 0.81
3. Threshold0.81 ≥ 0.5Predict: diabetic
Diagram
Decision Boundary — Where the Model Draws the LinePredict: not diabeticPredict: diabeticthreshold = 0.5boundary≈ 135 mg/dL80100120140160180200Blood Glucose (mg/dL)0.000.250.500.751.00P(diabetic)0.22105 mg/dL0.50135 mg/dL0.81170 mg/dLSigmoid σ(z)Not diabeticBoundaryDiabetic
The sigmoid curve mapped to glucose values. The dashed yellow vertical line is the decision boundary at ~135 mg/dL. The three annotated dots correspond to the examples above — blue (105) sits left of the boundary, yellow (135) sits on it, pink (170) sits right of it.

The decision boundary is not a property of the sigmoid — it is a property of the weights. Changing W and b shifts or rotates the boundary. Training is the process of finding the W and b that places the boundary in the right location for your data.

Quick Check

A patient has glucose = 120 mg/dL. Using W = 0.042 and b = −5.66, what does the model predict?

The Loss and Cost Functions

The model makes m predictions — one per training example. To train, we need a single number that summarises how wrong all those predictions are. We build that number bottom-up: first define the loss (error for one example), then stack losses into the cost (average error across all examples).

Step 1 — The Loss Function L

The loss L measures the error for a single prediction. For binary classification, we use binary cross-entropy:

  • When y = 1: L = −log(ŷ) — loss is 0 if ŷ = 1, grows to ∞ as ŷ → 0
  • When y = 0: L = −log(1 − ŷ) — loss is 0 if ŷ = 0, grows to ∞ as ŷ → 1

Both cases combine into one formula:

L(ŷ⁽ⁱ⁾, y⁽ⁱ⁾) = −[y⁽ⁱ⁾·log(ŷ⁽ⁱ⁾) + (1−y⁽ⁱ⁾)·log(1−ŷ⁽ⁱ⁾)]

Or, expanding ŷ to show the model parameters explicitly:

L(f(w,b)(x⁽ⁱ⁾), y⁽ⁱ⁾) = −[y⁽ⁱ⁾·log(f(w,b)(x⁽ⁱ⁾)) + (1−y⁽ⁱ⁾)·log(1−f(w,b)(x⁽ⁱ⁾))]

Let's look at each case in detail — when y = 1 and when y = 0 — to see exactly how the penalty scales with the prediction.

Diagram
Case 1 — y = 1 (Diabetic): L = −log(ŷ)y=1 makes the second term vanish → only −log(ŷ) remains0.00.20.40.60.81.0(ŷ ≠ 0)(ŷ ≠ 1)ŷ (predicted probability)01234(L ≠ 0)Loss LL → ∞ as ŷ → 0L → 0ŷ=0.15 L=1.90 wrongŷ=0.50 L=0.69 uncertainŷ=0.75 L=0.29 correct
When y=1 (diabetic), the second term vanishes and L = −log(ŷ). The open circle at the right end shows the curve approaches L=0 as ŷ→1 but never reaches it — the sigmoid can never output exactly 1. The curve climbs to ∞ as ŷ→0.
Diagram
Case 2 — y = 0 (Not Diabetic): L = −log(1 − ŷ)y=0 makes the first term vanish → only −log(1−ŷ) remains0.00.20.40.60.81.0(ŷ ≠ 0)(ŷ ≠ 1)ŷ (predicted probability)01234(L ≠ 0)Loss LL → ∞ as ŷ → 1L → 0ŷ=0.25 L=0.29 correctŷ=0.50 L=0.69 uncertainŷ=0.85 L=1.90 wrong
When y=0 (not diabetic), the first term vanishes and L = −log(1−ŷ). The open circle at the left end shows the curve approaches L=0 as ŷ→0 but never reaches it. The curve climbs to ∞ as ŷ→1.

Now let's compare the two cases side by side — the same predicted probability ŷ produces very different losses depending on what the true label y actually was.

Diagram
Binary Cross-Entropy Loss — Two CasesWhen y = 1 (diabetic)L = −log(ŷ)ŷ = 0.950.05Correct, confidentŷ = 0.500.69Uncertainŷ = 0.053.00Wrong, confidentWhen y = 0 (not diabetic)L = −log(1 − ŷ)ŷ = 0.050.05Correct, confidentŷ = 0.500.69Uncertainŷ = 0.953.00Wrong, confidentCorrectUncertainWrongBar length ∝ penalty. Both cases punish confident mistakes most.
Left panel: y=1 case — bar length is L = −log(ŷ). Right panel: y=0 case — bar length is L = −log(1−ŷ). Green bars are confident correct predictions (tiny loss), red bars are confident wrong predictions (maximum penalty).
Prediction ŷTrue label yLoss LInterpretation
0.9510.05Confident and correct — tiny penalty
0.5010.69Uncertain — moderate penalty
0.0513.00Confident and wrong — severe penalty
0.0500.05Confident and correct — tiny penalty
0.9503.00Confident and wrong — severe penalty
Diagram
Binary Cross-Entropy Loss0.00.20.40.50.60.81.0ŷ ≠ 0ŷ ≠ 1thresholdŷ (predicted probability)012345L ≠ 0Loss LL → 0 (never reaches)L → 0 (never reaches)low lossvery high losslow lossvery high lossy = 1 : L = −log(ŷ)y = 0 : L = −log(1−ŷ)
Binary cross-entropy loss vs predicted probability. Blue: y=1 case — loss explodes as ŷ → 0. Pink: y=0 case — loss explodes as ŷ → 1. Confident wrong predictions are penalised most severely.

Step 2 — The Cost Function J

The cost J is built by averaging L over all m training examples. It is the single number gradient descent minimises (we will take a deeper look at gradient descent in the next module):

J = (1/m) × Σ L(ŷ⁽ⁱ⁾, y⁽ⁱ⁾)
J = (1/m) × Σ L(f(w,b)(x⁽ⁱ⁾), y⁽ⁱ⁾)

If J is large, the model is wrong on average. Gradient descent adjusts W and b to push J down — we will take a deeper look at how this works in the next module.

Now let's compute the loss on each individual training example, then average them into the total cost J.

Diagram
From Individual Losses to Cost JEx.Glucoseŷ (pred)yLoss formula L = …L#168 mg/dL0.060−log(1 − 0.06)0.06#2105 mg/dL0.400−log(1 − 0.40)0.51#3135 mg/dL0.501−log(0.50)0.69#4158 mg/dL0.761−log(0.76)0.27#5210 mg/dL0.961−log(0.96)0.04J= (0.06 + 0.51 + 0.69 + 0.27 + 0.04) ÷ 5= 0.31Training minimises J by adjusting W and b until it approaches 0.
Five training examples, each producing a loss L. J is their average. Examples with low glucose (y=0) and high glucose (y=1) are colour-coded blue and pink. A model predicting confidently and correctly keeps J near 0.

Step 3 — Putting it into code

python
def compute_cost(A, Y):
    """
    A: predicted probabilities — shape (1, m)
    Y: true labels — shape (1, m)
    Returns scalar cost J
    """
    m = Y.shape[1]
    J = -(1/m) * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A))
    return float(J)

The cost J is the average of all individual losses. Minimising J by adjusting W and b is the entire training process. The algorithm that does this minimisation is gradient descent — we will take a deeper look in the next module.

Test Your Knowledge

Ready to check how much you remember? Take the quiz for Logistic Regression and see your score on the leaderboard.

Take the Quiz

Up next

Next, we implement gradient descent — the optimisation algorithm that adjusts W and b to minimise the cost and train the model.

Gradient Descent for Logistic Regression