Scikit-learn Essentials

What Is Scikit-learn?

Scikit-learn is the standard Python library for classical machine learning. It provides implementations of dozens of algorithms — linear regression, logistic regression, decision trees, SVMs, k-means clustering — all behind a consistent API. Every major ML workflow follows the same three steps:

fit — train the model on your training data
predict — generate predictions on new, unseen data
evaluate — measure how accurate those predictions are

The Fit / Predict API

Every scikit-learn model follows the same three-step pattern: fit → predict → evaluate. You create a model object, call .fit() to train it, .predict() to generate predictions, then evaluate how accurate those predictions are. This consistency means switching between algorithms requires changing only one line.

python

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()

model.fit(X_train, y_train)              # 1. train on training data
predictions = model.predict(X_test)      # 2. predict on unseen test data
score = accuracy_score(y_test, predictions)  # 3. evaluate — compare predictions to true labels
print(score)  # e.g. 0.87 — 87% correct

Train / Test Split

Before training any model, you split your data into a training set and a test set. The model learns on the training set. You evaluate it on the test set — data it has never seen — to measure how well it generalises.

python

from sklearn.model_selection import train_test_split

X = np.random.randn(1000, 5)
y = np.random.randint(0, 2, 1000)

# X             — all 1000 examples, shape (1000, 5)
# y             — all 1000 labels,   shape (1000,)
# test_size=0.2 — reserve 20% for testing, 80% for training
# random_state  — fixed seed so the split is the same every run
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# returns four arrays: training inputs, test inputs, training labels, test labels
print(X_train.shape)  # (800, 5)  — 80% of 1000
print(X_test.shape)   # (200, 5)  — 20% of 1000

Always set random_state to a fixed integer when splitting data. This makes your results reproducible — the same split every time you run the code. 42 is conventional but any integer works.

Feature Scaling

Many ML algorithms are sensitive to the scale of input features. For e.g., if one feature ranges from 0 to 1 and another from 0 to 100,000, the large-scale feature will dominate the model. StandardScaler normalises each feature to zero mean and unit variance.

python

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)                   # learn mean and std from training data only
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)   # use training stats on test set

# fit and transform in one step
X_train_scaled = scaler.fit_transform(X_train)

Model Evaluation

python

from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test_scaled)

print(accuracy_score(y_test, y_pred))
# 0.87 — 87% of test examples classified correctly

print(classification_report(y_test, y_pred))
# Shows precision, recall, F1-score per class

Common Algorithms in Scikit-learn

Algorithm	Import	Use case
Logistic Regression	`sklearn.linear_model.LogisticRegression`	Binary classification
Linear Regression	`sklearn.linear_model.LinearRegression`	Continuous output
Decision Tree	`sklearn.tree.DecisionTreeClassifier`	Interpretable classification
Random Forest	`sklearn.ensemble.RandomForestClassifier`	High-accuracy classification
K-Means	`sklearn.cluster.KMeans`	Unsupervised clustering
SVM	`sklearn.svm.SVC`	Classification with margins

Scikit-learn expects y to be a 1D array of shape (m,), not (1, m). If you get a DataConversionWarning, call y.ravel() to flatten your label array before passing it to fit().

What Is Scikit-learn?

The Fit / Predict API

Train / Test Split

Feature Scaling

Model Evaluation

Common Algorithms in Scikit-learn

Test Your Knowledge