AAI Logo
Loading...
AAI Logo
Loading...
Python for AI & ML
PythonBeginner

Scikit-learn Essentials

pythonscikit-learnmachine learningmodel evaluation
No reviews yet — be the first!

What Is Scikit-learn?

Scikit-learn is the standard Python library for classical machine learning. It provides implementations of dozens of algorithms — linear regression, logistic regression, decision trees, SVMs, k-means clustering — all behind a consistent API. Every major ML workflow follows the same three steps:

  • fit — train the model on your training data
  • predict — generate predictions on new, unseen data
  • evaluate — measure how accurate those predictions are

The Fit / Predict API

Every scikit-learn model follows the same three-step pattern: fit → predict → evaluate. You create a model object, call .fit() to train it, .predict() to generate predictions, then evaluate how accurate those predictions are. This consistency means switching between algorithms requires changing only one line.

python
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

model = LogisticRegression()

model.fit(X_train, y_train)              # 1. train on training data
predictions = model.predict(X_test)      # 2. predict on unseen test data
score = accuracy_score(y_test, predictions)  # 3. evaluate — compare predictions to true labels
print(score)  # e.g. 0.87 — 87% correct

Train / Test Split

Before training any model, you split your data into a training set and a test set. The model learns on the training set. You evaluate it on the test set — data it has never seen — to measure how well it generalises.

python
from sklearn.model_selection import train_test_split

X = np.random.randn(1000, 5)
y = np.random.randint(0, 2, 1000)

# X             — all 1000 examples, shape (1000, 5)
# y             — all 1000 labels,   shape (1000,)
# test_size=0.2 — reserve 20% for testing, 80% for training
# random_state  — fixed seed so the split is the same every run
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42
)
# returns four arrays: training inputs, test inputs, training labels, test labels
print(X_train.shape)  # (800, 5)  — 80% of 1000
print(X_test.shape)   # (200, 5)  — 20% of 1000

Always set random_state to a fixed integer when splitting data. This makes your results reproducible — the same split every time you run the code. 42 is conventional but any integer works.

Feature Scaling

Many ML algorithms are sensitive to the scale of input features. For e.g., if one feature ranges from 0 to 1 and another from 0 to 100,000, the large-scale feature will dominate the model. StandardScaler normalises each feature to zero mean and unit variance.

python
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
scaler.fit(X_train)                   # learn mean and std from training data only
X_train_scaled = scaler.transform(X_train)
X_test_scaled  = scaler.transform(X_test)   # use training stats on test set

# fit and transform in one step
X_train_scaled = scaler.fit_transform(X_train)

Model Evaluation

python
from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test_scaled)

print(accuracy_score(y_test, y_pred))
# 0.87 — 87% of test examples classified correctly

print(classification_report(y_test, y_pred))
# Shows precision, recall, F1-score per class

Common Algorithms in Scikit-learn

AlgorithmImportUse case
Logistic Regressionsklearn.linear_model.LogisticRegressionBinary classification
Linear Regressionsklearn.linear_model.LinearRegressionContinuous output
Decision Treesklearn.tree.DecisionTreeClassifierInterpretable classification
Random Forestsklearn.ensemble.RandomForestClassifierHigh-accuracy classification
K-Meanssklearn.cluster.KMeansUnsupervised clustering
SVMsklearn.svm.SVCClassification with margins

Scikit-learn expects y to be a 1D array of shape (m,), not (1, m). If you get a DataConversionWarning, call y.ravel() to flatten your label array before passing it to fit().

Test Your Knowledge

Ready to check how much you remember? Take the quiz for Scikit-learn Essentials and see your score on the leaderboard.

Take the Quiz

Up next

Next, we set up Jupyter Notebooks — the interactive environment used by every ML practitioner.

Jupyter Notebooks for ML