What Is Scikit-learn?
Scikit-learn is the standard Python library for classical machine learning. It provides implementations of dozens of algorithms — linear regression, logistic regression, decision trees, SVMs, k-means clustering — all behind a consistent API. Every major ML workflow follows the same three steps:
- fit — train the model on your training data
- predict — generate predictions on new, unseen data
- evaluate — measure how accurate those predictions are
The Fit / Predict API
Every scikit-learn model follows the same three-step pattern: fit → predict → evaluate. You create a model object, call .fit() to train it, .predict() to generate predictions, then evaluate how accurate those predictions are. This consistency means switching between algorithms requires changing only one line.
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
model = LogisticRegression()
model.fit(X_train, y_train) # 1. train on training data
predictions = model.predict(X_test) # 2. predict on unseen test data
score = accuracy_score(y_test, predictions) # 3. evaluate — compare predictions to true labels
print(score) # e.g. 0.87 — 87% correctTrain / Test Split
Before training any model, you split your data into a training set and a test set. The model learns on the training set. You evaluate it on the test set — data it has never seen — to measure how well it generalises.
from sklearn.model_selection import train_test_split
X = np.random.randn(1000, 5)
y = np.random.randint(0, 2, 1000)
# X — all 1000 examples, shape (1000, 5)
# y — all 1000 labels, shape (1000,)
# test_size=0.2 — reserve 20% for testing, 80% for training
# random_state — fixed seed so the split is the same every run
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42
)
# returns four arrays: training inputs, test inputs, training labels, test labels
print(X_train.shape) # (800, 5) — 80% of 1000
print(X_test.shape) # (200, 5) — 20% of 1000Always set random_state to a fixed integer when splitting data. This makes your results reproducible — the same split every time you run the code. 42 is conventional but any integer works.
Feature Scaling
Many ML algorithms are sensitive to the scale of input features. For e.g., if one feature ranges from 0 to 1 and another from 0 to 100,000, the large-scale feature will dominate the model. StandardScaler normalises each feature to zero mean and unit variance.
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X_train) # learn mean and std from training data only
X_train_scaled = scaler.transform(X_train)
X_test_scaled = scaler.transform(X_test) # use training stats on test set
# fit and transform in one step
X_train_scaled = scaler.fit_transform(X_train)Model Evaluation
from sklearn.metrics import accuracy_score, classification_report
y_pred = model.predict(X_test_scaled)
print(accuracy_score(y_test, y_pred))
# 0.87 — 87% of test examples classified correctly
print(classification_report(y_test, y_pred))
# Shows precision, recall, F1-score per classCommon Algorithms in Scikit-learn
| Algorithm | Import | Use case |
|---|---|---|
| Logistic Regression | sklearn.linear_model.LogisticRegression | Binary classification |
| Linear Regression | sklearn.linear_model.LinearRegression | Continuous output |
| Decision Tree | sklearn.tree.DecisionTreeClassifier | Interpretable classification |
| Random Forest | sklearn.ensemble.RandomForestClassifier | High-accuracy classification |
| K-Means | sklearn.cluster.KMeans | Unsupervised clustering |
| SVM | sklearn.svm.SVC | Classification with margins |
Scikit-learn expects y to be a 1D array of shape (m,), not (1, m). If you get a DataConversionWarning, call y.ravel() to flatten your label array before passing it to fit().
