The Misconception

Films portray AI as a scientist typing commands into a terminal and an all-knowing system emerging minutes later. The reality is closer to construction engineering than magic: long before any model is trained, teams spend weeks or months acquiring and cleaning data, defining what success actually means, and setting up infrastructure to measure it.

Understanding this pipeline matters — whether you are building AI systems, evaluating vendors, or deciding whether AI is the right tool for a problem at all. Each step in this pipeline is a deep subject in its own right — in the courses ahead we cover model training (Gradient Descent), why models fail to generalise (Overfitting and Regularisation), how neural networks are structured (Neural Networks Explained and Introduction to Deep Learning), the architecture behind modern language models (Transformer Architecture), and the ethical consequences of deploying AI at scale (AI Ethics and Limitations).

Diagram

The seven-step AI development pipeline. Most projects cycle through steps 2–5 several times before reaching deployment.

Step 1: Define the Problem Precisely

Every AI project starts with a question that must be made concrete. "Detect fraud" is not specific enough to build anything. For e.g., a working definition might be: "Given the features of a credit card transaction — amount, location, merchant category, time since last transaction — classify it as fraudulent or legitimate, with a target recall of 95% on confirmed fraud cases."

This step forces three decisions:

What is the input? — the raw data the model will see
What is the output? — what the model must predict
What does good look like? — the metric you will optimise

Choosing the wrong metric is one of the most common ways AI projects fail. For e.g., optimising for accuracy on an imbalanced dataset (1% fraud, 99% legitimate) means a model that always predicts "not fraud" scores 99% accuracy — and catches zero actual fraud cases.

Quick Check

A medical AI is built to detect a rare disease affecting 1% of the population. It predicts 'no disease' for every patient and achieves 99% accuracy. What is the problem?

Step 2: Collect and Label Data

Machine learning systems learn from examples. The data must be:

Representative — it must reflect the real conditions the model will encounter in production, not just the easy cases
Labelled (for supervised learning) — a human or reliable process must provide the correct answer for each example
Large enough — complex tasks require more examples; simple tasks can work with less

Data collection is expensive and slow. For e.g., a company building a document classifier might need human reviewers to label tens of thousands of documents; medical AI requires clinical experts; self-driving AI needs millions of hours of recorded road data with annotated objects.

"Garbage in, garbage out" applies doubly to AI. A model trained on biased, mislabelled, or unrepresentative data will encode those flaws and reproduce them at scale.

Poor data causes more AI project failures than poor algorithms. A mediocre algorithm on excellent data typically outperforms an excellent algorithm on poor data.

Step 3: Prepare the Data

Raw data is almost never ready to train on. Data preparation typically involves:

Cleaning — removing duplicates, fixing errors, handling missing values
Splitting — dividing data into training set (to learn from), validation set (to tune during development), and test set (held out until the very end)
Normalisation — scaling numerical features so no single feature dominates due to its units
Encoding — converting categorical values into numerical representations

This step consumes 60–80% of total project time. It is unglamorous but determines everything that follows.

Step 4: Choose and Train a Model

With clean, split data, training begins. The team selects an architecture appropriate for the problem:

Tabular data (spreadsheets, databases) → gradient boosted trees (XGBoost, LightGBM) or logistic regression
Images → convolutional neural networks (CNNs)
Text → transformer-based models (BERT, GPT variants)
Sequential data → recurrent networks or transformers

Training means running an optimisation algorithm — almost always a variant of gradient descent — that adjusts the model's parameters to minimise prediction error on the training data. For e.g., training a large language model can run for weeks across thousands of GPUs.

Quick Check

During training, a model's error on the training set keeps decreasing, but its error on the validation set starts increasing after a certain point. What should the team do?

Step 5: Evaluate

Before any model goes near real users, it is evaluated against the held-out test set — data it has never seen. Metrics depend on the task:

Classification — accuracy, precision, recall, F1, ROC-AUC
Regression — mean absolute error, root mean square error
Generation — human evaluation, BLEU score (for translation), perplexity

The test set must be used only once. If a team evaluates repeatedly and adjusts the model each time, the test set effectively becomes a validation set and the final reported performance will be misleadingly optimistic.

Evaluation also includes error analysis — examining the cases the model gets wrong, not just the aggregate score. For e.g., a model that achieves 90% accuracy but systematically fails on a particular demographic is not a 90% model. It is a broken model for that group.

Step 6: Deploy and Monitor

A trained model sitting on a laptop helps no one. Deployment means packaging the model and serving its predictions via an API to users or downstream systems.

Production AI introduces challenges that research does not:

Latency — predictions may need to arrive in milliseconds. For e.g., fraud detection and search ranking cannot afford a one-second response
Scale — the system may need to handle millions of requests per day
Data drift — the real world changes; a model trained on last year's data may perform poorly today
Feedback loops — model predictions can influence the data used to retrain future models

Monitoring is not optional. Model performance degrades silently in production unless systems are in place to detect it.

Deploying a model is the beginning of the work, not the end. Production AI requires ongoing maintenance, retraining, and monitoring — much like any other software system.

The Full Pipeline at a Glance

Step	What Happens	Common Failure
Define	Set precise inputs, outputs, and success metric	Wrong or gameable metric
Collect	Gather representative, labelled examples	Biased or unrepresentative data
Prepare	Clean, split, normalise, encode	Data leakage from the test set
Train	Fit model parameters to minimise error	Overfitting to training data
Evaluate	Measure on held-out test data, analyse errors	Re-using the test set for tuning
Deploy	Serve predictions in production via API	Latency or infrastructure failure
Monitor	Track performance over time, retrain as needed	Silent degradation from data drift

Most AI projects cycle through steps 2–5 many times before deployment. The pipeline is iterative, not linear.

How AI Systems Are Built