The Misconception
Films portray AI as a scientist typing commands into a terminal and an all-knowing system emerging minutes later. The reality is closer to construction engineering than magic: long before any model is trained, teams spend weeks or months acquiring and cleaning data, defining what success actually means, and setting up infrastructure to measure it.
Understanding this pipeline matters — whether you are building AI systems, evaluating vendors, or deciding whether AI is the right tool for a problem at all. Each step in this pipeline is a deep subject in its own right — in the courses ahead we cover model training (Gradient Descent), why models fail to generalise (Overfitting and Regularisation), how neural networks are structured (Neural Networks Explained and Introduction to Deep Learning), the architecture behind modern language models (Transformer Architecture), and the ethical consequences of deploying AI at scale (AI Ethics and Limitations).
Step 1: Define the Problem Precisely
Every AI project starts with a question that must be made concrete. "Detect fraud" is not specific enough to build anything. For e.g., a working definition might be: "Given the features of a credit card transaction — amount, location, merchant category, time since last transaction — classify it as fraudulent or legitimate, with a target recall of 95% on confirmed fraud cases."
This step forces three decisions:
- What is the input? — the raw data the model will see
- What is the output? — what the model must predict
- What does good look like? — the metric you will optimise
Choosing the wrong metric is one of the most common ways AI projects fail. For e.g., optimising for accuracy on an imbalanced dataset (1% fraud, 99% legitimate) means a model that always predicts "not fraud" scores 99% accuracy — and catches zero actual fraud cases.
A medical AI is built to detect a rare disease affecting 1% of the population. It predicts 'no disease' for every patient and achieves 99% accuracy. What is the problem?
Step 2: Collect and Label Data
Machine learning systems learn from examples. The data must be:
- Representative — it must reflect the real conditions the model will encounter in production, not just the easy cases
- Labelled (for supervised learning) — a human or reliable process must provide the correct answer for each example
- Large enough — complex tasks require more examples; simple tasks can work with less
Data collection is expensive and slow. For e.g., a company building a document classifier might need human reviewers to label tens of thousands of documents; medical AI requires clinical experts; self-driving AI needs millions of hours of recorded road data with annotated objects.
"Garbage in, garbage out" applies doubly to AI. A model trained on biased, mislabelled, or unrepresentative data will encode those flaws and reproduce them at scale.
Poor data causes more AI project failures than poor algorithms. A mediocre algorithm on excellent data typically outperforms an excellent algorithm on poor data.
Step 3: Prepare the Data
Raw data is almost never ready to train on. Data preparation typically involves:
- Cleaning — removing duplicates, fixing errors, handling missing values
- Splitting — dividing data into training set (to learn from), validation set (to tune during development), and test set (held out until the very end)
- Normalisation — scaling numerical features so no single feature dominates due to its units
- Encoding — converting categorical values into numerical representations
This step consumes 60–80% of total project time. It is unglamorous but determines everything that follows.
Step 4: Choose and Train a Model
With clean, split data, training begins. The team selects an architecture appropriate for the problem:
- Tabular data (spreadsheets, databases) → gradient boosted trees (XGBoost, LightGBM) or logistic regression
- Images → convolutional neural networks (CNNs)
- Text → transformer-based models (BERT, GPT variants)
- Sequential data → recurrent networks or transformers
Training means running an optimisation algorithm — almost always a variant of gradient descent — that adjusts the model's parameters to minimise prediction error on the training data. For e.g., training a large language model can run for weeks across thousands of GPUs.
During training, a model's error on the training set keeps decreasing, but its error on the validation set starts increasing after a certain point. What should the team do?
Step 5: Evaluate
Before any model goes near real users, it is evaluated against the held-out test set — data it has never seen. Metrics depend on the task:
- Classification — accuracy, precision, recall, F1, ROC-AUC
- Regression — mean absolute error, root mean square error
- Generation — human evaluation, BLEU score (for translation), perplexity
The test set must be used only once. If a team evaluates repeatedly and adjusts the model each time, the test set effectively becomes a validation set and the final reported performance will be misleadingly optimistic.
Evaluation also includes error analysis — examining the cases the model gets wrong, not just the aggregate score. For e.g., a model that achieves 90% accuracy but systematically fails on a particular demographic is not a 90% model. It is a broken model for that group.
Step 6: Deploy and Monitor
A trained model sitting on a laptop helps no one. Deployment means packaging the model and serving its predictions via an API to users or downstream systems.
Production AI introduces challenges that research does not:
- Latency — predictions may need to arrive in milliseconds. For e.g., fraud detection and search ranking cannot afford a one-second response
- Scale — the system may need to handle millions of requests per day
- Data drift — the real world changes; a model trained on last year's data may perform poorly today
- Feedback loops — model predictions can influence the data used to retrain future models
Monitoring is not optional. Model performance degrades silently in production unless systems are in place to detect it.
Deploying a model is the beginning of the work, not the end. Production AI requires ongoing maintenance, retraining, and monitoring — much like any other software system.
The Full Pipeline at a Glance
| Step | What Happens | Common Failure |
|---|---|---|
| Define | Set precise inputs, outputs, and success metric | Wrong or gameable metric |
| Collect | Gather representative, labelled examples | Biased or unrepresentative data |
| Prepare | Clean, split, normalise, encode | Data leakage from the test set |
| Train | Fit model parameters to minimise error | Overfitting to training data |
| Evaluate | Measure on held-out test data, analyse errors | Re-using the test set for tuning |
| Deploy | Serve predictions in production via API | Latency or infrastructure failure |
| Monitor | Track performance over time, retrain as needed | Silent degradation from data drift |
Most AI projects cycle through steps 2–5 many times before deployment. The pipeline is iterative, not linear.