What Is Deep Learning?
Deep learning is a subfield of machine learning that uses multi-layered neural networks to learn representations of data. The "deep" in deep learning simply refers to the number of layers — the more layers, the deeper the network.
The distinction matters because classical ML algorithms (decision trees, logistic regression, SVMs) require humans to hand-craft features before training. Deep learning sidesteps this: given enough data, the network learns its own features at every layer, building from simple patterns at the bottom to rich abstractions at the top.
Classical ML: Raw data → Human-crafted features → Algorithm → Prediction
Deep Learning: Raw data → Algorithm learns features automatically → Prediction
This ability to learn features directly from raw inputs is why deep learning powers image recognition, speech transcription, language translation, and protein structure prediction — all domains where defining features by hand would be impractical.
The Neuron: The Smallest Building Block
Every deep learning model is assembled from one unit: the artificial neuron. Understanding what a single neuron does makes the rest of deep learning much clearer.
A neuron takes a set of inputs, multiplies each by a learned weight, sums the results, then passes the sum through an activation function to produce an output.
Suppose you are building a system to predict whether a user will finish watching a movie given three signals: how long the trailer was watched (in seconds), the average review score (out of 10), and whether the genre matches the user's history (1 or 0).
A single neuron might learn weights like:
- Trailer watch time: 0.4
- Review score: 0.5
- Genre match: 0.8
It computes: (watch_time × 0.4) + (score × 0.5) + (genre_match × 0.8), then applies an activation function. If the result clears a threshold, the neuron fires and outputs a high value — signalling "this user is likely to watch".
The neuron does not decide the weights. Gradient descent finds them by minimising prediction error across thousands of examples.
Activation Functions: Adding Non-Linearity
Without an activation function, stacking neurons achieves nothing — any number of linear transformations composed together is still just one linear transformation. A network without non-linearity cannot learn curves, boundaries, or complex patterns.
The most widely used activation in modern networks is ReLU (Rectified Linear Unit):
ReLU(x) = max(0, x)
ReLU outputs the input directly when it is positive, and zero when it is negative. That kink at zero is the non-linearity that lets networks model almost any function.
Why must activation functions be non-linear?
From One Neuron to a Network
A single neuron is limited to a single weighted combination. The power of deep learning comes from connecting many neurons in layers.
- Input layer — receives the raw features (pixel values, word embeddings, sensor readings)
- Hidden layers — each layer learns a transformed representation of the previous layer's output
- Output layer — produces the final prediction (a class probability, a number, a sequence)
Consider training a network to classify whether a product review is positive or negative. The first hidden layer might detect whether individual words appear ("great", "terrible"). The second layer picks up phrase-level patterns ("not great", "really terrible"). Deeper layers capture overall sentiment tone. No human defined these steps — the network discovered them by adjusting weights to reduce prediction error.
This hierarchical feature learning is the core capability that separates deep networks from shallow models.
A network is trained to detect objects in photos. Which layer is most likely to detect simple edges and colour gradients?
What Deep Learning Is Good At
Deep learning performs best when three things are true:
1. The input is high-dimensional raw data. Images, audio, text, and video all have too many raw features for humans to engineer manually. Neural networks learn which features matter.
2. There is a lot of data. In supervised deep learning, labeled data is required, and deep networks only outperform classical methods reliably when the dataset is large enough to constrain all those parameters. Self-supervised approaches (BERT, GPT) sidestep labeling by generating supervision from the data itself — but they still require large volumes of raw data.
3. Compute is available. Training large models requires matrix operations on millions of examples. GPUs and TPUs make this tractable.
Strong current applications:
- Computer vision — object detection, medical imaging, autonomous driving perception
- Natural language — translation, summarisation, question answering, code generation
- Speech — real-time transcription, voice cloning, speaker identification
- Generative AI — image synthesis, video generation, protein structure prediction (AlphaFold)
Why Has Deep Learning Taken Off Now?
The mathematical ideas behind neural networks are decades old. What changed is the intersection of three forces:
Data. The digitisation of everyday life — social media, e-commerce, streaming, sensors — created datasets at a scale that was simply unavailable before. A network's performance keeps improving as you add more data, well past the point where traditional algorithms plateau.
Compute. GPUs were designed to render graphics by executing thousands of parallel matrix operations. That happens to be exactly what training a neural network requires. Cloud access to GPU clusters means a researcher can train models in hours that would have taken months on CPU hardware a decade ago.
Algorithmic improvements. Switching from sigmoid to ReLU activations significantly reduced the vanishing gradient problem for deep networks — ReLU does not saturate for positive inputs, keeping gradients alive. Residual connections (ResNets), batch normalisation, He initialisation, and better optimisers (Adam, AdamW) were equally important in making very deep networks stable and practical to train.
Scaling law finding: performance improves predictably with more data, larger models, and more compute — but model size and data must scale together. The Chinchilla paper (2022) showed that many large models were undertrained: too large for the amount of data used. Bigger is not reliably better without proportionally more data.
A company has 500 labelled training examples and wants to choose between logistic regression and a large deep neural network. Which is likely to perform better, and why?
Summary
Deep learning is machine learning with multi-layered neural networks that learn their own feature representations from raw data. A single neuron computes a weighted sum and passes it through a non-linear activation; stacking many neurons in layers creates hierarchical representations powerful enough to solve problems that were unsolvable with classical approaches. The field has exploded because data, compute, and algorithmic improvements all converged at the same time — and all three continue to improve.
The next course covers neural network architecture in depth — backpropagation, weight initialisation, and the mechanics of how networks actually learn from data.