The Biological Inspiration
Artificial neural networks are loosely inspired by biological brains. A biological neuron receives signals from other neurons through dendrites, processes them in the cell body, and fires an output signal through its axon if the cumulative input exceeds a threshold. Artificial neurons replicate this: they compute a weighted sum of inputs, add a bias, and pass the result through an activation function.
The Perceptron
The perceptron is the simplest neural network unit. It computes a linear combination of its inputs and applies a step function. A single perceptron can learn any linearly separable problem, but it fails on non-linear problems like XOR — which motivated multi-layer networks.
output = activation(sum(w_i * x_i) + b)
# w = weights, x = inputs, b = bias
Activation Functions
Without activation functions, a deep network is just a linear transformation, no more expressive than a single layer. Activation functions introduce the non-linearity that lets networks approximate any function.
- Sigmoid: squashes output to (0,1). Used in output layers for binary classification. Suffers from vanishing gradients in deep networks.
- Tanh: squashes output to (-1,1). Zero-centered, better than sigmoid for hidden layers but still has vanishing gradient issues.
- ReLU (Rectified Linear Unit): f(x) = max(0, x). The default for hidden layers — computationally cheap and avoids vanishing gradients for positive inputs.
- Leaky ReLU: allows a small slope for negative inputs to fix the "dying ReLU" problem.
- Softmax: converts a vector of raw scores to a probability distribution. Used in the output layer for multi-class classification.
What is the primary reason activation functions are used in neural networks?
Forward and Backward Passes
Forward Pass
Data flows from the input layer through hidden layers to the output layer. Each layer applies a linear transformation followed by an activation function. The final output is compared to the true label using a loss function (e.g., cross-entropy for classification, MSE for regression).
Backpropagation
Backpropagation computes the gradient of the loss with respect to every weight in the network by applying the chain rule repeatedly from the output layer back to the input layer. These gradients are then used by an optimizer (e.g., SGD, Adam) to update the weights.
Backpropagation = chain rule applied layer by layer from output → input
The Vanishing Gradient Problem
In very deep networks using sigmoid or tanh activations, gradients can shrink exponentially as they propagate backward. By the time they reach early layers, updates are negligibly small — those layers barely learn. This was a major barrier to training deep networks before ReLU activations and residual connections became standard.
Why does the vanishing gradient problem occur with sigmoid and tanh activations in deep networks?
Regularization Techniques
- Dropout: randomly zero out a fraction of neurons during each training step. Forces the network to learn redundant representations and prevents co-adaptation.
- Batch Normalization: normalize each layer's inputs to zero mean and unit variance. Stabilizes training, allows higher learning rates, and acts as a mild regularizer.
- L2 Regularization (Weight Decay): adds a penalty proportional to the sum of squared weights to the loss, discouraging large weight values.
- Early Stopping: monitor validation loss; stop training when it stops improving.
Convolutional and Recurrent Variants
- CNNs (Convolutional Neural Networks): use learned filters to detect local patterns in grid-like data. Excellent for images and sequences.
- RNNs / LSTMs / GRUs: process sequential data with hidden state that carries information across time steps. Largely replaced by Transformers for NLP.
- Residual Networks (ResNets): add skip connections that allow gradients to bypass layers, enabling training of very deep networks (100+ layers).