Why Transformers?
Before Transformers, sequence models like LSTMs and GRUs dominated NLP. They process tokens one by one, passing a hidden state forward. This sequential nature makes training slow (can't be parallelized), and long-range dependencies are hard to maintain — the hidden state is a fixed-size bottleneck.
Transformers, introduced in "Attention Is All You Need" (Vaswani et al., 2017), replace recurrence with self-attention. Every position attends to every other position in one parallel operation. Training is dramatically faster, and long-range relationships are captured directly.
Self-Attention
Self-attention computes a representation for each token by looking at all other tokens in the sequence. For each token, three vectors are computed from its embedding: Query (Q), Key (K), and Value (V).
- Attention score between token i and token j = dot product of Q_i and K_j, scaled by √d_k
- Scores are passed through softmax to get attention weights (sum to 1)
- Output for token i = weighted sum of all Value vectors, weighted by the attention scores
import torch
import torch.nn.functional as F
def scaled_dot_product_attention(Q, K, V, mask=None):
d_k = Q.shape[-1]
scores = torch.matmul(Q, K.transpose(-2, -1)) / (d_k ** 0.5)
if mask is not None:
scores = scores.masked_fill(mask == 0, float('-inf'))
weights = F.softmax(scores, dim=-1)
return torch.matmul(weights, V)
In self-attention, what is the role of the Query (Q) vector for a given token?
Multi-Head Attention
Instead of computing attention once, multi-head attention runs h parallel attention operations with different learned projections. Each "head" can focus on different aspects of the input (syntax, semantics, coreference, etc.). The outputs are concatenated and projected back to the model dimension.
Positional Encoding
Self-attention is permutation-invariant — it sees the same result regardless of token order. To fix this, Transformers add positional encoding to the token embeddings before the first layer. The original paper used sinusoidal functions; modern models often use learned positional embeddings or rotary position embedding (RoPE).
The Full Transformer Block
- Multi-Head Self-Attention sublayer
- Add & LayerNorm (residual connection + layer normalization)
- Position-wise Feed-Forward Network (two linear layers with a non-linear activation — ReLU in the original paper; modern models typically use GELU, SwiGLU, or similar variants)
- Add & LayerNorm again
Encoder vs Decoder
Encoder (BERT-style)
The encoder processes the entire input sequence bidirectionally. BERT pre-trains by masking random tokens and predicting them (Masked Language Modeling). The encoder is ideal for understanding tasks: classification, NER, question answering.
Decoder (GPT-style)
The decoder generates tokens autoregressively — one at a time, left to right. Masked self-attention ensures each token can only attend to previous positions (no "future leakage"). GPT pre-trains by predicting the next token. The decoder is ideal for generation tasks.
Encoder-Decoder (T5, BART-style)
Combines both: the encoder processes the full source sequence; the decoder generates the target sequence with cross-attention to encoder outputs. Used for translation, summarization, and general seq2seq tasks.
Which Transformer variant uses masked self-attention to prevent tokens from attending to future positions?
Complexity and Scaling
Standard self-attention has O(n²) time and memory complexity with respect to sequence length n, which becomes a bottleneck for very long sequences. Research has produced efficient alternatives (Longformer, FlashAttention, sparse attention). Transformers also scale remarkably well — more data + more compute + more parameters = better performance (scaling laws).
Fine-Tuning Pre-trained Models
Pre-trained models like BERT and GPT are trained on massive corpora and capture rich language representations. Fine-tuning continues training on a smaller task-specific dataset, adapting the model weights. Modern approaches include instruction tuning (supervised fine-tuning on instruction-following examples) and RLHF (a separate reinforcement learning step using human preference feedback — the two are distinct techniques that often appear together). Parameter-efficient methods like LoRA freeze most weights and train only small adapter matrices.