Transformers & Attention
Self-attention, multi-head attention, positional encodings, and the encoder-decoder architecture
What are Transformers?
🤖 Transformers - The Foundation of Modern LLMs
What: Neural network architecture based entirely on attention mechanisms (no recurrence, no convolution)
Why revolutionary: Processes entire sequences in parallel, captures long-range dependencies efficiently
Introduced: "Attention is All You Need" (Vaswani et al., 2017)
Powers: GPT, BERT, T5, Claude, ChatGPT, LLaMA, and virtually all modern LLMs
• RNN: Must process sequentially (token 1 → token 2 → token 3...)
• Transformer: Processes all tokens simultaneously
• Result: 10-100× faster training on GPUs
Long-Range Dependencies:
• RNN: Information degrades over long sequences (vanishing gradients)
• Transformer: Direct connections between any two tokens via attention
• Result: Can relate words 1000+ tokens apart
Interpretability:
• RNN: Hidden states are opaque
• Transformer: Attention weights show which tokens are important
• Result: Can visualize what the model "pays attention" to
Self-Attention Mechanism
Example: "The animal didn't cross the street because it was too tired"
→ Self-attention helps the model understand that "it" refers to "animal" (not "street")
• $Q$ (Query): "What am I looking for?" - matrix of shape $(n, d_k)$
• $K$ (Key): "What do I contain?" - matrix of shape $(n, d_k)$
• $V$ (Value): "What information do I provide?" - matrix of shape $(n, d_v)$
• $n$ = sequence length (number of tokens)
• $d_k$ = dimension of keys/queries (typically 64)
• $d_v$ = dimension of values (typically 64)
• $\sqrt{d_k}$ = scaling factor to prevent softmax saturation
Process:
1. Compute attention scores: $QK^T$ (how much each token attends to others)
2. Scale: divide by $\sqrt{d_k}$ (stabilizes gradients)
3. Normalize: softmax converts scores to probabilities (sum to 1)
4. Aggregate: weighted sum of values $V$ using attention weights
• $X$ = input embeddings (shape: $n \times d_{model}$, e.g., $512 \times 512$)
• $W^Q, W^K, W^V$ = learned weight matrices
• Each token's embedding is linearly projected to create Q, K, V
Intuition: The model learns three different "views" of each token:
• Query: What to search for
• Key: How to be searched
• Value: What information to pass forward
Example dimensions:
• Input: "The cat sat" → 3 tokens × 512 dimensions
• $W^Q$: $512 \times 64$ (projects to smaller query space)
• Result: $Q, K, V$ all have shape $3 \times 64$
Step-by-Step Example
Sentence: "The cat sat"
Each word gets three vectors (simplified to 3D for illustration):
• "cat": $Q_2 = [0.8, 0.3, 0.4]$, $K_2 = [0.7, 0.5, 0.3]$, $V_2 = [0.5, 1.0, 0.3]$
• "sat": $Q_3 = [0.4, 0.7, 0.2]$, $K_3 = [0.5, 0.6, 0.4]$, $V_3 = [0.2, 0.5, 1.0]$
$Q_2 \cdot K^T$ = dot products with all keys:
• "cat" → "cat": $Q_2 \cdot K_2 = 0.8×0.7 + 0.3×0.5 + 0.4×0.3 = 0.83$
• "cat" → "sat": $Q_2 \cdot K_3 = 0.8×0.5 + 0.3×0.6 + 0.4×0.4 = 0.74$
Scale by $\sqrt{d_k} = \sqrt{3} \approx 1.73$, then apply softmax:
After softmax: $[0.24, 0.41, 0.35]$ ← attention weights (sum to 1)
Output for "cat":
$= 0.24 \times [1.0, 0.0, 0.5] + 0.41 \times [0.5, 1.0, 0.3] + 0.35 \times [0.2, 0.5, 1.0]$
$= [0.515, 0.585, 0.597]$
Multi-Head Attention
🎯 Multi-Head Attention - Multiple Perspectives
Why multiple heads? Different heads can learn different types of relationships
Example heads might learn:
- Head 1: Syntactic relationships (subject-verb, noun-adjective)
- Head 2: Semantic relationships (synonyms, antonyms)
- Head 3: Positional relationships (adjacent words, distant dependencies)
- Head 4: Coreference (pronouns to their referents)
• $h$ = number of attention heads (typically 8 or 12)
• $W_i^Q, W_i^K, W_i^V$ = learned projection matrices for head $i$
• $W^O$ = output projection matrix
• Each head has its own Q, K, V projections
• Heads run in parallel (very efficient on GPUs)
Dimension management:
• Model dimension $d_{model} = 512$ (typical)
• Number of heads $h = 8$
• Each head dimension: $d_k = d_v = d_{model} / h = 64$
• Total parameters same as single-head, but more expressive!
Process:
1. Split embedding into $h$ heads
2. Each head performs self-attention independently
3. Concatenate all head outputs
4. Final linear projection $W^O$
Visualization: What Different Heads Learn
Sentence: "The quick brown fox jumps over the lazy dog"
• "brown" → "fox" (high attention)
• "lazy" → "dog" (high attention)
Learns: Adjectives modify nearby nouns
• "The" → "fox" (moderate attention)
Learns: Subject-verb agreement and relationships
• "over" → "dog" (high attention)
Learns: Prepositional phrases and verb objects
• Strong local attention pattern
Learns: Word order and sequential information
Positional Encoding
📍 Positional Encoding - Injecting Word Order
Problem: Self-attention is permutation-invariant - it treats "dog bites man" same as "man bites dog"!
Solution: Add positional information to embeddings so model knows word order
Two approaches: Learned embeddings (GPT) or sinusoidal functions (original Transformer)
• $pos$ = position in sequence (0, 1, 2, ...)
• $i$ = dimension index (0 to $d_{model}/2$)
• $d_{model}$ = embedding dimension (e.g., 512)
• Even dimensions use sine, odd dimensions use cosine
Why this formula?
• Different frequencies for different dimensions
• Low dimensions: rapid oscillation (capture local position)
• High dimensions: slow oscillation (capture global position)
• Can extrapolate to longer sequences than seen in training
• Relative positions can be computed with linear transformations
How it's used:
$$\text{Input} = \text{Token Embedding} + \text{Positional Encoding}$$ The positional encoding is added element-wise to token embeddings before entering the transformer.
• Create embedding matrix: $E_{pos}$ of shape $(\text{max_length}, d_{model})$
• Position 0 gets $E_{pos}[0]$, position 1 gets $E_{pos}[1]$, etc.
• These are learned during training like word embeddings
Advantages:
• Simpler to implement
• Can learn position-specific patterns
• Used by GPT-2, GPT-3, BERT
Disadvantages:
• Fixed maximum sequence length (can't extrapolate)
• More parameters to learn
• GPT-3: max 2048 positions × 12288 dims = 25M parameters just for positions!
Positional Encoding Visualization
Example: First few dimensions of sinusoidal encoding
Position | Dim 0 (sin) | Dim 1 (cos) | Dim 2 (sin) | Dim 3 (cos) |
---|---|---|---|---|
0 | 0.000 | 1.000 | 0.000 | 1.000 |
1 | 0.841 | 0.540 | 0.010 | 1.000 |
2 | 0.909 | -0.416 | 0.020 | 1.000 |
3 | 0.141 | -0.990 | 0.030 | 1.000 |
4 | -0.757 | -0.654 | 0.040 | 0.999 |
• Dim 0-1 (low dimensions): Values change rapidly with position → capture fine-grained local order
• Dim 2-3 (higher dimensions): Values change slowly → capture coarse global position
• This multi-scale representation helps model understand both local and global position
Encoder-Decoder Architecture
🔄 Encoder-Decoder - Two-Stage Processing
Encoder: Processes input sequence, builds contextual representations
Decoder: Generates output sequence, attending to encoder outputs
Use cases: Translation, summarization, question answering
Note: GPT = decoder-only, BERT = encoder-only, T5/BART = full encoder-decoder
Architecture Overview
📥 ENCODER
Example: "Hello, how are you?" (English)
Each layer has:
• Each token attends to all input tokens
• Bidirectional (can see entire input)
• 2-layer MLP applied to each position
• Same across positions, different across layers
• After each sub-layer
• Stabilizes training
• One vector per input token
• Passed to decoder
📤 DECODER
Example: Start with "<START>", generate "Bonjour"
Each layer has:
• Each position attends only to earlier positions
• Prevents "peeking" at future tokens
• Query from decoder, Keys & Values from encoder
• Allows decoder to "look at" source sentence
• Same as encoder FFN
• After each sub-layer
• Softmax over vocabulary
Example: "Bonjour, comment allez-vous?"
$$Q, K, V \text{ all from same sequence}$$ • Query, Key, Value all derived from the same input
• Token attends to tokens in same sequence
• Used in both encoder and decoder (with masking in decoder)
Cross-Attention (Decoder only):
$$Q \text{ from decoder}, \quad K, V \text{ from encoder}$$ • Query from decoder (what I'm generating)
• Keys & Values from encoder (source sentence information)
• Allows decoder to "attend" to relevant parts of input
• This is how translation works: French decoder looks at English encoder outputs
Example (Translation):
Generating French "chat" (cat):
• Decoder query for "chat" attends to encoder keys
• High attention to English "cat" in encoder outputs
• Retrieves corresponding encoder values to inform generation
For autoregressive generation (GPT-style), mask is lower-triangular: $$M = \begin{bmatrix} 0 & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 \end{bmatrix}$$
• Position $i$ can only attend to positions $\leq i$
• $-\infty$ makes softmax output 0 for masked positions
• Prevents model from "cheating" by looking ahead
• Essential for language modeling and generation tasks
Why needed: During training, we have the full target sequence. Without masking, the model could simply copy future tokens instead of learning to generate them!
Complete Transformer Layer
Single Encoder Layer (detailed)
$$Z = \text{MultiHead}(X, X, X)$$ All tokens attend to all tokens
$$X' = \text{LayerNorm}(X + Z)$$ Helps with gradient flow, stabilizes training
$$\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2$$ Applied independently to each position
Typical dimensions: $d_{model} = 512 \rightarrow d_{ff} = 2048 \rightarrow d_{model} = 512$
$$X'' = \text{LayerNorm}(X' + \text{FFN}(X'))$$
Passed to next encoder layer (or to decoder if final layer)
• Residual connections: $X + F(X)$ instead of just $F(X)$ - enables training deep networks (100+ layers)
• Layer normalization: Normalizes across feature dimension - stabilizes training
• Feed-forward expansion: $d_{ff} = 4 \times d_{model}$ - provides capacity for complex transformations
• Position-wise FFN: Same network applied to each token independently - enables parallelization
Transformer Variants
Model | Architecture | Attention Type | Use Case |
---|---|---|---|
BERT | Encoder-only | Bidirectional | Classification, NER, Q&A |
GPT-2/3/4 | Decoder-only | Causal (masked) | Text generation, completion |
T5 | Encoder-Decoder | Both + Cross | Translation, summarization |
BART | Encoder-Decoder | Both + Cross | Denoising, summarization |
LLaMA | Decoder-only | Causal | Efficient LLM, open-source |
Claude | Decoder-only (likely) | Causal | Conversational AI, safety |
Key Insights
The 2017 paper showed that recurrence and convolution are not necessary - pure attention is enough and more efficient.
Parallelization is Key:
RNNs process sequentially (slow), Transformers process all tokens simultaneously (fast) → enabled training on massive datasets.
Scaling Laws:
Transformers scale incredibly well - performance improves predictably with more parameters, data, and compute.
Multi-Head = Multiple Perspectives:
Different heads learn different relationships (syntax, semantics, position) without being explicitly told.
Position Matters:
Without positional encoding, "dog bites man" = "man bites dog" to the model. Position info is critical.
Architecture Matters for Task:
• Encoder-only (BERT): Best for understanding tasks (classification, NER)
• Decoder-only (GPT): Best for generation tasks (completion, chat)
• Encoder-Decoder (T5): Best for seq2seq tasks (translation, summarization)
Cross-Attention is the Bridge:
In encoder-decoder models, cross-attention is how the decoder "looks at" the input - crucial for translation.
Masking Prevents Cheating:
Causal mask in decoder ensures model learns to generate, not just copy future tokens during training.