Transformers & Attention - Data Science Cheatsheet

What are Transformers?

🤖 Transformers - The Foundation of Modern LLMs

What: Neural network architecture based entirely on attention mechanisms (no recurrence, no convolution)

Why revolutionary: Processes entire sequences in parallel, captures long-range dependencies efficiently

Introduced: "Attention is All You Need" (Vaswani et al., 2017)

Powers: GPT, BERT, T5, Claude, ChatGPT, LLaMA, and virtually all modern LLMs

Key Advantages Over RNNs/LSTMs

Parallelization:
• RNN: Must process sequentially (token 1 → token 2 → token 3...)
• Transformer: Processes all tokens simultaneously
• Result: 10-100× faster training on GPUs

Long-Range Dependencies:
• RNN: Information degrades over long sequences (vanishing gradients)
• Transformer: Direct connections between any two tokens via attention
• Result: Can relate words 1000+ tokens apart

Interpretability:
• RNN: Hidden states are opaque
• Transformer: Attention weights show which tokens are important
• Result: Can visualize what the model "pays attention" to

Self-Attention Mechanism

Core Idea

Simple explanation: For each word in a sentence, attention computes how much focus to place on every other word (including itself). This allows the model to understand context dynamically.

Example: "The animal didn't cross the street because it was too tired"
→ Self-attention helps the model understand that "it" refers to "animal" (not "street")

Self-Attention Formula (Scaled Dot-Product)

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where:
• $Q$ (Query): "What am I looking for?" - matrix of shape $(n, d_k)$
• $K$ (Key): "What do I contain?" - matrix of shape $(n, d_k)$
• $V$ (Value): "What information do I provide?" - matrix of shape $(n, d_v)$
• $n$ = sequence length (number of tokens)
• $d_k$ = dimension of keys/queries (typically 64)
• $d_v$ = dimension of values (typically 64)
• $\sqrt{d_k}$ = scaling factor to prevent softmax saturation

Process:
1. Compute attention scores: $QK^T$ (how much each token attends to others)
2. Scale: divide by $\sqrt{d_k}$ (stabilizes gradients)
3. Normalize: softmax converts scores to probabilities (sum to 1)
4. Aggregate: weighted sum of values $V$ using attention weights

How Q, K, V are Created

$$Q = XW^Q, \quad K = XW^K, \quad V = XW^V$$

Where:
• $X$ = input embeddings (shape: $n \times d_{model}$, e.g., $512 \times 512$)
• $W^Q, W^K, W^V$ = learned weight matrices
• Each token's embedding is linearly projected to create Q, K, V

Intuition: The model learns three different "views" of each token:
• Query: What to search for
• Key: How to be searched
• Value: What information to pass forward

Example dimensions:
• Input: "The cat sat" → 3 tokens × 512 dimensions
• $W^Q$: $512 \times 64$ (projects to smaller query space)
• Result: $Q, K, V$ all have shape $3 \times 64$

Step-by-Step Example

Sentence: "The cat sat"

Step 1: Create Q, K, V
Each word gets three vectors (simplified to 3D for illustration):

• "The": $Q_1 = [0.2, 0.5, 0.1]$, $K_1 = [0.3, 0.4, 0.2]$, $V_1 = [1.0, 0.0, 0.5]$
• "cat": $Q_2 = [0.8, 0.3, 0.4]$, $K_2 = [0.7, 0.5, 0.3]$, $V_2 = [0.5, 1.0, 0.3]$
• "sat": $Q_3 = [0.4, 0.7, 0.2]$, $K_3 = [0.5, 0.6, 0.4]$, $V_3 = [0.2, 0.5, 1.0]$

Step 2: Compute Attention Scores (for "cat")
$Q_2 \cdot K^T$ = dot products with all keys:

• "cat" → "The": $Q_2 \cdot K_1 = 0.8×0.3 + 0.3×0.4 + 0.4×0.2 = 0.44$
• "cat" → "cat": $Q_2 \cdot K_2 = 0.8×0.7 + 0.3×0.5 + 0.4×0.3 = 0.83$
• "cat" → "sat": $Q_2 \cdot K_3 = 0.8×0.5 + 0.3×0.6 + 0.4×0.4 = 0.74$

Step 3: Scale and Softmax
Scale by $\sqrt{d_k} = \sqrt{3} \approx 1.73$, then apply softmax:

Scaled scores: $[0.25, 0.48, 0.43]$
After softmax: $[0.24, 0.41, 0.35]$ ← attention weights (sum to 1)

Step 4: Weighted Sum of Values
Output for "cat":

$\text{Output}_2 = 0.24 \times V_1 + 0.41 \times V_2 + 0.35 \times V_3$
$= 0.24 \times [1.0, 0.0, 0.5] + 0.41 \times [0.5, 1.0, 0.3] + 0.35 \times [0.2, 0.5, 1.0]$
$= [0.515, 0.585, 0.597]$

Interpretation: "cat" attends mostly to itself (41%) and "sat" (35%), with less attention to "The" (24%). The output is a context-aware representation of "cat" that incorporates information from the entire sentence.

Multi-Head Attention

🎯 Multi-Head Attention - Multiple Perspectives

Why multiple heads? Different heads can learn different types of relationships

Example heads might learn:

Head 1: Syntactic relationships (subject-verb, noun-adjective)
Head 2: Semantic relationships (synonyms, antonyms)
Head 3: Positional relationships (adjacent words, distant dependencies)
Head 4: Coreference (pronouns to their referents)

Multi-Head Attention Formula

$$\text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

Where:
• $h$ = number of attention heads (typically 8 or 12)
• $W_i^Q, W_i^K, W_i^V$ = learned projection matrices for head $i$
• $W^O$ = output projection matrix
• Each head has its own Q, K, V projections
• Heads run in parallel (very efficient on GPUs)

Dimension management:
• Model dimension $d_{model} = 512$ (typical)
• Number of heads $h = 8$
• Each head dimension: $d_k = d_v = d_{model} / h = 64$
• Total parameters same as single-head, but more expressive!

Process:
1. Split embedding into $h$ heads
2. Each head performs self-attention independently
3. Concatenate all head outputs
4. Final linear projection $W^O$

Visualization: What Different Heads Learn

Sentence: "The quick brown fox jumps over the lazy dog"

Head 1: Adjective → Noun

• "quick" → "fox" (high attention)
• "brown" → "fox" (high attention)
• "lazy" → "dog" (high attention)

Learns: Adjectives modify nearby nouns

Head 2: Subject → Verb

• "fox" → "jumps" (high attention)
• "The" → "fox" (moderate attention)

Learns: Subject-verb agreement and relationships

Head 3: Verb → Object

• "jumps" → "over" (high attention)
• "over" → "dog" (high attention)

Learns: Prepositional phrases and verb objects

Head 4: Positional

• Each word → immediate neighbors
• Strong local attention pattern

Learns: Word order and sequential information

Key Insight: Different heads specialize without being explicitly told! The model learns these patterns from data through backpropagation.

Positional Encoding

📍 Positional Encoding - Injecting Word Order

Problem: Self-attention is permutation-invariant - it treats "dog bites man" same as "man bites dog"!

Solution: Add positional information to embeddings so model knows word order

Two approaches: Learned embeddings (GPT) or sinusoidal functions (original Transformer)

Sinusoidal Positional Encoding (Original Transformer)

$$PE_{(pos, 2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$ $$PE_{(pos, 2i+1)} = \cos\left(\frac{pos}{10000^{2i/d_{model}}}\right)$$

Where:
• $pos$ = position in sequence (0, 1, 2, ...)
• $i$ = dimension index (0 to $d_{model}/2$)
• $d_{model}$ = embedding dimension (e.g., 512)
• Even dimensions use sine, odd dimensions use cosine

Why this formula?
• Different frequencies for different dimensions
• Low dimensions: rapid oscillation (capture local position)
• High dimensions: slow oscillation (capture global position)
• Can extrapolate to longer sequences than seen in training
• Relative positions can be computed with linear transformations

How it's used:
$$\text{Input} = \text{Token Embedding} + \text{Positional Encoding}$$ The positional encoding is added element-wise to token embeddings before entering the transformer.

Learned Positional Embeddings (GPT, BERT)

Alternative approach: Learn a separate embedding for each position

• Create embedding matrix: $E_{pos}$ of shape $(\text{max_length}, d_{model})$
• Position 0 gets $E_{pos}[0]$, position 1 gets $E_{pos}[1]$, etc.
• These are learned during training like word embeddings

Advantages:
• Simpler to implement
• Can learn position-specific patterns
• Used by GPT-2, GPT-3, BERT

Disadvantages:
• Fixed maximum sequence length (can't extrapolate)
• More parameters to learn
• GPT-3: max 2048 positions × 12288 dims = 25M parameters just for positions!

Positional Encoding Visualization

Example: First few dimensions of sinusoidal encoding

Position	Dim 0 (sin)	Dim 1 (cos)	Dim 2 (sin)	Dim 3 (cos)
0	0.000	1.000	0.000	1.000
1	0.841	0.540	0.010	1.000
2	0.909	-0.416	0.020	1.000
3	0.141	-0.990	0.030	1.000
4	-0.757	-0.654	0.040	0.999

Notice:
• Dim 0-1 (low dimensions): Values change rapidly with position → capture fine-grained local order
• Dim 2-3 (higher dimensions): Values change slowly → capture coarse global position
• This multi-scale representation helps model understand both local and global position

Encoder-Decoder Architecture

🔄 Encoder-Decoder - Two-Stage Processing

Encoder: Processes input sequence, builds contextual representations

Decoder: Generates output sequence, attending to encoder outputs

Use cases: Translation, summarization, question answering

Note: GPT = decoder-only, BERT = encoder-only, T5/BART = full encoder-decoder

Architecture Overview

📥 ENCODER

Input: Source sequence
Example: "Hello, how are you?" (English)

Stack of N layers (N=6 typical)
Each layer has:

1. Multi-Head Self-Attention
• Each token attends to all input tokens
• Bidirectional (can see entire input)

2. Feed-Forward Network
• 2-layer MLP applied to each position
• Same across positions, different across layers

3. Layer Normalization + Residual
• After each sub-layer
• Stabilizes training

Output: Contextual representations
• One vector per input token
• Passed to decoder

📤 DECODER

Input: Target sequence (shifted right)
Example: Start with "<START>", generate "Bonjour"

Stack of N layers (N=6 typical)
Each layer has:

1. Masked Self-Attention
• Each position attends only to earlier positions
• Prevents "peeking" at future tokens

2. Cross-Attention (Encoder-Decoder)
• Query from decoder, Keys & Values from encoder
• Allows decoder to "look at" source sentence

3. Feed-Forward Network
• Same as encoder FFN

4. Layer Normalization + Residual
• After each sub-layer

Output: Generated tokens
• Softmax over vocabulary
Example: "Bonjour, comment allez-vous?"

Key Differences: Self-Attention vs Cross-Attention

Self-Attention (Encoder & Decoder):
$$Q, K, V \text{ all from same sequence}$$ • Query, Key, Value all derived from the same input
• Token attends to tokens in same sequence
• Used in both encoder and decoder (with masking in decoder)

Cross-Attention (Decoder only):
$$Q \text{ from decoder}, \quad K, V \text{ from encoder}$$ • Query from decoder (what I'm generating)
• Keys & Values from encoder (source sentence information)
• Allows decoder to "attend" to relevant parts of input
• This is how translation works: French decoder looks at English encoder outputs

Example (Translation):
Generating French "chat" (cat):
• Decoder query for "chat" attends to encoder keys
• High attention to English "cat" in encoder outputs
• Retrieves corresponding encoder values to inform generation

Masked Self-Attention (Causal Mask)

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T + M}{\sqrt{d_k}}\right)V$$

Where $M$ is the mask matrix:

For autoregressive generation (GPT-style), mask is lower-triangular: $$M = \begin{bmatrix} 0 & -\infty & -\infty & -\infty \\ 0 & 0 & -\infty & -\infty \\ 0 & 0 & 0 & -\infty \\ 0 & 0 & 0 & 0 \end{bmatrix}$$
• Position $i$ can only attend to positions $\leq i$
• $-\infty$ makes softmax output 0 for masked positions
• Prevents model from "cheating" by looking ahead
• Essential for language modeling and generation tasks

Why needed: During training, we have the full target sequence. Without masking, the model could simply copy future tokens instead of learning to generate them!

Complete Transformer Layer

Single Encoder Layer (detailed)

Input: $X$ (sequence embeddings)

Step 1: Multi-Head Self-Attention
$$Z = \text{MultiHead}(X, X, X)$$ All tokens attend to all tokens

Step 2: Add & Norm (Residual Connection)
$$X' = \text{LayerNorm}(X + Z)$$ Helps with gradient flow, stabilizes training

Step 3: Feed-Forward Network
$$\text{FFN}(x) = \text{ReLU}(xW_1 + b_1)W_2 + b_2$$ Applied independently to each position
Typical dimensions: $d_{model} = 512 \rightarrow d_{ff} = 2048 \rightarrow d_{model} = 512$

Step 4: Add & Norm (Residual Connection)
$$X'' = \text{LayerNorm}(X' + \text{FFN}(X'))$$

Output: $X''$ (contextualized representations)
Passed to next encoder layer (or to decoder if final layer)

Key Design Choices:
• Residual connections: $X + F(X)$ instead of just $F(X)$ - enables training deep networks (100+ layers)
• Layer normalization: Normalizes across feature dimension - stabilizes training
• Feed-forward expansion: $d_{ff} = 4 \times d_{model}$ - provides capacity for complex transformations
• Position-wise FFN: Same network applied to each token independently - enables parallelization

Transformer Variants

Model	Architecture	Attention Type	Use Case
BERT	Encoder-only	Bidirectional	Classification, NER, Q&A
GPT-2/3/4	Decoder-only	Causal (masked)	Text generation, completion
T5	Encoder-Decoder	Both + Cross	Translation, summarization
BART	Encoder-Decoder	Both + Cross	Denoising, summarization
LLaMA	Decoder-only	Causal	Efficient LLM, open-source
Claude	Decoder-only (likely)	Causal	Conversational AI, safety

Key Insights

Attention is All You Need:
The 2017 paper showed that recurrence and convolution are not necessary - pure attention is enough and more efficient.

Parallelization is Key:
RNNs process sequentially (slow), Transformers process all tokens simultaneously (fast) → enabled training on massive datasets.

Scaling Laws:
Transformers scale incredibly well - performance improves predictably with more parameters, data, and compute.

Multi-Head = Multiple Perspectives:
Different heads learn different relationships (syntax, semantics, position) without being explicitly told.

Position Matters:
Without positional encoding, "dog bites man" = "man bites dog" to the model. Position info is critical.

Architecture Matters for Task:
• Encoder-only (BERT): Best for understanding tasks (classification, NER)
• Decoder-only (GPT): Best for generation tasks (completion, chat)
• Encoder-Decoder (T5): Best for seq2seq tasks (translation, summarization)

Cross-Attention is the Bridge:
In encoder-decoder models, cross-attention is how the decoder "looks at" the input - crucial for translation.

Masking Prevents Cheating:
Causal mask in decoder ensures model learns to generate, not just copy future tokens during training.