ℹ️ Definition Transformer Architectures are sequence-to-sequence models based on self-attention mechanisms that can generate text, code, and other sequential data by learning long-range dependencies without recurrence, forming the foundation of modern large language models like GPT-4 and Claude.
By the end of this lesson, you will:
In Lessons 9-13, we explored generative models for images (VAEs, GANs, Diffusion). Now we'll learn Transformers - the architecture behind GPT-4, Claude, Gemini, and modern LLMs!
Why Transformers revolutionized AI:
Real-world applications:
Idea: Process sequences step-by-step with hidden state
Architecture:
h_t = tanh(W_hh * h_{t-1} + W_xh * x_t)
y_t = W_hy * h_t
Problems:
Improvements: Gates (forget, input, output) to control information flow
Still sequential: Can't parallelize training
Key insight: Use self-attention to process entire sequence in parallel!
Benefits:
Goal: For each word, attend to all other words based on relevance
Example:
Sentence: "The cat sat on the mat because it was tired"
When processing "it":
- Attend strongly to "cat" (high relevance)
- Attend weakly to "mat" (low relevance)
- Attend very weakly to "the", "on" (filler words)
Analogy: Database lookup
Computation:
One. Create Q, K, V from input X:
Q = X * W_Q (Query matrix)
K = X * W_K (Key matrix)
V = X * W_V (Value matrix)
2. Compute attention scores:
Attention(Q, K, V) = softmax(Q * K^T / √d_k) * V
Where:
Input sequence:
["The", "cat", "sat"]
Step One: Embed words
X = [[0.1, 0.2, 0.3], # "The"
[0.4, 0.5, 0.6], # "cat"
[0.7, 0.8, 0.9]] # "sat"
Step 2: Compute Q, K, V
d_model = 3
d_k = 3
W_Q = nn.Linear(d_model, d_k)
W_K = nn.Linear(d_model, d_k)
W_V = nn.Linear(d_model, d_k)
Q = W_Q(X) # (3, 3)
K = W_K(X) # (3, 3)
V = W_V(X) # (3, 3)
Step 3: Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# scores[i, j] = relevance of word j to word i
attention_weights = torch.softmax(scores, dim=-1)
# Example attention_weights[1] (for "cat"):
# [0.1, 0.6, 0.3] → Attends most to itself, some to "sat"
Step 4: Apply attention to values
output = torch.matmul(attention_weights, V)
# output[i] = weighted sum of all words' values

import torch
import torch.nn as nn
import math
class SelfAttention(nn.Module):
def __init__(self, d_model, d_k):
super().__init__()
self.d_k = d_k
# Linear projections for Q, K, V
self.W_Q = nn.Linear(d_model, d_k)
self.W_K = nn.Linear(d_model, d_k)
self.W_V = nn.Linear(d_model, d_k)
def forward(self, X, mask=None):
# X: (batch_size, seq_len, d_model)
# Project to Q, K, V
Q = self.W_Q(X) # (batch, seq_len, d_k)
K = self.W_K(X) # (batch, seq_len, d_k)
V = self.W_V(X) # (batch, seq_len, d_k)
# Compute attention scores
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
# (batch, seq_len, seq_len)
# Apply mask (for decoder)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
# Softmax
attention_weights = torch.softmax(scores, dim=-1)
# Apply attention to values
output = torch.matmul(attention_weights, V)
# (batch, seq_len, d_k)
return output, attention_weights
Problem: Single attention captures one type of relationship
Solution: Use multiple attention heads in parallel!
Multi-head attention:
head_1 = Attention(Q_1, K_1, V_1) # Captures syntax
head_2 = Attention(Q_2, K_2, V_2) # Captures semantics
...
head_h = Attention(Q_h, K_h, V_h) # Captures context
output = Concat(head_1, ..., head_h) * W_O
Implementation:
class MultiHeadAttention(nn.Module):
def __init__(self, d_model, num_heads):
super().__init__()
assert d_model % num_heads == 0
self.d_model = d_model
self.num_heads = num_heads
self.d_k = d_model // num_heads
# Linear layers for all heads
self.W_Q = nn.Linear(d_model, d_model)
self.W_K = nn.Linear(d_model, d_model)
self.W_V = nn.Linear(d_model, d_model)
self.W_O = nn.Linear(d_model, d_model)
def forward(self, X, mask=None):
batch_size, seq_len, d_model = X.size()
# Linear projections and split into heads
Q = self.W_Q(X).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
K = self.W_K(X).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
V = self.W_V(X).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
# (batch, num_heads, seq_len, d_k)
# Compute attention for all heads in parallel
scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
if mask is not None:
scores = scores.masked_fill(mask == 0, -1e9)
attention_weights = torch.softmax(scores, dim=-1)
attention_output = torch.matmul(attention_weights, V)
# (batch, num_heads, seq_len, d_k)
# Concatenate heads
attention_output = attention_output.transpose(1, 2).contiguous()
attention_output = attention_output.view(batch_size, seq_len, d_model)
# Final linear projection
output = self.W_O(attention_output)
return output, attention_weights
Components:
class TransformerBlock(nn.Module):
def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
super().__init__()
# Multi-head attention
self.attention = MultiHeadAttention(d_model, num_heads)
# Feed-forward network
self.ffn = nn.Sequential(
nn.Linear(d_model, d_ff),
nn.ReLU(),
nn.Linear(d_ff, d_model),
)
# Layer normalization
self.norm1 = nn.LayerNorm(d_model)
self.norm2 = nn.LayerNorm(d_model)
# Dropout
self.dropout = nn.Dropout(dropout)
def forward(self, x, mask=None):
# Self-attention with residual connection
attn_output, _ = self.attention(x, mask)
x = self.norm1(x + self.dropout(attn_output))
# Feed-forward with residual connection
ffn_output = self.ffn(x)
x = self.norm2(x + self.dropout(ffn_output))
return x
Problem: Self-attention has no notion of position!
Example:
"The cat sat on the mat" = "mat the on sat cat The" (same attention!)
Solution: Add positional information to embeddings
Sinusoidal positional encoding:
PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))
Implementation:
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super().__init__()
# Create positional encoding matrix
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len).unsqueeze(1).float()
div_term = torch.exp(torch.arange(0, d_model, 2).float() *
-(math.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0) # (1, max_len, d_model)
self.register_buffer('pe', pe)
def forward(self, x):
# x: (batch, seq_len, d_model)
return x + self.pe[:, :x.size(1)]
For autoregressive generation: Token i can only attend to tokens <= i
Mask matrix:
Mask = [[1, 0, 0, 0], # Token 1 can only see itself
[1, 1, 0, 0], # Token 2 can see tokens 1-2
[1, 1, 1, 0], # Token 3 can see tokens 1-3
[1, 1, 1, 1]] # Token 4 can see all
Implementation:
def create_causal_mask(seq_len):
mask = torch.tril(torch.ones(seq_len, seq_len))
return mask.unsqueeze(0).unsqueeze(0) # (1, 1, seq_len, seq_len)

Architecture:
Input tokens
↓
Token Embedding + Positional Encoding
↓
Transformer Block 1 (Masked Self-Attention)
↓
Transformer Block 2
↓
...
↓
Transformer Block N
↓
Linear + Softmax (predict next token)
Key features:
class GPT(nn.Module):
def __init__(self, vocab_size, d_model=512, num_heads=8, num_layers=6,
d_ff=2048, max_len=512, dropout=0.1):
super().__init__()
# Token embedding
self.token_embedding = nn.Embedding(vocab_size, d_model)
self.positional_encoding = PositionalEncoding(d_model, max_len)
# Transformer blocks
self.blocks = nn.ModuleList([
TransformerBlock(d_model, num_heads, d_ff, dropout)
for _ in range(num_layers)
])
# Output layer
self.ln_f = nn.LayerNorm(d_model)
self.head = nn.Linear(d_model, vocab_size, bias=False)
# Initialize weights
self.apply(self._init_weights)
def _init_weights(self, module):
if isinstance(module, nn.Linear):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
if module.bias is not None:
torch.nn.init.zeros_(module.bias)
elif isinstance(module, nn.Embedding):
torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
def forward(self, idx, targets=None):
# idx: (batch, seq_len)
batch_size, seq_len = idx.size()
# Token + positional embeddings
token_emb = self.token_embedding(idx) # (batch, seq_len, d_model)
x = self.positional_encoding(token_emb)
# Create causal mask
mask = create_causal_mask(seq_len).to(idx.device)
# Apply transformer blocks
for block in self.blocks:
x = block(x, mask)
# Final layer norm and projection
x = self.ln_f(x)
logits = self.head(x) # (batch, seq_len, vocab_size)
# Compute loss if targets provided
loss = None
if targets is not None:
loss = F.cross_entropy(
logits.view(-1, logits.size(-1)),
targets.view(-1)
)
return logits, loss
def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
# idx: (batch, seq_len) - context
for _ in range(max_new_tokens):
# Crop context if too long
idx_cond = idx if idx.size(1) <= self.max_len else idx[:, -self.max_len:]
# Forward pass
logits, _ = self(idx_cond)
logits = logits[:, -1, :] / temperature # Last token, scale by temperature
# Optional: top-k sampling
if top_k is not None:
v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
logits[logits < v[:, [-1]]] = -float('Inf')
# Sample from distribution
probs = F.softmax(logits, dim=-1)
idx_next = torch.multinomial(probs, num_samples=1)
# Append to sequence
idx = torch.cat([idx, idx_next], dim=1)
return idx
Method: Always pick most probable token
logits, _ = model(context)
next_token = torch.argmax(logits[:, -1, :], dim=-1)
Pros: Deterministic, simple Cons: Repetitive, boring output
Method: Scale logits before softmax
temperature = 0.8 # Lower = more confident, Higher = more random
scaled_logits = logits / temperature
probs = F.softmax(scaled_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
Temperature effects:
Method: Sample from top k most likely tokens
k = 50
top_k_logits, top_k_indices = torch.topk(logits, k)
probs = F.softmax(top_k_logits, dim=-1)
next_token_idx = torch.multinomial(probs, num_samples=1)
next_token = top_k_indices.gather(-1, next_token_idx)
Benefit: Prevents sampling very unlikely tokens
Method: Sample from smallest set of tokens with cumulative probability >= p
p = 0.9 # Nucleus probability
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)
# Remove tokens with cumulative probability > p
sorted_indices_to_remove = cumulative_probs > p
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = 0
# Sample
indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
logits[indices_to_remove] = -float('Inf')
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
Benefit: Adaptive - adjusts to distribution
| Method | Quality | Diversity | Consistency |
|---|---|---|---|
| Greedy | ⚠️ Repetitive | ❌ Low | ✅ High |
| Temperature | ✅ Good | ⚠️ Medium | ⚠️ Medium |
| Top-k | ✅ Good | ✅ High | ⚠️ Medium |
| Top-p (Nucleus) | ✅ Excellent | ✅ High | ✅ Good |
Recommendation: Use top-p (nucleus) with p=0.9 and temperature=0.8
Objective: Predict next token given context
# Training data: "The cat sat on the mat"
# Create input-target pairs:
# Input: "The cat sat on the" → Target: "mat"
# Input: "The cat sat on" → Target: "the"
# ...
for epoch in range(num_epochs):
for batch in train_loader:
input_ids = batch['input_ids'] # (batch, seq_len)
targets = batch['labels'] # (batch, seq_len)
# Forward pass
logits, loss = model(input_ids, targets)
# Backward pass
optimizer.zero_grad()
loss.backward()
# Gradient clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()
Use case: Adapt pre-trained model to specific task
Example: Fine-tune GPT-2 on custom corpus
from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments
# Load pre-trained GPT-2
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# Prepare dataset
train_dataset = ... # Your custom data
# Training arguments
training_args = TrainingArguments(
output_dir='./results',
num_train_epochs=3,
per_device_train_batch_size=4,
save_steps=1000,
save_total_limit=2,
learning_rate=5e-5,
)
# Trainer
trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_dataset,
)
# Train
trainer.train()
Use case: Autocomplete, code generation
prompt = "Once upon a time, in a land far away,"
generated = model.generate(prompt, max_length=100)
Format: Context + Question -> Answer
prompt = """Context: Paris is the capital of France.
Question: What is the capital of France?
Answer:"""
answer = model.generate(prompt)
Example: GitHub Copilot
prompt = """# Function to compute fibonacci numbers
def fibonacci(n):
\"\"\"Compute the nth fibonacci number\"\"\"
"""
code = model.generate(prompt)
Format: English: ... French:
prompt = "English: Hello, how are you? French:"
translation = model.generate(prompt)
Format: Article: ... Summary:
prompt = f"""Article: {long_article}
Summary:"""
summary = model.generate(prompt, max_length=100)
Architecture: Encoder blocks with bidirectional attention
Use case: Classification, NER, sentiment analysis
Pre-training: Masked language modeling (predict masked tokens)
Architecture: Full encoder-decoder Transformer
Use case: Translation, summarization, QA
Pre-training: Span corruption (predict masked spans)
Architecture: Decoder blocks with causal masking
Use case: Text generation, completion
Pre-training: Next token prediction
| Model | Architecture | Attention | Use Case |
|---|---|---|---|
| BERT | Encoder | Bidirectional | Understanding |
| T5 | Encoder-Decoder | Bi + Causal | Seq2Seq |
| GPT | Decoder | Causal | Generation |