Demo Mode

No student ID available

Concept 14 of 18

Concept 14: Transformer Architectures for Generation

Transformer Architectures for Generation

ℹ️ Definition Transformer Architectures are sequence-to-sequence models based on self-attention mechanisms that can generate text, code, and other sequential data by learning long-range dependencies without recurrence, forming the foundation of modern large language models like GPT-4 and Claude.

Learning Objectives

By the end of this lesson, you will:

Understand the self-attention mechanism
Learn the Transformer architecture (encoder, decoder, masked attention)
Implement autoregressive text generation
Apply different sampling strategies (greedy, beam search, nucleus)
Fine-tune a pre-trained GPT-2 model
Build a text generation system with temperature control

Introduction

In Lessons 9-13, we explored generative models for images (VAEs, GANs, Diffusion). Now we'll learn Transformers - the architecture behind GPT-4, Claude, Gemini, and modern LLMs!

Why Transformers revolutionized AI:

Attention is all you need (no RNNs/CNNs needed)
Parallelizable (fast training)
Long-range dependencies (understand context)
Transfer learning (pre-train once, fine-tune many times)

Real-world applications:

ChatGPT/Claude (conversational AI)
GitHub Copilot (code generation)
Translation (Google Translate)
Summarization (document processing)

Background: Sequence Models

Recurrent Neural Networks (RNNs)

Idea: Process sequences step-by-step with hidden state

Architecture:

arduino

h_t = tanh(W_hh * h_{t-1} + W_xh * x_t)
y_t = W_hy * h_t

Problems:

Sequential processing: Can't parallelize (slow!)
Vanishing gradients: Hard to learn long-range dependencies
Limited memory: Hidden state has finite capacity

Long Short-Term Memory (LSTM)

Improvements: Gates (forget, input, output) to control information flow

Still sequential: Can't parallelize training

The Transformer Solution

Key insight: Use self-attention to process entire sequence in parallel!

Benefits:

✅ Parallelizable (train on full sequence at once)
✅ Long-range dependencies (attention to any position)
✅ No vanishing gradients (direct connections)
✅ Scalable (works for 100K+ tokens)

Self-Attention Mechanism

Intuition

Goal: For each word, attend to all other words based on relevance

Example:

vbnet

Sentence: "The cat sat on the mat because it was tired"

When processing "it":
- Attend strongly to "cat" (high relevance)
- Attend weakly to "mat" (low relevance)
- Attend very weakly to "the", "on" (filler words)

Queries, Keys, Values (QKV)

Analogy: Database lookup

Query (Q): What am I looking for?
Key (K): What does each item represent?
Value (V): What information does each item contain?

Computation:

One. Create Q, K, V from input X:

ini

Q = X * W_Q  (Query matrix)
K = X * W_K  (Key matrix)
V = X * W_V  (Value matrix)

2. Compute attention scores:

scss

Attention(Q, K, V) = softmax(Q * K^T / √d_k) * V

Where:

Q * K^T: Dot product (how relevant is each key to each query?)
√d_k: Scaling factor (prevent large values)
softmax: Normalize to probabilities
x V: Weighted sum of values

Self-Attention Example

Input sequence:

css

["The", "cat", "sat"]

Step One: Embed words

lua

X = [[0.1, 0.2, 0.3],   # "The"
     [0.4, 0.5, 0.6],   # "cat"
     [0.7, 0.8, 0.9]]   # "sat"

Step 2: Compute Q, K, V

python

d_model = 3
d_k = 3

W_Q = nn.Linear(d_model, d_k)
W_K = nn.Linear(d_model, d_k)
W_V = nn.Linear(d_model, d_k)

Q = W_Q(X)  # (3, 3)
K = W_K(X)  # (3, 3)
V = W_V(X)  # (3, 3)

Step 3: Compute attention scores

python

scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# scores[i, j] = relevance of word j to word i

attention_weights = torch.softmax(scores, dim=-1)
# Example attention_weights[1] (for "cat"):
# [0.1, 0.6, 0.3]  → Attends most to itself, some to "sat"

Step 4: Apply attention to values

python

output = torch.matmul(attention_weights, V)
# output[i] = weighted sum of all words' values

Implementation

python

import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, d_k):
        super().__init__()
        self.d_k = d_k

        # Linear projections for Q, K, V
        self.W_Q = nn.Linear(d_model, d_k)
        self.W_K = nn.Linear(d_model, d_k)
        self.W_V = nn.Linear(d_model, d_k)

    def forward(self, X, mask=None):
        # X: (batch_size, seq_len, d_model)

        # Project to Q, K, V
        Q = self.W_Q(X)  # (batch, seq_len, d_k)
        K = self.W_K(X)  # (batch, seq_len, d_k)
        V = self.W_V(X)  # (batch, seq_len, d_k)

        # Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        # (batch, seq_len, seq_len)

        # Apply mask (for decoder)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        # Softmax
        attention_weights = torch.softmax(scores, dim=-1)

        # Apply attention to values
        output = torch.matmul(attention_weights, V)
        # (batch, seq_len, d_k)

        return output, attention_weights

Multi-Head Attention

Problem: Single attention captures one type of relationship

Solution: Use multiple attention heads in parallel!

Multi-head attention:

ini

head_1 = Attention(Q_1, K_1, V_1)  # Captures syntax
head_2 = Attention(Q_2, K_2, V_2)  # Captures semantics
...
head_h = Attention(Q_h, K_h, V_h)  # Captures context

output = Concat(head_1, ..., head_h) * W_O

Implementation:

python

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Linear layers for all heads
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)

    def forward(self, X, mask=None):
        batch_size, seq_len, d_model = X.size()

        # Linear projections and split into heads
        Q = self.W_Q(X).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_K(X).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_V(X).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        # (batch, num_heads, seq_len, d_k)

        # Compute attention for all heads in parallel
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attention_weights = torch.softmax(scores, dim=-1)
        attention_output = torch.matmul(attention_weights, V)
        # (batch, num_heads, seq_len, d_k)

        # Concatenate heads
        attention_output = attention_output.transpose(1, 2).contiguous()
        attention_output = attention_output.view(batch_size, seq_len, d_model)

        # Final linear projection
        output = self.W_O(attention_output)

        return output, attention_weights

Transformer Architecture

Complete Transformer Block

Components:

Multi-head self-attention
Add & Norm (residual connection + layer normalization)
Feed-forward network (2-layer MLP)
Add & Norm

python

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()

        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, num_heads)

        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
        )

        # Layer normalization
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # Dropout
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output, _ = self.attention(x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed-forward with residual connection
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))

        return x

Positional Encoding

Problem: Self-attention has no notion of position!

Example:

ini

"The cat sat on the mat" = "mat the on sat cat The" (same attention!)

Solution: Add positional information to embeddings

Sinusoidal positional encoding:

scss

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Implementation:

python

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()

        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                             -(math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # (1, max_len, d_model)

        self.register_buffer('pe', pe)

    def forward(self, x):
        # x: (batch, seq_len, d_model)
        return x + self.pe[:, :x.size(1)]

Masked Self-Attention (Causal)

For autoregressive generation: Token i can only attend to tokens <= i

Mask matrix:

ini

Mask = [[1, 0, 0, 0],   # Token 1 can only see itself
        [1, 1, 0, 0],   # Token 2 can see tokens 1-2
        [1, 1, 1, 0],   # Token 3 can see tokens 1-3
        [1, 1, 1, 1]]   # Token 4 can see all

Implementation:

python

def create_causal_mask(seq_len):
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask.unsqueeze(0).unsqueeze(0)  # (1, 1, seq_len, seq_len)

GPT Architecture (Decoder-Only)

GPT = Generative Pre-trained Transformer

Architecture:

scss

Input tokens
    ↓
Token Embedding + Positional Encoding
    ↓
Transformer Block 1 (Masked Self-Attention)
    ↓
Transformer Block 2
    ↓
...
    ↓
Transformer Block N
    ↓
Linear + Softmax (predict next token)

Key features:

Decoder-only: No encoder (unlike original Transformer)
Causal masking: Can't see future tokens
Autoregressive: Generate one token at a time

Complete GPT Implementation

python

class GPT(nn.Module):
    def __init__(self, vocab_size, d_model=512, num_heads=8, num_layers=6,
                 d_ff=2048, max_len=512, dropout=0.1):
        super().__init__()

        # Token embedding
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_len)

        # Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        # Output layer
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        # idx: (batch, seq_len)
        batch_size, seq_len = idx.size()

        # Token + positional embeddings
        token_emb = self.token_embedding(idx)  # (batch, seq_len, d_model)
        x = self.positional_encoding(token_emb)

        # Create causal mask
        mask = create_causal_mask(seq_len).to(idx.device)

        # Apply transformer blocks
        for block in self.blocks:
            x = block(x, mask)

        # Final layer norm and projection
        x = self.ln_f(x)
        logits = self.head(x)  # (batch, seq_len, vocab_size)

        # Compute loss if targets provided
        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )

        return logits, loss

    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        # idx: (batch, seq_len) - context
        for _ in range(max_new_tokens):
            # Crop context if too long
            idx_cond = idx if idx.size(1) <= self.max_len else idx[:, -self.max_len:]

            # Forward pass
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature  # Last token, scale by temperature

            # Optional: top-k sampling
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')

            # Sample from distribution
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)

            # Append to sequence
            idx = torch.cat([idx, idx_next], dim=1)

        return idx

Sampling Strategies

One. Greedy Sampling

Method: Always pick most probable token

python

logits, _ = model(context)
next_token = torch.argmax(logits[:, -1, :], dim=-1)

Pros: Deterministic, simple Cons: Repetitive, boring output

2. Temperature Sampling

Method: Scale logits before softmax

python

temperature = 0.8  # Lower = more confident, Higher = more random
scaled_logits = logits / temperature
probs = F.softmax(scaled_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

Temperature effects:

T = 0.1: Very confident (almost greedy)
T = 1.0: Original distribution
T = 2.0: More random/creative

3. Top-k Sampling

Method: Sample from top k most likely tokens

python

k = 50
top_k_logits, top_k_indices = torch.topk(logits, k)
probs = F.softmax(top_k_logits, dim=-1)
next_token_idx = torch.multinomial(probs, num_samples=1)
next_token = top_k_indices.gather(-1, next_token_idx)

Benefit: Prevents sampling very unlikely tokens

4. Nucleus (Top-p) Sampling

Method: Sample from smallest set of tokens with cumulative probability >= p

python

p = 0.9  # Nucleus probability
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

# Remove tokens with cumulative probability > p
sorted_indices_to_remove = cumulative_probs > p
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = 0

# Sample
indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
logits[indices_to_remove] = -float('Inf')
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

Benefit: Adaptive - adjusts to distribution

Comparison

Method	Quality	Diversity	Consistency
Greedy	⚠️ Repetitive	❌ Low	✅ High
Temperature	✅ Good	⚠️ Medium	⚠️ Medium
Top-k	✅ Good	✅ High	⚠️ Medium
Top-p (Nucleus)	✅ Excellent	✅ High	✅ Good

Recommendation: Use top-p (nucleus) with p=0.9 and temperature=0.8

Training GPT

Pre-Training (Next Token Prediction)

Objective: Predict next token given context

python

# Training data: "The cat sat on the mat"
# Create input-target pairs:
# Input: "The cat sat on the" → Target: "mat"
# Input: "The cat sat on" → Target: "the"
# ...

for epoch in range(num_epochs):
    for batch in train_loader:
        input_ids = batch['input_ids']  # (batch, seq_len)
        targets = batch['labels']        # (batch, seq_len)

        # Forward pass
        logits, loss = model(input_ids, targets)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()

Fine-Tuning

Use case: Adapt pre-trained model to specific task

Example: Fine-tune GPT-2 on custom corpus

python

from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments

# Load pre-trained GPT-2
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Prepare dataset
train_dataset = ...  # Your custom data

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=1000,
    save_total_limit=2,
    learning_rate=5e-5,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Train
trainer.train()

Applications

One. Text Completion

Use case: Autocomplete, code generation

python

prompt = "Once upon a time, in a land far away,"
generated = model.generate(prompt, max_length=100)

2. Question Answering

Format: Context + Question -> Answer

python

prompt = """Context: Paris is the capital of France.
Question: What is the capital of France?
Answer:"""
answer = model.generate(prompt)

3. Code Generation

Example: GitHub Copilot

python

prompt = """# Function to compute fibonacci numbers
def fibonacci(n):
    \"\"\"Compute the nth fibonacci number\"\"\"
"""
code = model.generate(prompt)

4. Translation

Format: English: ... French:

python

prompt = "English: Hello, how are you? French:"
translation = model.generate(prompt)

5. Summarization

Format: Article: ... Summary:

python

prompt = f"""Article: {long_article}
Summary:"""
summary = model.generate(prompt, max_length=100)

Transformer Variants

One. BERT (Encoder-Only)

Architecture: Encoder blocks with bidirectional attention

Use case: Classification, NER, sentiment analysis

Pre-training: Masked language modeling (predict masked tokens)

2. T5 (Encoder-Decoder)

Architecture: Full encoder-decoder Transformer

Use case: Translation, summarization, QA

Pre-training: Span corruption (predict masked spans)

3. GPT (Decoder-Only)

Architecture: Decoder blocks with causal masking

Use case: Text generation, completion

Pre-training: Next token prediction

Comparison

Model	Architecture	Attention	Use Case
BERT	Encoder	Bidirectional	Understanding
T5	Encoder-Decoder	Bi + Causal	Seq2Seq
GPT	Decoder	Causal	Generation

Advantages of Transformers

Parallelizable: Train on entire sequence simultaneously
Long-range dependencies: Attention to any position
Transfer learning: Pre-train once, fine-tune many times
Scalability: Works for 100K+ tokens (with optimizations)
State-of-the-art: Best performance on NLP benchmarks
Interpretability: Attention weights show what model focuses on

Limitations

Quadratic complexity: O(n²) attention (expensive for long sequences)
Memory: Large models need GPUs (GPT-3: 175B parameters)
Training cost: Pre-training requires massive compute (millions of dollars)
Inference latency: Autoregressive generation is slow
Hallucination: May generate plausible but incorrect information

Key Takeaways

Self-attention computes relevance between all token pairs
Multi-head attention captures different types of relationships
Positional encoding adds sequence order information
Masked attention enables autoregressive generation
GPT architecture is decoder-only with causal masking
Sampling strategies control generation diversity (greedy, top-k, nucleus)
Pre-training + fine-tuning enables transfer learning
Transformers are foundation of modern LLMs (GPT-4, Claude, Gemini)

Looking Ahead

Lesson 15: Large Language Models (LLMs) - scaling laws, prompt engineering
Lesson 16: RLHF - aligning LLMs with human preferences using RL

Summary

Transformers use self-attention to process sequences in parallel
Self-attention computes weighted sum based on relevance (Q, K, V)
Multi-head attention captures multiple relationship types
GPT is decoder-only architecture for text generation
Sampling strategies: Temperature, top-k, nucleus (top-p)
Pre-training + fine-tuning enables transfer learning
Applications: Text generation, translation, QA, code, summarization
Foundation of modern LLMs (GPT-4, Claude, Gemini)

Concept 14 of 18

Concept 14: Transformer Architectures for Generation

Transformer Architectures for Generation

ℹ️ Definition Transformer Architectures are sequence-to-sequence models based on self-attention mechanisms that can generate text, code, and other sequential data by learning long-range dependencies without recurrence, forming the foundation of modern large language models like GPT-4 and Claude.

Learning Objectives

By the end of this lesson, you will:

Understand the self-attention mechanism
Learn the Transformer architecture (encoder, decoder, masked attention)
Implement autoregressive text generation
Apply different sampling strategies (greedy, beam search, nucleus)
Fine-tune a pre-trained GPT-2 model
Build a text generation system with temperature control

Introduction

In Lessons 9-13, we explored generative models for images (VAEs, GANs, Diffusion). Now we'll learn Transformers - the architecture behind GPT-4, Claude, Gemini, and modern LLMs!

Why Transformers revolutionized AI:

Attention is all you need (no RNNs/CNNs needed)
Parallelizable (fast training)
Long-range dependencies (understand context)
Transfer learning (pre-train once, fine-tune many times)

Real-world applications:

ChatGPT/Claude (conversational AI)
GitHub Copilot (code generation)
Translation (Google Translate)
Summarization (document processing)

Background: Sequence Models

Recurrent Neural Networks (RNNs)

Idea: Process sequences step-by-step with hidden state

Architecture:

arduino

h_t = tanh(W_hh * h_{t-1} + W_xh * x_t)
y_t = W_hy * h_t

Problems:

Sequential processing: Can't parallelize (slow!)
Vanishing gradients: Hard to learn long-range dependencies
Limited memory: Hidden state has finite capacity

Long Short-Term Memory (LSTM)

Improvements: Gates (forget, input, output) to control information flow

Still sequential: Can't parallelize training

The Transformer Solution

Key insight: Use self-attention to process entire sequence in parallel!

Benefits:

✅ Parallelizable (train on full sequence at once)
✅ Long-range dependencies (attention to any position)
✅ No vanishing gradients (direct connections)
✅ Scalable (works for 100K+ tokens)

Self-Attention Mechanism

Intuition

Goal: For each word, attend to all other words based on relevance

Example:

vbnet

Sentence: "The cat sat on the mat because it was tired"

When processing "it":
- Attend strongly to "cat" (high relevance)
- Attend weakly to "mat" (low relevance)
- Attend very weakly to "the", "on" (filler words)

Queries, Keys, Values (QKV)

Analogy: Database lookup

Query (Q): What am I looking for?
Key (K): What does each item represent?
Value (V): What information does each item contain?

Computation:

One. Create Q, K, V from input X:

ini

Q = X * W_Q  (Query matrix)
K = X * W_K  (Key matrix)
V = X * W_V  (Value matrix)

2. Compute attention scores:

scss

Attention(Q, K, V) = softmax(Q * K^T / √d_k) * V

Where:

Q * K^T: Dot product (how relevant is each key to each query?)
√d_k: Scaling factor (prevent large values)
softmax: Normalize to probabilities
x V: Weighted sum of values

Self-Attention Example

Input sequence:

css

["The", "cat", "sat"]

Step One: Embed words

lua

X = [[0.1, 0.2, 0.3],   # "The"
     [0.4, 0.5, 0.6],   # "cat"
     [0.7, 0.8, 0.9]]   # "sat"

Step 2: Compute Q, K, V

python

d_model = 3
d_k = 3

W_Q = nn.Linear(d_model, d_k)
W_K = nn.Linear(d_model, d_k)
W_V = nn.Linear(d_model, d_k)

Q = W_Q(X)  # (3, 3)
K = W_K(X)  # (3, 3)
V = W_V(X)  # (3, 3)

Step 3: Compute attention scores

python

scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(d_k)
# scores[i, j] = relevance of word j to word i

attention_weights = torch.softmax(scores, dim=-1)
# Example attention_weights[1] (for "cat"):
# [0.1, 0.6, 0.3]  → Attends most to itself, some to "sat"

Step 4: Apply attention to values

python

output = torch.matmul(attention_weights, V)
# output[i] = weighted sum of all words' values

Implementation

python

import torch
import torch.nn as nn
import math

class SelfAttention(nn.Module):
    def __init__(self, d_model, d_k):
        super().__init__()
        self.d_k = d_k

        # Linear projections for Q, K, V
        self.W_Q = nn.Linear(d_model, d_k)
        self.W_K = nn.Linear(d_model, d_k)
        self.W_V = nn.Linear(d_model, d_k)

    def forward(self, X, mask=None):
        # X: (batch_size, seq_len, d_model)

        # Project to Q, K, V
        Q = self.W_Q(X)  # (batch, seq_len, d_k)
        K = self.W_K(X)  # (batch, seq_len, d_k)
        V = self.W_V(X)  # (batch, seq_len, d_k)

        # Compute attention scores
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        # (batch, seq_len, seq_len)

        # Apply mask (for decoder)
        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        # Softmax
        attention_weights = torch.softmax(scores, dim=-1)

        # Apply attention to values
        output = torch.matmul(attention_weights, V)
        # (batch, seq_len, d_k)

        return output, attention_weights

Multi-Head Attention

Problem: Single attention captures one type of relationship

Solution: Use multiple attention heads in parallel!

Multi-head attention:

ini

head_1 = Attention(Q_1, K_1, V_1)  # Captures syntax
head_2 = Attention(Q_2, K_2, V_2)  # Captures semantics
...
head_h = Attention(Q_h, K_h, V_h)  # Captures context

output = Concat(head_1, ..., head_h) * W_O

Implementation:

python

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        assert d_model % num_heads == 0

        self.d_model = d_model
        self.num_heads = num_heads
        self.d_k = d_model // num_heads

        # Linear layers for all heads
        self.W_Q = nn.Linear(d_model, d_model)
        self.W_K = nn.Linear(d_model, d_model)
        self.W_V = nn.Linear(d_model, d_model)
        self.W_O = nn.Linear(d_model, d_model)

    def forward(self, X, mask=None):
        batch_size, seq_len, d_model = X.size()

        # Linear projections and split into heads
        Q = self.W_Q(X).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        K = self.W_K(X).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        V = self.W_V(X).view(batch_size, seq_len, self.num_heads, self.d_k).transpose(1, 2)
        # (batch, num_heads, seq_len, d_k)

        # Compute attention for all heads in parallel
        scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)

        if mask is not None:
            scores = scores.masked_fill(mask == 0, -1e9)

        attention_weights = torch.softmax(scores, dim=-1)
        attention_output = torch.matmul(attention_weights, V)
        # (batch, num_heads, seq_len, d_k)

        # Concatenate heads
        attention_output = attention_output.transpose(1, 2).contiguous()
        attention_output = attention_output.view(batch_size, seq_len, d_model)

        # Final linear projection
        output = self.W_O(attention_output)

        return output, attention_weights

Transformer Architecture

Complete Transformer Block

Components:

Multi-head self-attention
Add & Norm (residual connection + layer normalization)
Feed-forward network (2-layer MLP)
Add & Norm

python

class TransformerBlock(nn.Module):
    def __init__(self, d_model, num_heads, d_ff, dropout=0.1):
        super().__init__()

        # Multi-head attention
        self.attention = MultiHeadAttention(d_model, num_heads)

        # Feed-forward network
        self.ffn = nn.Sequential(
            nn.Linear(d_model, d_ff),
            nn.ReLU(),
            nn.Linear(d_ff, d_model),
        )

        # Layer normalization
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)

        # Dropout
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, mask=None):
        # Self-attention with residual connection
        attn_output, _ = self.attention(x, mask)
        x = self.norm1(x + self.dropout(attn_output))

        # Feed-forward with residual connection
        ffn_output = self.ffn(x)
        x = self.norm2(x + self.dropout(ffn_output))

        return x

Positional Encoding

Problem: Self-attention has no notion of position!

Example:

ini

"The cat sat on the mat" = "mat the on sat cat The" (same attention!)

Solution: Add positional information to embeddings

Sinusoidal positional encoding:

scss

PE(pos, 2i) = sin(pos / 10000^(2i/d_model))
PE(pos, 2i+1) = cos(pos / 10000^(2i/d_model))

Implementation:

python

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_len=5000):
        super().__init__()

        # Create positional encoding matrix
        pe = torch.zeros(max_len, d_model)
        position = torch.arange(0, max_len).unsqueeze(1).float()
        div_term = torch.exp(torch.arange(0, d_model, 2).float() *
                             -(math.log(10000.0) / d_model))

        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        pe = pe.unsqueeze(0)  # (1, max_len, d_model)

        self.register_buffer('pe', pe)

    def forward(self, x):
        # x: (batch, seq_len, d_model)
        return x + self.pe[:, :x.size(1)]

Masked Self-Attention (Causal)

For autoregressive generation: Token i can only attend to tokens <= i

Mask matrix:

ini

Mask = [[1, 0, 0, 0],   # Token 1 can only see itself
        [1, 1, 0, 0],   # Token 2 can see tokens 1-2
        [1, 1, 1, 0],   # Token 3 can see tokens 1-3
        [1, 1, 1, 1]]   # Token 4 can see all

Implementation:

python

def create_causal_mask(seq_len):
    mask = torch.tril(torch.ones(seq_len, seq_len))
    return mask.unsqueeze(0).unsqueeze(0)  # (1, 1, seq_len, seq_len)

GPT Architecture (Decoder-Only)

GPT = Generative Pre-trained Transformer

Architecture:

scss

Input tokens
    ↓
Token Embedding + Positional Encoding
    ↓
Transformer Block 1 (Masked Self-Attention)
    ↓
Transformer Block 2
    ↓
...
    ↓
Transformer Block N
    ↓
Linear + Softmax (predict next token)

Key features:

Decoder-only: No encoder (unlike original Transformer)
Causal masking: Can't see future tokens
Autoregressive: Generate one token at a time

Complete GPT Implementation

python

class GPT(nn.Module):
    def __init__(self, vocab_size, d_model=512, num_heads=8, num_layers=6,
                 d_ff=2048, max_len=512, dropout=0.1):
        super().__init__()

        # Token embedding
        self.token_embedding = nn.Embedding(vocab_size, d_model)
        self.positional_encoding = PositionalEncoding(d_model, max_len)

        # Transformer blocks
        self.blocks = nn.ModuleList([
            TransformerBlock(d_model, num_heads, d_ff, dropout)
            for _ in range(num_layers)
        ])

        # Output layer
        self.ln_f = nn.LayerNorm(d_model)
        self.head = nn.Linear(d_model, vocab_size, bias=False)

        # Initialize weights
        self.apply(self._init_weights)

    def _init_weights(self, module):
        if isinstance(module, nn.Linear):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    def forward(self, idx, targets=None):
        # idx: (batch, seq_len)
        batch_size, seq_len = idx.size()

        # Token + positional embeddings
        token_emb = self.token_embedding(idx)  # (batch, seq_len, d_model)
        x = self.positional_encoding(token_emb)

        # Create causal mask
        mask = create_causal_mask(seq_len).to(idx.device)

        # Apply transformer blocks
        for block in self.blocks:
            x = block(x, mask)

        # Final layer norm and projection
        x = self.ln_f(x)
        logits = self.head(x)  # (batch, seq_len, vocab_size)

        # Compute loss if targets provided
        loss = None
        if targets is not None:
            loss = F.cross_entropy(
                logits.view(-1, logits.size(-1)),
                targets.view(-1)
            )

        return logits, loss

    def generate(self, idx, max_new_tokens, temperature=1.0, top_k=None):
        # idx: (batch, seq_len) - context
        for _ in range(max_new_tokens):
            # Crop context if too long
            idx_cond = idx if idx.size(1) <= self.max_len else idx[:, -self.max_len:]

            # Forward pass
            logits, _ = self(idx_cond)
            logits = logits[:, -1, :] / temperature  # Last token, scale by temperature

            # Optional: top-k sampling
            if top_k is not None:
                v, _ = torch.topk(logits, min(top_k, logits.size(-1)))
                logits[logits < v[:, [-1]]] = -float('Inf')

            # Sample from distribution
            probs = F.softmax(logits, dim=-1)
            idx_next = torch.multinomial(probs, num_samples=1)

            # Append to sequence
            idx = torch.cat([idx, idx_next], dim=1)

        return idx

Sampling Strategies

One. Greedy Sampling

Method: Always pick most probable token

python

logits, _ = model(context)
next_token = torch.argmax(logits[:, -1, :], dim=-1)

Pros: Deterministic, simple Cons: Repetitive, boring output

2. Temperature Sampling

Method: Scale logits before softmax

python

temperature = 0.8  # Lower = more confident, Higher = more random
scaled_logits = logits / temperature
probs = F.softmax(scaled_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

Temperature effects:

T = 0.1: Very confident (almost greedy)
T = 1.0: Original distribution
T = 2.0: More random/creative

3. Top-k Sampling

Method: Sample from top k most likely tokens

python

k = 50
top_k_logits, top_k_indices = torch.topk(logits, k)
probs = F.softmax(top_k_logits, dim=-1)
next_token_idx = torch.multinomial(probs, num_samples=1)
next_token = top_k_indices.gather(-1, next_token_idx)

Benefit: Prevents sampling very unlikely tokens

4. Nucleus (Top-p) Sampling

Method: Sample from smallest set of tokens with cumulative probability >= p

python

p = 0.9  # Nucleus probability
sorted_logits, sorted_indices = torch.sort(logits, descending=True)
cumulative_probs = torch.cumsum(F.softmax(sorted_logits, dim=-1), dim=-1)

# Remove tokens with cumulative probability > p
sorted_indices_to_remove = cumulative_probs > p
sorted_indices_to_remove[..., 1:] = sorted_indices_to_remove[..., :-1].clone()
sorted_indices_to_remove[..., 0] = 0

# Sample
indices_to_remove = sorted_indices_to_remove.scatter(1, sorted_indices, sorted_indices_to_remove)
logits[indices_to_remove] = -float('Inf')
probs = F.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)

Benefit: Adaptive - adjusts to distribution

Comparison

Method	Quality	Diversity	Consistency
Greedy	⚠️ Repetitive	❌ Low	✅ High
Temperature	✅ Good	⚠️ Medium	⚠️ Medium
Top-k	✅ Good	✅ High	⚠️ Medium
Top-p (Nucleus)	✅ Excellent	✅ High	✅ Good

Recommendation: Use top-p (nucleus) with p=0.9 and temperature=0.8

Training GPT

Pre-Training (Next Token Prediction)

Objective: Predict next token given context

python

# Training data: "The cat sat on the mat"
# Create input-target pairs:
# Input: "The cat sat on the" → Target: "mat"
# Input: "The cat sat on" → Target: "the"
# ...

for epoch in range(num_epochs):
    for batch in train_loader:
        input_ids = batch['input_ids']  # (batch, seq_len)
        targets = batch['labels']        # (batch, seq_len)

        # Forward pass
        logits, loss = model(input_ids, targets)

        # Backward pass
        optimizer.zero_grad()
        loss.backward()

        # Gradient clipping
        torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

        optimizer.step()

Fine-Tuning

Use case: Adapt pre-trained model to specific task

Example: Fine-tune GPT-2 on custom corpus

python

from transformers import GPT2LMHeadModel, GPT2Tokenizer, Trainer, TrainingArguments

# Load pre-trained GPT-2
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Prepare dataset
train_dataset = ...  # Your custom data

# Training arguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=4,
    save_steps=1000,
    save_total_limit=2,
    learning_rate=5e-5,
)

# Trainer
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=train_dataset,
)

# Train
trainer.train()

Applications

One. Text Completion

Use case: Autocomplete, code generation

python

prompt = "Once upon a time, in a land far away,"
generated = model.generate(prompt, max_length=100)

2. Question Answering

Format: Context + Question -> Answer

python

prompt = """Context: Paris is the capital of France.
Question: What is the capital of France?
Answer:"""
answer = model.generate(prompt)

3. Code Generation

Example: GitHub Copilot

python

prompt = """# Function to compute fibonacci numbers
def fibonacci(n):
    \"\"\"Compute the nth fibonacci number\"\"\"
"""
code = model.generate(prompt)

4. Translation

Format: English: ... French:

python

prompt = "English: Hello, how are you? French:"
translation = model.generate(prompt)

5. Summarization

Format: Article: ... Summary:

python

prompt = f"""Article: {long_article}
Summary:"""
summary = model.generate(prompt, max_length=100)

Transformer Variants

One. BERT (Encoder-Only)

Architecture: Encoder blocks with bidirectional attention

Use case: Classification, NER, sentiment analysis

Pre-training: Masked language modeling (predict masked tokens)

2. T5 (Encoder-Decoder)

Architecture: Full encoder-decoder Transformer

Use case: Translation, summarization, QA

Pre-training: Span corruption (predict masked spans)

3. GPT (Decoder-Only)

Architecture: Decoder blocks with causal masking

Use case: Text generation, completion

Pre-training: Next token prediction

Comparison

Model	Architecture	Attention	Use Case
BERT	Encoder	Bidirectional	Understanding
T5	Encoder-Decoder	Bi + Causal	Seq2Seq
GPT	Decoder	Causal	Generation

Advantages of Transformers

Parallelizable: Train on entire sequence simultaneously
Long-range dependencies: Attention to any position
Transfer learning: Pre-train once, fine-tune many times
Scalability: Works for 100K+ tokens (with optimizations)
State-of-the-art: Best performance on NLP benchmarks
Interpretability: Attention weights show what model focuses on

Limitations

Quadratic complexity: O(n²) attention (expensive for long sequences)
Memory: Large models need GPUs (GPT-3: 175B parameters)
Training cost: Pre-training requires massive compute (millions of dollars)
Inference latency: Autoregressive generation is slow
Hallucination: May generate plausible but incorrect information

Key Takeaways

Self-attention computes relevance between all token pairs
Multi-head attention captures different types of relationships
Positional encoding adds sequence order information
Masked attention enables autoregressive generation
GPT architecture is decoder-only with causal masking
Sampling strategies control generation diversity (greedy, top-k, nucleus)
Pre-training + fine-tuning enables transfer learning
Transformers are foundation of modern LLMs (GPT-4, Claude, Gemini)

Looking Ahead

Lesson 15: Large Language Models (LLMs) - scaling laws, prompt engineering
Lesson 16: RLHF - aligning LLMs with human preferences using RL

Summary

Transformers use self-attention to process sequences in parallel
Self-attention computes weighted sum based on relevance (Q, K, V)
Multi-head attention captures multiple relationship types
GPT is decoder-only architecture for text generation
Sampling strategies: Temperature, top-k, nucleus (top-p)
Pre-training + fine-tuning enables transfer learning
Applications: Text generation, translation, QA, code, summarization
Foundation of modern LLMs (GPT-4, Claude, Gemini)