ℹ️ Definition: RLHF is a machine learning technique that combines reinforcement learning with human preferences to align AI models with human values and intentions. It's the key innovation that transformed large language models from text predictors into helpful, harmless, and honest assistants.
By the end of this lesson, you will be able to:
This is the moment everything comes together.
For the past 15 lessons, we've been building two parallel foundations:
The Big Question: What if we could use RL to teach generative models to follow human preferences?
The Answer: Reinforcement Learning from Human Feedback (RLHF).
RLHF is the breakthrough that enabled:
Before RLHF: Large language models trained with standard next-token prediction could:
Example (Pre-RLHF GPT-3):
User: "Write a Python function to sort a list"
Model: "The history of Python dates back to the late 1980s when Guido van Rossum..."
After RLHF (ChatGPT):
User: "Write a Python function to sort a list"
Model: "Here's a Python function to sort a list:
def sort_list(items):
return sorted(items)
Usage example:
numbers = [3, 1, 4, 1, 5]
sorted_numbers = sort_list(numbers)
print(sorted_numbers) # [1, 1, 3, 4, 5]
From Lesson 6 (PPO): We learned Proximal Policy Optimization - a stable RL algorithm that optimizes policies while preventing catastrophic updates. In RLHF, PPO is the engine that fine-tunes the language model.
From Lesson 15 (LLMs): We studied large language models - how transformers generate text token by token. In RLHF, the LLM becomes the policy that RL optimizes.
The Magic Formula:
RLHF = LLM (as policy) + Human Preferences (as reward) + PPO (as optimizer)
RLHF transforms a base language model into an aligned assistant through three stages:
Stage 1: Supervised Fine-Tuning (SFT)
Base Model → [Instruction Dataset] → SFT Model
"Learn to follow instructions"
Stage 2: Reward Model Training
SFT Model → [Comparison Dataset] → Reward Model
"Learn what humans prefer"
Stage 3: PPO Fine-Tuning
SFT Model → [Reward Model + PPO] → RLHF Model
"Optimize for human preferences"

Goal: Teach the base model to follow instructions and respond helpfully.
Each example contains:
Example:
{
"prompt": "Explain photosynthesis to a 10-year-old.",
"response": "Photosynthesis is how plants make their own food! They use sunlight, water from the soil, and carbon dioxide from the air. The sunlight gives them energy to turn these ingredients into sugar (food for the plant) and oxygen (which we breathe). The green color in leaves, called chlorophyll, helps capture the sunlight. It's like a tiny solar panel!"
}
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
# Load base model (e.g., GPT-2 or Llama 2)
model_name = "gpt2"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.eos_token
# Load instruction dataset (e.g., Alpaca, Dolly, OpenAssistant)
dataset = load_dataset("tatsu-lab/alpaca", split="train[:5000]")
# Format dataset: concatenate instruction + response
def format_instruction(example):
text = f"### Instruction: {example['instruction']}\n### Response: {example['output']}"
return {"text": text}
dataset = dataset.map(format_instruction)
# Tokenize
def tokenize(example):
return tokenizer(example["text"], truncation=True, padding="max_length", max_length=512)
dataset = dataset.map(tokenize, remove_columns=dataset.column_names)
# Training arguments
training_args = TrainingArguments(
output_dir="./sft_model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
logging_steps=100,
save_steps=1000,
fp16=True, # Mixed precision training
)
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
trainer.save_model("./sft_model_final")
After SFT, the model can:
Why SFT Alone Isn't Enough: Human demonstrations are expensive to collect at scale. We need a way to learn preferences from cheaper comparison data.
Goal: Train a model that predicts which responses humans prefer.
In traditional RL (Lessons 1-8), rewards were simple:
For language: How do we score the "quality" of a text response?
Solution: Learn a reward model from human preference comparisons.
Instead of rating responses (hard and subjective), humans compare pairs:
Example:
Prompt: "Write a haiku about machine learning"
Response A (Chosen ✓):
"Algorithms learn,
Patterns emerge from the noise,
Intelligence grows."
Response B (Rejected ✗):
"Machine learning is a subset of artificial intelligence that uses statistical techniques to enable computers to learn from data."
Humans simply choose which response is better. This is easier and more consistent than rating.
The reward model learns to predict preference probabilities:
Formula:
P(A > B) = σ(r(A) - r(B))
Where:
r(A), r(B) = Scalar reward scores from reward modelσ = Sigmoid functionP(A > B) = Probability that A is preferred over BTraining Objective (Binary Cross-Entropy):
Loss = -log(σ(r_chosen - r_rejected))

import torch.nn as nn
from transformers import AutoModel
class RewardModel(nn.Module):
def __init__(self, base_model_name):
super().__init__()
# Use SFT model as backbone
self.backbone = AutoModel.from_pretrained(base_model_name)
hidden_size = self.backbone.config.hidden_size
# Reward head: map hidden states to scalar reward
self.reward_head = nn.Linear(hidden_size, 1)
def forward(self, input_ids, attention_mask):
# Get hidden states from backbone
outputs = self.backbone(input_ids=input_ids, attention_mask=attention_mask)
# Take last token's hidden state (end of response)
last_hidden = outputs.last_hidden_state[:, -1, :]
# Predict scalar reward
reward = self.reward_head(last_hidden)
return reward
# Initialize reward model from SFT checkpoint
reward_model = RewardModel("./sft_model_final")
from transformers import AutoTokenizer, Trainer, TrainingArguments
from datasets import load_dataset
import torch
tokenizer = AutoTokenizer.from_pretrained("./sft_model_final")
# Load preference dataset (e.g., Anthropic HH-RLHF, OpenAssistant)
dataset = load_dataset("Anthropic/hh-rlhf", split="train[:10000]")
# Format: Each example has 'chosen' and 'rejected' responses
def tokenize_pair(example):
# Tokenize chosen response
chosen = tokenizer(example["chosen"], truncation=True, max_length=512, padding="max_length")
# Tokenize rejected response
rejected = tokenizer(example["rejected"], truncation=True, max_length=512, padding="max_length")
return {"chosen": chosen, "rejected": rejected}
dataset = dataset.map(tokenize_pair)
# Custom loss function
def compute_loss(model, batch):
# Get rewards for chosen responses
r_chosen = model(
input_ids=batch["chosen"]["input_ids"],
attention_mask=batch["chosen"]["attention_mask"]
)
# Get rewards for rejected responses
r_rejected = model(
input_ids=batch["rejected"]["input_ids"],
attention_mask=batch["rejected"]["attention_mask"]
)
# Bradley-Terry loss: maximize log(sigmoid(r_chosen - r_rejected))
loss = -torch.log(torch.sigmoid(r_chosen - r_rejected)).mean()
return loss
# Training arguments
training_args = TrainingArguments(
output_dir="./reward_model",
num_train_epochs=1,
per_device_train_batch_size=4,
learning_rate=1e-5,
logging_steps=50,
)
# Train reward model
trainer = Trainer(
model=reward_model,
args=training_args,
train_dataset=dataset,
compute_loss=compute_loss,
)
trainer.train()
trainer.save_model("./reward_model_final")
After training, the reward model can:
Example:
prompt = "Explain quantum computing"
response_good = "Quantum computing uses quantum bits (qubits) that can be in superposition..."
response_bad = "Quantum computing is very complicated and hard to understand."
r_good = reward_model(tokenizer(response_good, return_tensors="pt"))
r_bad = reward_model(tokenizer(response_bad, return_tensors="pt"))
print(f"Good response reward: {r_good.item():.2f}") # e.g., 2.34
print(f"Bad response reward: {r_bad.item():.2f}") # e.g., -0.87
Goal: Use the reward model to optimize the SFT model with reinforcement learning.
Recall from Lesson 6: PPO optimizes a policy to maximize expected reward.
In RLHF:
Without constraints, the model could hack the reward by exploiting reward model weaknesses:
Example Reward Hacking:
Prompt: "Write a short poem"
Optimized (Hacked) Response: "AMAZING WONDERFUL FANTASTIC PERFECT EXCELLENT BRILLIANT..."
The model learned that the reward model likes positive words, so it outputs only superlatives - technically high reward but useless!
Solution: Add a penalty that prevents the policy from deviating too far from the SFT model.
PPO Objective with KL Penalty:
maximize: E[r(s, a)] - β * KL(π || π_SFT)
Where:
r(s, a) = Reward from reward modelKL(π || π_SFT) = KL divergence between current policy and SFT policyβ = Penalty coefficient (typically 0.01 - 0.1)Intuition: "Maximize reward, but don't change the model too much."

from transformers import AutoModelForCausalLM, AutoTokenizer
from trl import PPOTrainer, PPOConfig, AutoModelForCausalLMWithValueHead
from datasets import load_dataset
import torch
# Load SFT model as policy
policy_model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft_model_final")
ref_model = AutoModelForCausalLM.from_pretrained("./sft_model_final") # Reference (frozen)
reward_model = RewardModel.from_pretrained("./reward_model_final")
tokenizer = AutoTokenizer.from_pretrained("./sft_model_final")
# Load prompts for RL training
prompts_dataset = load_dataset("Anthropic/hh-rlhf", split="train[:5000]")
prompts = [example["prompt"] for example in prompts_dataset]
# PPO configuration
ppo_config = PPOConfig(
model_name="./sft_model_final",
learning_rate=1e-5,
batch_size=16,
mini_batch_size=4,
gradient_accumulation_steps=4,
ppo_epochs=4,
init_kl_coef=0.05, # β coefficient for KL penalty
)
# Initialize PPO trainer
ppo_trainer = PPOTrainer(
config=ppo_config,
model=policy_model,
ref_model=ref_model,
tokenizer=tokenizer,
)
# Training loop
for epoch in range(3):
for batch_idx, prompt_batch in enumerate(prompts):
# Tokenize prompts
prompt_tensors = [tokenizer(p, return_tensors="pt").input_ids[0] for p in prompt_batch]
# Generate responses from policy
response_tensors = []
for prompt in prompt_tensors:
response = policy_model.generate(
prompt.unsqueeze(0),
max_new_tokens=128,
do_sample=True,
top_p=0.9,
temperature=0.7,
)
response_tensors.append(response[0])
# Get rewards from reward model
rewards = []
for prompt, response in zip(prompt_tensors, response_tensors):
# Concatenate prompt + response
full_text = tokenizer.decode(torch.cat([prompt, response]))
reward_input = tokenizer(full_text, return_tensors="pt")
reward = reward_model(**reward_input).item()
rewards.append(torch.tensor(reward))
# PPO update step (with KL penalty)
stats = ppo_trainer.step(prompt_tensors, response_tensors, rewards)
if batch_idx % 10 == 0:
print(f"Epoch {epoch}, Batch {batch_idx}")
print(f" Mean reward: {stats['ppo/mean_scores']:.2f}")
print(f" KL divergence: {stats['ppo/mean_non_score_reward']:.4f}")
# Save final RLHF model
policy_model.save_pretrained("./rlhf_model_final")
| Hyperparameter | Typical Value | Purpose |
|---|---|---|
| Learning rate | 1e-5 to 1e-6 | Policy update step size |
| β (KL coefficient) | 0.01 to 0.1 | KL penalty strength |
| PPO epochs | 4 | Updates per batch |
| Batch size | 16-64 | Prompts per update |
| Max new tokens | 128-512 | Response length |
| Temperature | 0.7-1.0 | Sampling randomness |
OpenAI's RLHF Pipeline (Ouyang et al., 2022):
Results:
Dataset Sizes:
Anthropic's Extension of RLHF:
RLHF Stage 1-3 (as above) + Constitutional AI:
Stage 4: AI Self-Critique
Example Constitution Principle:
"Choose the response that is most helpful, harmless, and honest. Avoid responses that are toxic, dangerous, or promote illegal activities."
Benefits:
Meta's Open-Source RLHF:
Public Datasets Used:
Performance: Comparable to ChatGPT on instruction following benchmarks.
Even with KL penalty, models can exploit reward model weaknesses:
Common Hacking Strategies:
Example:
User: "What's 2+2?"
Hacked Model: "Great question! I'm so glad you asked this important and thoughtful question.
Let me provide a comprehensive, helpful, and accurate response to your excellent inquiry about
this fascinating mathematical topic. The answer, which I will explain clearly and safely, is..."
[continues for 500 words]
Better Reward Models:
Red Teaming:
Adversarial Training:
Constitutional AI (as above)
Problem with RLHF: Requires training a separate reward model (Stage 2), which is complex.
DPO Alternative (Rafailov et al., 2023):
DPO Loss Function:
Loss = -log(σ(β * log(π(y_chosen | x) / π_ref(y_chosen | x))
- β * log(π(y_rejected | x) / π_ref(y_rejected | x))))
Advantages:
Disadvantages:
Problem: Single-round RLHF may not fully align model.
Solution: Repeat RLHF multiple times:
Round 1: SFT → Reward Model 1 → PPO → RLHF Model 1
Round 2: RLHF Model 1 → Reward Model 2 → PPO → RLHF Model 2
Round 3: RLHF Model 2 → Reward Model 3 → PPO → RLHF Model 3
...
Each round uses:
Benefits: Gradually improves alignment over multiple iterations.
| Application | How RLHF is Used | Example |
|---|---|---|
| Chatbots | Align responses with helpfulness, safety, tone | ChatGPT, Claude, Bard |
| Code Generation | Prefer working, readable, secure code | GitHub Copilot, Codex |
| Summarization | Generate concise, accurate summaries | Claude, GPT-4 |
| Content Moderation | Filter toxic/harmful content | OpenAI Moderation API |
| Customer Support | Provide helpful, empathetic responses | Intercom, Zendesk bots |
| Creative Writing | Match user's style and intent | NovelAI, Sudowrite |
| Search | Rank results by user preference | Bing Chat, Google Bard |
| Advantage | Description |
|---|---|
| ✅ Scalable Alignment | Learn from comparisons (cheaper than demonstrations) |
| ✅ Flexible Objectives | Reward model captures nuanced preferences |
| ✅ Iterative Improvement | Can continuously update with new data |
| ✅ Handles Subjectivity | Works even when "correct answer" is undefined |
| ✅ Reduces Harmful Outputs | Learns to avoid toxic/dangerous content |
| Limitation | Description | Mitigation |
|---|---|---|
| ⚠️ Reward Hacking | Models exploit reward model weaknesses | KL penalty, adversarial training |
| ⚠️ Human Bias | Inherits biases from human labelers | Diverse labeler pool, bias audits |
| ⚠️ Expensive Labeling | Requires thousands of human comparisons | Active learning, AI-assisted labeling |
| ⚠️ Alignment Tax | RLHF models may be more restrictive/cautious | Balance safety with capability |
| ⚠️ Reward Model Errors | Imperfect reward model causes suboptimal policy | Ensemble models, uncertainty estimates |
| Stage | Dataset Type | Typical Size | Cost Estimate |
|---|---|---|---|
| Stage One: SFT | Instruction demonstrations | 10K - 100K | $50K - $500K |
| Stage 2: Reward Model | Preference comparisons | 50K - 1M pairs | $20K - $200K |
| Stage 3: PPO | Prompts only (no labels) | 100K - 1M | Free (no labeling) |
Note: Open-source datasets (Alpaca, HH-RLHF, OpenAssistant) significantly reduce costs.
RLHF = RL + LLM: Combines PPO (Lesson 6) with language models (Lesson 15) to align AI with human preferences.
3-Stage Pipeline: SFT (learn instructions) -> Reward Model (learn preferences) -> PPO (optimize for preferences).
Preference Learning: Easier to collect comparisons (A vs B) than absolute ratings or demonstrations.
Bradley-Terry Model: Reward model predicts preference probabilities from scalar rewards.
KL Penalty: Prevents reward hacking by keeping policy close to SFT model.
Production Systems: ChatGPT, Claude, and Llama 2 Chat all use RLHF for alignment.
Reward Hacking: Models can exploit reward model weaknesses; mitigate with adversarial training.
Constitutional AI: Extends RLHF with AI self-critique for more scalable alignment.
DPO Alternative: Simpler 2-stage approach that skips reward model training.
Future of AI: RLHF is the foundation for aligning increasingly powerful AI systems with human values.
Congratulations! You've reached the course climax - the convergence of RL and GenAI.
In the next lessons, we'll explore:
Integration Complete: You now understand how modern AI assistants are built. Every interaction with ChatGPT, Claude, or GitHub Copilot is powered by the RLHF techniques you've just learned.
RLHF transforms language models from text predictors into aligned assistants through a 3-stage pipeline:
Real-World Impact: RLHF enabled ChatGPT, Claude, and Llama 2 Chat - the AI assistants transforming industries today.
Challenges: Reward hacking, human bias, labeling costs, alignment tax.
Advanced Techniques: Constitutional AI (self-critique), DPO (skip reward model), iterative RLHF.
The Big Picture: This lesson unites 15 previous lessons - RL provides the optimization framework, GenAI provides the models, and RLHF aligns them with human values. This is the foundation of modern AI alignment.
Next: Multi-Modal AI - Vision and Language ->