Demo Mode

No student ID available

Activity 16 of 18

Activity 16: Reinforcement Learning from Human Feedback (RLHF)

Practice and reinforce the concepts from Lesson 16

Activity 16: Reinforcement Learning from Human Feedback (RLHF)

Overview

In this activity, you'll implement the complete 3-stage RLHF pipeline to transform a base language model into an instruction-following assistant. You'll train a reward model on human preference comparisons, then use PPO to optimize the model for maximum reward. This hands-on implementation will show you how ChatGPT, Claude, and other AI assistants are created.

What makes this special: You're building the exact same pipeline that OpenAI used to create ChatGPT. By the end, you'll have a working RLHF model that generates more helpful, harmless, and honest responses than the base model.

Time Required: 90-120 minutes

Learning Objectives

By completing this activity, you will be able to:

Implement supervised fine-tuning (SFT) on instruction-following datasets
Build a reward model that learns human preferences from comparison data
Apply PPO with KL divergence penalty to align language models
Evaluate the impact of RLHF by comparing base, SFT, and RLHF models
Debug reward hacking behaviors and apply mitigation strategies
Analyze the trade-offs between alignment, capability, and training cost

Prerequisites

Before starting this activity, you should:

✅ Complete Lesson 6 (PPO) - Understanding PPO algorithm
✅ Complete Lesson 15 (LLMs) - Understanding transformer language models
✅ Complete Lesson 16 Concept - Understanding RLHF pipeline
✅ Have a Google Colab account (GPU runtime recommended for faster training)
✅ Familiarity with HuggingFace Transformers library

Getting Started

Download the template from the course Templates folder:
- File: AI25-Template-activity-16-rlhf.zip
- Extract to your working directory
Upload to Google Colab:
- Open Google Colab (colab.research.google.com)
- Upload the .ipynb file
- Enable GPU: Runtime -> Change runtime type -> GPU (T4)
Run the first cell to verify your environment and see a working demo

What You'll Build

This activity provides a 65-70% working implementation of RLHF. You'll complete the missing pieces:

Part One: Supervised Fine-Tuning (✅ Pre-Built: 70%)

✅ Dataset loading and preprocessing
✅ Tokenization and formatting
⚠️ TODO 1: Implement the SFT training loop

Part 2: Reward Model Training (⚠️ Your Task: 30%)

✅ Preference dataset loading
⚠️ TODO 2: Implement Bradley-Terry loss function
⚠️ TODO 3: Train the reward model on comparison pairs

Part 3: PPO Fine-Tuning (⚠️ Your Task: 35%)

✅ PPO trainer setup
⚠️ TODO 4: Implement the PPO training loop with KL penalty
✅ Evaluation and comparison

Part 4: Evaluation (✅ Pre-Built: 100%)

✅ Compare base vs SFT vs RLHF models
✅ Generate responses to test prompts
✅ Visualize reward distributions

Expected Results

After Completing TODO 1 (SFT)

SFT model generates coherent instruction-following responses
Loss decreases from ~4.0 to ~1.5 over 3 epochs
Perplexity improves by 30-50%

Example Output:

vbnet

Prompt: "Explain machine learning in one sentence"
Base Model: "Machine learning is a subset of artificial intelligence that focuses on..."
SFT Model: "Machine learning is when computers learn patterns from data to make predictions without being explicitly programmed."

After Completing TODO 2-3 (Reward Model)

Reward model assigns higher scores to chosen responses
Bradley-Terry accuracy: 65-75% on validation set
Reward gap: Chosen responses score 0.5-2.0 points higher than rejected

Example Output:

java

Prompt: "Write a haiku about coding"
Response A (Chosen): "Lines of logic flow, / Algorithms come alive, / Bugs fear the debugger."
Reward A: 2.34

Response B (Rejected): "Coding is fun and interesting. I like to code."
Reward B: -0.87

After Completing TODO 4 (PPO)

RLHF model generates more preferred responses than SFT model
Mean reward increases from 0.0 (SFT baseline) to 1.5-3.0 (RLHF)
KL divergence stays below 0.5 (controlled drift from SFT)

Example Output:

vbnet

Prompt: "How do I learn Python?"
SFT Model: "You can learn Python by reading tutorials and practicing coding exercises."
Reward: 0.23

RLHF Model: "Here's a structured approach to learn Python:
1. Start with Python.org's beginner tutorial
2. Practice on LeetCode or HackerRank
3. Build small projects (calculator, to-do list)
4. Join Python communities (r/learnpython)
5. Read 'Automate the Boring Stuff with Python'

What's your current programming experience? I can tailor the roadmap for you."
Reward: 2.71

Success Criteria

Your implementation is successful when:

SFT Training: Loss converges below 2.0, model generates coherent responses
Reward Model: Validation ``accuracy > 65``%, clear reward gap between chosen/rejected
PPO Training: Mean reward increases ``by >1.0`` from SFT baseline
KL Penalty: KL divergence stays between 0.1 and 0.5 (not too similar, not too different)
Evaluation: RLHF model generates more helpful responses than SFT model on 8/10 test prompts
No Reward Hacking: Responses are helpful and not excessively long or repetitive

Tips for Success

General Tips

Start with the Working Code: Run all pre-built cells first to understand the full pipeline before implementing TODOs.
Use Small Models: The template uses GPT-2 (124M parameters) for fast iteration. Upgrade to Llama-2-7B for better quality if you have Colab Pro.
Monitor Training: Watch the loss curves and sample generations during training to catch issues early.
GPU Memory: If you run out of memory, reduce batch_size or use gradient_checkpointing=True.

TODO 1: SFT Training Loop

Hint: Use the HuggingFace Trainer API. The dataset is already tokenized.

Common Mistakes:

Forgetting to set tokenizer.pad_token (causes errors)
Using too high learning rate (2e-5 is recommended)
Not saving the model after training

Debug Checklist:

✅ Loss decreases each epoch
✅ No NaN losses (indicates learning rate too high)
✅ Generated samples improve over training

TODO 2-3: Reward Model Training

Hint: The Bradley-Terry loss is: -log(sigmoid(r_chosen - r_rejected)).

Common Mistakes:

Confusing chosen/rejected pairs (flip the sign)
Not handling batch dimensions correctly
Forgetting to detach reference model outputs

Debug Checklist:

✅ Loss decreases (should converge to 0.4-0.6)
✅ Reward gap increases during training
✅ Validation accuracy > random chance (50%)

TODO 4: PPO Training Loop

Hint: Use the PPOTrainer from the trl library. Set init_kl_coef=0.05 for KL penalty.

Common Mistakes:

KL divergence explodes (β too small)
Rewards don't improve (learning rate too small)
Reward hacking (β too small or reward model too weak)

Debug Checklist:

✅ Mean reward increases each batch
✅ KL divergence stays in 0.1-0.5 range
✅ Generated responses are helpful (not hacked)
✅ Training is stable (no NaN values)

Debugging Reward Hacking

If your RLHF model generates weird outputs:

Symptom: Very long responses (500+ tokens) Fix: Add length penalty to reward: reward = reward - 0.01 * length

Symptom: Repetitive keywords ("helpful safe accurate helpful safe...") Fix: Increase KL penalty (β = 0.1 instead of 0.05)

Symptom: Always refuses to answer Fix: Reward model may be overly cautious; add more diverse training data

Hyperparameter Recommendations

Hyperparameter	Recommended Value	Why
SFT learning rate	2e-5	Too high -> unstable, too low -> slow
SFT epochs	1-3	More epochs -> overfitting on small datasets
RM learning rate	1e-5	Lower than SFT to avoid overfitting
RM batch size	4-8	Higher is better (more pairs per update)
PPO learning rate	1e-6	Very low to prevent catastrophic updates
PPO β (KL coef)	0.05	Balance reward vs. drift
PPO batch size	16-32	Trade-off: larger = stable, smaller = diverse

Extension Challenges

🟢 Easy: Experiment with Hyperparameters

Task: Try different KL penalty values (β = 0.01, 0.05, 0.1, 0.2) and compare results.

Questions:

How does reward vs. KL divergence change?
Which β gives the best balance?

Expected Outcome: Plot showing reward-KL trade-off curve.

🟡 Medium: Implement Multi-Objective Rewards

Task: Train a reward model that optimizes for both helpfulness AND conciseness.

Approach:

Collect preference data with two labels: "more helpful" and "more concise"
Train two reward models (one for each objective)
Combine rewards: r_total = α * r_helpful + (1-α) * r_concise

Expected Outcome: RLHF model that generates helpful but concise responses.

🔴 Hard: Implement Constitutional AI Self-Critique

Task: Add a 4th stage where the RLHF model critiques and revises its own responses.

Steps:

Generate initial response with RLHF model
Prompt model to critique response against "constitution" (e.g., "Is this helpful? Is this safe?")
Generate revised response based on critique
Compare rewards of initial vs. revised

Expected Outcome: Iterative improvement loop that reduces harmful outputs.

⚫ Very Hard: Implement Direct Preference Optimization (DPO)

Task: Replace the 3-stage RLHF pipeline with a 2-stage DPO pipeline.

Approach:

Stage One: SFT (same as before)
Stage 2: Directly optimize policy on preference data using DPO loss

DPO Loss (Rafailov et al., 2023):

python

loss = -log(sigmoid(β * log(π(y_chosen | x) / π_ref(y_chosen | x))
                   - β * log(π(y_rejected | x) / π_ref(y_rejected | x))))

Expected Outcome: Compare DPO vs. RLHF on quality, training time, and stability.

Submission Requirements

Required Deliverables

Completed Jupyter Notebook (.ipynb file)
- All TODOs implemented and tested
- Training logs and loss curves visible
- Sample generations from base, SFT, and RLHF models
Comparison Report (Markdown cell in notebook)
- Table comparing base vs. SFT vs. RLHF on 10 test prompts
- Analysis of reward improvements
- Discussion of any reward hacking observed
Hyperparameter Log
- Document all hyperparameters used (learning rates, β, epochs, batch sizes)
- Justify choices

Evaluation Criteria

Criterion	Points	Description
SFT Implementation	20	Correct training loop, loss convergence
Reward Model	25	Bradley-Terry loss, validation ``accuracy > 65``%
PPO Training	30	Reward increases, KL stays in range
Evaluation	15	Clear comparison showing RLHF > SFT
Code Quality	10	Clean, commented, well-organized
Total	100

Bonus Points (+10 each):

Implement an extension challenge
Visualize reward distributions with plots
Deploy RLHF model as a Gradio demo

Resources

Documentation

HuggingFace TRL Library - PPO trainer and RLHF tools
HuggingFace Transformers - Language model APIs
HuggingFace Datasets - Loading RLHF datasets

Research Papers

InstructGPT (Ouyang et al., 2022): Training language models to follow instructions with human feedback
Constitutional AI (Bai et al., 2022): Constitutional AI: Harmlessness from AI Feedback
DPO (Rafailov et al., 2023): Direct Preference Optimization

Datasets

Anthropic HH-RLHF - Human preference comparisons
Alpaca - Instruction-following demonstrations
OpenAssistant - Multilingual instruction dataset

Lesson 6: PPO (Proximal Policy Optimization)
Lesson 15: Large Language Models (LLMs)
Lesson 17: Multi-Modal AI (extends RLHF to vision)

Next Steps

After Completing This Activity

Explore Open-Source RLHF Models:
- Try Llama-2-Chat on HuggingFace: meta-llama/Llama-2-7b-chat-hf
- Compare your RLHF model to production models
Read the InstructGPT Paper:
- Understand OpenAI's RLHF pipeline in depth
- Learn about iterative RLHF and red teaming
Prepare for Activity 17:
- Multi-Modal RLHF (text + images)
- Extending alignment to vision-language models

Career Connections

Skills from this activity are directly applicable to:

AI Alignment Engineer: Design RLHF pipelines for production LLMs
ML Research Scientist: Investigate new alignment techniques (Constitutional AI, DPO)
LLM Fine-Tuning Specialist: Customize models for domain-specific applications
AI Safety Researcher: Develop methods to reduce harmful AI behaviors

Troubleshooting

Issue: "CUDA out of memory"

Solution: Reduce batch size or use gradient accumulation:

python

training_args = TrainingArguments(
    per_device_train_batch_size=2,  # Reduce from 4
    gradient_accumulation_steps=2,   # Effective batch size still 4
)

Issue: "Loss is NaN"

Solution: Lower learning rate or enable mixed precision:

python

training_args = TrainingArguments(
    learning_rate=1e-6,  # Reduce from 1e-5
    fp16=True,           # Mixed precision
)

Issue: "Reward model accuracy stuck at 50%"

Solution: Check that chosen/rejected pairs are labeled correctly. Print a few examples to verify.

Issue: "PPO training is unstable"

Solution: Increase KL penalty (β) or reduce PPO learning rate:

python

ppo_config = PPOConfig(
    learning_rate=5e-7,  # Reduce from 1e-6
    init_kl_coef=0.1,    # Increase from 0.05
)

Issue: "RLHF model generates repetitive text"

Solution: This is reward hacking. Increase KL penalty or add diversity penalty:

python

# During generation
response = model.generate(
    prompt,
    do_sample=True,
    top_p=0.9,           # Nucleus sampling
    temperature=0.8,     # Increase randomness
    repetition_penalty=1.2,  # Penalize repetition
)

Assessment

Your submission will be evaluated on:

Correctness (40%): Does your implementation produce expected results?
Understanding (30%): Do you demonstrate understanding through comments and analysis?
Quality (20%): Is your code clean, efficient, and well-documented?
Insight (10%): Do you provide thoughtful analysis of RLHF behavior?

Passing Criteria: ``Score >= 70``/100 and all success criteria met.

Reflection Questions

After completing the activity, reflect on:

How did SFT improve the base model? What types of errors did it fix?
What preferences did the reward model learn? Examine high-reward vs. low-reward responses.
How did PPO change the model's behavior? Compare SFT vs. RLHF generations.
Did you observe reward hacking? If so, what form did it take?
What are the ethical implications of RLHF? Whose preferences should AI systems align with?
How would you improve the RLHF pipeline? Consider data quality, training efficiency, and safety.

Congratulations! By completing this activity, you've implemented the core technology behind ChatGPT, Claude, and modern AI assistants. You now understand how reinforcement learning from human feedback aligns AI systems with human values - one of the most important techniques in modern AI.

Next Activity: Activity 17 - Multi-Modal AI ->

Activity 16 of 18