Practice and reinforce the concepts from Lesson 16
In this activity, you'll implement the complete 3-stage RLHF pipeline to transform a base language model into an instruction-following assistant. You'll train a reward model on human preference comparisons, then use PPO to optimize the model for maximum reward. This hands-on implementation will show you how ChatGPT, Claude, and other AI assistants are created.
What makes this special: You're building the exact same pipeline that OpenAI used to create ChatGPT. By the end, you'll have a working RLHF model that generates more helpful, harmless, and honest responses than the base model.
Time Required: 90-120 minutes
By completing this activity, you will be able to:
Before starting this activity, you should:
Download the template from the course Templates folder:
AI25-Template-activity-16-rlhf.zipUpload to Google Colab:
.ipynb fileRun the first cell to verify your environment and see a working demo
This activity provides a 65-70% working implementation of RLHF. You'll complete the missing pieces:
Example Output:
Prompt: "Explain machine learning in one sentence"
Base Model: "Machine learning is a subset of artificial intelligence that focuses on..."
SFT Model: "Machine learning is when computers learn patterns from data to make predictions without being explicitly programmed."
Example Output:
Prompt: "Write a haiku about coding"
Response A (Chosen): "Lines of logic flow, / Algorithms come alive, / Bugs fear the debugger."
Reward A: 2.34
Response B (Rejected): "Coding is fun and interesting. I like to code."
Reward B: -0.87
Example Output:
Prompt: "How do I learn Python?"
SFT Model: "You can learn Python by reading tutorials and practicing coding exercises."
Reward: 0.23
RLHF Model: "Here's a structured approach to learn Python:
1. Start with Python.org's beginner tutorial
2. Practice on LeetCode or HackerRank
3. Build small projects (calculator, to-do list)
4. Join Python communities (r/learnpython)
5. Read 'Automate the Boring Stuff with Python'
What's your current programming experience? I can tailor the roadmap for you."
Reward: 2.71
Your implementation is successful when:
Start with the Working Code: Run all pre-built cells first to understand the full pipeline before implementing TODOs.
Use Small Models: The template uses GPT-2 (124M parameters) for fast iteration. Upgrade to Llama-2-7B for better quality if you have Colab Pro.
Monitor Training: Watch the loss curves and sample generations during training to catch issues early.
GPU Memory: If you run out of memory, reduce batch_size or use gradient_checkpointing=True.
Hint: Use the HuggingFace Trainer API. The dataset is already tokenized.
Common Mistakes:
tokenizer.pad_token (causes errors)Debug Checklist:
Hint: The Bradley-Terry loss is: -log(sigmoid(r_chosen - r_rejected)).
Common Mistakes:
Debug Checklist:
Hint: Use the PPOTrainer from the trl library. Set init_kl_coef=0.05 for KL penalty.
Common Mistakes:
Debug Checklist:
If your RLHF model generates weird outputs:
Symptom: Very long responses (500+ tokens)
Fix: Add length penalty to reward: reward = reward - 0.01 * length
Symptom: Repetitive keywords ("helpful safe accurate helpful safe...") Fix: Increase KL penalty (β = 0.1 instead of 0.05)
Symptom: Always refuses to answer Fix: Reward model may be overly cautious; add more diverse training data
| Hyperparameter | Recommended Value | Why |
|---|---|---|
| SFT learning rate | 2e-5 | Too high -> unstable, too low -> slow |
| SFT epochs | 1-3 | More epochs -> overfitting on small datasets |
| RM learning rate | 1e-5 | Lower than SFT to avoid overfitting |
| RM batch size | 4-8 | Higher is better (more pairs per update) |
| PPO learning rate | 1e-6 | Very low to prevent catastrophic updates |
| PPO β (KL coef) | 0.05 | Balance reward vs. drift |
| PPO batch size | 16-32 | Trade-off: larger = stable, smaller = diverse |
Task: Try different KL penalty values (β = 0.01, 0.05, 0.1, 0.2) and compare results.
Questions:
Expected Outcome: Plot showing reward-KL trade-off curve.
Task: Train a reward model that optimizes for both helpfulness AND conciseness.
Approach:
r_total = α * r_helpful + (1-α) * r_conciseExpected Outcome: RLHF model that generates helpful but concise responses.
Task: Add a 4th stage where the RLHF model critiques and revises its own responses.
Steps:
Expected Outcome: Iterative improvement loop that reduces harmful outputs.
Task: Replace the 3-stage RLHF pipeline with a 2-stage DPO pipeline.
Approach:
DPO Loss (Rafailov et al., 2023):
loss = -log(sigmoid(β * log(π(y_chosen | x) / π_ref(y_chosen | x))
- β * log(π(y_rejected | x) / π_ref(y_rejected | x))))
Expected Outcome: Compare DPO vs. RLHF on quality, training time, and stability.
Completed Jupyter Notebook (.ipynb file)
Comparison Report (Markdown cell in notebook)
Hyperparameter Log
| Criterion | Points | Description |
|---|---|---|
| SFT Implementation | 20 | Correct training loop, loss convergence |
| Reward Model | 25 | Bradley-Terry loss, validation ``accuracy > 65``% |
| PPO Training | 30 | Reward increases, KL stays in range |
| Evaluation | 15 | Clear comparison showing RLHF > SFT |
| Code Quality | 10 | Clean, commented, well-organized |
| Total | 100 |
Bonus Points (+10 each):
Explore Open-Source RLHF Models:
meta-llama/Llama-2-7b-chat-hfRead the InstructGPT Paper:
Prepare for Activity 17:
Skills from this activity are directly applicable to:
Solution: Reduce batch size or use gradient accumulation:
training_args = TrainingArguments(
per_device_train_batch_size=2, # Reduce from 4
gradient_accumulation_steps=2, # Effective batch size still 4
)
Solution: Lower learning rate or enable mixed precision:
training_args = TrainingArguments(
learning_rate=1e-6, # Reduce from 1e-5
fp16=True, # Mixed precision
)
Solution: Check that chosen/rejected pairs are labeled correctly. Print a few examples to verify.
Solution: Increase KL penalty (β) or reduce PPO learning rate:
ppo_config = PPOConfig(
learning_rate=5e-7, # Reduce from 1e-6
init_kl_coef=0.1, # Increase from 0.05
)
Solution: This is reward hacking. Increase KL penalty or add diversity penalty:
# During generation
response = model.generate(
prompt,
do_sample=True,
top_p=0.9, # Nucleus sampling
temperature=0.8, # Increase randomness
repetition_penalty=1.2, # Penalize repetition
)
Your submission will be evaluated on:
Passing Criteria: ``Score >= 70``/100 and all success criteria met.
After completing the activity, reflect on:
How did SFT improve the base model? What types of errors did it fix?
What preferences did the reward model learn? Examine high-reward vs. low-reward responses.
How did PPO change the model's behavior? Compare SFT vs. RLHF generations.
Did you observe reward hacking? If so, what form did it take?
What are the ethical implications of RLHF? Whose preferences should AI systems align with?
How would you improve the RLHF pipeline? Consider data quality, training efficiency, and safety.
Congratulations! By completing this activity, you've implemented the core technology behind ChatGPT, Claude, and modern AI assistants. You now understand how reinforcement learning from human feedback aligns AI systems with human values - one of the most important techniques in modern AI.
Next Activity: Activity 17 - Multi-Modal AI ->