Apply your knowledge to build something amazing!
Duration: 3 weeks Points: 100 Prerequisites: Complete Lessons 6 (PPO), 15 (LLMs), 16 (RLHF) Difficulty: Advanced
In this project, you'll implement the complete 3-stage RLHF pipeline to align a language model with human preferences. You'll perform supervised fine-tuning (SFT), train a reward model from preference comparisons, and use PPO to optimize the model for helpfulness, harmlessness, and honesty. This is the exact technique used to create ChatGPT, Claude, and other aligned AI assistants.
Why This Matters: RLHF is THE breakthrough that transformed language models from text predictors into helpful assistants. This project gives you production-ready experience with the most important technique in modern AI.
What You'll Build:
By completing this project, you will:
Your RLHF system must:
<0.5 during trainingYour implementation must include:
project-05-text-generation-rlhf/
├── README.md # Project documentation
├── requirements.txt # Python dependencies
├── stage1_sft.py # Supervised fine-tuning
├── stage2_reward_model.py # Reward model training
├── stage3_ppo.py # PPO training loop
├── evaluation.py # Model comparison and metrics
├── chatbot_demo.py # Interactive Gradio interface
├── data/ # Datasets
│ ├── instructions/ # SFT data
│ └── preferences/ # Reward model data
├── models/ # Saved checkpoints
│ ├── sft_model/
│ ├── reward_model/
│ └── rlhf_model/
└── logs/ # Training logs
| Criterion | Points | Description |
|---|---|---|
| SFT Implementation | 20 | Correct instruction fine-tuning |
| Reward Model | 25 | Bradley-Terry model ``with >65``% accuracy |
| PPO Training | 30 | Correct PPO with KL penalty, rewards increase |
| Evaluation | 15 | Clear improvement: RLHF > SFT > Base |
| Demo | 10 | Interactive chatbot comparing models |
| Total | 100 |
Bonus Points (+10 each):
Day 1-2: Dataset Preparation
Day 3-4: SFT Training
Day 5: SFT Evaluation
Deliverable: SFT model that follows basic instructions
Day 6-7: Preference Dataset
Day 8-9: Reward Model Implementation
Day 10: Reward Model Evaluation
Deliverable: Reward model ``with >65``% validation accuracy
Day 11-13: PPO Implementation
Day 14-15: Training Monitoring
<0.5)Day 16-17: Evaluation
Day 18-19: Interactive Demo
Day 20-21: Documentation and Portfolio
Deliverable: Complete RLHF system with evaluation and demo
from transformers import AutoModelForCausalLM, AutoTokenizer, Trainer, TrainingArguments
# Load base model
model = AutoModelForCausalLM.from_pretrained("gpt2")
tokenizer = AutoTokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
# Format dataset
def format_instruction(example):
return {
"text": f"### Instruction:\n{example['instruction']}\n\n### Response:\n{example['output']}"
}
dataset = load_dataset("tatsu-lab/alpaca").map(format_instruction)
# Training arguments
training_args = TrainingArguments(
output_dir="./sft_model",
num_train_epochs=3,
per_device_train_batch_size=4,
learning_rate=2e-5,
logging_steps=100,
save_steps=1000,
)
# Train
trainer = Trainer(
model=model,
args=training_args,
train_dataset=dataset,
)
trainer.train()
import torch.nn as nn
class RewardModel(nn.Module):
def __init__(self, base_model):
super().__init__()
self.backbone = base_model
self.reward_head = nn.Linear(base_model.config.hidden_size, 1)
def forward(self, input_ids, attention_mask):
outputs = self.backbone(input_ids, attention_mask=attention_mask)
# Use last token's hidden state
last_hidden = outputs.last_hidden_state[:, -1, :]
reward = self.reward_head(last_hidden)
return reward
# Training loop
for chosen, rejected in preference_dataloader:
# Compute rewards
r_chosen = reward_model(chosen['input_ids'], chosen['attention_mask'])
r_rejected = reward_model(rejected['input_ids'], rejected['attention_mask'])
# Bradley-Terry loss
loss = -torch.log(torch.sigmoid(r_chosen - r_rejected)).mean()
# Backward
loss.backward()
optimizer.step()
from trl import PPOTrainer, PPOConfig
# Load models
policy_model = AutoModelForCausalLMWithValueHead.from_pretrained("./sft_model")
ref_model = AutoModelForCausalLM.from_pretrained("./sft_model") # Frozen reference
reward_model = RewardModel.from_pretrained("./reward_model")
# PPO config
ppo_config = PPOConfig(
learning_rate=1e-5,
batch_size=16,
mini_batch_size=4,
init_kl_coef=0.05, # KL penalty coefficient
ppo_epochs=4,
)
# Initialize trainer
ppo_trainer = PPOTrainer(
config=ppo_config,
model=policy_model,
ref_model=ref_model,
tokenizer=tokenizer,
)
# Training loop
for prompts in prompt_dataloader:
# Generate responses from policy
query_tensors = [tokenizer.encode(p, return_tensors="pt")[0] for p in prompts]
response_tensors = ppo_trainer.generate(query_tensors, max_new_tokens=128)
# Compute rewards from reward model
rewards = []
for query, response in zip(query_tensors, response_tensors):
full_text = tokenizer.decode(torch.cat([query, response]))
reward = reward_model(tokenizer(full_text, return_tensors="pt"))
rewards.append(reward)
# PPO update
stats = ppo_trainer.step(query_tensors, response_tensors, rewards)
# Log
print(f"Mean reward: {stats['ppo/mean_scores']:.2f}, KL: {stats['ppo/mean_non_score_reward']:.4f}")
| Hyperparameter | Recommended Value | Notes |
|---|---|---|
| SFT | ||
| Learning rate | 2e-5 | Too high -> catastrophic forgetting |
| Epochs | 1-3 | More -> overfitting on small datasets |
| Batch size | 4-16 | Depends on GPU memory |
| Reward Model | ||
| Learning rate | 1e-5 | Lower than SFT |
| Batch size | 8-32 | Larger is better |
| PPO | ||
| Learning rate | 1e-6 | Very low for stability |
| β (KL coef) | 0.01-0.1 | Higher -> stay closer to SFT |
| Batch size | 16-64 | Trade-off: stable vs diverse |
| PPO epochs | 4 | Standard |
def evaluate_win_rate(model_a, model_b, test_prompts, num_samples=100):
"""
Present responses from both models to human evaluator
Calculate percentage where model_a is preferred
"""
wins = 0
for prompt in test_prompts[:num_samples]:
response_a = generate(model_a, prompt)
response_b = generate(model_b, prompt)
# Human evaluation (or automated with GPT-4)
preference = get_preference(prompt, response_a, response_b)
if preference == "A":
wins += 1
return wins / num_samples
# Expected: RLHF > SFT with ≥75% win rate
def gpt4_judge(prompt, response_a, response_b):
"""Use GPT-4 to evaluate which response is better"""
judgment_prompt = f"""
Evaluate which response is more helpful, harmless, and honest.
Prompt: {prompt}
Response A: {response_a}
Response B: {response_b}
Which response is better? (A/B/Tie)
"""
judgment = openai.ChatCompletion.create(
model="gpt-4",
messages=[{"role": "user", "content": judgment_prompt}]
)
return judgment.choices[0].message.content.strip()
reward = reward - 0.01 * lengthRequired Deliverables:
Deadline: 3 weeks from project start
Demo Website:
LinkedIn/Resume:
"Implemented complete RLHF pipeline to align language model with human preferences, achieving 78% win rate improvement over supervised fine-tuning baseline. Deployed interactive chatbot demonstrating helpfulness, harmlessness, and honesty alignment."
Good luck! RLHF is the most important technique in modern AI. This project will give you the skills to build aligned AI systems.
Related Projects: