Student starter code (30% baseline)
index.html- Main HTML pagescript.js- JavaScript logicstyles.css- Styling and layoutpackage.json- Dependenciessetup.sh- Setup scriptREADME.md- Instructions (below)💡 Download the ZIP, extract it, and follow the instructions below to get started!
By completing this activity, you will:
Runtime -> Run all (or press Ctrl+F9)Expected First Run Time: ~90 seconds
The template comes with 65% working code:
Location: Section 4 - "Policy Network"
Current State: Network structure is defined but forward pass is incomplete
Your Task: Complete the forward pass to convert states to action probabilities
Requirements:
Starter Code Provided:
class PolicyNetwork(nn.Module):
def forward(self, state):
# TODO: Implement forward pass
# Hint: x = F.relu(self.fc1(state))
# Hint: logits = self.fc2(x)
# Hint: action_probs = F.softmax(logits, dim=-1)
pass
Success Criteria:
Location: Section 6 - "Training Algorithm"
Current State: Episode collection works but policy updates are missing
Your Task: Implement the REINFORCE update rule using policy gradient theorem
The Policy Gradient Theorem:
∇J(θ) = E[∇log π(a|s) × G_t]
Where:
π(a|s) = policy probability of action a in state sG_t = return (discounted sum of future rewards)∇log π(a|s) = score function (gradient of log probability)PyTorch Implementation:
# For each (state, action, return) in episode:
# 1. Get action probabilities: probs = policy_network(state)
# 2. Calculate log probability: log_prob = torch.log(probs[action])
# 3. Calculate loss: -log_prob × return
# 4. Accumulate loss over episode
# 5. Backpropagate and update
Starter Code Provided:
def reinforce_update(policy_net, optimizer, episode_data, gamma=0.99):
# TODO: Implement REINFORCE algorithm
# 1. Calculate returns from rewards
# 2. For each step, compute log probability
# 3. Compute policy gradient loss
# 4. Backpropagate and update policy
pass
Success Criteria:
Key Insight:
Location: Section 7 - "Variance Reduction"
Current State: REINFORCE works but has high variance
Your Task: Reduce gradient variance using a baseline
Why Baselines Help:
Modified Update Rule:
∇J(θ) = E[∇log π(a|s) × (G_t - b)]
Where b = average return over episode
Requirements:
b = mean(all_returns_in_episode)A_t = G_t - b for each timestepStarter Code Provided:
def reinforce_with_baseline(policy_net, optimizer, episode_data, gamma=0.99):
# TODO: Implement baseline subtraction
# 1. Calculate returns (same as TODO 2)
# 2. Calculate baseline: baseline = mean(returns)
# 3. Calculate advantages: advantages = returns - baseline
# 4. Use advantages in loss instead of returns
pass
Success Criteria:
Once you've completed all TODOs, try these advanced challenges:
Implement Actor-Critic instead of pure REINFORCE:
Implement GAE(λ) for better bias-variance tradeoff:
A^GAE = Σ(γλ)^t δ_t where δ_t = R + γV(s') - V(s)Add entropy bonus to encourage exploration:
H(π) = -Σ π(a|s) log π(a|s)-log_prob × advantage - β × entropyTest your REINFORCE implementation on other environments:
Implement simplified Proximal Policy Optimization:
r = π_new(a|s) / π_old(a|s)min(r × A, clip(r, 1-ε, 1+ε) × A)Minimum Requirements (for passing):
Target Grade (for excellent work):
Exceptional Work (bonus points):
Solution:
torch.log() not torch.log10()torch.log(probs + 1e-8) to avoid log(0)Solution:
-log_prob × return (negative!)Solution:
returns = (returns - mean) / (std + 1e-8)Solution:
Solution: CartPole is lightweight, this shouldn't happen. If it does:
device = torch.device('cpu')torch.cuda.empty_cache()| Aspect | Value-Based (Q-Learning) | Policy-Based (REINFORCE) |
|---|---|---|
| What it learns | Q(s,a) values | π(a|s) probabilities |
| Action selection | argmax Q(s,a) | Sample from π(a|s) |
| Storage | Q-table or Q-network | Policy network |
| Exploration | Epsilon-greedy | Stochastic policy |
| Continuous actions | Hard (must discretize) | Natural (can output continuous distributions) |
| Convergence | Can diverge with function approximation | More stable with neural networks |
| Sample efficiency | More efficient (TD learning) | Less efficient (Monte Carlo) |
Goal: Maximize expected return J(θ) = E[G_t]
Problem: Can't differentiate expectation over stochastic policy
Solution: Policy gradient theorem transforms it to:
∇J(θ) = E[∇log π(a|s) × G_t]
Why it works:
∇log π(a|s) tells us which direction increases probability of action aG_t to increase probability of high-return actionsPyTorch automatically computes ∇log π(a|s) via backpropagation!
Submission Checklist:
activity-04-[YourName].ipynbAfter completing this activity:
Key Insight:
You've now learned BOTH major families of RL algorithms:
In Activity 05, you'll combine them into Actor-Critic, getting the best of both worlds!
Comparison to Activity 02:
Good luck! Policy gradients are the foundation of modern deep RL (PPO, TRPO, SAC). Master this, and you'll understand how robots learn to walk and game AIs reach championship level! 🚀