Student starter code (30% baseline)
index.html- Main HTML pagescript.js- JavaScript logicstyles.css- Styling and layoutpackage.json- Dependenciessetup.sh- Setup scriptREADME.md- Instructions (below)💡 Download the ZIP, extract it, and follow the instructions below to get started!
By completing this activity, you will:
Runtime -> Run all (or press Ctrl+F9)Expected First Run Time: ~90 seconds
The template comes with 65% working code:
Location: Section 5 - "Actor-Critic Architecture"
Current State: Actor network works, but critic network outputs random values
Your Task: Complete the value network that estimates V(s):
class ActorCritic(nn.Module):
def __init__(self):
# Actor head (policy) - already implemented ✅
self.actor = nn.Linear(128, n_actions)
# Critic head (value function) - TODO
self.critic = nn.Linear(128, 1) # Output: V(s)
def forward(self, state):
# TODO: Implement critic forward pass
# Return: action_probs, state_value
Architecture Details:
Success Criteria:
Hints:
Location: Section 6 - "Training Loop"
Current State: Rewards are collected but advantage is not calculated
Your Task: Compute advantage function using TD error:
TD Error: δ = r + γ * V(s') - V(s)
Advantage: A(s,a) = δ = r + γ * V(s') - V(s)
Why Advantage?
Requirements:
reward + gamma * V(next_state) - V(state)(advantages - mean) / (std + 1e-8)Success Criteria:
Debugging Tips:
# Print advantage statistics
print(f"Advantage: mean={advantages.mean():.3f}, std={advantages.std():.3f}")
print(f"TD Error range: [{advantages.min():.3f}, {advantages.max():.3f}]")
Location: Section 6 - "Loss Calculation"
Current State: Policy updates without baseline (like REINFORCE)
Your Task: Implement policy gradient loss with advantage baseline:
Actor Loss: -log π(a|s) * A(s,a)
Critic Loss: MSE(V(s), r + γ * V(s'))
Total Loss: actor_loss + critic_loss
Formulas:
# Actor loss (policy gradient with baseline)
log_probs = torch.log(action_probs[range(batch_size), actions])
actor_loss = -(log_probs * advantages).mean()
# Critic loss (TD error squared)
td_targets = rewards + gamma * next_values * (1 - dones)
critic_loss = F.mse_loss(values, td_targets.detach())
# Combined loss
total_loss = actor_loss + critic_loss
Key Concepts:
Success Criteria:
Expected Loss Curves:
Once you've completed all TODOs, try these advanced challenges:
Add entropy bonus to encourage exploration:
entropy = -(action_probs * torch.log(action_probs + 1e-8)).sum(dim=1).mean()
actor_loss = -(log_probs * advantages).mean() - 0.01 * entropy
Does this improve final performance?
Replace TD error with GAE for better variance-bias tradeoff:
A^GAE(t) = Σ (γλ)^k * δ_{t+k}
where δ_t = r_t + γV(s_{t+1}) - V(s_t)
Implement with λ = 0.95. How does it compare?
Implement mini-batch updates:
Test Actor-Critic on other Gymnasium environments:
MountainCar-v0 (sparse rewards)LunarLander-v2 (continuous rewards, requires Box2D)Acrobot-v1 (swing-up task)Which environments does Actor-Critic excel at?
Minimum Requirements (for passing):
Target Grade (for excellent work):
Exceptional Work (bonus points):
Solution: Check your ActorCritic forward pass returns (action_probs, state_value). Both are needed!
Solution:
reward + gamma * next_value * (1 - done).detach() prevents gradient flowSolution:
-log_prob * advantageSolution:
Solution:
torch.log(prob + 1e-8)torch.clamp(advantages, -10, 10)Solution: This is normal! Actor-Critic updates every step, so it's slower than episodic methods. However, it converges faster in terms of episodes. If training ``takes >5`` minutes for 1000 episodes, consider reducing logging frequency.
REINFORCE (Activity 04):
Actor-Critic (This Activity):
State → [Shared Features] → Actor → π(a|s) (action probabilities)
→ Critic → V(s) (state value)
Actor (Policy Network):
Critic (Value Network):
Q(s,a) = expected return starting from state s, taking action a
V(s) = expected return starting from state s (average over actions)
A(s,a) = Q(s,a) - V(s) = how much better is action a than average?
Approximation: We approximate A(s,a) with TD error δ:
A(s,a) ≈ δ = r + γV(s') - V(s)
This is an unbiased estimator of advantage with lower variance than full returns!
| Algorithm | Variance | Convergence Speed | Final Performance | Sample Efficiency |
|---|---|---|---|---|
| Random | N/A | Never | ~22 | Very Poor |
| REINFORCE | High | Slow (1000+ episodes) | 300-400 | Poor |
| Actor-Critic | Low | Fast (500-700 episodes) | 450-495 | Good |
| A2C (batch) | Medium | Medium (400-600 episodes) | 475-500 | Very Good |
| PPO | Low | Fast (300-500 episodes) | 490-500 | Excellent |
Actor-Critic hits the sweet spot between simplicity and performance!
Submission Checklist:
activity-05-[YourName].ipynbAfter completing this activity:
Key Insight: Actor-Critic introduced you to dual-network architectures and advantage functions. PPO builds on these concepts to create one of the most robust RL algorithms used in production today (OpenAI, DeepMind)!
Good luck! Actor-Critic is a pivotal algorithm in modern RL. Understanding how actor and critic work together is key to mastering advanced algorithms like PPO and SAC! 🚀