Actor-Critic Methods - Discovery Challenge

🎯 Learning Objectives

By completing this activity, you will:

Understand Actor-Critic architecture (policy + value network)
Implement advantage function using TD error
Master policy gradient with value function baseline
Apply online learning for continuous policy improvement
Train an agent on CartPole-v1 with stable convergence
Compare Actor-Critic vs REINFORCE algorithm performance

🚀 Getting Started (See Results in 30 Seconds!)

Open in Google Colab: Upload this notebook to Google Colab
Run All Cells: Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic: You'll see:
- ✅ CartPole environment rendering
- ✅ Actor-Critic architecture visualization
- ✅ Training progress with dual loss curves
- ✅ Agent balancing the pole successfully
- ✅ Performance comparison dashboard

Expected First Run Time: ~90 seconds

🎯 What's Already Working

The template comes with 65% working code:

✅ CartPole Environment: Same as Activities 01-04 (state progression)
✅ Actor Network: Policy π(a|s) outputs action probabilities
✅ Critic Network: Value function V(s) estimates state values
✅ Shared Architecture: CNN backbone with dual heads (actor + critic)
✅ Training Loop Framework: Online updates, episode management
✅ Visualization Tools: Loss curves, advantage distribution, TD errors
✅ Performance Tracking: Episode rewards, success rate, convergence metrics

What Needs Your Work (35%):

⚠️ TODO 1: Implement value network forward pass (Medium)
⚠️ TODO 2: Implement advantage calculation with TD error (Hard)
⚠️ TODO 3: Implement actor loss with critic baseline (Hard)

📋 Tasks to Complete

TODO 1: Implement Value Network (Medium)

Location: Section 5 - "Actor-Critic Architecture"

Current State: Actor network works, but critic network outputs random values

Your Task: Complete the value network that estimates V(s):

python

class ActorCritic(nn.Module):
    def __init__(self):
        # Actor head (policy) - already implemented ✅
        self.actor = nn.Linear(128, n_actions)

        # Critic head (value function) - TODO
        self.critic = nn.Linear(128, 1)  # Output: V(s)

    def forward(self, state):
        # TODO: Implement critic forward pass
        # Return: action_probs, state_value

Architecture Details:

Input: CartPole state (4 features: position, velocity, angle, angular velocity)
Shared Backbone: 2 hidden layers (128, 128) extract state features
Actor Head: Linear(128 -> 2) -> Softmax -> π(a|s)
Critic Head: Linear(128 -> 1) -> V(s)
Output: (action_probs, state_value)

Success Criteria:

Critic outputs single scalar value per state
Value predictions improve during training (track V(s) over episodes)
Network accepts CartPole states (batch, 4) without errors
Forward pass completes ``in < 5``ms (CPU is fine)

Hints:

Critic should share the same backbone as actor (feature extractor)
Only the final layer differs between actor and critic
Value can be any real number (no activation needed)

TODO 2: Implement Advantage Calculation (Hard)

Location: Section 6 - "Training Loop"

Current State: Rewards are collected but advantage is not calculated

Your Task: Compute advantage function using TD error:

css

TD Error: δ = r + γ * V(s') - V(s)
Advantage: A(s,a) = δ = r + γ * V(s') - V(s)

Why Advantage?

Problem: REINFORCE uses raw returns -> high variance
Solution: Use advantage A(s,a) = Q(s,a) - V(s) -> lower variance
Approximation: A(s,a) ~= TD error δ

Requirements:

Calculate TD error for each timestep: reward + gamma * V(next_state) - V(state)
Handle terminal states: V(terminal) = 0
Normalize advantages: (advantages - mean) / (std + 1e-8)
Store advantages for actor loss computation

Success Criteria:

Advantages have mean ~= 0 and std ~= 1 (after normalization)
Positive advantages for good actions, negative for bad actions
TD error magnitude decreases as training progresses
Terminal states correctly handled (V = 0)

Debugging Tips:

python

# Print advantage statistics
print(f"Advantage: mean={advantages.mean():.3f}, std={advantages.std():.3f}")
print(f"TD Error range: [{advantages.min():.3f}, {advantages.max():.3f}]")

TODO 3: Implement Actor Loss with Baseline (Hard)

Location: Section 6 - "Loss Calculation"

Current State: Policy updates without baseline (like REINFORCE)

Your Task: Implement policy gradient loss with advantage baseline:

css

Actor Loss: -log π(a|s) * A(s,a)
Critic Loss: MSE(V(s), r + γ * V(s'))
Total Loss: actor_loss + critic_loss

Formulas:

python

# Actor loss (policy gradient with baseline)
log_probs = torch.log(action_probs[range(batch_size), actions])
actor_loss = -(log_probs * advantages).mean()

# Critic loss (TD error squared)
td_targets = rewards + gamma * next_values * (1 - dones)
critic_loss = F.mse_loss(values, td_targets.detach())

# Combined loss
total_loss = actor_loss + critic_loss

Key Concepts:

Actor Loss: Encourages actions with positive advantage
Critic Loss: Learns to predict accurate state values
Baseline Reduces Variance: A(s,a) has lower variance than raw returns
Detach Targets: Prevents gradients from flowing through TD targets

Success Criteria:

Actor loss decreases (policy improves)
Critic loss decreases (value predictions improve)
Agent achieves average ``reward > 450`` (vs REINFORCE ~300-400)
Training is stable (no divergence or NaN losses)

Expected Loss Curves:

Actor Loss: Starts at ~0.7, decreases to ~0.3-0.4
Critic Loss: Starts at ~5000, decreases to ~50-100
TD Error: Starts at ~+/-50, decreases to ~+/-10

🚀 Extension Challenges

Once you've completed all TODOs, try these advanced challenges:

Challenge One: Entropy Regularization (Medium)

Add entropy bonus to encourage exploration:

python

entropy = -(action_probs * torch.log(action_probs + 1e-8)).sum(dim=1).mean()
actor_loss = -(log_probs * advantages).mean() - 0.01 * entropy

Does this improve final performance?

Challenge 2: Generalized Advantage Estimation (GAE) (Hard)

Replace TD error with GAE for better variance-bias tradeoff:

scss

A^GAE(t) = Σ (γλ)^k * δ_{t+k}
where δ_t = r_t + γV(s_{t+1}) - V(s_t)

Implement with λ = 0.95. How does it compare?

Challenge 3: A2C (Advantage Actor-Critic) (Hard)

Implement mini-batch updates:

Collect 5 episodes before updating
Compute advantages across entire batch
Update networks once per batch Does this improve stability?

Challenge 4: Multi-Environment Testing (Very Hard)

Test Actor-Critic on other Gymnasium environments:

MountainCar-v0 (sparse rewards)
LunarLander-v2 (continuous rewards, requires Box2D)
Acrobot-v1 (swing-up task)

Which environments does Actor-Critic excel at?

📊 Expected Results

Random Baseline (Pre-Built):

Average Reward: ~22 +/- 8
Success Rate: 0% (no episodes reach 500 steps)
Max Episode Length: ~50 steps

REINFORCE (Activity 04 Reference):

Average Reward: 300-400 after 1000 episodes
Success Rate: 60-80%
Training: High variance, slow convergence

Your Actor-Critic Agent (TODO 1-3):

Average Reward: 450-495 after 1000 episodes
Success Rate: 90-100% (solves CartPole!)
Training: Low variance, faster convergence than REINFORCE
Convergence: ~500-700 episodes

With Entropy Regularization (Challenge 1):

Average Reward: 475-500 (more stable)
Exploration: Better action diversity early in training

With GAE (Challenge 2):

Average Reward: 480-500 (reduced variance)
Convergence: ~400-600 episodes (faster)

🎓 Success Criteria Checklist

Minimum Requirements (for passing):

Notebook runs without errors in Google Colab
TODO 1 completed (value network implemented)
TODO 2 completed (advantage calculation working)
TODO 3 completed (actor loss with baseline)
Agent achieves average ``reward > 450``
Visualizations render correctly

Target Grade (for excellent work):

All 3 TODOs completed
Agent solves CartPole (avg ``reward >= 475``)
Loss curves show proper convergence
At least 1 extension challenge attempted
Written analysis compares Actor-Critic vs REINFORCE

Exceptional Work (bonus points):

Multiple extension challenges completed
GAE implementation working correctly
A2C (mini-batch) variant implemented
Multi-environment testing with analysis
Novel insights or creative approaches

🛠️ Troubleshooting

Issue: "ValueError: too many values to unpack"

Solution: Check your ActorCritic forward pass returns (action_probs, state_value). Both are needed!

Issue: "Critic loss stays high (>1000) throughout training"

Solution:

Verify TD target calculation: reward + gamma * next_value * (1 - done)
Ensure you're detaching targets: .detach() prevents gradient flow
Check value network architecture (should output single scalar)
Try reducing learning rate for critic (e.g., use separate optimizer)

Issue: "Actor loss doesn't decrease"

Solution:

Verify advantages are normalized (mean ~= 0, std ~= 1)
Check log probabilities are negative (log of ``probabilities < 1``)
Ensure you're negating the loss: -log_prob * advantage
Print sample advantages and log_probs to debug

Issue: "Agent performance worse than REINFORCE"

Solution:

Actor-Critic requires proper advantage calculation (TODO 2)
Try different learning rates (0.001 - 0.01)
Ensure both actor and critic are updating (check gradients)
Increase training episodes (Actor-Critic may start slower but converge better)

Issue: "NaN losses during training"

Solution:

Add small epsilon to log: torch.log(prob + 1e-8)
Clip advantages to reasonable range: torch.clamp(advantages, -10, 10)
Reduce learning rate (try 0.001 instead of 0.01)
Check for NaN in input states (shouldn't happen with CartPole)

Issue: "Training is very slow"

Solution: This is normal! Actor-Critic updates every step, so it's slower than episodic methods. However, it converges faster in terms of episodes. If training ``takes >5`` minutes for 1000 episodes, consider reducing logging frequency.

📚 Resources

Documentation

Concept 03: Policy Gradient Methods (REINFORCE foundation)
Concept 04: Actor-Critic Methods (theory behind this activity)
Concept 05: Advanced Policy Optimization (A2C, A3C, PPO)

Key Papers

Sutton et al. (2000): "Policy Gradient Methods for RL with Function Approximation"
Mnih et al. (2016): "Asynchronous Methods for Deep RL" (A3C paper)
Schulman et al. (2018): "High-Dimensional Continuous Control Using GAE"

Additional Reading

🔍 Understanding Actor-Critic

What Makes Actor-Critic Special?

REINFORCE (Activity 04):

Updates policy after full episode
Uses Monte Carlo returns (high variance)
Slow convergence due to variance

Actor-Critic (This Activity):

Updates policy every step (online learning)
Uses TD error (lower variance via bootstrapping)
Faster convergence, more stable training

Architecture Components:

scss

State → [Shared Features] → Actor → π(a|s) (action probabilities)
                         → Critic → V(s) (state value)

Actor (Policy Network):

Learns: "What action should I take?"
Output: Probability distribution over actions
Loss: Policy gradient with advantage baseline

Critic (Value Network):

Learns: "How good is this state?"
Output: Scalar value V(s)
Loss: MSE between V(s) and TD target

Advantage Function:

css

Q(s,a) = expected return starting from state s, taking action a
V(s) = expected return starting from state s (average over actions)
A(s,a) = Q(s,a) - V(s) = how much better is action a than average?

Approximation: We approximate A(s,a) with TD error δ:

scss

A(s,a) ≈ δ = r + γV(s') - V(s)

This is an unbiased estimator of advantage with lower variance than full returns!

📊 Performance Comparison Table

Algorithm	Variance	Convergence Speed	Final Performance	Sample Efficiency
Random	N/A	Never	~22	Very Poor
REINFORCE	High	Slow (1000+ episodes)	300-400	Poor
Actor-Critic	Low	Fast (500-700 episodes)	450-495	Good
A2C (batch)	Medium	Medium (400-600 episodes)	475-500	Very Good
PPO	Low	Fast (300-500 episodes)	490-500	Excellent

Actor-Critic hits the sweet spot between simplicity and performance!

📤 Submission

Complete all required TODOs (minimum: TODO 1-3)
Run the entire notebook to generate all outputs
Download your completed notebook: File -> Download -> Download .ipynb
Export learning curves: Save loss plot as PNG
Submit via course portal: Upload .ipynb and loss_curves.png

Submission Checklist:

Notebook filename: activity-05-[YourName].ipynb
All code cells executed successfully
All visualizations visible in notebook
Agent achieves average ``reward > 450``
Comments explain your implementation choices
Analysis section compares Actor-Critic vs REINFORCE

🎉 What's Next?

After completing this activity:

Move on to Activity 06: Proximal Policy Optimization (PPO)
Learn advanced techniques: clipped objectives, KL penalties
Scale to more complex environments (LunarLander, BipedalWalker)

Key Insight: Actor-Critic introduced you to dual-network architectures and advantage functions. PPO builds on these concepts to create one of the most robust RL algorithms used in production today (OpenAI, DeepMind)!

Good luck! Actor-Critic is a pivotal algorithm in modern RL. Understanding how actor and critic work together is key to mastering advanced algorithms like PPO and SAC! 🚀

Template 5: Actor Critic Methods

📦 Project Files Included: