Practice and reinforce the concepts from Lesson 5
In this activity, you'll implement an Advantage Actor-Critic (A2C) agent that combines policy gradients with value function learning. You'll train an agent on a continuous control task and see how the critic reduces variance for more stable learning!
By completing this activity, you will:
Download the activity template from the Templates folder:
AI25-Template-activity-05-actor-critic-methods.zipTemplates/AI25-Template-activity-05-actor-critic-methods.zipactivity-05-actor-critic-methods.ipynb to Google ColabExecute the first few cells to:
Pendulum-v1:
Challenge: Continuous actions require Gaussian policy!
TODO 1: Complete the actor network for continuous actions
class Actor(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.mean = nn.Linear(128, action_dim)
self.log_std = nn.Parameter(torch.zeros(action_dim))
def forward(self, state):
# TODO 1: Complete forward pass
# Return mean and std for Gaussian policy
pass
def select_action(self, state):
# TODO 1: Sample action from Gaussian distribution
# 1. Get mean and std from forward()
# 2. Create Normal distribution
# 3. Sample action
# 4. Compute log probability
# 5. Return action, log_prob
pass
TODO 2: Complete the critic (value function) network
class Critic(nn.Module):
def __init__(self, state_dim):
super().__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.value = nn.Linear(128, 1)
def forward(self, state):
# TODO 2: Complete forward pass
# Return scalar value estimate
pass
TODO 3: Compute temporal difference error as advantage
def compute_advantage(state, action, reward, next_state, done, critic, gamma=0.99):
# TODO 3: Compute TD error (advantage estimate)
# δ = r + γ*V(s') - V(s)
# If done: δ = r - V(s)
pass
TODO 4: Implement actor and critic updates
def actor_critic_update(actor, critic, actor_opt, critic_opt,
state, action, reward, next_state, done, gamma=0.99):
# TODO 4: Implement actor-critic update
# 1. Compute advantage (from TODO 3)
# 2. Compute actor loss (policy gradient with advantage)
# 3. Compute critic loss (TD error squared)
# 4. Update both networks
pass
TODO 5 (Optional): Implement Generalized Advantage Estimation
def compute_gae(rewards, values, next_value, gamma=0.99, lam=0.95):
# TODO 5: Implement GAE(λ)
# Exponentially-weighted average of n-step advantages
pass
The main training loop is provided, you'll complete:
Episodes 0-50: Random swinging, reward ~-1500 to -1000 Episodes 50-150: Learning to reduce swings, reward improves to -800 Episodes 150-300: Approaching upright position, reward reaches -400 Episodes 300+: Consistent upright balance, reward > -200 (SOLVED!)
Pendulum-v1 is considered "solved" when:
Average reward > -200 over 100 consecutive episodes
Actor-critic typically solves this in 300-600 episodes.
Your implementation is complete when:
Key Insight:
Actor: Policy π(a|s; θ) → Choose actions
Critic: Value V(s; w) → Reduce variance
TD error δ = r + γ*V(s') - V(s)
≈ Advantage A(s,a)
Actor update: θ ← θ + α_θ * δ * ∇ log π(a|s)
Critic update: w ← w + α_w * δ * ∇ V(s)
# Network
actor_lr = 0.0001 # Lower than critic
critic_lr = 0.001 # Higher (learn faster)
hidden_size = 128
# Training
gamma = 0.99
max_steps = 200 # Pendulum episode length
# Variance reduction
use_gae = False # Start simple, then try GAE
gae_lambda = 0.95 # If using GAE
# Exploration
action_noise_std = 0.1 # Add exploration noise
Problem: Critic values explode or become NaN
Problem: Actor doesn't learn
Problem: Training unstable
<< critic)Problem: Slow convergence
Critical: Actor learning rate should be lower than critic!
actor_lr = 0.0001 # Slow, careful policy updates
critic_lr = 0.001 # Fast value learning
# Ratio typically 1:10 or 1:5
Why? Large policy changes -> catastrophic performance drops. Critic can recover from bad estimates more easily.
Share early layers between actor and critic:
class ActorCritic(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
# Shared feature extractor
self.shared = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, 128),
nn.ReLU()
)
# Actor head
self.actor_mean = nn.Linear(128, action_dim)
self.actor_log_std = nn.Parameter(torch.zeros(action_dim))
# Critic head
self.critic = nn.Linear(128, 1)
Use n-step returns instead of 1-step TD:
def compute_n_step_returns(rewards, values, n=5, gamma=0.99):
returns = []
for t in range(len(rewards)):
G = sum([gamma**k * rewards[t+k] for k in range(min(n, len(rewards)-t))])
if t + n < len(rewards):
G += gamma**n * values[t+n]
returns.append(G)
return returns
Implement asynchronous actor-critic with multiple parallel workers.
Try other continuous control environments:
Completed Notebook: activity-05-actor-critic-methods.ipynb
Performance Report: Brief summary including:
Visualizations:
After completing this activity:
In the next lesson, you'll learn PPO - the gold standard for policy optimization!
This activity is graded on:
Passing Grade: 70% or higher
Good luck, and enjoy building your actor-critic agent! 🎭🎬