Demo Mode

No student ID available

Activity 5 of 18

Activity 5: Actor-Critic Methods

Practice and reinforce the concepts from Lesson 5

Activity 05: Actor-Critic Methods

Overview

In this activity, you'll implement an Advantage Actor-Critic (A2C) agent that combines policy gradients with value function learning. You'll train an agent on a continuous control task and see how the critic reduces variance for more stable learning!

Learning Objectives

By completing this activity, you will:

Implement actor and critic networks
Compute advantages using temporal difference errors
Apply actor-critic updates simultaneously
Train an agent on Pendulum-v1 (continuous control)
Implement Generalized Advantage Estimation (GAE)
Compare actor-critic with pure policy gradients

Prerequisites

Completed Concept 05: Actor-Critic Methods
Completed Activity 04: Policy Gradient Methods
Understanding of value functions and TD learning

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-05-actor-critic-methods.zip
Location: Templates/AI25-Template-activity-05-actor-critic-methods.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-05-actor-critic-methods.ipynb to Google Colab
Set Runtime to GPU: Runtime -> Change runtime type -> GPU (T4 recommended)

Step 3: Run Initial Cells

Execute the first few cells to:

Verify GPU availability
Install Gymnasium
Import PyTorch, NumPy, Matplotlib
Set random seeds

What You'll Build

Part One: Pendulum Environment (Pre-Built)

Pendulum-v1:

State: 3 continuous values (cos(θ), sin(θ), angular velocity)
Action: 1 continuous value (torque in [-2.0, 2.0])
Reward: Negative cost based on angle and velocity (range: -16.27 to 0)
Goal: Keep pendulum upright (minimize negative reward)
Success: Average reward > -200 over 100 episodes

Challenge: Continuous actions require Gaussian policy!

Part 2: Actor Network (YOU COMPLETE)

TODO 1: Complete the actor network for continuous actions

Input: State (3 dimensions)
Output: Mean and log_std for Gaussian distribution

python

class Actor(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.mean = nn.Linear(128, action_dim)
        self.log_std = nn.Parameter(torch.zeros(action_dim))

    def forward(self, state):
        # TODO 1: Complete forward pass
        # Return mean and std for Gaussian policy
        pass

    def select_action(self, state):
        # TODO 1: Sample action from Gaussian distribution
        # 1. Get mean and std from forward()
        # 2. Create Normal distribution
        # 3. Sample action
        # 4. Compute log probability
        # 5. Return action, log_prob
        pass

Part 3: Critic Network (YOU COMPLETE)

TODO 2: Complete the critic (value function) network

Input: State (3 dimensions)
Output: Scalar value V(s)

python

class Critic(nn.Module):
    def __init__(self, state_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.value = nn.Linear(128, 1)

    def forward(self, state):
        # TODO 2: Complete forward pass
        # Return scalar value estimate
        pass

Part 4: Advantage Computation (YOU COMPLETE)

TODO 3: Compute temporal difference error as advantage

TD error: δ = r + γ*V(s') - V(s)
Handle terminal states (V(s') = 0 if done)

python

def compute_advantage(state, action, reward, next_state, done, critic, gamma=0.99):
    # TODO 3: Compute TD error (advantage estimate)
    # δ = r + γ*V(s') - V(s)
    # If done: δ = r - V(s)
    pass

Part 5: Actor-Critic Update (YOU COMPLETE)

TODO 4: Implement actor and critic updates

Actor loss: -log π(a|s) * advantage (policy gradient)
Critic loss: (V(s) - target)² (MSE)
Total loss: actor_loss + critic_loss_coef * critic_loss

python

def actor_critic_update(actor, critic, actor_opt, critic_opt,
                         state, action, reward, next_state, done, gamma=0.99):
    # TODO 4: Implement actor-critic update
    # 1. Compute advantage (from TODO 3)
    # 2. Compute actor loss (policy gradient with advantage)
    # 3. Compute critic loss (TD error squared)
    # 4. Update both networks
    pass

Part 6: GAE Implementation (EXTENSION)

TODO 5 (Optional): Implement Generalized Advantage Estimation

Smoother advantage estimates using λ-weighted TD errors

python

def compute_gae(rewards, values, next_value, gamma=0.99, lam=0.95):
    # TODO 5: Implement GAE(λ)
    # Exponentially-weighted average of n-step advantages
    pass

Part 7: Training Loop (70% Complete)

The main training loop is provided, you'll complete:

Online actor-critic updates (every step, not episode)
Action selection from Gaussian policy
Advantage computation
Separate optimizers for actor and critic

Expected Results

Training Progression

Episodes 0-50: Random swinging, reward ~-1500 to -1000 Episodes 50-150: Learning to reduce swings, reward improves to -800 Episodes 150-300: Approaching upright position, reward reaches -400 Episodes 300+: Consistent upright balance, reward > -200 (SOLVED!)

Performance Milestones

Episode 50: Average reward ~-1000 (random swinging)
Episode 150: Average reward ~-600 (controlled oscillations)
Episode 300: Average reward ~-300 (nearly upright)
Episode 500: Average reward > -200 (consistent balance)

Solved Criteria

Pendulum-v1 is considered "solved" when:

sql

Average reward > -200 over 100 consecutive episodes

Actor-critic typically solves this in 300-600 episodes.

Success Criteria

Your implementation is complete when:

Actor outputs Gaussian distribution (mean, std)
Critic outputs state value estimates
Advantages computed using TD error
Actor and critic updated simultaneously
Agent achieves average reward > -200
Training is more stable than pure policy gradients

Tips for Success

Actor-Critic Architecture

Key Insight:

css

Actor: Policy π(a|s; θ) → Choose actions
Critic: Value V(s; w) → Reduce variance

TD error δ = r + γ*V(s') - V(s)
         ≈ Advantage A(s,a)

Actor update: θ ← θ + α_θ * δ * ∇ log π(a|s)
Critic update: w ← w + α_w * δ * ∇ V(s)

Hyperparameter Starting Points

python

# Network
actor_lr = 0.0001  # Lower than critic
critic_lr = 0.001  # Higher (learn faster)
hidden_size = 128

# Training
gamma = 0.99
max_steps = 200  # Pendulum episode length

# Variance reduction
use_gae = False  # Start simple, then try GAE
gae_lambda = 0.95  # If using GAE

# Exploration
action_noise_std = 0.1  # Add exploration noise

Debugging Common Issues

Problem: Critic values explode or become NaN

Solution: Reduce critic learning rate
Solution: Normalize state inputs
Solution: Clip TD errors or gradients

Problem: Actor doesn't learn

Solution: Check advantage signs (positive for good actions)
Solution: Increase actor learning rate slightly
Solution: Check log probabilities are correct

Problem: Training unstable

Solution: Use separate learning rates (actor << critic)
Solution: Normalize advantages
Solution: Add entropy regularization

Problem: Slow convergence

Solution: Implement GAE (TODO 5)
Solution: Use n-step returns instead of 1-step TD
Solution: Train critic longer before updating actor

Learning Rate Ratio

Critical: Actor learning rate should be lower than critic!

python

actor_lr = 0.0001  # Slow, careful policy updates
critic_lr = 0.001  # Fast value learning

# Ratio typically 1:10 or 1:5

Why? Large policy changes -> catastrophic performance drops. Critic can recover from bad estimates more easily.

Extension Challenges

Challenge One: Shared Network (Medium)

Share early layers between actor and critic:

python

class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        # Shared feature extractor
        self.shared = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU()
        )

        # Actor head
        self.actor_mean = nn.Linear(128, action_dim)
        self.actor_log_std = nn.Parameter(torch.zeros(action_dim))

        # Critic head
        self.critic = nn.Linear(128, 1)

Challenge 2: N-Step Returns (Medium)

Use n-step returns instead of 1-step TD:

python

def compute_n_step_returns(rewards, values, n=5, gamma=0.99):
    returns = []
    for t in range(len(rewards)):
        G = sum([gamma**k * rewards[t+k] for k in range(min(n, len(rewards)-t))])
        if t + n < len(rewards):
            G += gamma**n * values[t+n]
        returns.append(G)
    return returns

Challenge 3: A3C (Hard)

Implement asynchronous actor-critic with multiple parallel workers.

Challenge 4: Different Environments (Easy)

Try other continuous control environments:

MountainCarContinuous-v0
LunarLanderContinuous-v2
BipedalWalker-v3

Submission Requirements

What to Submit

Completed Notebook: activity-05-actor-critic-methods.ipynb
- All code cells executed
- Output visible for all cells
- All TODOs completed
Performance Report: Brief summary including:
- Episode number when solved (reward > -200)
- Training time (minutes)
- Hyperparameters (actor_lr, critic_lr)
- Comparison with REINFORCE (if attempted)
Visualizations:
- Training curve (reward over episodes)
- Value function estimates over time
- Advantage distribution histogram

Submission Steps

Train agent until solved (reward > -200)
Run all cells from top to bottom
Verify visualizations display
Download notebook: File -> Download -> Download .ipynb
Submit via [course portal link]

Asynchronous Methods for Deep RL (A3C) (DeepMind, 2016)
High-Dimensional Continuous Control Using GAE (Schulman et al., 2016)

Actor-critic architecture
Temporal difference learning
Advantage functions
Generalized Advantage Estimation

Next Steps

After completing this activity:

Concept 06: Proximal Policy Optimization (PPO)
Activity 06: PPO with Stable-Baselines3
Project 2: Autonomous Robot Navigation

In the next lesson, you'll learn PPO - the gold standard for policy optimization!

Assessment

This activity is graded on:

Code Completion (40%): All TODOs implemented correctly
Code Quality (30%): Clean, readable, well-documented
Performance (20%): Agent solves Pendulum (reward > -200)
Understanding (10%): Report demonstrates grasp of actor-critic

Passing Grade: 70% or higher

Good luck, and enjoy building your actor-critic agent! 🎭🎬

Activity 5 of 18

Activity 5: Actor-Critic Methods

Practice and reinforce the concepts from Lesson 5

Activity 05: Actor-Critic Methods

Overview

Learning Objectives

By completing this activity, you will:

Implement actor and critic networks
Compute advantages using temporal difference errors
Apply actor-critic updates simultaneously
Train an agent on Pendulum-v1 (continuous control)
Implement Generalized Advantage Estimation (GAE)
Compare actor-critic with pure policy gradients

Prerequisites

Completed Concept 05: Actor-Critic Methods
Completed Activity 04: Policy Gradient Methods
Understanding of value functions and TD learning

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-05-actor-critic-methods.zip
Location: Templates/AI25-Template-activity-05-actor-critic-methods.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-05-actor-critic-methods.ipynb to Google Colab
Set Runtime to GPU: Runtime -> Change runtime type -> GPU (T4 recommended)

Step 3: Run Initial Cells

Execute the first few cells to:

Verify GPU availability
Install Gymnasium
Import PyTorch, NumPy, Matplotlib
Set random seeds

What You'll Build

Part One: Pendulum Environment (Pre-Built)

Pendulum-v1:

State: 3 continuous values (cos(θ), sin(θ), angular velocity)
Action: 1 continuous value (torque in [-2.0, 2.0])
Reward: Negative cost based on angle and velocity (range: -16.27 to 0)
Goal: Keep pendulum upright (minimize negative reward)
Success: Average reward > -200 over 100 episodes

Challenge: Continuous actions require Gaussian policy!

Part 2: Actor Network (YOU COMPLETE)

TODO 1: Complete the actor network for continuous actions

Input: State (3 dimensions)
Output: Mean and log_std for Gaussian distribution

python

class Actor(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.mean = nn.Linear(128, action_dim)
        self.log_std = nn.Parameter(torch.zeros(action_dim))

    def forward(self, state):
        # TODO 1: Complete forward pass
        # Return mean and std for Gaussian policy
        pass

    def select_action(self, state):
        # TODO 1: Sample action from Gaussian distribution
        # 1. Get mean and std from forward()
        # 2. Create Normal distribution
        # 3. Sample action
        # 4. Compute log probability
        # 5. Return action, log_prob
        pass

Part 3: Critic Network (YOU COMPLETE)

TODO 2: Complete the critic (value function) network

Input: State (3 dimensions)
Output: Scalar value V(s)

python

class Critic(nn.Module):
    def __init__(self, state_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.value = nn.Linear(128, 1)

    def forward(self, state):
        # TODO 2: Complete forward pass
        # Return scalar value estimate
        pass

Part 4: Advantage Computation (YOU COMPLETE)

TODO 3: Compute temporal difference error as advantage

TD error: δ = r + γ*V(s') - V(s)
Handle terminal states (V(s') = 0 if done)

python

def compute_advantage(state, action, reward, next_state, done, critic, gamma=0.99):
    # TODO 3: Compute TD error (advantage estimate)
    # δ = r + γ*V(s') - V(s)
    # If done: δ = r - V(s)
    pass

Part 5: Actor-Critic Update (YOU COMPLETE)

TODO 4: Implement actor and critic updates

Actor loss: -log π(a|s) * advantage (policy gradient)
Critic loss: (V(s) - target)² (MSE)
Total loss: actor_loss + critic_loss_coef * critic_loss

python

def actor_critic_update(actor, critic, actor_opt, critic_opt,
                         state, action, reward, next_state, done, gamma=0.99):
    # TODO 4: Implement actor-critic update
    # 1. Compute advantage (from TODO 3)
    # 2. Compute actor loss (policy gradient with advantage)
    # 3. Compute critic loss (TD error squared)
    # 4. Update both networks
    pass

Part 6: GAE Implementation (EXTENSION)

TODO 5 (Optional): Implement Generalized Advantage Estimation

Smoother advantage estimates using λ-weighted TD errors

python

def compute_gae(rewards, values, next_value, gamma=0.99, lam=0.95):
    # TODO 5: Implement GAE(λ)
    # Exponentially-weighted average of n-step advantages
    pass

Part 7: Training Loop (70% Complete)

The main training loop is provided, you'll complete:

Online actor-critic updates (every step, not episode)
Action selection from Gaussian policy
Advantage computation
Separate optimizers for actor and critic

Episode 50: Average reward ~-1000 (random swinging)
Episode 150: Average reward ~-600 (controlled oscillations)
Episode 300: Average reward ~-300 (nearly upright)
Episode 500: Average reward > -200 (consistent balance)

Solved Criteria

Pendulum-v1 is considered "solved" when:

sql

Average reward > -200 over 100 consecutive episodes

Actor-critic typically solves this in 300-600 episodes.

Success Criteria

Your implementation is complete when:

Actor outputs Gaussian distribution (mean, std)
Critic outputs state value estimates
Advantages computed using TD error
Actor and critic updated simultaneously
Agent achieves average reward > -200
Training is more stable than pure policy gradients

Tips for Success

Actor-Critic Architecture

Key Insight:

css

Actor: Policy π(a|s; θ) → Choose actions
Critic: Value V(s; w) → Reduce variance

TD error δ = r + γ*V(s') - V(s)
         ≈ Advantage A(s,a)

Actor update: θ ← θ + α_θ * δ * ∇ log π(a|s)
Critic update: w ← w + α_w * δ * ∇ V(s)

Hyperparameter Starting Points

python

# Network
actor_lr = 0.0001  # Lower than critic
critic_lr = 0.001  # Higher (learn faster)
hidden_size = 128

# Training
gamma = 0.99
max_steps = 200  # Pendulum episode length

# Variance reduction
use_gae = False  # Start simple, then try GAE
gae_lambda = 0.95  # If using GAE

# Exploration
action_noise_std = 0.1  # Add exploration noise

Debugging Common Issues

Problem: Critic values explode or become NaN

Solution: Reduce critic learning rate
Solution: Normalize state inputs
Solution: Clip TD errors or gradients

Problem: Actor doesn't learn

Solution: Check advantage signs (positive for good actions)
Solution: Increase actor learning rate slightly
Solution: Check log probabilities are correct

Problem: Training unstable

Solution: Use separate learning rates (actor << critic)
Solution: Normalize advantages
Solution: Add entropy regularization

Problem: Slow convergence

Solution: Implement GAE (TODO 5)
Solution: Use n-step returns instead of 1-step TD
Solution: Train critic longer before updating actor

Learning Rate Ratio

Critical: Actor learning rate should be lower than critic!

python

actor_lr = 0.0001  # Slow, careful policy updates
critic_lr = 0.001  # Fast value learning

# Ratio typically 1:10 or 1:5

Why? Large policy changes -> catastrophic performance drops. Critic can recover from bad estimates more easily.

Extension Challenges

Challenge One: Shared Network (Medium)

Share early layers between actor and critic:

python

class ActorCritic(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        # Shared feature extractor
        self.shared = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 128),
            nn.ReLU()
        )

        # Actor head
        self.actor_mean = nn.Linear(128, action_dim)
        self.actor_log_std = nn.Parameter(torch.zeros(action_dim))

        # Critic head
        self.critic = nn.Linear(128, 1)

Challenge 2: N-Step Returns (Medium)

Use n-step returns instead of 1-step TD:

python

def compute_n_step_returns(rewards, values, n=5, gamma=0.99):
    returns = []
    for t in range(len(rewards)):
        G = sum([gamma**k * rewards[t+k] for k in range(min(n, len(rewards)-t))])
        if t + n < len(rewards):
            G += gamma**n * values[t+n]
        returns.append(G)
    return returns

Challenge 3: A3C (Hard)

Implement asynchronous actor-critic with multiple parallel workers.

Challenge 4: Different Environments (Easy)

Try other continuous control environments:

MountainCarContinuous-v0
LunarLanderContinuous-v2
BipedalWalker-v3

Submission Requirements

What to Submit

Completed Notebook: activity-05-actor-critic-methods.ipynb
- All code cells executed
- Output visible for all cells
- All TODOs completed
Performance Report: Brief summary including:
- Episode number when solved (reward > -200)
- Training time (minutes)
- Hyperparameters (actor_lr, critic_lr)
- Comparison with REINFORCE (if attempted)
Visualizations:
- Training curve (reward over episodes)
- Value function estimates over time
- Advantage distribution histogram

Submission Steps

Train agent until solved (reward > -200)
Run all cells from top to bottom
Verify visualizations display
Download notebook: File -> Download -> Download .ipynb
Submit via [course portal link]

Asynchronous Methods for Deep RL (A3C) (DeepMind, 2016)
High-Dimensional Continuous Control Using GAE (Schulman et al., 2016)

Actor-critic architecture
Temporal difference learning
Advantage functions
Generalized Advantage Estimation

Next Steps

After completing this activity:

Concept 06: Proximal Policy Optimization (PPO)
Activity 06: PPO with Stable-Baselines3
Project 2: Autonomous Robot Navigation

In the next lesson, you'll learn PPO - the gold standard for policy optimization!

Assessment

This activity is graded on:

Code Completion (40%): All TODOs implemented correctly
Code Quality (30%): Clean, readable, well-documented
Performance (20%): Agent solves Pendulum (reward > -200)
Understanding (10%): Report demonstrates grasp of actor-critic

Passing Grade: 70% or higher

Good luck, and enjoy building your actor-critic agent! 🎭🎬