Demo Mode

No student ID available

Concept 5 of 18

Concept 5: Actor-Critic Methods

Actor-Critic Methods

ℹ️ Definition Actor-Critic Methods combine value-based and policy-based reinforcement learning by using a critic to evaluate actions (reducing variance) while an actor learns the policy (enabling continuous actions and stochastic policies).

Learning Objectives

By the end of this lesson, you will:

Understand how actor-critic combines the best of value-based and policy-based RL
Learn the advantage function and its role in variance reduction
Implement A2C (Advantage Actor-Critic) from scratch
Understand A3C (Asynchronous Actor-Critic) for parallel training
Apply actor-critic methods to continuous control tasks
Debug bias-variance trade-offs in actor-critic training

Introduction

In Lesson 4, we learned policy gradients (REINFORCE), which directly optimize policies but suffer from high variance:

∇J = E[Σ_t ∇ log π(a_t|s_t) * G_t]
        ↑
    Very noisy!

In Lessons 2-3, we learned value-based methods (Q-Learning, DQN), which are sample-efficient but can't handle continuous actions.

Actor-Critic methods get the best of both worlds by combining them.

The Actor-Critic Architecture

Two Networks

Actor: Policy network π(a|s; θ)

Role: Choose actions
Training: Policy gradient updates
Output: Action (discrete or continuous)

Critic: Value network V(s; w)

Role: Evaluate how good states are
Training: Temporal difference learning (like Q-Learning)
Output: State value V(s)

Information Flow

scss

          State s
             ↓
    ┌────────┴────────┐
    ↓                 ↓
  Actor           Critic
π(a|s;θ)         V(s;w)
    ↓                 ↓
  Action a        Value V(s)
    ↓
 Environment
    ↓
Reward r, Next State s'
    ↓
TD Error: δ = r + γV(s') - V(s)
    ↓
Update Actor (policy gradient with δ)
Update Critic (minimize TD error)

Advantage Function

Definition

The advantage function measures how much better an action is compared to average:

css

A(s,a) = Q(s,a) - V(s)

Where:

Q(s,a): Value of taking action a in state s
V(s): Average value of state s
A(s,a): Advantage of action a (positive = better than average)

Why Advantages?

Problem with raw returns:

python

# Episode 1: G = 100 (good!)
# Episode 2: G = 95 (also good, but lower)
# Policy gradient: Decrease probability of Episode 2 actions ❌

Solution with advantages:

python

# Baseline V(s) = 90
# Episode 1: A = 100 - 90 = +10 (increase probability ✓)
# Episode 2: A = 95 - 90 = +5 (increase probability ✓)
# Both episodes better than average!

Advantage Estimation

We don't have Q(s,a) directly, but we can estimate it:

TD Error as Advantage:

scss

δ_t = r_t + γV(s_{t+1}) - V(s_t)

This is an unbiased estimate of the advantage:

arduino

E[δ_t] = Q(s_t, a_t) - V(s_t) = A(s_t, a_t)

Intuition:

r_t + γV(s_t+1): One-step estimate of Q(s_t, a_t)
V(s_t): Baseline
δ_t: How much better/worse the action was than expected

Actor-Critic Algorithm

Basic Algorithm

python

Initialize actor π(a|s; θ) and critic V(s; w)
Set learning rates α_θ (actor), α_w (critic)

for episode in range(num_episodes):
    s = env.reset()
    done = False

    while not done:
        # Actor: Select action
        a ~ π(·|s; θ)

        # Environment: Take action
        s', r, done = env.step(a)

        # Critic: Compute TD error
        if done:
            δ = r - V(s; w)
        else:
            δ = r + γ * V(s'; w) - V(s; w)

        # Update Critic (minimize TD error)
        w ← w + α_w * δ * ∇_w V(s; w)

        # Update Actor (policy gradient with advantage)
        θ ← θ + α_θ * δ * ∇_θ log π(a|s; θ)

        s = s'

PyTorch Implementation

python

import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributions as distributions

class Actor(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        logits = self.fc3(x)
        return logits

    def select_action(self, state):
        logits = self.forward(state)
        probs = torch.softmax(logits, dim=-1)
        dist = distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob

class Critic(nn.Module):
    def __init__(self, state_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, 1)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        value = self.fc3(x)
        return value

# Initialize networks
actor = Actor(state_dim=4, action_dim=2)
critic = Critic(state_dim=4)
actor_optimizer = optim.Adam(actor.parameters(), lr=0.001)
critic_optimizer = optim.Adam(critic.parameters(), lr=0.005)

# Training loop
for episode in range(1000):
    state = env.reset()
    done = False

    while not done:
        # Actor selects action
        state_tensor = torch.FloatTensor(state)
        action, log_prob = actor.select_action(state_tensor)

        # Environment step
        next_state, reward, done, _ = env.step(action)
        next_state_tensor = torch.FloatTensor(next_state)

        # Critic evaluates
        value = critic(state_tensor)
        next_value = critic(next_state_tensor) if not done else torch.tensor(0.0)

        # Compute TD error (advantage)
        td_target = reward + gamma * next_value
        td_error = td_target - value

        # Update Critic
        critic_loss = td_error.pow(2)
        critic_optimizer.zero_grad()
        critic_loss.backward()
        critic_optimizer.step()

        # Update Actor
        actor_loss = -log_prob * td_error.detach()  # Detach to prevent backprop through critic
        actor_optimizer.zero_grad()
        actor_loss.backward()
        actor_optimizer.step()

        state = next_state

A2C (Advantage Actor-Critic)

Motivation

Basic actor-critic updates after every step (online learning):

✅ Low variance (uses baseline)
❌ High computational overhead (many gradient updates)
❌ Doesn't leverage parallelism

A2C collects experiences over multiple steps/environments before updating.

N-Step Returns

Instead of 1-step TD error, use n-step returns for better credit assignment:

1-step return:

scss

G_t^(1) = r_t + γV(s_{t+1})

N-step return:

scss

G_t^(n) = r_t + γr_{t+1} + γ²r_{t+2} + ... + γ^n V(s_{t+n})

Advantage:

scss

A_t^(n) = G_t^(n) - V(s_t)

Bias-Variance Trade-off:

n=1: Low variance, high bias
n=large: High variance, low bias
Typical: n=5 to n=20

A2C Algorithm

python

Initialize actor π(a|s; θ) and critic V(s; w)

for iteration in range(num_iterations):
    # Collect trajectories from environment
    trajectories = []
    for _ in range(num_actors):  # Parallel environments
        trajectory = collect_trajectory(env, actor, n_steps=20)
        trajectories.append(trajectory)

    # Compute n-step returns and advantages
    for trajectory in trajectories:
        for t, (s_t, a_t, r_t) in enumerate(trajectory):
            # N-step return
            G_t = Σ_{k=0}^{n-1} γ^k * r_{t+k} + γ^n * V(s_{t+n})

            # Advantage
            A_t = G_t - V(s_t)

    # Batch update actor and critic
    actor_loss = -Σ_t log π(a_t|s_t) * A_t
    critic_loss = Σ_t (G_t - V(s_t))²

    update(actor, critic, actor_loss, critic_loss)

Benefits of A2C

Batching: Updates use multiple experiences (more stable gradients)
Parallelism: Can run multiple environments simultaneously
N-step returns: Better credit assignment than 1-step TD
Synchronous: All workers update at same time (reproducible)

A3C (Asynchronous Actor-Critic)

Motivation

A2C requires synchronization (all workers finish before update).

A3C allows asynchronous updates: each worker updates global parameters independently.

Architecture

markdown

         Global Actor & Critic
         θ_global, w_global
                ↑  ↓
        ┌───────┴───────┬───────┐
        ↓               ↓       ↓
    Worker 1        Worker 2  Worker N
      θ₁, w₁          θ₂, w₂    θₙ, wₙ
        ↓               ↓         ↓
      Env 1           Env 2     Env N

Algorithm

python

# Global networks (shared)
global_actor = Actor()
global_critic = Critic()

# Worker thread
def worker(worker_id):
    # Local networks (copy of global)
    local_actor = Actor()
    local_critic = Critic()

    while not done:
        # 1. Sync with global
        local_actor.load_state_dict(global_actor.state_dict())
        local_critic.load_state_dict(global_critic.state_dict())

        # 2. Collect trajectory
        trajectory = collect_trajectory(env, local_actor, n_steps=20)

        # 3. Compute gradients locally
        actor_gradients, critic_gradients = compute_gradients(trajectory)

        # 4. Update global networks asynchronously
        with lock:
            global_actor.apply_gradients(actor_gradients)
            global_critic.apply_gradients(critic_gradients)

# Launch multiple workers
threads = [Thread(target=worker, args=(i,)) for i in range(num_workers)]
for t in threads:
    t.start()

A3C vs A2C

Aspect	A2C	A3C
Updates	Synchronous (wait for all)	Asynchronous (independent)
Reproducibility	Deterministic	Stochastic (thread timing)
GPU Utilization	Better (batched updates)	Worse (many small updates)
Implementation	Simpler	More complex (threading)
Modern Practice	Preferred	Historical interest

Consensus: A2C is now preferred due to better GPU utilization and reproducibility.

Generalized Advantage Estimation (GAE)

Motivation

Trade-off between bias and variance:

1-step advantage: Low variance, high bias
N-step advantage: High variance, low bias

GAE uses an exponentially-weighted average of all n-step advantages.

Formula

scss

GAE(λ) = (1-λ) * [A^(1) + λ*A^(2) + λ²*A^(3) + ...]

Where:

λ ∈ [0, 1]: Decay parameter
λ=0: 1-step advantage (low variance, high bias)
λ=1: Monte Carlo return (high variance, low bias)
λ=0.95: Typical value (good balance)

Implementation

python

def compute_gae(rewards, values, next_value, gamma=0.99, lam=0.95):
    advantages = []
    gae = 0

    # Iterate backwards through trajectory
    for t in reversed(range(len(rewards))):
        if t == len(rewards) - 1:
            next_val = next_value
        else:
            next_val = values[t + 1]

        # TD error
        delta = rewards[t] + gamma * next_val - values[t]

        # GAE recursion
        gae = delta + gamma * lam * gae
        advantages.insert(0, gae)

    return advantages

Continuous Action Spaces

Gaussian Actor

For continuous control:

python

class GaussianActor(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.mean = nn.Linear(128, action_dim)
        self.log_std = nn.Parameter(torch.zeros(action_dim))

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        mean = self.mean(x)
        std = torch.exp(self.log_std)
        return mean, std

    def select_action(self, state):
        mean, std = self.forward(state)
        dist = torch.distributions.Normal(mean, std)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum()
        return action, log_prob

Entropy Regularization

Problem

Actor may become deterministic too early, stopping exploration.

Solution

Add entropy bonus to encourage exploration:

python

entropy = -(probs * torch.log(probs)).sum()
actor_loss = -log_probs * advantages - entropy_coef * entropy

Where:

entropy_coef: Typically 0.01
Higher entropy: More random policy (exploration)
Lower entropy: More deterministic policy (exploitation)

Hyperparameters

Parameter	Typical Value	Effect
Actor learning rate	0.0001 - 0.001	Lower than critic (more stable)
Critic learning rate	0.001 - 0.01	Higher than actor (learn faster)
Discount factor γ	0.95 - 0.99	Future reward valuation
N-steps	5 - 20	Longer: less bias, more variance
GAE λ	0.9 - 0.99	Advantage estimation smoothness
Entropy coefficient	0.01 - 0.1	Exploration encouragement

Common Issues

Issue 1: Critic Learns Slowly

Symptom: Advantages don't shrink, actor updates are noisy.

Fix: Increase critic learning rate (2-10x actor learning rate).

Issue 2: Policy Collapse

Symptom: Policy becomes deterministic and performs poorly.

Fix: Increase entropy coefficient (0.01 -> 0.05).

Issue 3: Instability

Symptom: Performance oscillates wildly.

Fixes:

Reduce learning rates
Use gradient clipping
Normalize advantages

Advantages of Actor-Critic

Lower variance than policy gradients (baseline from critic)
Continuous actions (from actor)
Online learning (no need for full episodes)
Bias-variance trade-off (tunable with n-steps, GAE)
Parallelizable (A2C/A3C with multiple environments)

Limitations of Actor-Critic

More complex than pure policy gradients
Two networks to tune (separate learning rates)
Biased estimates (from value function bootstrap)
Still sample inefficient compared to off-policy methods

Key Takeaways

Actor-critic combines policy gradients (actor) with value estimation (critic)
Advantage function reduces variance by using critic as baseline
A2C uses n-step returns and parallel environments
GAE trades off bias and variance in advantage estimation
Continuous control handled with Gaussian policies
Entropy regularization maintains exploration

Looking Ahead

Actor-critic methods significantly improve upon vanilla policy gradients, but still have challenges:

Lesson 6: PPO adds trust region constraints for more stable learning
Lesson 7: Multi-armed bandits for specialized exploration
Lesson 8: Debugging and deployment of RL agents

Next lesson: Learn how PPO became the gold standard for policy optimization!

Summary

Actor-critic methods use two networks: actor (policy) and critic (value function)
Advantage function A(s,a) = Q(s,a) - V(s) reduces variance without bias
A2C collects trajectories from parallel environments and updates in batches
N-step returns and GAE trade off bias and variance
Continuous actions handled naturally with Gaussian actor policies
Entropy regularization prevents premature convergence to deterministic policies

Concept 5 of 18

Concept 5: Actor-Critic Methods

Actor-Critic Methods

ℹ️ Definition Actor-Critic Methods combine value-based and policy-based reinforcement learning by using a critic to evaluate actions (reducing variance) while an actor learns the policy (enabling continuous actions and stochastic policies).

Learning Objectives

By the end of this lesson, you will:

Understand how actor-critic combines the best of value-based and policy-based RL
Learn the advantage function and its role in variance reduction
Implement A2C (Advantage Actor-Critic) from scratch
Understand A3C (Asynchronous Actor-Critic) for parallel training
Apply actor-critic methods to continuous control tasks
Debug bias-variance trade-offs in actor-critic training

Introduction

In Lesson 4, we learned policy gradients (REINFORCE), which directly optimize policies but suffer from high variance:

∇J = E[Σ_t ∇ log π(a_t|s_t) * G_t]
        ↑
    Very noisy!

In Lessons 2-3, we learned value-based methods (Q-Learning, DQN), which are sample-efficient but can't handle continuous actions.

Actor-Critic methods get the best of both worlds by combining them.

The Actor-Critic Architecture

Two Networks

Actor: Policy network π(a|s; θ)

Role: Choose actions
Training: Policy gradient updates
Output: Action (discrete or continuous)

Critic: Value network V(s; w)

Role: Evaluate how good states are
Training: Temporal difference learning (like Q-Learning)
Output: State value V(s)

Information Flow

scss

          State s
             ↓
    ┌────────┴────────┐
    ↓                 ↓
  Actor           Critic
π(a|s;θ)         V(s;w)
    ↓                 ↓
  Action a        Value V(s)
    ↓
 Environment
    ↓
Reward r, Next State s'
    ↓
TD Error: δ = r + γV(s') - V(s)
    ↓
Update Actor (policy gradient with δ)
Update Critic (minimize TD error)

Advantage Function

Definition

The advantage function measures how much better an action is compared to average:

css

A(s,a) = Q(s,a) - V(s)

Where:

Q(s,a): Value of taking action a in state s
V(s): Average value of state s
A(s,a): Advantage of action a (positive = better than average)

Why Advantages?

Problem with raw returns:

python

# Episode 1: G = 100 (good!)
# Episode 2: G = 95 (also good, but lower)
# Policy gradient: Decrease probability of Episode 2 actions ❌

Solution with advantages:

python

# Baseline V(s) = 90
# Episode 1: A = 100 - 90 = +10 (increase probability ✓)
# Episode 2: A = 95 - 90 = +5 (increase probability ✓)
# Both episodes better than average!

Advantage Estimation

We don't have Q(s,a) directly, but we can estimate it:

TD Error as Advantage:

scss

δ_t = r_t + γV(s_{t+1}) - V(s_t)

This is an unbiased estimate of the advantage:

arduino

E[δ_t] = Q(s_t, a_t) - V(s_t) = A(s_t, a_t)

Intuition:

r_t + γV(s_t+1): One-step estimate of Q(s_t, a_t)
V(s_t): Baseline
δ_t: How much better/worse the action was than expected

Actor-Critic Algorithm

Basic Algorithm

python

Initialize actor π(a|s; θ) and critic V(s; w)
Set learning rates α_θ (actor), α_w (critic)

for episode in range(num_episodes):
    s = env.reset()
    done = False

    while not done:
        # Actor: Select action
        a ~ π(·|s; θ)

        # Environment: Take action
        s', r, done = env.step(a)

        # Critic: Compute TD error
        if done:
            δ = r - V(s; w)
        else:
            δ = r + γ * V(s'; w) - V(s; w)

        # Update Critic (minimize TD error)
        w ← w + α_w * δ * ∇_w V(s; w)

        # Update Actor (policy gradient with advantage)
        θ ← θ + α_θ * δ * ∇_θ log π(a|s; θ)

        s = s'

PyTorch Implementation

python

import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributions as distributions

class Actor(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, action_dim)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        logits = self.fc3(x)
        return logits

    def select_action(self, state):
        logits = self.forward(state)
        probs = torch.softmax(logits, dim=-1)
        dist = distributions.Categorical(probs)
        action = dist.sample()
        log_prob = dist.log_prob(action)
        return action.item(), log_prob

class Critic(nn.Module):
    def __init__(self, state_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.fc3 = nn.Linear(128, 1)

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        value = self.fc3(x)
        return value

# Initialize networks
actor = Actor(state_dim=4, action_dim=2)
critic = Critic(state_dim=4)
actor_optimizer = optim.Adam(actor.parameters(), lr=0.001)
critic_optimizer = optim.Adam(critic.parameters(), lr=0.005)

# Training loop
for episode in range(1000):
    state = env.reset()
    done = False

    while not done:
        # Actor selects action
        state_tensor = torch.FloatTensor(state)
        action, log_prob = actor.select_action(state_tensor)

        # Environment step
        next_state, reward, done, _ = env.step(action)
        next_state_tensor = torch.FloatTensor(next_state)

        # Critic evaluates
        value = critic(state_tensor)
        next_value = critic(next_state_tensor) if not done else torch.tensor(0.0)

        # Compute TD error (advantage)
        td_target = reward + gamma * next_value
        td_error = td_target - value

        # Update Critic
        critic_loss = td_error.pow(2)
        critic_optimizer.zero_grad()
        critic_loss.backward()
        critic_optimizer.step()

        # Update Actor
        actor_loss = -log_prob * td_error.detach()  # Detach to prevent backprop through critic
        actor_optimizer.zero_grad()
        actor_loss.backward()
        actor_optimizer.step()

        state = next_state

A2C (Advantage Actor-Critic)

Motivation

Basic actor-critic updates after every step (online learning):

✅ Low variance (uses baseline)
❌ High computational overhead (many gradient updates)
❌ Doesn't leverage parallelism

A2C collects experiences over multiple steps/environments before updating.

N-Step Returns

Instead of 1-step TD error, use n-step returns for better credit assignment:

1-step return:

scss

G_t^(1) = r_t + γV(s_{t+1})

N-step return:

scss

G_t^(n) = r_t + γr_{t+1} + γ²r_{t+2} + ... + γ^n V(s_{t+n})

Advantage:

scss

A_t^(n) = G_t^(n) - V(s_t)

Bias-Variance Trade-off:

n=1: Low variance, high bias
n=large: High variance, low bias
Typical: n=5 to n=20

A2C Algorithm

python

Initialize actor π(a|s; θ) and critic V(s; w)

for iteration in range(num_iterations):
    # Collect trajectories from environment
    trajectories = []
    for _ in range(num_actors):  # Parallel environments
        trajectory = collect_trajectory(env, actor, n_steps=20)
        trajectories.append(trajectory)

    # Compute n-step returns and advantages
    for trajectory in trajectories:
        for t, (s_t, a_t, r_t) in enumerate(trajectory):
            # N-step return
            G_t = Σ_{k=0}^{n-1} γ^k * r_{t+k} + γ^n * V(s_{t+n})

            # Advantage
            A_t = G_t - V(s_t)

    # Batch update actor and critic
    actor_loss = -Σ_t log π(a_t|s_t) * A_t
    critic_loss = Σ_t (G_t - V(s_t))²

    update(actor, critic, actor_loss, critic_loss)

Benefits of A2C

Batching: Updates use multiple experiences (more stable gradients)
Parallelism: Can run multiple environments simultaneously
N-step returns: Better credit assignment than 1-step TD
Synchronous: All workers update at same time (reproducible)

         Global Actor & Critic
         θ_global, w_global
                ↑  ↓
        ┌───────┴───────┬───────┐
        ↓               ↓       ↓
    Worker 1        Worker 2  Worker N
      θ₁, w₁          θ₂, w₂    θₙ, wₙ
        ↓               ↓         ↓
      Env 1           Env 2     Env N

Algorithm

python

# Global networks (shared)
global_actor = Actor()
global_critic = Critic()

# Worker thread
def worker(worker_id):
    # Local networks (copy of global)
    local_actor = Actor()
    local_critic = Critic()

    while not done:
        # 1. Sync with global
        local_actor.load_state_dict(global_actor.state_dict())
        local_critic.load_state_dict(global_critic.state_dict())

        # 2. Collect trajectory
        trajectory = collect_trajectory(env, local_actor, n_steps=20)

        # 3. Compute gradients locally
        actor_gradients, critic_gradients = compute_gradients(trajectory)

        # 4. Update global networks asynchronously
        with lock:
            global_actor.apply_gradients(actor_gradients)
            global_critic.apply_gradients(critic_gradients)

# Launch multiple workers
threads = [Thread(target=worker, args=(i,)) for i in range(num_workers)]
for t in threads:
    t.start()

A3C vs A2C

Aspect	A2C	A3C
Updates	Synchronous (wait for all)	Asynchronous (independent)
Reproducibility	Deterministic	Stochastic (thread timing)
GPU Utilization	Better (batched updates)	Worse (many small updates)
Implementation	Simpler	More complex (threading)
Modern Practice	Preferred	Historical interest

Consensus: A2C is now preferred due to better GPU utilization and reproducibility.

Generalized Advantage Estimation (GAE)

Motivation

Trade-off between bias and variance:

1-step advantage: Low variance, high bias
N-step advantage: High variance, low bias

GAE uses an exponentially-weighted average of all n-step advantages.

Formula

scss

GAE(λ) = (1-λ) * [A^(1) + λ*A^(2) + λ²*A^(3) + ...]

Where:

λ ∈ [0, 1]: Decay parameter
λ=0: 1-step advantage (low variance, high bias)
λ=1: Monte Carlo return (high variance, low bias)
λ=0.95: Typical value (good balance)

Implementation

python

def compute_gae(rewards, values, next_value, gamma=0.99, lam=0.95):
    advantages = []
    gae = 0

    # Iterate backwards through trajectory
    for t in reversed(range(len(rewards))):
        if t == len(rewards) - 1:
            next_val = next_value
        else:
            next_val = values[t + 1]

        # TD error
        delta = rewards[t] + gamma * next_val - values[t]

        # GAE recursion
        gae = delta + gamma * lam * gae
        advantages.insert(0, gae)

    return advantages

Continuous Action Spaces

Gaussian Actor

For continuous control:

python

class GaussianActor(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.fc1 = nn.Linear(state_dim, 128)
        self.fc2 = nn.Linear(128, 128)
        self.mean = nn.Linear(128, action_dim)
        self.log_std = nn.Parameter(torch.zeros(action_dim))

    def forward(self, state):
        x = torch.relu(self.fc1(state))
        x = torch.relu(self.fc2(x))
        mean = self.mean(x)
        std = torch.exp(self.log_std)
        return mean, std

    def select_action(self, state):
        mean, std = self.forward(state)
        dist = torch.distributions.Normal(mean, std)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum()
        return action, log_prob

entropy = -(probs * torch.log(probs)).sum()
actor_loss = -log_probs * advantages - entropy_coef * entropy

Where:

entropy_coef: Typically 0.01
Higher entropy: More random policy (exploration)
Lower entropy: More deterministic policy (exploitation)

Hyperparameters

Parameter	Typical Value	Effect
Actor learning rate	0.0001 - 0.001	Lower than critic (more stable)
Critic learning rate	0.001 - 0.01	Higher than actor (learn faster)
Discount factor γ	0.95 - 0.99	Future reward valuation
N-steps	5 - 20	Longer: less bias, more variance
GAE λ	0.9 - 0.99	Advantage estimation smoothness
Entropy coefficient	0.01 - 0.1	Exploration encouragement

Reduce learning rates
Use gradient clipping
Normalize advantages

Advantages of Actor-Critic

Lower variance than policy gradients (baseline from critic)
Continuous actions (from actor)
Online learning (no need for full episodes)
Bias-variance trade-off (tunable with n-steps, GAE)
Parallelizable (A2C/A3C with multiple environments)

Limitations of Actor-Critic

More complex than pure policy gradients
Two networks to tune (separate learning rates)
Biased estimates (from value function bootstrap)
Still sample inefficient compared to off-policy methods

Key Takeaways

Actor-critic combines policy gradients (actor) with value estimation (critic)
Advantage function reduces variance by using critic as baseline
A2C uses n-step returns and parallel environments
GAE trades off bias and variance in advantage estimation
Continuous control handled with Gaussian policies
Entropy regularization maintains exploration

Looking Ahead

Actor-critic methods significantly improve upon vanilla policy gradients, but still have challenges:

Lesson 6: PPO adds trust region constraints for more stable learning
Lesson 7: Multi-armed bandits for specialized exploration
Lesson 8: Debugging and deployment of RL agents

Next lesson: Learn how PPO became the gold standard for policy optimization!

Summary

Actor-critic methods use two networks: actor (policy) and critic (value function)
Advantage function A(s,a) = Q(s,a) - V(s) reduces variance without bias
A2C collects trajectories from parallel environments and updates in batches
N-step returns and GAE trade off bias and variance
Continuous actions handled naturally with Gaussian actor policies
Entropy regularization prevents premature convergence to deterministic policies