ℹ️ Definition Actor-Critic Methods combine value-based and policy-based reinforcement learning by using a critic to evaluate actions (reducing variance) while an actor learns the policy (enabling continuous actions and stochastic policies).
By the end of this lesson, you will:
In Lesson 4, we learned policy gradients (REINFORCE), which directly optimize policies but suffer from high variance:
∇J = E[Σ_t ∇ log π(a_t|s_t) * G_t]
↑
Very noisy!
In Lessons 2-3, we learned value-based methods (Q-Learning, DQN), which are sample-efficient but can't handle continuous actions.
Actor-Critic methods get the best of both worlds by combining them.
Actor: Policy network π(a|s; θ)
Critic: Value network V(s; w)
State s
↓
┌────────┴────────┐
↓ ↓
Actor Critic
π(a|s;θ) V(s;w)
↓ ↓
Action a Value V(s)
↓
Environment
↓
Reward r, Next State s'
↓
TD Error: δ = r + γV(s') - V(s)
↓
Update Actor (policy gradient with δ)
Update Critic (minimize TD error)

The advantage function measures how much better an action is compared to average:
A(s,a) = Q(s,a) - V(s)
Where:
Problem with raw returns:
# Episode 1: G = 100 (good!)
# Episode 2: G = 95 (also good, but lower)
# Policy gradient: Decrease probability of Episode 2 actions ❌
Solution with advantages:
# Baseline V(s) = 90
# Episode 1: A = 100 - 90 = +10 (increase probability ✓)
# Episode 2: A = 95 - 90 = +5 (increase probability ✓)
# Both episodes better than average!
We don't have Q(s,a) directly, but we can estimate it:
TD Error as Advantage:
δ_t = r_t + γV(s_{t+1}) - V(s_t)
This is an unbiased estimate of the advantage:
E[δ_t] = Q(s_t, a_t) - V(s_t) = A(s_t, a_t)
Intuition:

Initialize actor π(a|s; θ) and critic V(s; w)
Set learning rates α_θ (actor), α_w (critic)
for episode in range(num_episodes):
s = env.reset()
done = False
while not done:
# Actor: Select action
a ~ π(·|s; θ)
# Environment: Take action
s', r, done = env.step(a)
# Critic: Compute TD error
if done:
δ = r - V(s; w)
else:
δ = r + γ * V(s'; w) - V(s; w)
# Update Critic (minimize TD error)
w ← w + α_w * δ * ∇_w V(s; w)
# Update Actor (policy gradient with advantage)
θ ← θ + α_θ * δ * ∇_θ log π(a|s; θ)
s = s'
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributions as distributions
class Actor(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, action_dim)
def forward(self, state):
x = torch.relu(self.fc1(state))
x = torch.relu(self.fc2(x))
logits = self.fc3(x)
return logits
def select_action(self, state):
logits = self.forward(state)
probs = torch.softmax(logits, dim=-1)
dist = distributions.Categorical(probs)
action = dist.sample()
log_prob = dist.log_prob(action)
return action.item(), log_prob
class Critic(nn.Module):
def __init__(self, state_dim):
super().__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, 1)
def forward(self, state):
x = torch.relu(self.fc1(state))
x = torch.relu(self.fc2(x))
value = self.fc3(x)
return value
# Initialize networks
actor = Actor(state_dim=4, action_dim=2)
critic = Critic(state_dim=4)
actor_optimizer = optim.Adam(actor.parameters(), lr=0.001)
critic_optimizer = optim.Adam(critic.parameters(), lr=0.005)
# Training loop
for episode in range(1000):
state = env.reset()
done = False
while not done:
# Actor selects action
state_tensor = torch.FloatTensor(state)
action, log_prob = actor.select_action(state_tensor)
# Environment step
next_state, reward, done, _ = env.step(action)
next_state_tensor = torch.FloatTensor(next_state)
# Critic evaluates
value = critic(state_tensor)
next_value = critic(next_state_tensor) if not done else torch.tensor(0.0)
# Compute TD error (advantage)
td_target = reward + gamma * next_value
td_error = td_target - value
# Update Critic
critic_loss = td_error.pow(2)
critic_optimizer.zero_grad()
critic_loss.backward()
critic_optimizer.step()
# Update Actor
actor_loss = -log_prob * td_error.detach() # Detach to prevent backprop through critic
actor_optimizer.zero_grad()
actor_loss.backward()
actor_optimizer.step()
state = next_state
Basic actor-critic updates after every step (online learning):
A2C collects experiences over multiple steps/environments before updating.
Instead of 1-step TD error, use n-step returns for better credit assignment:
1-step return:
G_t^(1) = r_t + γV(s_{t+1})
N-step return:
G_t^(n) = r_t + γr_{t+1} + γ²r_{t+2} + ... + γ^n V(s_{t+n})
Advantage:
A_t^(n) = G_t^(n) - V(s_t)
Bias-Variance Trade-off:
Initialize actor π(a|s; θ) and critic V(s; w)
for iteration in range(num_iterations):
# Collect trajectories from environment
trajectories = []
for _ in range(num_actors): # Parallel environments
trajectory = collect_trajectory(env, actor, n_steps=20)
trajectories.append(trajectory)
# Compute n-step returns and advantages
for trajectory in trajectories:
for t, (s_t, a_t, r_t) in enumerate(trajectory):
# N-step return
G_t = Σ_{k=0}^{n-1} γ^k * r_{t+k} + γ^n * V(s_{t+n})
# Advantage
A_t = G_t - V(s_t)
# Batch update actor and critic
actor_loss = -Σ_t log π(a_t|s_t) * A_t
critic_loss = Σ_t (G_t - V(s_t))²
update(actor, critic, actor_loss, critic_loss)
A2C requires synchronization (all workers finish before update).
A3C allows asynchronous updates: each worker updates global parameters independently.
Global Actor & Critic
θ_global, w_global
↑ ↓
┌───────┴───────┬───────┐
↓ ↓ ↓
Worker 1 Worker 2 Worker N
θ₁, w₁ θ₂, w₂ θₙ, wₙ
↓ ↓ ↓
Env 1 Env 2 Env N
# Global networks (shared)
global_actor = Actor()
global_critic = Critic()
# Worker thread
def worker(worker_id):
# Local networks (copy of global)
local_actor = Actor()
local_critic = Critic()
while not done:
# 1. Sync with global
local_actor.load_state_dict(global_actor.state_dict())
local_critic.load_state_dict(global_critic.state_dict())
# 2. Collect trajectory
trajectory = collect_trajectory(env, local_actor, n_steps=20)
# 3. Compute gradients locally
actor_gradients, critic_gradients = compute_gradients(trajectory)
# 4. Update global networks asynchronously
with lock:
global_actor.apply_gradients(actor_gradients)
global_critic.apply_gradients(critic_gradients)
# Launch multiple workers
threads = [Thread(target=worker, args=(i,)) for i in range(num_workers)]
for t in threads:
t.start()
| Aspect | A2C | A3C |
|---|---|---|
| Updates | Synchronous (wait for all) | Asynchronous (independent) |
| Reproducibility | Deterministic | Stochastic (thread timing) |
| GPU Utilization | Better (batched updates) | Worse (many small updates) |
| Implementation | Simpler | More complex (threading) |
| Modern Practice | Preferred | Historical interest |
Consensus: A2C is now preferred due to better GPU utilization and reproducibility.
Trade-off between bias and variance:
GAE uses an exponentially-weighted average of all n-step advantages.
GAE(λ) = (1-λ) * [A^(1) + λ*A^(2) + λ²*A^(3) + ...]
Where:
def compute_gae(rewards, values, next_value, gamma=0.99, lam=0.95):
advantages = []
gae = 0
# Iterate backwards through trajectory
for t in reversed(range(len(rewards))):
if t == len(rewards) - 1:
next_val = next_value
else:
next_val = values[t + 1]
# TD error
delta = rewards[t] + gamma * next_val - values[t]
# GAE recursion
gae = delta + gamma * lam * gae
advantages.insert(0, gae)
return advantages
For continuous control:
class GaussianActor(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.mean = nn.Linear(128, action_dim)
self.log_std = nn.Parameter(torch.zeros(action_dim))
def forward(self, state):
x = torch.relu(self.fc1(state))
x = torch.relu(self.fc2(x))
mean = self.mean(x)
std = torch.exp(self.log_std)
return mean, std
def select_action(self, state):
mean, std = self.forward(state)
dist = torch.distributions.Normal(mean, std)
action = dist.sample()
log_prob = dist.log_prob(action).sum()
return action, log_prob
Actor may become deterministic too early, stopping exploration.
Add entropy bonus to encourage exploration:
entropy = -(probs * torch.log(probs)).sum()
actor_loss = -log_probs * advantages - entropy_coef * entropy
Where:
| Parameter | Typical Value | Effect |
|---|---|---|
| Actor learning rate | 0.0001 - 0.001 | Lower than critic (more stable) |
| Critic learning rate | 0.001 - 0.01 | Higher than actor (learn faster) |
| Discount factor γ | 0.95 - 0.99 | Future reward valuation |
| N-steps | 5 - 20 | Longer: less bias, more variance |
| GAE λ | 0.9 - 0.99 | Advantage estimation smoothness |
| Entropy coefficient | 0.01 - 0.1 | Exploration encouragement |
Symptom: Advantages don't shrink, actor updates are noisy.
Fix: Increase critic learning rate (2-10x actor learning rate).
Symptom: Policy becomes deterministic and performs poorly.
Fix: Increase entropy coefficient (0.01 -> 0.05).
Symptom: Performance oscillates wildly.
Fixes:
Actor-critic methods significantly improve upon vanilla policy gradients, but still have challenges:
Next lesson: Learn how PPO became the gold standard for policy optimization!