ℹ️ Definition Policy Gradient Methods are a class of reinforcement learning algorithms that directly optimize the policy function using gradient ascent on expected rewards, enabling learning of stochastic policies and handling continuous action spaces.
By the end of this lesson, you will:
In Lessons 2-3, we learned value-based methods (Q-Learning, DQN) that learn Q-values and derive policies implicitly:
Learn Q(s,a) → Policy: π(s) = argmax_a Q(s,a)
Policy gradient methods take a different approach: directly learn the policy.
Learn π(a|s; θ) directly using gradient ascent
This seemingly simple change unlocks powerful capabilities.
Approach:
Strengths:
Weaknesses:
Approach:
Strengths:
Weaknesses:

Use policy gradient methods when:
Example - Robot Control:
# Value-based: Need to discretize (bad approximation)
actions = [0.0, 0.1, 0.2, ..., 1.0] # 10 discrete actions
Q(state, action) for each discrete action
# Policy-based: Output continuous action directly
action = π(state; θ) # Can be any value in [0, 1]
Example - Rock-Paper-Scissors:
Example - Multi-Joint Robot:
For discrete action space with n actions:
π(a|s; θ) = softmax(f(s; θ))_a = exp(f_a(s; θ)) / Σ_a' exp(f_a'(s; θ))
Where:
Example:
logits = network(state) # [2.0, 1.0, 0.5]
probs = softmax(logits) # [0.59, 0.24, 0.16]
action = sample(probs) # Sample from distribution

For continuous action space:
π(a|s; θ) = N(μ(s; θ), σ²)
Where:
Example:
mu = network(state) # μ = 0.7
sigma = 0.5 # Fixed σ
action = mu + sigma * randn() # Sample: a ~ N(0.7, 0.5²)
Maximize expected return:
J(θ) = E_τ~π_θ [R(τ)]
Where:
Policy Gradient Theorem:
∇_θ J(θ) = E_τ~π_θ [Σ_t ∇_θ log π_θ(a_t|s_t) * R(τ)]
Intuition:
For episodic tasks:
∇_θ J(θ) = E_τ [Σ_t ∇_θ log π_θ(a_t|s_t) * G_t]
Where:
Interpretation: Push up probability of actions that led to high returns.

REINFORCE (Monte Carlo Policy Gradient) is the simplest policy gradient algorithm.
Initialize policy network π(a|s; θ) with random θ
Set learning rate α
for episode in range(num_episodes):
# Generate trajectory
τ = []
s = env.reset()
done = False
while not done:
# Sample action from policy
a ~ π(·|s; θ)
s', r, done = env.step(a)
τ.append((s, a, r))
s = s'
# Compute returns
G = 0
returns = []
for (s, a, r) in reversed(τ):
G = r + γ * G
returns.insert(0, G)
# Policy gradient update
for t, (s_t, a_t, r_t) in enumerate(τ):
G_t = returns[t]
# Compute gradient
∇J = ∇_θ log π_θ(a_t|s_t) * G_t
# Gradient ascent (maximize reward)
θ ← θ + α * ∇J
import torch
import torch.nn as nn
import torch.optim as optim
import torch.distributions as distributions
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, action_dim)
def forward(self, state):
x = torch.relu(self.fc1(state))
x = torch.relu(self.fc2(x))
logits = self.fc3(x)
return logits
def select_action(self, state):
logits = self.forward(state)
probs = torch.softmax(logits, dim=-1)
dist = distributions.Categorical(probs)
action = dist.sample()
log_prob = dist.log_prob(action)
return action.item(), log_prob
# Training loop
policy = PolicyNetwork(state_dim=4, action_dim=2)
optimizer = optim.Adam(policy.parameters(), lr=0.01)
for episode in range(1000):
states, actions, rewards, log_probs = [], [], [], []
# Collect trajectory
state = env.reset()
done = False
while not done:
state_tensor = torch.FloatTensor(state)
action, log_prob = policy.select_action(state_tensor)
next_state, reward, done, _ = env.step(action)
states.append(state)
actions.append(action)
rewards.append(reward)
log_probs.append(log_prob)
state = next_state
# Compute returns (discounted cumulative rewards)
returns = []
G = 0
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.FloatTensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-8) # Normalize
# Compute policy gradient loss
log_probs = torch.stack(log_probs)
loss = -(log_probs * returns).sum() # Negative for gradient ascent
# Update policy
optimizer.zero_grad()
loss.backward()
optimizer.step()
Mathematical convenience:
∇_θ π_θ(a|s) / π_θ(a|s) = ∇_θ log π_θ(a|s)
This transforms a difficult derivative into a simple gradient of log probability.
Discrete actions (Categorical):
logits = policy_network(state)
log_probs = F.log_softmax(logits, dim=-1)
log_prob_action = log_probs[action]
Continuous actions (Gaussian):
mu = policy_network(state)
dist = Normal(mu, sigma)
log_prob = dist.log_prob(action)
REINFORCE has very high variance in gradient estimates:
Subtract a baseline b(s) from returns to reduce variance:
∇_θ J(θ) ≈ Σ_t ∇_θ log π_θ(a_t|s_t) * [G_t - b(s_t)]
Common baselines:
Why it works:
Implementation:
returns = torch.FloatTensor(returns)
baseline = returns.mean()
advantages = returns - baseline
loss = -(log_probs * advantages).sum()
Normalize returns to have mean 0 and std 1:
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
Benefits:
Use advantage A(s,a) = Q(s,a) - V(s):
∇_θ J(θ) = E [Σ_t ∇_θ log π_θ(a_t|s_t) * A(s_t, a_t)]
This leads to Actor-Critic methods (next lesson).
For continuous actions, use a Gaussian policy:
class GaussianPolicy(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.mean = nn.Linear(128, action_dim)
self.log_std = nn.Parameter(torch.zeros(action_dim)) # Learnable std
def forward(self, state):
x = torch.relu(self.fc1(state))
x = torch.relu(self.fc2(x))
mean = self.mean(x)
std = torch.exp(self.log_std)
return mean, std
def select_action(self, state):
mean, std = self.forward(state)
dist = torch.distributions.Normal(mean, std)
action = dist.sample()
log_prob = dist.log_prob(action).sum() # Sum over action dimensions
return action, log_prob
Clip actions to valid range:
action = action.clamp(env.action_space.low, env.action_space.high)
| Parameter | Typical Value | Effect |
|---|---|---|
| Learning rate | 0.001 - 0.01 | Higher: faster but less stable |
| Discount factor γ | 0.95 - 0.99 | How much to value future rewards |
| Entropy bonus | 0.01 - 0.1 | Encourages exploration (higher = more random) |
| Gradient clipping | 0.5 - 1.0 | Prevents exploding gradients |
Symptom: Policy becomes deterministic too quickly, stops exploring.
Cause: Exploitation too early.
Fix: Add entropy bonus to encourage exploration:
entropy = -(probs * torch.log(probs)).sum()
loss = -(log_probs * returns).sum() - entropy_weight * entropy
Symptom: Policy improves very slowly.
Causes:
Fixes:
Symptom: Performance oscillates wildly.
Causes:
Fixes:
torch.nn.utils.clip_grad_norm_(policy.parameters(), max_norm=1.0)Policy gradients opened new possibilities, but their high variance limits performance. In the next lesson:
Next lesson: Learn how to reduce variance by combining policy gradients with value functions!