Practice and reinforce the concepts from Lesson 3
In this activity, you'll implement a Deep Q-Network (DQN) agent to solve CartPole-v1, transitioning from tabular Q-Learning to function approximation with neural networks. You'll build experience replay, target networks, and achieve superhuman performance!
By completing this activity, you will:
Download the activity template from the Templates folder:
AI25-Template-activity-03-deep-q-networks.zipTemplates/AI25-Template-activity-03-deep-q-networks.zipactivity-03-deep-q-networks.ipynb to Google ColabExecute the first few cells to:
CartPole-v1:
Challenge: Continuous state space -> can't use Q-table!
TODO 1: Complete the DQN network forward pass
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, action_dim)
def forward(self, state):
# TODO 1: Complete the forward pass
# Hint: Use F.relu() for activations
pass
TODO 2: Implement the replay buffer
add() methodsample() methodclass ReplayBuffer:
def __init__(self, capacity=10000):
self.buffer = []
self.capacity = capacity
def add(self, state, action, reward, next_state, done):
# TODO 2: Add experience to buffer
# Hint: Use deque or list with max capacity
pass
def sample(self, batch_size):
# TODO 2: Sample random batch from buffer
# Return: states, actions, rewards, next_states, dones
pass
TODO 3: Implement the DQN update step
def update_dqn(q_network, target_network, replay_buffer, optimizer, batch_size, gamma):
# TODO 3: Implement DQN update
# 1. Sample batch from replay buffer
# 2. Compute current Q-values for taken actions
# 3. Compute target Q-values using target network
# 4. Compute loss and update q_network
pass
TODO 4: Implement target network updates
if episode % target_update_freq == 0:
# TODO 4: Update target network
pass
The main training loop is mostly complete, you'll add:
Dashboards showing:
Episodes 0-50: Random exploration, reward ~20-30 Episodes 50-150: Learning begins, reward increases to ~100 Episodes 150-300: Rapid improvement, reward approaches 200 Episodes 300+: Solved! Consistent 200+ reward
CartPole-v1 is considered "solved" when:
Average reward ≥ 195 over 100 consecutive episodes
Your DQN agent should achieve this in 200-400 episodes.
Your implementation is complete when:
Why this architecture works:
# Network
learning_rate = 0.001
hidden_size = 128
# Training
batch_size = 64
gamma = 0.99 # Discount factor
replay_buffer_size = 10000
# Exploration
epsilon_start = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
# Target network
target_update_freq = 10 # episodes
Problem: Agent doesn't learn (reward stays ~20-30)
len(replay_buffer))param.grad.norm())Problem: Training diverges (reward drops after improving)
Problem: Slow learning
# 1. Check network output shape
state = env.reset()
q_values = q_network(torch.FloatTensor(state))
print(f"Q-values shape: {q_values.shape}") # Should be [2]
# 2. Check replay buffer
replay_buffer.add(state, action, reward, next_state, done)
print(f"Buffer size: {len(replay_buffer)}")
# 3. Check loss is decreasing
print(f"Episode {episode}, Loss: {loss.item():.4f}")
# 4. Check Q-values are updating
print(f"Q-values: {q_values.detach().numpy()}")
Implement Double DQN to reduce overestimation:
# Standard DQN target:
target = reward + gamma * target_network(next_state).max()
# Double DQN target:
best_action = q_network(next_state).argmax()
target = reward + gamma * target_network(next_state)[best_action]
Implement Dueling architecture:
# Split into value and advantage streams
value = self.value_stream(features)
advantage = self.advantage_stream(features)
q_values = value + (advantage - advantage.mean())
Sample important experiences more frequently based on TD error magnitude.
Extend your DQN to play Pong from pixels:
Completed Notebook: activity-03-deep-q-networks.ipynb
Performance Report: Brief summary including:
Visualizations:
After completing this activity:
In the next lesson, you'll learn a completely different approach: directly optimizing policies instead of value functions!
This activity is graded on:
Passing Grade: 70% or higher
Good luck, and enjoy building your first deep RL agent! 🚀