ℹ️ Definition Deep Q-Networks (DQN) extend Q-Learning to high-dimensional state spaces by using neural networks as function approximators, enabling RL agents to learn directly from raw sensory input like images.
By the end of this lesson, you will:
In Lesson 2, we implemented tabular Q-Learning for FrozenLake, where the Q-table had 16 states x 4 actions = 64 entries. But what if your environment has millions of states?
Example - Atari Pong:
Deep Q-Networks solve this by using neural networks to approximate Q-values instead of storing them in a table.
Tabular Q-Learning fails for:
One. Large discrete state spaces
2. Continuous state spaces
3. High-dimensional observations
CartPole: 4 continuous state variables → Discretize to 10 bins each
States: 10^4 = 10,000 (manageable)
Atari Pong: 84×84 grayscale pixels
States: 256^(84×84) = impossible to enumerate
Need: Function approximation instead of tables
Instead of storing Q(s,a) for every state-action pair, we approximate it with a function:
Q(s,a) ≈ Q(s,a; θ)
Where:
Architecture:
Input: State s (e.g., image pixels)
↓
Hidden Layers: Extract features
↓
Output: Q-values for all actions [Q(s,a₁), Q(s,a₂), ..., Q(s,aₙ)]
Advantage: Network learns to generalize - similar states produce similar Q-values without explicitly visiting every state.

Naive combination of neural networks + Q-Learning is unstable:
Problem One: Correlated Samples
Problem 2: Moving Targets
Store experiences in a replay buffer and sample randomly for training.
Replay Buffer:
buffer = [] # List of (s, a, r, s', done) tuples
# During interaction
buffer.append((state, action, reward, next_state, done))
# During training
batch = random_sample(buffer, batch_size=32)
train_on(batch)
Benefits:
Replay Buffer Visualized:
Episode 1: s₁ → s₂ → s₃ → s₄
Episode 2: s₅ → s₆ → s₇ → s₈
↓
[s₁,s₂,s₃,s₄,s₅,s₆,s₇,s₈] ← Buffer
↓
Random sample: [s₆, s₂, s₈, s₄] ← Training batch

Use a separate target network with frozen parameters for computing targets.
Architecture:
Update Rule:
# Standard Q-Learning (unstable)
target = r + γ * max_a Q(s', a; θ)
# DQN with target network (stable)
target = r + γ * max_a Q(s', a; θ⁻)
Why This Works:
Target Network Update:
every C steps:
θ⁻ ← θ # Hard update (copy parameters)
# Alternative: Soft update (more stable)
every step:
θ⁻ ← τ*θ + (1-τ)*θ⁻ # τ=0.001 (polyak averaging)

Initialize Q-network Q(s,a; θ) with random weights
Initialize target network Q(s,a; θ⁻) with θ⁻ = θ
Initialize replay buffer D with capacity N
Set hyperparameters: batch_size, learning_rate, γ, ε, C
for episode in range(num_episodes):
s = env.reset()
for t in range(max_steps):
# Select action (ε-greedy)
if random() < ε:
a = random_action()
else:
a = argmax_a Q(s, a; θ)
# Take action, observe reward and next state
s', r, done = env.step(a)
# Store transition in replay buffer
D.append((s, a, r, s', done))
# Sample random minibatch from D
batch = random_sample(D, batch_size)
# Compute targets using target network
for (sⱼ, aⱼ, rⱼ, s'ⱼ, doneⱼ) in batch:
if doneⱼ:
yⱼ = rⱼ
else:
yⱼ = rⱼ + γ * max_a' Q(s'ⱼ, a'; θ⁻)
# Perform gradient descent step
loss = mean((yⱼ - Q(sⱼ, aⱼ; θ))²)
θ ← θ - learning_rate * ∇_θ loss
# Update target network every C steps
if t % C == 0:
θ⁻ ← θ
# Move to next state
s = s'
if done:
break
# Decay epsilon
ε = max(ε_min, ε * decay_rate)
class DQN(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.fc1 = nn.Linear(state_dim, 128)
self.fc2 = nn.Linear(128, 128)
self.fc3 = nn.Linear(128, action_dim)
def forward(self, state):
x = F.relu(self.fc1(state))
x = F.relu(self.fc2(x))
q_values = self.fc3(x) # No activation on output
return q_values
Input: State vector (e.g., [pos, vel, angle, angular_vel]) Output: Q-values for each action [Q(s,left), Q(s,right)]
class DQN_Atari(nn.Module):
def __init__(self, action_dim):
super().__init__()
# Convolutional layers for image processing
self.conv1 = nn.Conv2d(4, 32, kernel_size=8, stride=4)
self.conv2 = nn.Conv2d(32, 64, kernel_size=4, stride=2)
self.conv3 = nn.Conv2d(64, 64, kernel_size=3, stride=1)
# Fully connected layers
self.fc1 = nn.Linear(64 * 7 * 7, 512)
self.fc2 = nn.Linear(512, action_dim)
def forward(self, state):
# state: (batch, 4, 84, 84) - 4 stacked grayscale frames
x = F.relu(self.conv1(state))
x = F.relu(self.conv2(x))
x = F.relu(self.conv3(x))
x = x.view(x.size(0), -1) # Flatten
x = F.relu(self.fc1(x))
q_values = self.fc2(x)
return q_values
Input: 84x84x4 stacked grayscale frames Output: Q-values for each action

import torch
import torch.nn.functional as F
import torch.optim as optim
# Initialize networks
q_network = DQN(state_dim=4, action_dim=2)
target_network = DQN(state_dim=4, action_dim=2)
target_network.load_state_dict(q_network.state_dict())
optimizer = optim.Adam(q_network.parameters(), lr=0.001)
replay_buffer = ReplayBuffer(capacity=10000)
for episode in range(1000):
state = env.reset()
episode_reward = 0
for t in range(500):
# Epsilon-greedy action selection
if random.random() < epsilon:
action = env.action_space.sample()
else:
with torch.no_grad():
state_tensor = torch.FloatTensor(state).unsqueeze(0)
q_values = q_network(state_tensor)
action = q_values.argmax().item()
# Environment step
next_state, reward, done, _ = env.step(action)
replay_buffer.add(state, action, reward, next_state, done)
episode_reward += reward
# Train if buffer has enough samples
if len(replay_buffer) >= batch_size:
# Sample batch
states, actions, rewards, next_states, dones = replay_buffer.sample(batch_size)
# Convert to tensors
states = torch.FloatTensor(states)
actions = torch.LongTensor(actions)
rewards = torch.FloatTensor(rewards)
next_states = torch.FloatTensor(next_states)
dones = torch.FloatTensor(dones)
# Compute current Q-values
current_q_values = q_network(states).gather(1, actions.unsqueeze(1))
# Compute target Q-values
with torch.no_grad():
next_q_values = target_network(next_states).max(1)[0]
target_q_values = rewards + (1 - dones) * gamma * next_q_values
# Compute loss and update
loss = F.mse_loss(current_q_values.squeeze(), target_q_values)
optimizer.zero_grad()
loss.backward()
optimizer.step()
state = next_state
if done:
break
# Update target network every C episodes
if episode % target_update_freq == 0:
target_network.load_state_dict(q_network.state_dict())
print(f"Episode {episode}: Reward = {episode_reward}")
| Parameter | Typical Value | Effect |
|---|---|---|
| Learning rate | 0.0001 - 0.001 | Too high: unstable; too low: slow learning |
| Batch size | 32 - 128 | Larger: more stable but slower |
| Replay buffer size | 10,000 - 1,000,000 | Larger: better diversity but more memory |
| Target update freq | 1,000 - 10,000 steps | Too frequent: unstable; too rare: slow convergence |
| Discount factor γ | 0.95 - 0.99 | Higher: value long-term rewards more |
| Epsilon decay | 0.995 - 0.999 | Faster: exploit sooner; slower: explore longer |
learning_rate = 0.001
batch_size = 64
replay_buffer_size = 10000
target_update_freq = 100 # episodes
gamma = 0.99
epsilon_start = 1.0
epsilon_min = 0.01
epsilon_decay = 0.995
learning_rate = 0.00025
batch_size = 32
replay_buffer_size = 1000000
target_update_freq = 10000 # steps
gamma = 0.99
epsilon_start = 1.0
epsilon_min = 0.1
epsilon_decay = 0.9999 # Linear decay over 1M steps
To make learning feasible, Atari frames are preprocessed:
def preprocess_frame(frame):
# 1. Convert to grayscale
gray = cv2.cvtColor(frame, cv2.COLOR_RGB2GRAY)
# 2. Resize to 84×84
resized = cv2.resize(gray, (84, 84))
# 3. Normalize pixel values to [0, 1]
normalized = resized / 255.0
return normalized
Stack 4 consecutive frames to capture motion:
Frame 1: Position at t-3
Frame 2: Position at t-2
Frame 3: Position at t-1
Frame 4: Position at t
↓
State: 84×84×4 tensor (captures velocity implicitly)
Why stack frames?
Problem: Standard DQN overestimates Q-values due to max operator.
Standard DQN target:
y = r + γ * max_a' Q(s', a'; θ⁻)
Double DQN target:
a* = argmax_a' Q(s', a'; θ) # Select action with online network
y = r + γ * Q(s', a*; θ⁻) # Evaluate with target network
Benefit: Reduces overestimation bias, more stable learning.
Architecture: Split Q-network into value and advantage streams.
Q(s,a) = V(s) + [A(s,a) - mean_a' A(s,a')]
Where:
Benefit: Learns state values independently of actions, better for states where action choice doesn't matter.
Idea: Sample important transitions more frequently.
Priority: TD error magnitude |δ| = |r + γ max Q(s',a') - Q(s,a)|
Benefit: Learn faster from surprising transitions.
Symptom: Agent learns well, then suddenly forgets and performance drops.
Causes:
Fixes:
Symptom: Q-values grow unrealistically large, unstable learning.
Cause: Max operator in Q-learning inherently overestimates.
Fixes:
Symptom: Q-values stay near zero, no improvement.
Causes:
Fixes:
Symptom: Q-values explode, agent performance collapses.
Causes:
Fixes:
Performance:
Training:
DQN launched the deep RL revolution, but it has limitations:
Next lesson, we'll explore a fundamentally different approach: directly learning policies instead of value functions.