ℹ️ Definition Q-Learning is a model-free reinforcement learning algorithm that learns the value of taking each action in each state, enabling an agent to discover optimal policies without needing a model of the environment.
By the end of this lesson, you will:
In Lesson 1, we saw how a random agent performs poorly on RL tasks like CartPole. The agent had no way to learn which actions were good or bad in different situations. Q-Learning solves this by learning a value function that estimates how good each action is in each state.
Think of Q-Learning as building a "cheat sheet" for the environment:
By building this cheat sheet through experience, the agent can make informed decisions.
The state-value function V(s) tells us "how good is it to be in state s?"
Mathematically:
V(s) = Expected total reward starting from state s and following policy π
Example - Chess:
The action-value function Q(s,a) tells us "how good is it to take action a in state s?"
Mathematically:
Q(s,a) = Expected total reward starting from state s, taking action a, then following policy π
Key difference: V(s) evaluates states, Q(s,a) evaluates state-action pairs.
Example - Chess:
The agent chooses the action with highest Q-value (castle kingside).

The Bellman equation is the fundamental relationship in RL. It expresses value functions recursively.
Q(s, a) = R(s,a) + γ * max_a' Q(s', a')
Where:
The value of an action = immediate reward + discounted value of best future action
Example - Video Game:
Q(here, go_right) = +5 (collect coin) + 0.9 * max(Q(next_room, all_actions))
= +5 + 0.9 * 10
= +5 + 9
= 14

The discount factor determines how much we value future rewards:
Example:
Immediate reward: $100
Reward in 1 step: $100
Reward in 2 steps: $100
With γ=0.9:
Total value = 100 + 0.9*100 + 0.9²*100 = 100 + 90 + 81 = 271
With γ=0.5:
Total value = 100 + 0.5*100 + 0.5²*100 = 100 + 50 + 25 = 175
Q-Learning learns Q-values through experience using temporal difference (TD) learning.
Q(s,a) ← Q(s,a) + α * [R + γ * max_a' Q(s',a') - Q(s,a)]
Where:
Analogy - Learning Restaurant Quality:

For small state/action spaces, we use a Q-table:
| Action 0 | Action 1 | Action 2 | Action 3 | |
|---|---|---|---|---|
| State 0 | 0.0 | 0.0 | 0.0 | 0.0 |
| State 1 | 0.0 | 0.0 | 0.0 | 0.0 |
| State 2 | 0.0 | 0.0 | 0.0 | 0.0 |
| ... | ... | ... | ... | ... |
After learning:
| Action 0 | Action 1 | Action 2 | Action 3 | |
|---|---|---|---|---|
| State 0 | 3.2 | 5.7 | 2.1 | 4.8 |
| State 1 | 7.3 | 2.4 | 8.9 | One.2 |
| State 2 | One.5 | 9.2 | 3.7 | 6.4 |
The agent selects actions with highest Q-values (after learning).
Q-Learning must balance:
The most common exploration strategy:
if random() < epsilon:
action = random_action() # Explore
else:
action = argmax(Q[state]) # Exploit
Parameters:
Often, we start with high ε (exploration) and gradually reduce it (exploitation):
epsilon = epsilon_start * (epsilon_decay ** episode)
Example:
Episode 0: ε = 1.0 (explore)
Episode 100: ε = 0.5 (balanced)
Episode 500: ε = 0.1 (mostly exploit)
Episode 1000: ε = 0.01 (minimal exploration)

Initialize Q(s,a) arbitrarily for all s,a
Set hyperparameters: α, γ, ε
For each episode:
Initialize state s
While episode not done:
# Choose action using ε-greedy
if random() < ε:
a = random_action()
else:
a = argmax_a Q(s,a)
# Take action, observe reward and next state
s', r = env.step(a)
# Q-Learning update
Q(s,a) = Q(s,a) + α * [r + γ * max_a' Q(s',a') - Q(s,a)]
# Move to next state
s = s'
# Decay epsilon
ε = ε * decay_rate
FrozenLake is a 4x4 grid world perfect for learning Q-Learning:
S F F F
F H F H
F F F H
H F F G
Challenge: The ice is slippery! When you try to move in a direction, you might slip to adjacent cells.
States: 16 grid positions (0-15) Actions: 4 (Left=0, Down=1, Right=2, Up=3) Rewards: +1 for reaching goal, 0 otherwise
Episode 1: Random walk, fall in hole, Q-table stays mostly zeros.
Episode 10: Some positive Q-values near goal start appearing.
Episode 100: Q-values propagate backward from goal.
Episode 1000: Optimal path learned, agent reaches ``goal >70``% of the time.
Controls how much we update Q-values:
Too low: Learning is slow, may not converge in reasonable time. Too high: Unstable learning, Q-values oscillate wildly.
Controls future reward valuation:
Too low: Myopic behavior, agent only optimizes immediate rewards. Too high: Slow value propagation, requires more episodes.
Controls exploration:
Decay too fast: Agent commits to suboptimal policy. Decay too slow: Wastes time exploring when policy is already good.
Q-Learning is guaranteed to converge to optimal Q-values if:
In practice, we use constant α and sufficient exploration (ε-greedy), which works well empirically.
Track these metrics:
| Aspect | Q-Learning | SARSA | Monte Carlo |
|---|---|---|---|
| Learning | Off-policy | On-policy | On-policy |
| Update | TD (temporal difference) | TD | Full episode |
| Target | max_a Q(s',a') | Q(s',a') from policy | Actual return G |
| Optimism | Optimistic (learns optimal policy) | Conservative (learns followed policy) | Exact (uses real returns) |
| Speed | Fast | Fast | Slow (needs full episode) |
Solution: Deep Q-Networks (DQN) - Next lesson!
In the next lesson, we'll extend Q-Learning to handle:
Get ready to train agents that play Atari games from pixels!