Demo Mode

No student ID available

Concept 2 of 18

Concept 2: Q-Learning and Value Functions

Q-Learning and Value Functions

ℹ️ Definition Q-Learning is a model-free reinforcement learning algorithm that learns the value of taking each action in each state, enabling an agent to discover optimal policies without needing a model of the environment.

Learning Objectives

By the end of this lesson, you will:

Understand the concept of value functions (V and Q)
Learn the Bellman equation and its role in RL
Implement tabular Q-Learning from scratch
Apply epsilon-greedy exploration strategy
Visualize Q-table convergence and learned policies
Tune hyperparameters (learning rate, discount factor, epsilon)

In Lesson 1, we saw how a random agent performs poorly on RL tasks like CartPole. The agent had no way to learn which actions were good or bad in different situations. Q-Learning solves this by learning a value function that estimates how good each action is in each state.

Think of Q-Learning as building a "cheat sheet" for the environment:

State: "I'm at the intersection"
Action: "Turn left"
Q-value: "This action is worth 8.5 points"

By building this cheat sheet through experience, the agent can make informed decisions.

Value Functions

State-Value Function V(s)

The state-value function V(s) tells us "how good is it to be in state s?"

Mathematically:

scss

V(s) = Expected total reward starting from state s and following policy π

Example - Chess:

V(winning position) = High (close to victory)
V(losing position) = Low (close to defeat)
V(equal position) = Medium (game could go either way)

Action-Value Function Q(s,a)

The action-value function Q(s,a) tells us "how good is it to take action a in state s?"

Mathematically:

css

Q(s,a) = Expected total reward starting from state s, taking action a, then following policy π

Key difference: V(s) evaluates states, Q(s,a) evaluates state-action pairs.

Example - Chess:

Q(current_position, "move_queen_to_e5") = 7.2
Q(current_position, "move_pawn_to_a4") = 3.1
Q(current_position, "castle_kingside") = 8.9

The agent chooses the action with highest Q-value (castle kingside).

The Bellman Equation

The Bellman equation is the fundamental relationship in RL. It expresses value functions recursively.

Bellman Equation for Q-values

css

Q(s, a) = R(s,a) + γ * max_a' Q(s', a')

Where:

Q(s,a): Value of taking action a in state s
R(s,a): Immediate reward for taking action a in state s
γ (gamma): Discount factor (0 <= γ <= 1)
s': Next state after taking action a
max_a' Q(s', a'): Maximum Q-value in next state (over all possible actions)

Intuition

The value of an action = immediate reward + discounted value of best future action

Example - Video Game:

ini

Q(here, go_right) = +5 (collect coin) + 0.9 * max(Q(next_room, all_actions))
                  = +5 + 0.9 * 10
                  = +5 + 9
                  = 14

Why Discount Factor γ?

The discount factor determines how much we value future rewards:

γ = 0: Only care about immediate reward (myopic)
γ = 0.5: Future rewards worth 50% of immediate rewards
γ = 0.9: Value future highly (common in practice)
γ = 0.99: Value future very highly (for long-term planning)
γ = 1: No discounting (problematic for infinite horizons)

Example:

ini

Immediate reward: $100
Reward in 1 step: $100
Reward in 2 steps: $100

With γ=0.9:
Total value = 100 + 0.9*100 + 0.9²*100 = 100 + 90 + 81 = 271

With γ=0.5:
Total value = 100 + 0.5*100 + 0.5²*100 = 100 + 50 + 25 = 175

Q-Learning Algorithm

Q-Learning learns Q-values through experience using temporal difference (TD) learning.

The Q-Learning Update Rule

python

Q(s,a) ← Q(s,a) + α * [R + γ * max_a' Q(s',a') - Q(s,a)]

Where:

α (alpha): Learning rate (0 < α <= 1)
R: Observed reward
TD Target: R + γ * max_a' Q(s',a')
TD Error: [TD Target - Q(s,a)]

Intuition

Current estimate: Q(s,a)
New information: R + γ * max Q(s',a')
Error: How wrong was our estimate?
Update: Move toward new information by learning rate α

Analogy - Learning Restaurant Quality:

Current belief: Restaurant is 7/10
You visit and have experience worth 9/10
Error: +2 points
Update with α=0.1: 7 + 0.1*(9-7) = 7.2

Tabular Q-Learning

For small state/action spaces, we use a Q-table:

	Action 0	Action 1	Action 2	Action 3
State 0	0.0	0.0	0.0	0.0
State 1	0.0	0.0	0.0	0.0
State 2	0.0	0.0	0.0	0.0
...	...	...	...	...

After learning:

	Action 0	Action 1	Action 2	Action 3
State 0	3.2	5.7	2.1	4.8
State 1	7.3	2.4	8.9	One.2
State 2	One.5	9.2	3.7	6.4

The agent selects actions with highest Q-values (after learning).

Exploration vs Exploitation

Q-Learning must balance:

Exploitation: Choose action with highest known Q-value
Exploration: Try other actions to discover better options

Epsilon-Greedy Strategy

The most common exploration strategy:

python

if random() < epsilon:
    action = random_action()  # Explore
else:
    action = argmax(Q[state])  # Exploit

Parameters:

ε = 1.0: Pure exploration (random)
ε = 0.5: 50% explore, 50% exploit
ε = 0.1: 10% explore, 90% exploit (common)
ε = 0.0: Pure exploitation (no learning)

Epsilon Decay

Often, we start with high ε (exploration) and gradually reduce it (exploitation):

python

epsilon = epsilon_start * (epsilon_decay ** episode)

Example:

yaml

Episode 0: ε = 1.0 (explore)
Episode 100: ε = 0.5 (balanced)
Episode 500: ε = 0.1 (mostly exploit)
Episode 1000: ε = 0.01 (minimal exploration)

Q-Learning Algorithm (Complete)

python

Initialize Q(s,a) arbitrarily for all s,a
Set hyperparameters: α, γ, ε

For each episode:
    Initialize state s

    While episode not done:
        # Choose action using ε-greedy
        if random() < ε:
            a = random_action()
        else:
            a = argmax_a Q(s,a)

        # Take action, observe reward and next state
        s', r = env.step(a)

        # Q-Learning update
        Q(s,a) = Q(s,a) + α * [r + γ * max_a' Q(s',a') - Q(s,a)]

        # Move to next state
        s = s'

    # Decay epsilon
    ε = ε * decay_rate

Example: FrozenLake Environment

FrozenLake is a 4x4 grid world perfect for learning Q-Learning:

S F F F
F H F H
F F F H
H F F G

S: Start position
F: Frozen surface (safe)
H: Hole (fall in, episode ends, reward = 0)
G: Goal (reward = +1)

Challenge: The ice is slippery! When you try to move in a direction, you might slip to adjacent cells.

States: 16 grid positions (0-15) Actions: 4 (Left=0, Down=1, Right=2, Up=3) Rewards: +1 for reaching goal, 0 otherwise

Learning Process

Episode 1: Random walk, fall in hole, Q-table stays mostly zeros.

Episode 10: Some positive Q-values near goal start appearing.

Episode 100: Q-values propagate backward from goal.

Episode 1000: Optimal path learned, agent reaches ``goal >70``% of the time.

Hyperparameters and Tuning

Learning Rate (α)

Controls how much we update Q-values:

α = 0.01: Very slow learning (conservative updates)
α = 0.1: Moderate learning (common default)
α = 0.5: Fast learning (aggressive updates)
α = 1.0: Replace old value completely (often too aggressive)

Too low: Learning is slow, may not converge in reasonable time. Too high: Unstable learning, Q-values oscillate wildly.

Discount Factor (γ)

Controls future reward valuation:

γ = 0.9: Standard for episodic tasks
γ = 0.95: Value future more
γ = 0.99: Long-term planning
γ = 1.0: Infinite horizon (use carefully)

Too low: Myopic behavior, agent only optimizes immediate rewards. Too high: Slow value propagation, requires more episodes.

Epsilon (ε)

Controls exploration:

Start high (ε=1.0): Explore environment initially
Decay gradually: Balance exploration and exploitation
End low (ε=0.01-0.1): Mostly exploit learned policy

Decay too fast: Agent commits to suboptimal policy. Decay too slow: Wastes time exploring when policy is already good.

Convergence and Optimality

Convergence Guarantees

Q-Learning is guaranteed to converge to optimal Q-values if:

All state-action pairs are visited infinitely often
Learning rate α decays appropriately (Σα = ∞, Σα² < ∞)

In practice, we use constant α and sufficient exploration (ε-greedy), which works well empirically.

Signs of Convergence

Q-values stabilize: Changes become very small
Policy stabilizes: Agent consistently chooses same actions
Performance plateaus: Average reward stops improving

Monitoring Learning

Track these metrics:

Average reward per episode (should increase)
Q-value magnitude (should stabilize)
Policy entropy (should decrease as policy becomes more deterministic)
Exploration rate (epsilon decay)

Q-Learning vs Other Algorithms

Aspect	Q-Learning	SARSA	Monte Carlo
Learning	Off-policy	On-policy	On-policy
Update	TD (temporal difference)	TD	Full episode
Target	max_a Q(s',a')	Q(s',a') from policy	Actual return G
Optimism	Optimistic (learns optimal policy)	Conservative (learns followed policy)	Exact (uses real returns)
Speed	Fast	Fast	Slow (needs full episode)

Advantages of Q-Learning

Model-free: No need to know environment dynamics
Off-policy: Learns optimal policy while exploring
Simple: Easy to understand and implement
Guaranteed convergence: Under mild conditions
Works well: For small state/action spaces

Limitations of Q-Learning

Curse of dimensionality: Q-table explodes with state/action size
Discrete spaces only: Can't handle continuous states/actions directly
Sample inefficiency: Needs many experiences to converge
No generalization: Each state learned independently

Solution: Deep Q-Networks (DQN) - Next lesson!

Applications

Classic Control

CartPole balancing
Mountain car
Acrobot

Grid Worlds

Navigation tasks
Maze solving
Pathfinding

Simple Games

Tic-tac-toe
Connect Four (with abstraction)
Simple Atari games (discretized)

Robotics (Discretized)

Pick and place (discretized positions)
Simple navigation
Task switching

Key Takeaways

Value functions estimate goodness of states (V) or state-action pairs (Q)
Bellman equation expresses recursive relationship between values
Q-Learning learns Q-values through TD updates
Epsilon-greedy balances exploration and exploitation
Hyperparameters (α, γ, ε) significantly affect learning
Tabular Q-Learning works for small discrete spaces
Next step: Deep Q-Networks for complex environments

Looking Ahead

In the next lesson, we'll extend Q-Learning to handle:

High-dimensional states: Images, continuous values
Large state spaces: Millions of possible states
Function approximation: Neural networks instead of tables
Experience replay: Learning from past experiences
Target networks: Stabilizing training

Get ready to train agents that play Atari games from pixels!

Summary

Q-Learning learns action values Q(s,a) through experience
Bellman equation is the foundation of value-based RL
TD learning updates Q-values using one-step lookahead
Epsilon-greedy exploration ensures we find optimal policies
Hyperparameters (α, γ, ε) control learning speed and quality
Tabular Q-Learning is limited to small discrete state spaces
DQN (next lesson) extends Q-Learning to complex domains

Concept 2 of 18

Concept 2: Q-Learning and Value Functions

Q-Learning and Value Functions

ℹ️ Definition Q-Learning is a model-free reinforcement learning algorithm that learns the value of taking each action in each state, enabling an agent to discover optimal policies without needing a model of the environment.

Learning Objectives

By the end of this lesson, you will:

Understand the concept of value functions (V and Q)
Learn the Bellman equation and its role in RL
Implement tabular Q-Learning from scratch
Apply epsilon-greedy exploration strategy
Visualize Q-table convergence and learned policies
Tune hyperparameters (learning rate, discount factor, epsilon)

Introduction

Think of Q-Learning as building a "cheat sheet" for the environment:

State: "I'm at the intersection"
Action: "Turn left"
Q-value: "This action is worth 8.5 points"

By building this cheat sheet through experience, the agent can make informed decisions.

Value Functions

State-Value Function V(s)

The state-value function V(s) tells us "how good is it to be in state s?"

Mathematically:

scss

V(s) = Expected total reward starting from state s and following policy π

Example - Chess:

V(winning position) = High (close to victory)
V(losing position) = Low (close to defeat)
V(equal position) = Medium (game could go either way)

Action-Value Function Q(s,a)

The action-value function Q(s,a) tells us "how good is it to take action a in state s?"

Mathematically:

css

Q(s,a) = Expected total reward starting from state s, taking action a, then following policy π

Key difference: V(s) evaluates states, Q(s,a) evaluates state-action pairs.

Example - Chess:

Q(current_position, "move_queen_to_e5") = 7.2
Q(current_position, "move_pawn_to_a4") = 3.1
Q(current_position, "castle_kingside") = 8.9

The agent chooses the action with highest Q-value (castle kingside).

The Bellman Equation

The Bellman equation is the fundamental relationship in RL. It expresses value functions recursively.

Bellman Equation for Q-values

css

Q(s, a) = R(s,a) + γ * max_a' Q(s', a')

Where:

Q(s,a): Value of taking action a in state s
R(s,a): Immediate reward for taking action a in state s
γ (gamma): Discount factor (0 <= γ <= 1)
s': Next state after taking action a
max_a' Q(s', a'): Maximum Q-value in next state (over all possible actions)

Intuition

The value of an action = immediate reward + discounted value of best future action

Example - Video Game:

ini

Q(here, go_right) = +5 (collect coin) + 0.9 * max(Q(next_room, all_actions))
                  = +5 + 0.9 * 10
                  = +5 + 9
                  = 14

Why Discount Factor γ?

The discount factor determines how much we value future rewards:

γ = 0: Only care about immediate reward (myopic)
γ = 0.5: Future rewards worth 50% of immediate rewards
γ = 0.9: Value future highly (common in practice)
γ = 0.99: Value future very highly (for long-term planning)
γ = 1: No discounting (problematic for infinite horizons)

Example:

ini

Immediate reward: $100
Reward in 1 step: $100
Reward in 2 steps: $100

With γ=0.9:
Total value = 100 + 0.9*100 + 0.9²*100 = 100 + 90 + 81 = 271

With γ=0.5:
Total value = 100 + 0.5*100 + 0.5²*100 = 100 + 50 + 25 = 175

Q-Learning Algorithm

Q-Learning learns Q-values through experience using temporal difference (TD) learning.

The Q-Learning Update Rule

python

Q(s,a) ← Q(s,a) + α * [R + γ * max_a' Q(s',a') - Q(s,a)]

Where:

α (alpha): Learning rate (0 < α <= 1)
R: Observed reward
TD Target: R + γ * max_a' Q(s',a')
TD Error: [TD Target - Q(s,a)]

Intuition

Current estimate: Q(s,a)
New information: R + γ * max Q(s',a')
Error: How wrong was our estimate?
Update: Move toward new information by learning rate α

Analogy - Learning Restaurant Quality:

Current belief: Restaurant is 7/10
You visit and have experience worth 9/10
Error: +2 points
Update with α=0.1: 7 + 0.1*(9-7) = 7.2

Tabular Q-Learning

For small state/action spaces, we use a Q-table:

	Action 0	Action 1	Action 2	Action 3
State 0	0.0	0.0	0.0	0.0
State 1	0.0	0.0	0.0	0.0
State 2	0.0	0.0	0.0	0.0
...	...	...	...	...

After learning:

	Action 0	Action 1	Action 2	Action 3
State 0	3.2	5.7	2.1	4.8
State 1	7.3	2.4	8.9	One.2
State 2	One.5	9.2	3.7	6.4

The agent selects actions with highest Q-values (after learning).

Exploration vs Exploitation

Q-Learning must balance:

Exploitation: Choose action with highest known Q-value
Exploration: Try other actions to discover better options

Epsilon-Greedy Strategy

The most common exploration strategy:

python

if random() < epsilon:
    action = random_action()  # Explore
else:
    action = argmax(Q[state])  # Exploit

Parameters:

ε = 1.0: Pure exploration (random)
ε = 0.5: 50% explore, 50% exploit
ε = 0.1: 10% explore, 90% exploit (common)
ε = 0.0: Pure exploitation (no learning)

Epsilon Decay

Often, we start with high ε (exploration) and gradually reduce it (exploitation):

python

epsilon = epsilon_start * (epsilon_decay ** episode)

Example:

yaml

Episode 0: ε = 1.0 (explore)
Episode 100: ε = 0.5 (balanced)
Episode 500: ε = 0.1 (mostly exploit)
Episode 1000: ε = 0.01 (minimal exploration)

Q-Learning Algorithm (Complete)

python

Initialize Q(s,a) arbitrarily for all s,a
Set hyperparameters: α, γ, ε

For each episode:
    Initialize state s

    While episode not done:
        # Choose action using ε-greedy
        if random() < ε:
            a = random_action()
        else:
            a = argmax_a Q(s,a)

        # Take action, observe reward and next state
        s', r = env.step(a)

        # Q-Learning update
        Q(s,a) = Q(s,a) + α * [r + γ * max_a' Q(s',a') - Q(s,a)]

        # Move to next state
        s = s'

    # Decay epsilon
    ε = ε * decay_rate

Example: FrozenLake Environment

FrozenLake is a 4x4 grid world perfect for learning Q-Learning:

S F F F
F H F H
F F F H
H F F G

S: Start position
F: Frozen surface (safe)
H: Hole (fall in, episode ends, reward = 0)
G: Goal (reward = +1)

Challenge: The ice is slippery! When you try to move in a direction, you might slip to adjacent cells.

States: 16 grid positions (0-15) Actions: 4 (Left=0, Down=1, Right=2, Up=3) Rewards: +1 for reaching goal, 0 otherwise

Learning Process

Episode 1: Random walk, fall in hole, Q-table stays mostly zeros.

Episode 10: Some positive Q-values near goal start appearing.

Episode 100: Q-values propagate backward from goal.

Episode 1000: Optimal path learned, agent reaches ``goal >70``% of the time.

Hyperparameters and Tuning

Learning Rate (α)

Controls how much we update Q-values:

α = 0.01: Very slow learning (conservative updates)
α = 0.1: Moderate learning (common default)
α = 0.5: Fast learning (aggressive updates)
α = 1.0: Replace old value completely (often too aggressive)

Too low: Learning is slow, may not converge in reasonable time. Too high: Unstable learning, Q-values oscillate wildly.

Discount Factor (γ)

Controls future reward valuation:

γ = 0.9: Standard for episodic tasks
γ = 0.95: Value future more
γ = 0.99: Long-term planning
γ = 1.0: Infinite horizon (use carefully)

Too low: Myopic behavior, agent only optimizes immediate rewards. Too high: Slow value propagation, requires more episodes.

Epsilon (ε)

Controls exploration:

Start high (ε=1.0): Explore environment initially
Decay gradually: Balance exploration and exploitation
End low (ε=0.01-0.1): Mostly exploit learned policy

Decay too fast: Agent commits to suboptimal policy. Decay too slow: Wastes time exploring when policy is already good.

Convergence and Optimality

Convergence Guarantees

Q-Learning is guaranteed to converge to optimal Q-values if:

All state-action pairs are visited infinitely often
Learning rate α decays appropriately (Σα = ∞, Σα² < ∞)

In practice, we use constant α and sufficient exploration (ε-greedy), which works well empirically.

Signs of Convergence

Q-values stabilize: Changes become very small
Policy stabilizes: Agent consistently chooses same actions
Performance plateaus: Average reward stops improving

Monitoring Learning

Track these metrics:

Average reward per episode (should increase)
Q-value magnitude (should stabilize)
Policy entropy (should decrease as policy becomes more deterministic)
Exploration rate (epsilon decay)

Q-Learning vs Other Algorithms

Aspect	Q-Learning	SARSA	Monte Carlo
Learning	Off-policy	On-policy	On-policy
Update	TD (temporal difference)	TD	Full episode
Target	max_a Q(s',a')	Q(s',a') from policy	Actual return G
Optimism	Optimistic (learns optimal policy)	Conservative (learns followed policy)	Exact (uses real returns)
Speed	Fast	Fast	Slow (needs full episode)