Demo Mode

No student ID available

Activity 2 of 18

Activity 2: Q-Learning and Value Functions

Practice and reinforce the concepts from Lesson 2

Activity 02: Q-Learning and Value Functions

Overview

In this activity, you'll implement tabular Q-Learning from scratch and train an agent to solve the FrozenLake environment - a slippery grid world where the agent must navigate from start to goal while avoiding holes.

Learning Objectives

By completing this activity, you will:

Implement the Q-Learning algorithm with epsilon-greedy exploration
Build and visualize a Q-table for state-action value estimates
Apply temporal difference (TD) learning updates
Tune hyperparameters (learning rate, discount factor, epsilon)
Analyze Q-table convergence and learned policies
Understand exploration-exploitation trade-offs

Prerequisites

Completed Concept 02: Q-Learning and Value Functions
Completed Activity 01: Introduction to Reinforcement Learning
Understanding of Bellman equation and TD learning

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-02-q-learning-and-value-functions.zip
Location: Templates/AI25-Template-activity-02-q-learning-and-value-functions.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-02-q-learning-and-value-functions.ipynb to Google Colab
Runtime: CPU is sufficient (GPU not needed for this activity)

Step 3: Run Initial Cells

Execute the first few cells to:

Install required packages (Gymnasium)
Import libraries (NumPy, Matplotlib)
Set random seeds for reproducibility

What You'll Build

Part One: FrozenLake Environment Exploration (Pre-Built)

FrozenLake-v1 Grid World:

S F F F
F H F H
F F F H
H F F G

S: Start position (state 0)
F: Frozen surface (safe to walk)
H: Hole (fall in -> episode ends with 0 reward)
G: Goal (reach it -> +1 reward, episode ends)
Challenge: Ice is slippery! Actions succeed with 33% probability, slip to adjacent cells otherwise

States: 16 (4x4 grid) Actions: 4 (Left=0, Down=1, Right=2, Up=3) Rewards: +1 for goal, 0 otherwise

Part 2: Q-Table Implementation (YOU COMPLETE)

TODO 1: Initialize Q-table

Create NumPy array with correct shape
Initialize all values to 0

TODO 2: Implement epsilon-greedy action selection

With probability ε: choose random action (explore)
With probability 1-ε: choose action with max Q-value (exploit)

TODO 3: Implement Q-Learning update rule

Apply temporal difference update: Q(s,a) <- Q(s,a) + α[r + γ*max Q(s',a') - Q(s,a)]
Handle terminal states correctly (no future value when done=True)

TODO 4: Implement epsilon decay schedule

Start with high epsilon (exploration)
Gradually decrease to low epsilon (exploitation)

Part 3: Training Loop (65% Complete)

You'll complete the missing pieces of:

Episode collection loop
Q-table updates after each step
Performance tracking (success rate, Q-value magnitude)
Epsilon decay implementation

Part 4: Visualization and Analysis (Pre-Built)

Dashboards showing:

Learning curve (success rate over episodes)
Q-table heatmap (visualize learned values)
Policy visualization (best action per state)
Exploration rate decay

Expected Results

Training Progress

Episodes 0-100: Random exploration, low success rate (<5%) Episodes 100-500: Q-values start propagating backward from goal Episodes 500-1000: Success rate increases to 40-60% Episodes 1000+: Convergence, success rate 70-80%

Q-Table Example (After Training)

State	Left	Down	Right	Up
0 (Start)	0.32	0.41	0.28	0.35
5 (Near hole)	0.01	0.03	0.58	0.12
15 (Goal)	0.00	0.00	0.00	0.00

Interpretation: At state 0, best action is Down (Q=0.41). At state 5, best action is Right (Q=0.58) to avoid hole.

Success Criteria

Your implementation is complete when:

Q-table initialized with correct shape (16 states x 4 actions)
Epsilon-greedy action selection implemented correctly
Q-Learning update formula matches Bellman equation
Agent ``achieves >70``% success rate after 2000 episodes
Q-table values converge (changes become small)
Policy visualization shows sensible paths to goal

Tips for Success

Understanding Slippery Ice

FrozenLake is stochastic: your action only works 33% of the time!

python

# You press RIGHT
# Actual outcome:
# - 33%: Move right (intended)
# - 33%: Move up (perpendicular)
# - 33%: Move down (perpendicular)

This is why success rates cap around 70-80% (not 100%).

Hyperparameter Tuning

Learning Rate (α):

Too low (0.01): Slow learning
Too high (0.9): Unstable learning
Good range: 0.1-0.5

Discount Factor (γ):

Too low (0.8): Myopic (short-term thinking)
Too high (0.99): Slow value propagation
Good range: 0.9-0.95

Epsilon Decay:

Too fast: Agent commits to suboptimal policy early
Too slow: Wastes time exploring after learning
Good: Start ε=1.0, decay to ε=0.01 over 1000 episodes

Debugging Common Issues

Q-values stay at 0:

Agent not reaching goal (increase episodes, check epsilon)
Learning rate too low (increase α)

Success rate not improving:

Epsilon decaying too fast (explore longer)
Learning rate wrong (try 0.1-0.5)
Not enough episodes (train for 2000+)

Training unstable:

Learning rate too high (reduce α)
Need to run more episodes for convergence

Extension Challenges

Once you complete the basic activity, try these extensions:

Challenge One: Deterministic FrozenLake (Easy)

python

env = gym.make('FrozenLake-v1', is_slippery=False)

Train on non-slippery version. Should achieve 100% success rate!

Challenge 2: Larger Grid (Medium)

python

env = gym.make('FrozenLake-v1', map_name='8x8')

Scale to 8x8 grid (64 states). Does your agent still learn?

Challenge 3: SARSA Comparison (Medium)

Implement SARSA (on-policy Q-Learning):

python

# Q-Learning: Q(s,a) ← Q(s,a) + α[r + γ*max_a' Q(s',a') - Q(s,a)]
# SARSA: Q(s,a) ← Q(s,a) + α[r + γ*Q(s',a') - Q(s,a)]
#                                      ↑
#                              Action actually taken (not max)

Compare convergence speed and final performance.

Challenge 4: Double Q-Learning (Hard)

Implement Double Q-Learning to reduce overestimation bias:

python

# Maintain two Q-tables: Q1 and Q2
# Update Q1 using Q2 for target, and vice versa

Submission Requirements

What to Submit

Completed Notebook: activity-02-q-learning-and-value-functions.ipynb
- All code cells executed
- Output visible for all cells
- All TODOs completed
Performance Report: Brief summary including:
- Final success rate (should ``be >70``%)
- Number of episodes to convergence
- Hyperparameter values used
- Q-table visualization screenshot

Submission Steps

Run all cells from top to bottom (Runtime -> Run all)
Verify visualizations display correctly
Check success criteria met
Download notebook: File -> Download -> Download .ipynb
Submit via [course portal link]

Resources

Documentation

Temporal Difference (TD) Learning
Bellman Optimality Equation
Exploration vs Exploitation
Value Function Convergence

Next Steps

After completing this activity:

Concept 03: Deep Q-Networks (DQN) for large state spaces
Activity 03: DQN for CartPole with neural networks
Project 1: DQN Game Master (Atari games)

In the next lesson, you'll extend Q-Learning to handle complex environments with millions of states using neural networks!

Assessment

This activity is graded on:

Code Completion (40%): All TODOs implemented correctly
Code Quality (30%): Clean, readable, well-commented
Performance (20%): Agent ``achieves >70``% success rate
Understanding (10%): Report demonstrates grasp of Q-Learning concepts

Passing Grade: 70% or higher

Good luck, and enjoy learning tabular Q-Learning! 🎯

Activity 2 of 18

Activity 2: Q-Learning and Value Functions

Practice and reinforce the concepts from Lesson 2

Activity 02: Q-Learning and Value Functions

Overview

Learning Objectives

By completing this activity, you will:

Implement the Q-Learning algorithm with epsilon-greedy exploration
Build and visualize a Q-table for state-action value estimates
Apply temporal difference (TD) learning updates
Tune hyperparameters (learning rate, discount factor, epsilon)
Analyze Q-table convergence and learned policies
Understand exploration-exploitation trade-offs

Prerequisites

Completed Concept 02: Q-Learning and Value Functions
Completed Activity 01: Introduction to Reinforcement Learning
Understanding of Bellman equation and TD learning

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-02-q-learning-and-value-functions.zip
Location: Templates/AI25-Template-activity-02-q-learning-and-value-functions.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-02-q-learning-and-value-functions.ipynb to Google Colab
Runtime: CPU is sufficient (GPU not needed for this activity)

Step 3: Run Initial Cells

Execute the first few cells to:

Install required packages (Gymnasium)
Import libraries (NumPy, Matplotlib)
Set random seeds for reproducibility

What You'll Build

Part One: FrozenLake Environment Exploration (Pre-Built)

FrozenLake-v1 Grid World:

S F F F
F H F H
F F F H
H F F G

S: Start position (state 0)
F: Frozen surface (safe to walk)
H: Hole (fall in -> episode ends with 0 reward)
G: Goal (reach it -> +1 reward, episode ends)
Challenge: Ice is slippery! Actions succeed with 33% probability, slip to adjacent cells otherwise

States: 16 (4x4 grid) Actions: 4 (Left=0, Down=1, Right=2, Up=3) Rewards: +1 for goal, 0 otherwise

Part 2: Q-Table Implementation (YOU COMPLETE)

TODO 1: Initialize Q-table

Create NumPy array with correct shape
Initialize all values to 0

TODO 2: Implement epsilon-greedy action selection

With probability ε: choose random action (explore)
With probability 1-ε: choose action with max Q-value (exploit)

TODO 3: Implement Q-Learning update rule

Apply temporal difference update: Q(s,a) <- Q(s,a) + α[r + γ*max Q(s',a') - Q(s,a)]
Handle terminal states correctly (no future value when done=True)

TODO 4: Implement epsilon decay schedule

Start with high epsilon (exploration)
Gradually decrease to low epsilon (exploitation)

Part 3: Training Loop (65% Complete)

You'll complete the missing pieces of:

Episode collection loop
Q-table updates after each step
Performance tracking (success rate, Q-value magnitude)
Epsilon decay implementation

Part 4: Visualization and Analysis (Pre-Built)

Dashboards showing:

Learning curve (success rate over episodes)
Q-table heatmap (visualize learned values)
Policy visualization (best action per state)
Exploration rate decay

Expected Results

Training Progress

Q-Table Example (After Training)

State	Left	Down	Right	Up
0 (Start)	0.32	0.41	0.28	0.35
5 (Near hole)	0.01	0.03	0.58	0.12
15 (Goal)	0.00	0.00	0.00	0.00

Interpretation: At state 0, best action is Down (Q=0.41). At state 5, best action is Right (Q=0.58) to avoid hole.

Success Criteria

Your implementation is complete when:

Q-table initialized with correct shape (16 states x 4 actions)
Epsilon-greedy action selection implemented correctly
Q-Learning update formula matches Bellman equation
Agent ``achieves >70``% success rate after 2000 episodes
Q-table values converge (changes become small)
Policy visualization shows sensible paths to goal

Tips for Success

Understanding Slippery Ice

FrozenLake is stochastic: your action only works 33% of the time!

python

# You press RIGHT
# Actual outcome:
# - 33%: Move right (intended)
# - 33%: Move up (perpendicular)
# - 33%: Move down (perpendicular)

This is why success rates cap around 70-80% (not 100%).

Hyperparameter Tuning

Learning Rate (α):

Too low (0.01): Slow learning
Too high (0.9): Unstable learning
Good range: 0.1-0.5

Discount Factor (γ):

Too low (0.8): Myopic (short-term thinking)
Too high (0.99): Slow value propagation
Good range: 0.9-0.95

Epsilon Decay:

Too fast: Agent commits to suboptimal policy early
Too slow: Wastes time exploring after learning
Good: Start ε=1.0, decay to ε=0.01 over 1000 episodes

Debugging Common Issues

Q-values stay at 0:

Agent not reaching goal (increase episodes, check epsilon)
Learning rate too low (increase α)

Success rate not improving:

Epsilon decaying too fast (explore longer)
Learning rate wrong (try 0.1-0.5)
Not enough episodes (train for 2000+)

Training unstable:

Learning rate too high (reduce α)
Need to run more episodes for convergence

Extension Challenges

Once you complete the basic activity, try these extensions:

Challenge One: Deterministic FrozenLake (Easy)

python

env = gym.make('FrozenLake-v1', is_slippery=False)

Train on non-slippery version. Should achieve 100% success rate!

Challenge 2: Larger Grid (Medium)

python

env = gym.make('FrozenLake-v1', map_name='8x8')

Scale to 8x8 grid (64 states). Does your agent still learn?

Challenge 3: SARSA Comparison (Medium)

Implement SARSA (on-policy Q-Learning):

python

# Q-Learning: Q(s,a) ← Q(s,a) + α[r + γ*max_a' Q(s',a') - Q(s,a)]
# SARSA: Q(s,a) ← Q(s,a) + α[r + γ*Q(s',a') - Q(s,a)]
#                                      ↑
#                              Action actually taken (not max)

Compare convergence speed and final performance.

Challenge 4: Double Q-Learning (Hard)

Implement Double Q-Learning to reduce overestimation bias:

python

# Maintain two Q-tables: Q1 and Q2
# Update Q1 using Q2 for target, and vice versa

Submission Requirements

What to Submit

Completed Notebook: activity-02-q-learning-and-value-functions.ipynb
- All code cells executed
- Output visible for all cells
- All TODOs completed
Performance Report: Brief summary including:
- Final success rate (should ``be >70``%)
- Number of episodes to convergence
- Hyperparameter values used
- Q-table visualization screenshot

Submission Steps

Run all cells from top to bottom (Runtime -> Run all)
Verify visualizations display correctly
Check success criteria met
Download notebook: File -> Download -> Download .ipynb
Submit via [course portal link]

Resources

Documentation

Temporal Difference (TD) Learning
Bellman Optimality Equation
Exploration vs Exploitation
Value Function Convergence

Next Steps

After completing this activity:

Concept 03: Deep Q-Networks (DQN) for large state spaces
Activity 03: DQN for CartPole with neural networks
Project 1: DQN Game Master (Atari games)

In the next lesson, you'll extend Q-Learning to handle complex environments with millions of states using neural networks!

Assessment

This activity is graded on:

Code Completion (40%): All TODOs implemented correctly
Code Quality (30%): Clean, readable, well-commented
Performance (20%): Agent ``achieves >70``% success rate
Understanding (10%): Report demonstrates grasp of Q-Learning concepts

Passing Grade: 70% or higher

Good luck, and enjoy learning tabular Q-Learning! 🎯