Practice and reinforce the concepts from Lesson 2
In this activity, you'll implement tabular Q-Learning from scratch and train an agent to solve the FrozenLake environment - a slippery grid world where the agent must navigate from start to goal while avoiding holes.
By completing this activity, you will:
Download the activity template from the Templates folder:
AI25-Template-activity-02-q-learning-and-value-functions.zipTemplates/AI25-Template-activity-02-q-learning-and-value-functions.zipactivity-02-q-learning-and-value-functions.ipynb to Google ColabExecute the first few cells to:
FrozenLake-v1 Grid World:
S F F F
F H F H
F F F H
H F F G
States: 16 (4x4 grid) Actions: 4 (Left=0, Down=1, Right=2, Up=3) Rewards: +1 for goal, 0 otherwise
TODO 1: Initialize Q-table
TODO 2: Implement epsilon-greedy action selection
TODO 3: Implement Q-Learning update rule
TODO 4: Implement epsilon decay schedule
You'll complete the missing pieces of:
Dashboards showing:
Episodes 0-100: Random exploration, low success rate (<5%)
Episodes 100-500: Q-values start propagating backward from goal
Episodes 500-1000: Success rate increases to 40-60%
Episodes 1000+: Convergence, success rate 70-80%
| State | Left | Down | Right | Up |
|---|---|---|---|---|
| 0 (Start) | 0.32 | 0.41 | 0.28 | 0.35 |
| 5 (Near hole) | 0.01 | 0.03 | 0.58 | 0.12 |
| 15 (Goal) | 0.00 | 0.00 | 0.00 | 0.00 |
Interpretation: At state 0, best action is Down (Q=0.41). At state 5, best action is Right (Q=0.58) to avoid hole.
Your implementation is complete when:
FrozenLake is stochastic: your action only works 33% of the time!
# You press RIGHT
# Actual outcome:
# - 33%: Move right (intended)
# - 33%: Move up (perpendicular)
# - 33%: Move down (perpendicular)
This is why success rates cap around 70-80% (not 100%).
Learning Rate (α):
Discount Factor (γ):
Epsilon Decay:
Q-values stay at 0:
Success rate not improving:
Training unstable:
Once you complete the basic activity, try these extensions:
env = gym.make('FrozenLake-v1', is_slippery=False)
Train on non-slippery version. Should achieve 100% success rate!
env = gym.make('FrozenLake-v1', map_name='8x8')
Scale to 8x8 grid (64 states). Does your agent still learn?
Implement SARSA (on-policy Q-Learning):
# Q-Learning: Q(s,a) ← Q(s,a) + α[r + γ*max_a' Q(s',a') - Q(s,a)]
# SARSA: Q(s,a) ← Q(s,a) + α[r + γ*Q(s',a') - Q(s,a)]
# ↑
# Action actually taken (not max)
Compare convergence speed and final performance.
Implement Double Q-Learning to reduce overestimation bias:
# Maintain two Q-tables: Q1 and Q2
# Update Q1 using Q2 for target, and vice versa
Completed Notebook: activity-02-q-learning-and-value-functions.ipynb
Performance Report: Brief summary including:
After completing this activity:
In the next lesson, you'll extend Q-Learning to handle complex environments with millions of states using neural networks!
This activity is graded on:
Passing Grade: 70% or higher
Good luck, and enjoy learning tabular Q-Learning! 🎯