By completing this activity, you will:
Understand Q-values (action-value functions) and value functions
Implement tabular Q-Learning from scratch
Apply the Bellman equation for value updates
Master epsilon-greedy exploration in learning contexts
Analyze Q-table convergence and optimal policies
Solve the FrozenLake navigation problem
Open in Google Colab : Upload this notebook to Google Colab
Run All Cells : Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic : You'll see:
✅ FrozenLake environment setup
✅ Random agent baseline (success rate ~1%)
✅ Q-table visualization
✅ Training progress plots
Expected First Run Time : ~60 seconds
The template comes with 65% working code :
✅ FrozenLake Environment : 4x4 grid world with slippery ice
✅ Random Baseline : Agent with ~1% success rate
✅ Q-Table Structure : Initialized and ready for learning
✅ Visualization Tools : Heatmaps, learning curves, policy arrows
✅ Hyperparameter Tracking : Learning rate, discount factor, epsilon decay
✅ Success Metrics : Win rate, average reward, convergence analysis
⚠️ TODO 1 : Implement Q-value update (Bellman equation)
⚠️ TODO 2 : Implement action selection (epsilon-greedy)
⚠️ TODO 3 : Add Q-table convergence tracking
⚠️ TODO 4 : Optimize hyperparameters
Location : Section 5 - "Q-Learning Algorithm"
Current State : Q-table exists but values never update
Your Task : Implement the Bellman equation update:
css
Q (s,a ) ← Q (s,a ) + α * [R + γ * max_a' Q(s' ,a') - Q(s,a)]
Where:
α (alpha) = learning rate (0.1)
R = immediate reward
γ (gamma) = discount factor (0.99)
s' = next state
max_a' Q(s',a') = maximum Q-value in next state
Starter Code Provided :
python
def update_q_value (Q, state, action, reward, next_state, alpha, gamma ):
pass
Success Criteria :
Location : Section 5 - "Q-Learning Agent"
Your Task : Select actions using epsilon-greedy strategy:
With probability ε: random action (exploration)
With probability 1-ε: action with highest Q-value (exploitation)
Requirements :
Implement epsilon decay: Start at ε=1.0, decay to ε_min=0.01
Decay formula: ε = max(ε_min, ε * decay_rate) after each episode
Handle Q-value ties: If multiple actions have same Q-value, choose randomly
Success Criteria :
Location : Section 7 - "Analysis"
Your Task : Measure how much Q-values change between episodes to detect convergence
Requirements :
Calculate Q-table change: Δ = Σ|Q_new - Q_old| for all state-action pairs
Track this metric every 100 episodes
Plot convergence curve (Δ over time)
Identify convergence threshold: Δ < 0.01 = converged
Success Criteria :
Location : New section you'll create
Your Task : Find optimal hyperparameters for fastest learning
Test Grid :
Learning rate α: [0.05, 0.1, 0.2, 0.5]
Discount γ: [0.9, 0.95, 0.99]
Epsilon decay: [0.995, 0.999, 0.9999]
Requirements :
Run Q-Learning with each combination (48 total experiments)
Track: final success rate, episodes to 70% success, convergence speed
Create heatmap showing α vs γ performance
Identify best configuration
Success Criteria :
Test Q-Learning on 8x8 FrozenLake (64 states vs 16):
Does convergence take longer? How much?
Do you need different hyperparameters?
Visualize the learned policy on the larger map
Implement Double Q-Learning to reduce overestimation bias:
Maintain two Q-tables: Q1 and Q2
Select action using Q1, update Q2 (and vice versa)
Compare performance vs standard Q-Learning
Implement SARSA (on-policy alternative to Q-Learning):
Update: Q(s,a) += α[R + γQ(s',a') - Q(s,a)] where a' is actually taken
Compare SARSA vs Q-Learning on FrozenLake
Which converges faster? Why?
Replace Q-table with neural network (DQN preview):
Use PyTorch to create Q-network: state → [Q(s,a0), Q(s,a1), ...]
Train with MSE loss between predicted and target Q-values
Compare vs tabular Q-Learning
Success Rate: ~1% (almost never reaches goal)
Average Reward: ~0.01
Episodes to Goal: N/A (rarely succeeds)
Success Rate: 50-70% after 10K episodes
Average Reward: 0.5-0.7
Convergence: ~5K-8K episodes
Success Rate: 70-85% (best hyperparameters)
Average Reward: 0.7-0.85
Convergence: ~3K-5K episodes (faster)
Success Rate: 75-90% (reduced overestimation)
More stable learning curve
Minimum Requirements (for passing):
Target Grade (for excellent work):
Exceptional Work (bonus points):
Solution : Check your Bellman update implementation. Ensure you're:
Using the correct formula: Q += alpha * (reward + gamma * max_next_Q - Q)
Actually updating the Q-table (not just calculating but not storing)
Indexing Q-table correctly: Q[state][action]
Solution :
Verify epsilon is decaying (should reach ~0.01 by end)
Check you're taking argmax of Q-values during exploitation
Ensure discount factor γ is high (0.95-0.99) for long-term planning
Try increasing learning rate α to 0.2 for faster learning
Solution :
Reduce learning rate α (try 0.05 instead of 0.1)
Verify you're not adding rewards without subtracting old Q-value
Check ``gamma < 1.0`` (gamma=1.0 can cause instability)
Solution :
Too fast (agent stops exploring early): Increase decay rate to 0.999
Too slow (agent still random at end): Decrease decay rate to 0.995
Rule of thumb: Should reach ε=0.01 at ~70% of total episodes
Concept 02 : Q-Learning and Value Functions (Bellman equation theory)
Concept 03 : Deep Q-Networks (DQN) - neural network approximation
Watkins & Dayan (1992): "Q-Learning" - original Q-Learning paper
Van Hasselt (2010): "Double Q-Learning" - reduces overestimation
Complete required TODOs (minimum: TODO 1-2)
Run entire notebook to generate all outputs
Export policy visualization : Save final Q-table heatmap as PNG
Download notebook : File -> Download -> Download .ipynb
Submit via portal : Upload .ipynb and policy.png
Submission Checklist :
After mastering Q-Learning:
Move to Activity 03: Deep Q-Networks (DQN)
Learn how to handle large state spaces with neural networks
Play Atari games with DQN (Project 1)
Key Insight : Q-Learning works great for small state spaces (FrozenLake = 16 states), but real-world problems have millions of states. That's where DQN comes in!
Good luck! Q-Learning is the foundation of value-based RL. Master this, and you'll understand the core of modern deep RL algorithms! 🚀