Demo Mode

No student ID available

Telebort | Learning to code made fun!

Q-Learning and Value Functions - Discovery Challenge

🎯 Learning Objectives

By completing this activity, you will:

Understand Q-values (action-value functions) and value functions
Implement tabular Q-Learning from scratch
Apply the Bellman equation for value updates
Master epsilon-greedy exploration in learning contexts
Analyze Q-table convergence and optimal policies
Solve the FrozenLake navigation problem

🚀 Getting Started (See Results in 30 Seconds!)

Open in Google Colab: Upload this notebook to Google Colab
Run All Cells: Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic: You'll see:
- ✅ FrozenLake environment setup
- ✅ Random agent baseline (success rate ~1%)
- ✅ Q-table visualization
- ✅ Training progress plots

Expected First Run Time: ~60 seconds

🎯 What's Already Working

The template comes with 65% working code:

✅ FrozenLake Environment: 4x4 grid world with slippery ice
✅ Random Baseline: Agent with ~1% success rate
✅ Q-Table Structure: Initialized and ready for learning
✅ Visualization Tools: Heatmaps, learning curves, policy arrows
✅ Hyperparameter Tracking: Learning rate, discount factor, epsilon decay
✅ Success Metrics: Win rate, average reward, convergence analysis

What Needs Your Work (35%):

⚠️ TODO 1: Implement Q-value update (Bellman equation)
⚠️ TODO 2: Implement action selection (epsilon-greedy)
⚠️ TODO 3: Add Q-table convergence tracking
⚠️ TODO 4: Optimize hyperparameters

📋 Tasks to Complete

TODO 1: Implement Q-Value Update (Hard)

Location: Section 5 - "Q-Learning Algorithm"

Current State: Q-table exists but values never update

Your Task: Implement the Bellman equation update:

css

Q(s,a) ← Q(s,a) + α * [R + γ * max_a' Q(s',a') - Q(s,a)]

Where:

α (alpha) = learning rate (0.1)
R = immediate reward
γ (gamma) = discount factor (0.99)
s' = next state
max_a' Q(s',a') = maximum Q-value in next state

Starter Code Provided:

python

def update_q_value(Q, state, action, reward, next_state, alpha, gamma):
    # TODO: Implement Bellman update
    # Hint: Find max Q-value for next_state
    # Hint: Apply formula Q(s,a) += alpha * (target - Q(s,a))
    pass

Success Criteria:

Q-values increase during training (print to verify)
Agent achieves >50% success rate on FrozenLake after 10K episodes
Q-table shows clear optimal policy (visualized as arrows)
Convergence: Q-values stabilize after sufficient training

TODO 2: Implement Epsilon-Greedy Action Selection (Medium)

Location: Section 5 - "Q-Learning Agent"

Your Task: Select actions using epsilon-greedy strategy:

With probability ε: random action (exploration)
With probability 1-ε: action with highest Q-value (exploitation)

Requirements:

Implement epsilon decay: Start at ε=1.0, decay to ε_min=0.01
Decay formula: ε = max(ε_min, ε * decay_rate) after each episode
Handle Q-value ties: If multiple actions have same Q-value, choose randomly

Success Criteria:

Epsilon decays from 1.0 -> 0.01 over training
Agent explores initially (random actions), exploits later (learned policy)
Final policy (ε=0) ``achieves >70``% success rate

TODO 3: Track Q-Table Convergence (Medium)

Location: Section 7 - "Analysis"

Your Task: Measure how much Q-values change between episodes to detect convergence

Requirements:

Calculate Q-table change: Δ = Σ|Q_new - Q_old| for all state-action pairs
Track this metric every 100 episodes
Plot convergence curve (Δ over time)
Identify convergence threshold: Δ < 0.01 = converged

Success Criteria:

Convergence metric calculated correctly
Plot shows decreasing Δ over time
Agent converges ``in < 10``,000 episodes
Analysis explains convergence behavior

TODO 4: Hyperparameter Optimization (Very Hard - Extension)

Location: New section you'll create

Your Task: Find optimal hyperparameters for fastest learning

Test Grid:

Learning rate α: [0.05, 0.1, 0.2, 0.5]
Discount γ: [0.9, 0.95, 0.99]
Epsilon decay: [0.995, 0.999, 0.9999]

Requirements:

Run Q-Learning with each combination (48 total experiments)
Track: final success rate, episodes to 70% success, convergence speed
Create heatmap showing α vs γ performance
Identify best configuration

Success Criteria:

All combinations tested
Heatmap visualizes performance landscape
Best hyperparameters identified
Explanation of why these work best

🚀 Extension Challenges

Challenge One: Larger FrozenLake (Medium)

Test Q-Learning on 8x8 FrozenLake (64 states vs 16):

Does convergence take longer? How much?
Do you need different hyperparameters?
Visualize the learned policy on the larger map

Challenge 2: Double Q-Learning (Hard)

Implement Double Q-Learning to reduce overestimation bias:

Maintain two Q-tables: Q1 and Q2
Select action using Q1, update Q2 (and vice versa)
Compare performance vs standard Q-Learning

Challenge 3: SARSA Algorithm (Hard)

Implement SARSA (on-policy alternative to Q-Learning):

Update: Q(s,a) += α[R + γQ(s',a') - Q(s,a)] where a' is actually taken
Compare SARSA vs Q-Learning on FrozenLake
Which converges faster? Why?

Challenge 4: Function Approximation (Very Hard)

Replace Q-table with neural network (DQN preview):

Use PyTorch to create Q-network: state → [Q(s,a0), Q(s,a1), ...]
Train with MSE loss between predicted and target Q-values
Compare vs tabular Q-Learning

📊 Expected Results

Random Baseline (Pre-Built):

Success Rate: ~1% (almost never reaches goal)
Average Reward: ~0.01
Episodes to Goal: N/A (rarely succeeds)

Your Q-Learning Agent (TODO 1-2):

Success Rate: 50-70% after 10K episodes
Average Reward: 0.5-0.7
Convergence: ~5K-8K episodes

Optimized Q-Learning (TODO 4):

Success Rate: 70-85% (best hyperparameters)
Average Reward: 0.7-0.85
Convergence: ~3K-5K episodes (faster)

Double Q-Learning (Challenge 2):

Success Rate: 75-90% (reduced overestimation)
More stable learning curve

🎓 Success Criteria Checklist

Minimum Requirements (for passing):

Notebook runs without errors
TODO 1 completed (Bellman update working)
TODO 2 completed (epsilon-greedy implemented)
Agent ``achieves >50``% success rate on FrozenLake
Q-table visualization shows clear policy

Target Grade (for excellent work):

All 4 TODOs completed
Agent ``achieves >70``% success rate
Convergence analysis shows proper learning
Hyperparameter optimization complete
At least 1 extension challenge attempted

Exceptional Work (bonus points):

Multiple extension challenges completed
Novel experiments or insights
Comparison to SARSA or Double Q-Learning
Clean, modular, production-quality code

🛠️ Troubleshooting

Issue: "Q-values are all zeros even after training"

Solution: Check your Bellman update implementation. Ensure you're:

Using the correct formula: Q += alpha * (reward + gamma * max_next_Q - Q)
Actually updating the Q-table (not just calculating but not storing)
Indexing Q-table correctly: Q[state][action]

Issue: "Agent success rate ``is < 10``% after 10K episodes"

Solution:

Verify epsilon is decaying (should reach ~0.01 by end)
Check you're taking argmax of Q-values during exploitation
Ensure discount factor γ is high (0.95-0.99) for long-term planning
Try increasing learning rate α to 0.2 for faster learning

Issue: "Q-values explode (become very large)"

Solution:

Reduce learning rate α (try 0.05 instead of 0.1)
Verify you're not adding rewards without subtracting old Q-value
Check ``gamma < 1.0`` (gamma=1.0 can cause instability)

Issue: "Epsilon decay is too fast/slow"

Solution:

Too fast (agent stops exploring early): Increase decay rate to 0.999
Too slow (agent still random at end): Decrease decay rate to 0.995
Rule of thumb: Should reach ε=0.01 at ~70% of total episodes

📚 Resources

Documentation

Concept 02: Q-Learning and Value Functions (Bellman equation theory)
Concept 03: Deep Q-Networks (DQN) - neural network approximation

Key Papers

Watkins & Dayan (1992): "Q-Learning" - original Q-Learning paper
Van Hasselt (2010): "Double Q-Learning" - reduces overestimation

📤 Submission

Complete required TODOs (minimum: TODO 1-2)
Run entire notebook to generate all outputs
Export policy visualization: Save final Q-table heatmap as PNG
Download notebook: File -> Download -> Download .ipynb
Submit via portal: Upload .ipynb and policy.png

Submission Checklist:

Filename: activity-02-[YourName].ipynb
All code cells executed
Q-table visualizations visible
Success ``rate >50``% demonstrated
Analysis section completed

🎉 What's Next?

After mastering Q-Learning:

Move to Activity 03: Deep Q-Networks (DQN)
Learn how to handle large state spaces with neural networks
Play Atari games with DQN (Project 1)

Key Insight: Q-Learning works great for small state spaces (FrozenLake = 16 states), but real-world problems have millions of states. That's where DQN comes in!

Good luck! Q-Learning is the foundation of value-based RL. Master this, and you'll understand the core of modern deep RL algorithms! 🚀

Template 2: Q Learning And Value Functions

📦 Project Files Included:

Q-Learning and Value Functions - Discovery Challenge

🎯 Learning Objectives

🚀 Getting Started (See Results in 30 Seconds!)

🎯 What's Already Working

What Needs Your Work (35%):

📋 Tasks to Complete

TODO 1: Implement Q-Value Update (Hard)

TODO 2: Implement Epsilon-Greedy Action Selection (Medium)

TODO 3: Track Q-Table Convergence (Medium)

TODO 4: Hyperparameter Optimization (Very Hard - Extension)

🚀 Extension Challenges

Challenge One: Larger FrozenLake (Medium)

Challenge 2: Double Q-Learning (Hard)

Challenge 3: SARSA Algorithm (Hard)

Challenge 4: Function Approximation (Very Hard)

📊 Expected Results

Random Baseline (Pre-Built):

Your Q-Learning Agent (TODO 1-2):

Optimized Q-Learning (TODO 4):

Double Q-Learning (Challenge 2):

🎓 Success Criteria Checklist

🛠️ Troubleshooting

Issue: "Q-values are all zeros even after training"

Issue: "Agent success rate ``is < 10``% after 10K episodes"

Issue: "Q-values explode (become very large)"

Issue: "Epsilon decay is too fast/slow"

📚 Resources

Documentation

Key Papers

📤 Submission

🎉 What's Next?