Demo Mode

No student ID available

Telebort | Learning to code made fun!

Introduction to Reinforcement Learning - Discovery Challenge

🎯 Learning Objectives

By completing this activity, you will:

Understand the agent-environment interaction loop in RL
Implement and visualize a random baseline agent on CartPole
Analyze performance metrics (episode rewards, success rate)
Develop exploration strategies beyond random actions
Create a performance dashboard for RL experiments

🚀 Getting Started (See Results in 30 Seconds!)

Open in Google Colab: Upload this notebook to Google Colab
Run All Cells: Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic: You'll see:
- ✅ CartPole environment rendering
- ✅ Random agent playing 100 episodes
- ✅ Performance graphs and statistics
- ✅ Video recording of agent behavior

Expected First Run Time: ~45 seconds

🎯 What's Already Working

The template comes with 65% working code:

✅ Environment Setup: CartPole-v1 configured and ready
✅ Random Agent: Baseline agent that takes random actions
✅ Visualization: Real-time rendering and video recording
✅ Performance Tracking: Episode rewards, lengths, success rates
✅ Statistical Analysis: Mean, std dev, rolling averages
✅ Dashboard: Beautiful plots showing agent performance over time

What Needs Your Work (35%):

⚠️ TODO 1: Implement epsilon-greedy exploration strategy
⚠️ TODO 2: Add action frequency analysis
⚠️ TODO 3: Compare random vs heuristic policies
⚠️ TODO 4: Optimize hyperparameters

📋 Tasks to Complete

TODO 1: Implement Epsilon-Greedy Exploration (Medium)

Location: Section 7 - "Agent Implementation"

Current State: The agent takes completely random actions (50/50 left/right)

Your Task: Implement an epsilon-greedy policy that:

Takes random action with probability ε (epsilon)
Takes a simple heuristic action with probability 1-ε
Heuristic: "If pole is leaning left, move left; if right, move right"

Starter Code Provided:

python

class EpsilonGreedyAgent:
    def __init__(self, epsilon=0.3):
        self.epsilon = epsilon

    def select_action(self, observation):
        # TODO: Implement epsilon-greedy logic here
        # Hint: observation[2] is the pole angle
        # Hint: action 0 = left, action 1 = right
        pass

Success Criteria:

Agent achieves average ``reward > 30`` (vs random baseline ~22)
Epsilon parameter is tunable (test ε = 0.1, 0.3, 0.5)
Heuristic action matches pole angle direction
Performance improves over random baseline by at least 25%

TODO 2: Add Action Frequency Analysis (Easy)

Location: Section 9 - "Analysis and Visualization"

Your Task: Track and visualize how often the agent takes each action (left vs right)

Requirements:

Count occurrences of action 0 (left) and action 1 (right)
Create a bar chart comparing action frequencies
Calculate action balance ratio: min(left, right) / max(left, right)
A balanced agent should have ratio close to 1.0

Success Criteria:

Bar chart displays action counts correctly
Action balance ratio is calculated and printed
Visualization is clear and labeled

TODO 3: Compare Policies (Hard)

Location: Section 10 - "Policy Comparison"

Your Task: Run experiments comparing three policies:

Random: Completely random actions
Epsilon-Greedy (ε=0.3): Your implementation from TODO 1
Pure Heuristic: Always follow the heuristic (ε=0)

Requirements:

Run each policy for 100 episodes
Track average reward, success rate, episode length
Create comparison table and side-by-side plots
Analyze which policy performs best and why

Success Criteria:

All three policies tested with same environment
Comparison table shows metrics for each policy
Plots clearly show performance differences
Written analysis (2-3 sentences) explaining results

TODO 4: Hyperparameter Optimization (Very Hard - Extension)

Location: New section you'll create

Your Task: Find the optimal epsilon value through systematic testing

Requirements:

Test epsilon values: [0.0, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1.0]
Run 50 episodes for each epsilon
Plot average reward vs epsilon
Identify the epsilon that maximizes performance

Success Criteria:

All epsilon values tested
Line plot shows reward vs epsilon relationship
Optimal epsilon identified and highlighted
Explanation of why this epsilon works best

🚀 Extension Challenges

Once you've completed all TODOs, try these advanced challenges:

Challenge One: Adaptive Epsilon (Medium)

Implement epsilon decay: Start with high exploration (ε=0.9) and gradually reduce to low exploration (ε=0.1) over episodes. Does this improve performance?

Challenge 2: State-Dependent Heuristic (Hard)

Improve the heuristic by considering both pole angle AND cart velocity. Use this improved heuristic in your epsilon-greedy agent.

Challenge 3: Multi-Environment Testing (Hard)

Test your epsilon-greedy agent on other Gymnasium environments:

MountainCar-v0
Acrobot-v1
LunarLander-v2 (requires Box2D: pip install box2d-py)

Does the same epsilon work well across environments?

Challenge 4: Learning from Experience (Very Hard)

Implement a simple Q-table to learn optimal actions:

Discretize the state space (4 observations -> bins)
Update Q-values after each step using: Q(s,a) += α * (reward + γ * max_Q(s') - Q(s,a))
Compare Q-learning agent vs epsilon-greedy heuristic

📊 Expected Results

Random Baseline (Pre-Built):

Average Reward: ~22 +/- 8
Success Rate: 0% (no episodes reach 200 steps)
Max Episode Length: ~50 steps

Your Epsilon-Greedy Agent (TODO 1):

Average Reward: 30-50 (35-130% improvement)
Success Rate: 0-10%
Max Episode Length: 80-150 steps

Pure Heuristic (TODO 3):

Average Reward: 40-80 (best simple policy)
Success Rate: 5-20%
Max Episode Length: 100-200 steps

Adaptive Epsilon (Challenge 1):

Average Reward: 60-100 (with proper decay schedule)
Success Rate: 10-30%
Max Episode Length: 150-200 steps

🎓 Success Criteria Checklist

Before submitting, ensure you've completed:

Minimum Requirements (for passing):

Notebook runs without errors in Google Colab
TODO 1 completed (epsilon-greedy agent implemented)
Epsilon-greedy agent achieves average ``reward > 30``
Code is well-commented explaining your logic
All visualizations render correctly

Target Grade (for excellent work):

All 4 TODOs completed
Policy comparison shows meaningful insights
Hyperparameter optimization finds optimal epsilon
At least 1 extension challenge attempted
Written analysis is clear and insightful

Exceptional Work (bonus points):

Multiple extension challenges completed
Novel insights or creative approaches demonstrated
Code is production-quality (clean, modular, reusable)
Additional visualizations or experiments beyond requirements

🛠️ Troubleshooting

Issue: "ModuleNotFoundError: No module named 'gymnasium'"

Solution: Run the installation cell at the top of the notebook:

python

!pip install gymnasium[classic_control] matplotlib imageio imageio-ffmpeg

Issue: "CartPole-v1 environment not rendering"

Solution: This is normal in Colab headless mode. The notebook uses rgb_array mode and saves videos instead. Check the /content/videos/ directory for recordings.

Issue: "Agent performance is worse than random baseline"

Solution:

Check your heuristic logic - are you moving in the correct direction?
Verify epsilon value is reasonable (0.1 - 0.5)
Ensure you're using observation[2] (pole angle) correctly
Print some sample decisions to debug

Issue: "Videos won't play in Colab"

Solution: Download the MP4 file and play locally, or use Colab's built-in video player:

python

from IPython.display import Video
Video('/content/videos/episode_0.mp4', width=400)

📚 Resources

Documentation

Concept 01: Introduction to Reinforcement Learning (agent-environment loop)
Concept 02: Q-Learning and Value Functions (for Challenge 4)

Additional Reading

Sutton & Barto: "Reinforcement Learning: An Introduction" - Chapter 2 (Multi-Armed Bandits)
OpenAI Spinning Up: Introduction to RL

📤 Submission

Complete all required TODOs (minimum: TODO 1-2)
Run the entire notebook to generate all outputs
Download your completed notebook: File -> Download -> Download .ipynb
Export as PDF (optional): File -> Print -> Save as PDF
Submit via course portal: Upload both .ipynb and PDF (if available)

Submission Checklist:

Notebook filename: activity-01-[YourName].ipynb
All code cells executed successfully
All visualizations visible in notebook
Comments explain your implementation choices
Results section includes your analysis

🎉 What's Next?

After completing this activity:

Move on to Activity 02: Q-Learning and Value Functions
Apply your epsilon-greedy strategy to tabular Q-Learning
Learn how agents can learn optimal policies from experience

Key Insight: This activity introduced you to the RL loop with a simple heuristic. In Activity 02, you'll learn how agents can discover optimal policies without hand-coded heuristics!

Good luck! Remember: the goal is to understand how RL agents interact with environments, not to achieve perfect performance. Focus on learning and experimentation! 🚀

Template 1: Introduction To Reinforcement Learning

📦 Project Files Included:

Template 1: Introduction To Reinforcement Learning

📦 Project Files Included:

Introduction to Reinforcement Learning - Discovery Challenge

🎯 Learning Objectives

🚀 Getting Started (See Results in 30 Seconds!)

🎯 What's Already Working

What Needs Your Work (35%):

📋 Tasks to Complete

TODO 1: Implement Epsilon-Greedy Exploration (Medium)

TODO 2: Add Action Frequency Analysis (Easy)

TODO 3: Compare Policies (Hard)

TODO 4: Hyperparameter Optimization (Very Hard - Extension)

🚀 Extension Challenges

Challenge One: Adaptive Epsilon (Medium)

Challenge 2: State-Dependent Heuristic (Hard)

Challenge 3: Multi-Environment Testing (Hard)

Challenge 4: Learning from Experience (Very Hard)

📊 Expected Results

Random Baseline (Pre-Built):

Your Epsilon-Greedy Agent (TODO 1):

Pure Heuristic (TODO 3):

Adaptive Epsilon (Challenge 1):

🎓 Success Criteria Checklist

🛠️ Troubleshooting

Issue: "ModuleNotFoundError: No module named 'gymnasium'"

Issue: "CartPole-v1 environment not rendering"

Issue: "Agent performance is worse than random baseline"

Issue: "Videos won't play in Colab"

📚 Resources

Documentation

Additional Reading

📤 Submission

🎉 What's Next?