Demo Mode

No student ID available

Activity 7 of 18

Activity 7: Multi-Armed Bandits and Exploration

Practice and reinforce the concepts from Lesson 7

Activity 07: Multi-Armed Bandits and Exploration

Overview

In this activity, you'll implement and compare three classic bandit algorithms (ε-Greedy, UCB, Thompson Sampling) on a simulated news recommendation system. You'll see exploration strategies in action and understand the exploration-exploitation trade-off!

Learning Objectives

By completing this activity, you will:

Implement ε-Greedy, UCB, and Thompson Sampling algorithms
Simulate a news recommendation scenario with unknown click rates
Compare exploration strategies empirically
Compute cumulative regret and convergence speed
Extend to contextual bandits with user features
Apply bandits to A/B testing scenarios

Prerequisites

Completed Concept 07: Multi-Armed Bandits and Exploration
Understanding of exploration vs exploitation
Basic probability (distributions, sampling)

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-07-multi-armed-bandits.zip
Location: Templates/AI25-Template-activity-07-multi-armed-bandits.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-07-multi-armed-bandits.ipynb to Google Colab
Runtime: CPU is sufficient (no GPU needed)

Step 3: Run Initial Cells

Execute the first few cells to:

Import NumPy, Matplotlib, SciPy
Set random seeds for reproducibility
Initialize visualization functions

What You'll Build

Part One: Multi-Armed Bandit Environment (Pre-Built)

News Recommendation Scenario:

K = 5 news articles to recommend
Each article has unknown true click rate (probability user clicks)
Goal: Maximize total clicks over T = 1000 recommendations

Example True Click Rates:

python

true_rates = [0.3, 0.5, 0.8, 0.4, 0.6]
#                    ↑
#               Best article (0.8 click rate)

Regret: Difference between always choosing best article vs. your algorithm's performance.

Part 2: ε-Greedy Algorithm (YOU COMPLETE)

TODO 1: Implement ε-greedy exploration

With probability ε: choose random article (explore)
With probability 1-ε: choose article with highest estimated click rate (exploit)
Update click rate estimates after each recommendation

python

class EpsilonGreedy:
    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.Q = np.zeros(n_arms)  # Estimated click rates
        self.N = np.zeros(n_arms)  # Number of times each article shown

    def select_arm(self):
        # TODO 1: Implement ε-greedy selection
        # With probability ε: return random arm
        # With probability 1-ε: return arm with max Q
        pass

    def update(self, arm, reward):
        # TODO 1: Update estimate for chosen arm
        # Use incremental mean: Q_new = Q_old + (1/N) * (reward - Q_old)
        pass

Part 3: UCB Algorithm (YOU COMPLETE)

TODO 2: Implement Upper Confidence Bound

Compute UCB = Q(a) + c * sqrt(log(t) / N(a)) for each arm
Choose arm with highest UCB
Automatically balances exploration and exploitation

python

class UCB:
    def __init__(self, n_arms, c=2.0):
        self.n_arms = n_arms
        self.c = c
        self.Q = np.zeros(n_arms)
        self.N = np.zeros(n_arms)
        self.t = 0  # Total timesteps

    def select_arm(self):
        # TODO 2: Implement UCB selection
        self.t += 1

        # Pull each arm once first
        if np.any(self.N == 0):
            return np.argmin(self.N)

        # Compute UCB for each arm
        # UCB(a) = Q(a) + c * sqrt(log(t) / N(a))
        # Return arm with highest UCB
        pass

    def update(self, arm, reward):
        # TODO 2: Update estimate (same as ε-greedy)
        pass

Part 4: Thompson Sampling (YOU COMPLETE)

TODO 3: Implement Thompson Sampling with Beta distributions

Maintain Beta(α, β) distribution for each arm
Sample from each distribution
Choose arm with highest sample
Update distributions based on observed rewards

python

class ThompsonSampling:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.alpha = np.ones(n_arms)  # Successes (Beta prior)
        self.beta = np.ones(n_arms)   # Failures (Beta prior)

    def select_arm(self):
        # TODO 3: Implement Thompson Sampling
        # 1. Sample from Beta(α, β) for each arm
        # 2. Return arm with highest sample
        pass

    def update(self, arm, reward):
        # TODO 3: Update Beta distribution
        # If reward == 1 (click): α ← α + 1
        # If reward == 0 (no click): β ← β + 1
        pass

Part 5: Simulation and Comparison (Pre-Built + YOU COMPLETE)

TODO 4: Run simulations for all three algorithms

Run each algorithm for T=1000 timesteps
Track cumulative reward and regret
Repeat for multiple random seeds (10+ runs)

python

def simulate_bandit(algorithm, true_rates, T=1000):
    # TODO 4: Simulate bandit algorithm
    rewards = []
    arm_selections = []

    for t in range(T):
        # 1. Select arm using algorithm
        arm = algorithm.select_arm()

        # 2. Simulate user click (Bernoulli trial)
        reward = 1 if np.random.random() < true_rates[arm] else 0

        # 3. Update algorithm
        algorithm.update(arm, reward)

        # 4. Track performance
        rewards.append(reward)
        arm_selections.append(arm)

    return rewards, arm_selections

Part 6: Performance Analysis (Pre-Built)

Visualization of:

Cumulative reward over time (higher is better)
Cumulative regret (lower is better)
Arm selection distribution
Estimated vs true click rates

Expected Results

Algorithm Performance Comparison

ε-Greedy (ε=0.1):

Cumulative reward: ~700-750 clicks (out of 1000)
Cumulative regret: ~100-150
Converges slowly, continues exploring indefinitely

UCB (c=2):

Cumulative reward: ~750-800 clicks
Cumulative regret: ~50-100
Smart exploration, converges faster than ε-greedy

Thompson Sampling:

Cumulative reward: ~780-820 clicks
Cumulative regret: ~30-70
Best empirical performance, fast convergence

Optimal (Oracle with true rates):

Cumulative reward: ~800 clicks (0.8 x 1000)
Cumulative regret: 0 (by definition)

Regret Curves

scss

Regret
  ↑
150 |    ε-Greedy (linear growth)
    |     ╱
100 |   ╱   UCB (sublinear)
    |  ╱   ╱
 50 | ╱   ╱  Thompson Sampling (best)
    |╱___╱___
  0 └────────────────> Time
    0   200   600   1000

Success Criteria

Your implementation is complete when:

ε-Greedy algorithm selects arms correctly (random + greedy)
UCB computes confidence bounds correctly
Thompson Sampling samples from Beta distributions
All algorithms run for 1000 timesteps without errors
Performance comparison shows expected ranking (Thompson > UCB > ε-Greedy)
Regret curves display correctly
Best arm converges to highest click rate article

Tips for Success

Understanding Each Algorithm

ε-Greedy:

Simple but effective
Fixed exploration rate (doesn't decrease over time)
Performance depends heavily on epsilon choice

UCB:

Principled exploration (optimism under uncertainty)
Automatically adjusts exploration (more for rarely-pulled arms)
Theoretical regret guarantees

Thompson Sampling:

Bayesian approach (maintains belief distributions)
Best empirical performance
Naturally handles uncertainty

Hyperparameter Tuning

ε-Greedy:

python

# Too low: insufficient exploration
epsilon = 0.01  # Might commit to suboptimal arm early

# Good: balanced
epsilon = 0.1  # Standard choice

# Too high: excessive exploration
epsilon = 0.5  # Wastes time on bad arms

UCB:

python

# Conservative: c = 1.0
# Standard: c = 2.0
# Aggressive: c = 5.0

Thompson Sampling:

python

# No hyperparameters to tune! (Prior α=1, β=1 works well)

Debugging Common Issues

Problem: All algorithms perform equally poorly

Check reward simulation (Bernoulli sampling correct?)
Check estimates are updating (print Q values)
Check arms are being pulled (not always choosing arm 0)

Problem: Thompson Sampling worse than others

Check Beta sampling (use np.random.beta(alpha, beta))
Check update logic (α increases on success, β on failure)
Run more seeds (Thompson has higher variance)

Problem: UCB not exploring enough

Increase c parameter (2.0 -> 5.0)
Check sqrt(log(t) / N(a)) computation
Ensure t is incrementing

Extension Challenges

Challenge One: Epsilon Decay (Easy)

Implement decaying epsilon for ε-greedy:

python

epsilon_t = max(epsilon_min, epsilon_start * (decay_rate ** t))

Compare with fixed epsilon.

Challenge 2: Contextual Bandits (Hard)

Extend to contextual setting with user features:

python

class LinearUCB:
    def __init__(self, n_arms, context_dim):
        # Maintain Aₐ and bₐ for each arm
        pass

    def select_arm(self, context):
        # Compute θₐ = Aₐ⁻¹ bₐ for each arm
        # UCB = context' θₐ + α * sqrt(context' Aₐ⁻¹ context)
        pass

Challenge 3: Non-Stationary Bandits (Medium)

Simulate click rates that change over time:

python

true_rates = [0.5, 0.3, 0.8, 0.4, 0.6]
# After t=500: rates change
if t > 500:
    true_rates = [0.8, 0.4, 0.3, 0.6, 0.5]

Which algorithm adapts fastest?

Challenge 4: Real-World Scenario (Easy)

Simulate A/B test for website designs:

python

# K=3 website designs with unknown conversion rates
true_conversion_rates = [0.05, 0.08, 0.06]

# Run bandit algorithms to find best design
# Track traffic allocation and total conversions

Submission Requirements

What to Submit

Completed Notebook: activity-07-multi-armed-bandits.ipynb
- All code cells executed
- Output visible for all simulations
- All TODOs completed
Performance Report: Brief summary including:
- Final cumulative rewards for each algorithm
- Regret comparison
- Which algorithm performed best and why
- Hyperparameters used
Visualizations:
- Cumulative reward curves (all algorithms)
- Regret curves
- Arm selection histograms

Submission Steps

Run all simulations (at least 10 seeds per algorithm)
Run all cells from top to bottom
Verify all plots display correctly
Download notebook: File -> Download -> Download .ipynb
Submit via [course portal link]

Resources

Documentation

Papers

Finite-Time Analysis of the Multiarmed Bandit Problem (UCB) (Auer et al., 2002)
Thompson Sampling Tutorial (Russo et al., 2017)

Exploration vs exploitation
Regret minimization
Upper confidence bounds
Bayesian inference

Next Steps

After completing this activity:

Concept 08: RL in Practice - Debugging and Deployment
Activity 08: Debug broken RL code and deploy agent
Concept 09: Introduction to Generative Models (GenAI module begins!)

Congratulations! You've completed the Reinforcement Learning module. Next, we move to Generative AI!

Assessment

This activity is graded on:

Code Completion (40%): All TODOs implemented correctly
Code Quality (30%): Clean, readable, well-commented
Analysis (20%): Performance comparison and insights
Understanding (10%): Report demonstrates grasp of bandit algorithms

Passing Grade: 70% or higher

Good luck, and enjoy exploring the world of bandits! 🎰🎲

Activity 7 of 18

Activity 7: Multi-Armed Bandits and Exploration

Practice and reinforce the concepts from Lesson 7

Activity 07: Multi-Armed Bandits and Exploration

Overview

Learning Objectives

By completing this activity, you will:

Implement ε-Greedy, UCB, and Thompson Sampling algorithms
Simulate a news recommendation scenario with unknown click rates
Compare exploration strategies empirically
Compute cumulative regret and convergence speed
Extend to contextual bandits with user features
Apply bandits to A/B testing scenarios

Prerequisites

Completed Concept 07: Multi-Armed Bandits and Exploration
Understanding of exploration vs exploitation
Basic probability (distributions, sampling)

Getting Started

Step One: Access the Template

Download the activity template from the Templates folder:

Template: AI25-Template-activity-07-multi-armed-bandits.zip
Location: Templates/AI25-Template-activity-07-multi-armed-bandits.zip

Step 2: Open in Google Colab

Extract the ZIP file
Upload activity-07-multi-armed-bandits.ipynb to Google Colab
Runtime: CPU is sufficient (no GPU needed)

Step 3: Run Initial Cells

Execute the first few cells to:

Import NumPy, Matplotlib, SciPy
Set random seeds for reproducibility
Initialize visualization functions

What You'll Build

Part One: Multi-Armed Bandit Environment (Pre-Built)

News Recommendation Scenario:

K = 5 news articles to recommend
Each article has unknown true click rate (probability user clicks)
Goal: Maximize total clicks over T = 1000 recommendations

Example True Click Rates:

python

true_rates = [0.3, 0.5, 0.8, 0.4, 0.6]
#                    ↑
#               Best article (0.8 click rate)

Regret: Difference between always choosing best article vs. your algorithm's performance.

Part 2: ε-Greedy Algorithm (YOU COMPLETE)

TODO 1: Implement ε-greedy exploration

With probability ε: choose random article (explore)
With probability 1-ε: choose article with highest estimated click rate (exploit)
Update click rate estimates after each recommendation

python

class EpsilonGreedy:
    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.Q = np.zeros(n_arms)  # Estimated click rates
        self.N = np.zeros(n_arms)  # Number of times each article shown

    def select_arm(self):
        # TODO 1: Implement ε-greedy selection
        # With probability ε: return random arm
        # With probability 1-ε: return arm with max Q
        pass

    def update(self, arm, reward):
        # TODO 1: Update estimate for chosen arm
        # Use incremental mean: Q_new = Q_old + (1/N) * (reward - Q_old)
        pass

Part 3: UCB Algorithm (YOU COMPLETE)

TODO 2: Implement Upper Confidence Bound

Compute UCB = Q(a) + c * sqrt(log(t) / N(a)) for each arm
Choose arm with highest UCB
Automatically balances exploration and exploitation

python

class UCB:
    def __init__(self, n_arms, c=2.0):
        self.n_arms = n_arms
        self.c = c
        self.Q = np.zeros(n_arms)
        self.N = np.zeros(n_arms)
        self.t = 0  # Total timesteps

    def select_arm(self):
        # TODO 2: Implement UCB selection
        self.t += 1

        # Pull each arm once first
        if np.any(self.N == 0):
            return np.argmin(self.N)

        # Compute UCB for each arm
        # UCB(a) = Q(a) + c * sqrt(log(t) / N(a))
        # Return arm with highest UCB
        pass

    def update(self, arm, reward):
        # TODO 2: Update estimate (same as ε-greedy)
        pass

Part 4: Thompson Sampling (YOU COMPLETE)

TODO 3: Implement Thompson Sampling with Beta distributions

Maintain Beta(α, β) distribution for each arm
Sample from each distribution
Choose arm with highest sample
Update distributions based on observed rewards

python

class ThompsonSampling:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.alpha = np.ones(n_arms)  # Successes (Beta prior)
        self.beta = np.ones(n_arms)   # Failures (Beta prior)

    def select_arm(self):
        # TODO 3: Implement Thompson Sampling
        # 1. Sample from Beta(α, β) for each arm
        # 2. Return arm with highest sample
        pass

    def update(self, arm, reward):
        # TODO 3: Update Beta distribution
        # If reward == 1 (click): α ← α + 1
        # If reward == 0 (no click): β ← β + 1
        pass

Part 5: Simulation and Comparison (Pre-Built + YOU COMPLETE)

TODO 4: Run simulations for all three algorithms

Run each algorithm for T=1000 timesteps
Track cumulative reward and regret
Repeat for multiple random seeds (10+ runs)

python

def simulate_bandit(algorithm, true_rates, T=1000):
    # TODO 4: Simulate bandit algorithm
    rewards = []
    arm_selections = []

    for t in range(T):
        # 1. Select arm using algorithm
        arm = algorithm.select_arm()

        # 2. Simulate user click (Bernoulli trial)
        reward = 1 if np.random.random() < true_rates[arm] else 0

        # 3. Update algorithm
        algorithm.update(arm, reward)

        # 4. Track performance
        rewards.append(reward)
        arm_selections.append(arm)

    return rewards, arm_selections

Part 6: Performance Analysis (Pre-Built)

Visualization of:

Cumulative reward over time (higher is better)
Cumulative regret (lower is better)
Arm selection distribution
Estimated vs true click rates

Expected Results

Algorithm Performance Comparison

ε-Greedy (ε=0.1):

Cumulative reward: ~700-750 clicks (out of 1000)
Cumulative regret: ~100-150
Converges slowly, continues exploring indefinitely

UCB (c=2):

Cumulative reward: ~750-800 clicks
Cumulative regret: ~50-100
Smart exploration, converges faster than ε-greedy

Thompson Sampling:

Cumulative reward: ~780-820 clicks
Cumulative regret: ~30-70
Best empirical performance, fast convergence

Optimal (Oracle with true rates):

Cumulative reward: ~800 clicks (0.8 x 1000)
Cumulative regret: 0 (by definition)

Regret Curves

scss

Regret
  ↑
150 |    ε-Greedy (linear growth)
    |     ╱
100 |   ╱   UCB (sublinear)
    |  ╱   ╱
 50 | ╱   ╱  Thompson Sampling (best)
    |╱___╱___
  0 └────────────────> Time
    0   200   600   1000

Success Criteria

Your implementation is complete when:

ε-Greedy algorithm selects arms correctly (random + greedy)
UCB computes confidence bounds correctly
Thompson Sampling samples from Beta distributions
All algorithms run for 1000 timesteps without errors
Performance comparison shows expected ranking (Thompson > UCB > ε-Greedy)
Regret curves display correctly
Best arm converges to highest click rate article

Tips for Success

Understanding Each Algorithm

ε-Greedy:

Simple but effective
Fixed exploration rate (doesn't decrease over time)
Performance depends heavily on epsilon choice

UCB:

Principled exploration (optimism under uncertainty)
Automatically adjusts exploration (more for rarely-pulled arms)
Theoretical regret guarantees

Thompson Sampling:

Bayesian approach (maintains belief distributions)
Best empirical performance
Naturally handles uncertainty

Hyperparameter Tuning

ε-Greedy:

python

# Too low: insufficient exploration
epsilon = 0.01  # Might commit to suboptimal arm early

# Good: balanced
epsilon = 0.1  # Standard choice

# Too high: excessive exploration
epsilon = 0.5  # Wastes time on bad arms

UCB:

python

# Conservative: c = 1.0
# Standard: c = 2.0
# Aggressive: c = 5.0

Thompson Sampling:

python

# No hyperparameters to tune! (Prior α=1, β=1 works well)

Debugging Common Issues

Problem: All algorithms perform equally poorly

Check reward simulation (Bernoulli sampling correct?)
Check estimates are updating (print Q values)
Check arms are being pulled (not always choosing arm 0)

Problem: Thompson Sampling worse than others

Check Beta sampling (use np.random.beta(alpha, beta))
Check update logic (α increases on success, β on failure)
Run more seeds (Thompson has higher variance)

Problem: UCB not exploring enough

Increase c parameter (2.0 -> 5.0)
Check sqrt(log(t) / N(a)) computation
Ensure t is incrementing

Extension Challenges

Challenge One: Epsilon Decay (Easy)

Implement decaying epsilon for ε-greedy:

python

epsilon_t = max(epsilon_min, epsilon_start * (decay_rate ** t))

Compare with fixed epsilon.

Challenge 2: Contextual Bandits (Hard)

Extend to contextual setting with user features:

python

class LinearUCB:
    def __init__(self, n_arms, context_dim):
        # Maintain Aₐ and bₐ for each arm
        pass

    def select_arm(self, context):
        # Compute θₐ = Aₐ⁻¹ bₐ for each arm
        # UCB = context' θₐ + α * sqrt(context' Aₐ⁻¹ context)
        pass

Challenge 3: Non-Stationary Bandits (Medium)

Simulate click rates that change over time:

python

true_rates = [0.5, 0.3, 0.8, 0.4, 0.6]
# After t=500: rates change
if t > 500:
    true_rates = [0.8, 0.4, 0.3, 0.6, 0.5]

Which algorithm adapts fastest?

Challenge 4: Real-World Scenario (Easy)

Simulate A/B test for website designs:

python

# K=3 website designs with unknown conversion rates
true_conversion_rates = [0.05, 0.08, 0.06]

# Run bandit algorithms to find best design
# Track traffic allocation and total conversions

Submission Requirements

What to Submit

Completed Notebook: activity-07-multi-armed-bandits.ipynb
- All code cells executed
- Output visible for all simulations
- All TODOs completed
Performance Report: Brief summary including:
- Final cumulative rewards for each algorithm
- Regret comparison
- Which algorithm performed best and why
- Hyperparameters used
Visualizations:
- Cumulative reward curves (all algorithms)
- Regret curves
- Arm selection histograms