Practice and reinforce the concepts from Lesson 7
In this activity, you'll implement and compare three classic bandit algorithms (ε-Greedy, UCB, Thompson Sampling) on a simulated news recommendation system. You'll see exploration strategies in action and understand the exploration-exploitation trade-off!
By completing this activity, you will:
Download the activity template from the Templates folder:
AI25-Template-activity-07-multi-armed-bandits.zipTemplates/AI25-Template-activity-07-multi-armed-bandits.zipactivity-07-multi-armed-bandits.ipynb to Google ColabExecute the first few cells to:
News Recommendation Scenario:
Example True Click Rates:
true_rates = [0.3, 0.5, 0.8, 0.4, 0.6]
# ↑
# Best article (0.8 click rate)
Regret: Difference between always choosing best article vs. your algorithm's performance.
TODO 1: Implement ε-greedy exploration
class EpsilonGreedy:
def __init__(self, n_arms, epsilon=0.1):
self.n_arms = n_arms
self.epsilon = epsilon
self.Q = np.zeros(n_arms) # Estimated click rates
self.N = np.zeros(n_arms) # Number of times each article shown
def select_arm(self):
# TODO 1: Implement ε-greedy selection
# With probability ε: return random arm
# With probability 1-ε: return arm with max Q
pass
def update(self, arm, reward):
# TODO 1: Update estimate for chosen arm
# Use incremental mean: Q_new = Q_old + (1/N) * (reward - Q_old)
pass
TODO 2: Implement Upper Confidence Bound
class UCB:
def __init__(self, n_arms, c=2.0):
self.n_arms = n_arms
self.c = c
self.Q = np.zeros(n_arms)
self.N = np.zeros(n_arms)
self.t = 0 # Total timesteps
def select_arm(self):
# TODO 2: Implement UCB selection
self.t += 1
# Pull each arm once first
if np.any(self.N == 0):
return np.argmin(self.N)
# Compute UCB for each arm
# UCB(a) = Q(a) + c * sqrt(log(t) / N(a))
# Return arm with highest UCB
pass
def update(self, arm, reward):
# TODO 2: Update estimate (same as ε-greedy)
pass
TODO 3: Implement Thompson Sampling with Beta distributions
class ThompsonSampling:
def __init__(self, n_arms):
self.n_arms = n_arms
self.alpha = np.ones(n_arms) # Successes (Beta prior)
self.beta = np.ones(n_arms) # Failures (Beta prior)
def select_arm(self):
# TODO 3: Implement Thompson Sampling
# 1. Sample from Beta(α, β) for each arm
# 2. Return arm with highest sample
pass
def update(self, arm, reward):
# TODO 3: Update Beta distribution
# If reward == 1 (click): α ← α + 1
# If reward == 0 (no click): β ← β + 1
pass
TODO 4: Run simulations for all three algorithms
def simulate_bandit(algorithm, true_rates, T=1000):
# TODO 4: Simulate bandit algorithm
rewards = []
arm_selections = []
for t in range(T):
# 1. Select arm using algorithm
arm = algorithm.select_arm()
# 2. Simulate user click (Bernoulli trial)
reward = 1 if np.random.random() < true_rates[arm] else 0
# 3. Update algorithm
algorithm.update(arm, reward)
# 4. Track performance
rewards.append(reward)
arm_selections.append(arm)
return rewards, arm_selections
Visualization of:
ε-Greedy (ε=0.1):
UCB (c=2):
Thompson Sampling:
Optimal (Oracle with true rates):
Regret
↑
150 | ε-Greedy (linear growth)
| ╱
100 | ╱ UCB (sublinear)
| ╱ ╱
50 | ╱ ╱ Thompson Sampling (best)
|╱___╱___
0 └────────────────> Time
0 200 600 1000
Your implementation is complete when:
ε-Greedy:
UCB:
Thompson Sampling:
ε-Greedy:
# Too low: insufficient exploration
epsilon = 0.01 # Might commit to suboptimal arm early
# Good: balanced
epsilon = 0.1 # Standard choice
# Too high: excessive exploration
epsilon = 0.5 # Wastes time on bad arms
UCB:
# Conservative: c = 1.0
# Standard: c = 2.0
# Aggressive: c = 5.0
Thompson Sampling:
# No hyperparameters to tune! (Prior α=1, β=1 works well)
Problem: All algorithms perform equally poorly
Problem: Thompson Sampling worse than others
np.random.beta(alpha, beta))Problem: UCB not exploring enough
Implement decaying epsilon for ε-greedy:
epsilon_t = max(epsilon_min, epsilon_start * (decay_rate ** t))
Compare with fixed epsilon.
Extend to contextual setting with user features:
class LinearUCB:
def __init__(self, n_arms, context_dim):
# Maintain Aₐ and bₐ for each arm
pass
def select_arm(self, context):
# Compute θₐ = Aₐ⁻¹ bₐ for each arm
# UCB = context' θₐ + α * sqrt(context' Aₐ⁻¹ context)
pass
Simulate click rates that change over time:
true_rates = [0.5, 0.3, 0.8, 0.4, 0.6]
# After t=500: rates change
if t > 500:
true_rates = [0.8, 0.4, 0.3, 0.6, 0.5]
Which algorithm adapts fastest?
Simulate A/B test for website designs:
# K=3 website designs with unknown conversion rates
true_conversion_rates = [0.05, 0.08, 0.06]
# Run bandit algorithms to find best design
# Track traffic allocation and total conversions
Completed Notebook: activity-07-multi-armed-bandits.ipynb
Performance Report: Brief summary including:
Visualizations:
After completing this activity:
Congratulations! You've completed the Reinforcement Learning module. Next, we move to Generative AI!
This activity is graded on:
Passing Grade: 70% or higher
Good luck, and enjoy exploring the world of bandits! 🎰🎲