Multi-Armed Bandits and Exploration - Discovery Challenge

🎯 Learning Objectives

By completing this activity, you will:

Understand the exploration-exploitation dilemma in sequential decision-making
Implement the epsilon-greedy baseline algorithm
Master Upper Confidence Bound (UCB) for principled exploration
Apply Thompson Sampling using Bayesian inference
Extend to contextual bandits with state-dependent actions
Compare algorithms using regret curves and performance metrics
Apply bandit algorithms to real-world scenarios (A/B testing, recommender systems)

🚀 Getting Started (See Results in 30 Seconds!)

Open in Google Colab: Upload this notebook to Google Colab
Run All Cells: Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic: You'll see:
- ✅ 10-armed bandit testbed created
- ✅ Epsilon-greedy baseline running
- ✅ Performance visualization (regret, action distribution)
- ✅ Comparison framework ready for your algorithms

Expected First Run Time: ~30 seconds

🎯 What's Already Working

The template comes with 65% working code:

✅ 10-Armed Bandit Environment: Custom testbed with reward distributions
✅ Epsilon-Greedy Baseline: Working implementation (from Activity 01)
✅ Performance Metrics: Regret tracking, optimal action selection
✅ Visualization Tools: Regret curves, action histograms, confidence bounds
✅ Comparison Framework: Side-by-side algorithm evaluation
✅ Random Baseline: Pure random agent for reference

What Needs Your Work (35%):

⚠️ TODO 1: Implement UCB (Upper Confidence Bound) algorithm
⚠️ TODO 2: Implement Thompson Sampling with Beta distributions
⚠️ TODO 3: Extend to contextual bandits (state-dependent actions)

📋 Tasks to Complete

TODO 1: Implement UCB (Upper Confidence Bound) (Medium)

Location: Section 5 - "UCB Algorithm"

Current State: UCB skeleton exists but selection logic is missing

Your Task: Implement UCB action selection using the formula:

scss

UCB(a) = Q(a) + c * sqrt(ln(t) / N(a))

Where:

Q(a) = average reward for action a
c = exploration parameter (typically 2.0)
t = total timesteps so far
N(a) = number of times action a was taken

Key Concept: UCB balances:

Exploitation: Q(a) (choose actions with high average reward)
Exploration: c * sqrt(ln(t) / N(a)) (bonus for rarely-tried actions)

Starter Code Provided:

python

class UCBAgent:
    def __init__(self, n_arms, c=2.0):
        self.n_arms = n_arms
        self.c = c
        self.Q = np.zeros(n_arms)  # Average rewards
        self.N = np.zeros(n_arms)  # Action counts
        self.t = 0                  # Total timesteps

    def select_action(self):
        # TODO: Implement UCB selection
        # Handle N(a) = 0 case: try each action once first
        # Then select argmax of UCB(a) for all actions
        pass

Success Criteria:

UCB achieves ``regret < 50`` after 1000 steps
All actions tried at least once (handles N(a)=0 case)
Exploration bonus decreases over time (as N increases)
Outperforms epsilon-greedy by 15-25% (lower cumulative regret)
Visualization shows confidence bounds narrowing

Implementation Hints:

First, check if any action has N(a) == 0 -> return that action
Otherwise, calculate UCB for all actions: Q + c * sqrt(ln(t) / N)
Handle division by zero: never select action where N(a) == 0 after initialization
Return argmax(UCB_values)

Expected Behavior:

Early timesteps: UCB aggressively explores (high uncertainty bonus)
Middle timesteps: Balances exploration/exploitation
Late timesteps: Mostly exploits (low uncertainty bonus, occasionally checks sub-optimal arms)

TODO 2: Implement Thompson Sampling (Hard)

Location: Section 6 - "Thompson Sampling"

Your Task: Implement Thompson Sampling using Beta distributions for Bayesian inference

Concept: Maintain a belief distribution over each arm's reward probability:

Each arm has a Beta(α, β) distribution
α = successes + 1, β = failures + 1
Sample from each Beta distribution -> select action with highest sample

Why Thompson Sampling?

Naturally balances exploration/exploitation via probability matching
Optimal in Bayesian sense (Bayes-optimal strategy)
Often outperforms UCB in practice

Algorithm:

sql

For each timestep t:
  1. For each arm a, sample θ_a ~ Beta(α_a, β_a)
  2. Select action a* = argmax(θ_a)
  3. Observe reward r
  4. Update: if r=1, α_a* += 1, else β_a* += 1

Starter Code Provided:

python

class ThompsonSamplingAgent:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.alpha = np.ones(n_arms)  # Successes
        self.beta = np.ones(n_arms)   # Failures

    def select_action(self):
        # TODO: Sample from Beta distributions
        # Hint: np.random.beta(alpha, beta)
        # Return argmax of samples
        pass

    def update(self, action, reward):
        # TODO: Update Beta parameters
        # If reward > 0, increment alpha
        # Otherwise, increment beta
        pass

Success Criteria:

Thompson Sampling achieves ``regret < 40`` after 1000 steps
Beta distributions converge (α/(α+β) -> true reward probabilities)
Outperforms UCB by 10-20% (lower regret)
Visualization shows posterior distributions narrowing
Sampling produces diverse actions early, focused actions later

Implementation Hints:

Use np.random.beta(alpha[a], beta[a]) to sample for each arm
For binary rewards: if reward == 1, update alpha[action] += 1, else beta[action] += 1
For continuous rewards in [0,1]: treat as success probability (requires discretization)
Initialize with α=1, β=1 (uniform prior over [0,1])

Expected Behavior:

Early: Wide Beta distributions -> high variance in samples -> exploration
Late: Narrow Beta distributions -> low variance -> exploitation
Posterior mean α/(α+β) converges to true reward mean

TODO 3: Implement Contextual Bandit (Very Hard)

Location: Section 7 - "Contextual Bandits"

Your Task: Extend bandits to handle state-dependent actions

Concept:

Standard bandit: Same action values regardless of context
Contextual bandit: Action values depend on current state/context
Example: Recommend movies based on user demographics (state)

Scenario: Medical treatment selection

Context: Patient features (age group: young/adult/senior)
Actions: 3 treatments
Rewards: Treatment effectiveness varies by patient age
Goal: Learn which treatment works best for each age group

Algorithm: LinUCB (Linear UCB for Contextual Bandits)

For each timestep t:
  1. Observe context x_t (feature vector)
  2. For each arm a:
       - Estimate reward: θ_a · x_t
       - Compute UCB: θ_a · x_t + α * sqrt(x_t^T * A_a^-1 * x_t)
  3. Select action a* = argmax(UCB)
  4. Update parameters A_a and b_a with linear regression

Starter Code Provided:

python

class ContextualBanditAgent:
    def __init__(self, n_arms, context_dim):
        self.n_arms = n_arms
        self.context_dim = context_dim
        # TODO: Initialize per-arm linear models
        # A[a] = X^T X (context features), b[a] = X^T y (rewards)
        pass

    def select_action(self, context):
        # TODO: Compute LinUCB for each arm
        # Estimate: θ · context, Uncertainty: α * sqrt(...)
        pass

    def update(self, action, context, reward):
        # TODO: Update linear model parameters
        pass

Success Criteria:

Agent learns different policies for different contexts
Achieves ``regret < 30`` after 1000 steps (with 3 contexts)
Context-specific Q-values converge correctly
Visualization shows per-context optimal actions
Outperforms non-contextual UCB by 30-40%

Implementation Hints:

Represent context as one-hot vector: [1,0,0] for young, [0,1,0] for adult, etc.
Maintain separate linear model for each arm: θ_a = A_a^-1 * b_a
Update: A_a += context * context^T, b_a += reward * context
Use ridge regression (add λI to A) for numerical stability

Expected Behavior:

Agent discovers treatment 1 best for young, treatment 2 for adults, treatment 3 for seniors
Regret much lower than treating all patients the same way
Confidence bounds tighten faster for frequently-seen contexts

🚀 Extension Challenges

Challenge One: Non-Stationary Bandits (Medium)

Implement sliding-window UCB where reward distributions change over time:

Use weighted average: Recent rewards count more than old rewards
Detect distribution shifts and re-explore
Compare: Fixed UCB vs Adaptive UCB

Challenge 2: Bayesian UCB (Hard)

Combine UCB and Thompson Sampling:

Maintain posterior distributions (like Thompson Sampling)
Select action using UCB on posterior means + uncertainty
Compare: Pure Thompson vs Bayesian UCB

Challenge 3: Neural Contextual Bandits (Very Hard)

Replace linear model with neural network:

Use PyTorch to model reward: NN(context, action) → reward
Apply UCB to neural network predictions
Train with replay buffer (store past (context, action, reward) tuples)

Challenge 4: Real-World A/B Testing (Medium)

Apply bandits to simulated A/B testing:

Scenario: 5 website designs, measure click-through rates
Compare: Traditional A/B test (50/50 split) vs Thompson Sampling
Metric: Cumulative clicks over 1000 users
Show Thompson Sampling achieves higher clicks (less time wasted on bad designs)

📊 Expected Results

Random Baseline (Pre-Built):

Cumulative Regret: ~500 after 1000 steps
Optimal Action Selection: ~10% (random chance for 10 arms)
Learning: No improvement over time

Epsilon-Greedy (Pre-Built, ε=0.1):

Cumulative Regret: 70-100 after 1000 steps
Optimal Action Selection: ~70-80% (late timesteps)
Learning: Fast initial learning, then plateaus

UCB (TODO 1):

Cumulative Regret: 40-60 after 1000 steps (15-25% better than epsilon-greedy)
Optimal Action Selection: ~80-90%
Learning: Steady improvement, logarithmic regret bound

Thompson Sampling (TODO 2):

Cumulative Regret: 30-50 after 1000 steps (best single-context algorithm)
Optimal Action Selection: ~85-95%
Learning: Fast convergence, optimal Bayesian regret

Contextual Bandit (TODO 3):

Cumulative Regret: 20-40 after 1000 steps (30-40% better than non-contextual)
Context-Specific Accuracy: >90% for each context
Learning: Rapid learning when contexts are distinct

🎓 Success Criteria Checklist

Minimum Requirements (for passing):

Notebook runs without errors
TODO 1 completed (UCB working, ``regret < 60``)
UCB outperforms epsilon-greedy baseline
Visualizations show clear learning curves
Analysis explains exploration-exploitation tradeoff

Target Grade (for excellent work):

TODO 1-2 completed (UCB + Thompson Sampling)
Thompson Sampling achieves ``regret < 50``
Side-by-side comparison of all 4 algorithms
Analysis explains why Thompson Sampling outperforms UCB
At least 1 extension challenge attempted

Exceptional Work (bonus points):

All 3 TODOs completed (including contextual bandit)
Multiple extension challenges completed
Novel experiments or visualizations
Real-world application demonstrated (A/B testing, etc.)
Clean, modular, production-quality code

🛠️ Troubleshooting

Issue: "UCB regret is higher than epsilon-greedy"

Solution: Check your UCB implementation:

Ensure you try each action at least once (handle N(a) == 0)
Verify ln(t) is natural log, not log10
Check exploration constant c: Try c=1.0, 2.0, 3.0 (default 2.0)
Ensure Q(a) is average reward: total_reward[a] / N(a)

Issue: "Thompson Sampling picks same action repeatedly"

Solution:

Verify you're sampling from Beta distribution, not using mean
Check Beta parameters aren't diverging: alpha + beta < 10000 in early steps
Ensure rewards are in [0,1] range (normalize if needed)
Try different priors: alpha=1, beta=1 (uniform) vs alpha=0.5, beta=0.5 (Jeffreys)

Issue: "Contextual bandit performs worse than non-contextual"

Solution:

Verify contexts are actually different (check feature vectors)
Ensure you're using context features in prediction: θ · context
Check linear model update: A += context * context.T, b += reward * context
Add regularization: A += 0.01 * np.eye(context_dim) for numerical stability

Issue: "Regret curve is noisy/jagged"

Solution:

This is normal! Bandit algorithms have inherent randomness
Run multiple seeds (10-20) and plot average regret with confidence intervals
Smooth with rolling average: pd.Series(regret).rolling(50).mean()

Issue: "All algorithms have similar regret"

Solution:

Check reward distributions are different enough (optimal arm clearly better)
Increase number of timesteps (1000 -> 5000) to see asymptotic behavior
Verify you're tracking cumulative regret, not instantaneous regret
Ensure optimal action is correct: optimal_action = np.argmax(true_means)

📚 Resources

Documentation

Concept 01: Introduction to RL (exploration-exploitation)
Concept 02: Q-Learning (value-based methods)
Concept 07: Multi-Armed Bandits (theory behind UCB, Thompson Sampling)

Key Papers

Auer et al. (2002): "Finite-time Analysis of the Multi-Armed Bandit Problem" (UCB algorithm)
Thompson (1933): "On the Likelihood that One Unknown Probability Exceeds Another" (Thompson Sampling)
Li et al. (2010): "A Contextual-Bandit Approach to Personalized News Article Recommendation" (LinUCB)
Russo et al. (2018): "A Tutorial on Thompson Sampling" (comprehensive survey)

Real-World Applications

A/B Testing: Optimize website designs, ad campaigns
Recommender Systems: Personalized content recommendations
Clinical Trials: Adaptive treatment assignment
Online Advertising: Ad placement optimization
Hyperparameter Tuning: Efficient neural network optimization

📊 Understanding the Results

Why UCB Outperforms Epsilon-Greedy:

Epsilon-greedy: Fixed exploration rate (always explores ε% of time, even when confident)
UCB: Adaptive exploration (explores less as confidence grows)
Result: UCB wastes less time exploring clearly-bad actions

Why Thompson Sampling Outperforms UCB:

UCB: Pessimistic (assumes worst-case uncertainty)
Thompson Sampling: Probabilistic (matches exploration probability to uncertainty)
Result: Thompson Sampling finds optimal balance faster

Why Contextual Bandits Win:

Non-contextual: One policy for all contexts (averages over different situations)
Contextual: Different policy per context (tailored decisions)
Result: Contextual bandits exploit problem structure (if reward truly depends on context)

Theoretical Guarantees:

Epsilon-greedy: O(T) regret (linear growth, not optimal)
UCB: O(log T) regret (logarithmic growth, optimal for stationary bandits)
Thompson Sampling: O(log T) regret (Bayesian optimal)
LinUCB: O(sqrt(T)) regret (optimal for contextual bandits)

🎯 Practical Tips

When to Use Each Algorithm:

Epsilon-Greedy:

✅ Simple to implement and understand
✅ Works well with epsilon decay
❌ Not theoretically optimal (linear regret)
Use: Quick prototypes, teaching, non-critical applications

UCB:

✅ Theoretical guarantees (logarithmic regret)
✅ No hyperparameters to tune (besides c)
❌ Assumes stationary rewards
Use: A/B testing, ad optimization, when you need guarantees

Thompson Sampling:

✅ Often best practical performance
✅ Handles delayed/batch feedback well
❌ Requires problem-specific prior (Beta for Bernoulli, Gaussian for continuous)
Use: Recommender systems, clinical trials, when you have domain knowledge

Contextual Bandits:

✅ Exploits side information (context/state)
✅ Much better when reward depends on context
❌ Requires feature engineering
❌ More complex (higher implementation cost)
Use: Personalized recommendations, when context matters

📤 Submission

Complete required TODOs (minimum: TODO 1-2)
Run entire notebook to generate all outputs
Export regret comparison plot: Save final comparison as PNG
Download notebook: File -> Download -> Download .ipynb
Submit via portal: Upload .ipynb and comparison.png

Submission Checklist:

Filename: activity-07-[YourName].ipynb
All code cells executed
Regret curves visible for all algorithms
UCB achieves ``regret < 60`` after 1000 steps
Thompson Sampling (if attempted) < 50 regret
Analysis section completed with explanations

🎉 What's Next?

After mastering multi-armed bandits:

Move to Activity 08: RL Debugging and Best Practices
Learn systematic debugging for RL algorithms
Apply to Project 2: Contextual Bandit Application (real-world scenario)

Key Insight: Bandits are stateless RL (no environment dynamics). Next, you'll learn stateful RL (MDPs, POMDPs) where actions affect future states. Bandits are the foundation-master them first!

Good luck! Multi-armed bandits teach you the core of exploration-exploitation. These algorithms power real-world systems serving billions of users (Netflix, Google, Amazon). You're learning production-ready techniques! 🎰🚀

Template 7: Multi Armed Bandits

📦 Project Files Included: