By completing this activity, you will:
Understand the exploration-exploitation dilemma in sequential decision-making
Implement the epsilon-greedy baseline algorithm
Master Upper Confidence Bound (UCB) for principled exploration
Apply Thompson Sampling using Bayesian inference
Extend to contextual bandits with state-dependent actions
Compare algorithms using regret curves and performance metrics
Apply bandit algorithms to real-world scenarios (A/B testing, recommender systems)
Open in Google Colab : Upload this notebook to Google Colab
Run All Cells : Click Runtime -> Run all (or press Ctrl+F9)
Watch the Magic : You'll see:
✅ 10-armed bandit testbed created
✅ Epsilon-greedy baseline running
✅ Performance visualization (regret, action distribution)
✅ Comparison framework ready for your algorithms
Expected First Run Time : ~30 seconds
The template comes with 65% working code :
✅ 10-Armed Bandit Environment : Custom testbed with reward distributions
✅ Epsilon-Greedy Baseline : Working implementation (from Activity 01)
✅ Performance Metrics : Regret tracking, optimal action selection
✅ Visualization Tools : Regret curves, action histograms, confidence bounds
✅ Comparison Framework : Side-by-side algorithm evaluation
✅ Random Baseline : Pure random agent for reference
⚠️ TODO 1 : Implement UCB (Upper Confidence Bound) algorithm
⚠️ TODO 2 : Implement Thompson Sampling with Beta distributions
⚠️ TODO 3 : Extend to contextual bandits (state-dependent actions)
Location : Section 5 - "UCB Algorithm"
Current State : UCB skeleton exists but selection logic is missing
Your Task : Implement UCB action selection using the formula:
scss
UCB (a) = Q (a) + c * sqrt (ln(t) / N (a))
Where:
Q(a) = average reward for action a
c = exploration parameter (typically 2.0)
t = total timesteps so far
N(a) = number of times action a was taken
Key Concept : UCB balances:
Exploitation : Q(a) (choose actions with high average reward)
Exploration : c * sqrt(ln(t) / N(a)) (bonus for rarely-tried actions)
Starter Code Provided :
python
class UCBAgent :
def __init__ (self, n_arms, c=2.0 ):
self .n_arms = n_arms
self .c = c
self .Q = np.zeros(n_arms)
self .N = np.zeros(n_arms)
self .t = 0
def select_action (self ):
pass
Success Criteria :
Implementation Hints :
First, check if any action has N(a) == 0 -> return that action
Otherwise, calculate UCB for all actions: Q + c * sqrt(ln(t) / N)
Handle division by zero: never select action where N(a) == 0 after initialization
Return argmax(UCB_values)
Expected Behavior :
Early timesteps: UCB aggressively explores (high uncertainty bonus)
Middle timesteps: Balances exploration/exploitation
Late timesteps: Mostly exploits (low uncertainty bonus, occasionally checks sub-optimal arms)
Location : Section 6 - "Thompson Sampling"
Your Task : Implement Thompson Sampling using Beta distributions for Bayesian inference
Concept : Maintain a belief distribution over each arm's reward probability:
Each arm has a Beta(α, β) distribution
α = successes + 1, β = failures + 1
Sample from each Beta distribution -> select action with highest sample
Why Thompson Sampling?
Naturally balances exploration/exploitation via probability matching
Optimal in Bayesian sense (Bayes-optimal strategy)
Often outperforms UCB in practice
Algorithm :
sql
For each timestep t:
1. For each arm a, sample θ_a ~ Beta(α_a, β_a)
2. Select action a* = argmax(θ_a)
3. Observe reward r
4. Update : if r= 1 , α_a* + = 1 , else β_a* + = 1
Starter Code Provided :
python
class ThompsonSamplingAgent :
def __init__ (self, n_arms ):
self .n_arms = n_arms
self .alpha = np.ones(n_arms)
self .beta = np.ones(n_arms)
def select_action (self ):
pass
def update (self, action, reward ):
pass
Success Criteria :
Implementation Hints :
Use np.random.beta(alpha[a], beta[a]) to sample for each arm
For binary rewards: if reward == 1, update alpha[action] += 1, else beta[action] += 1
For continuous rewards in [0,1]: treat as success probability (requires discretization)
Initialize with α=1, β=1 (uniform prior over [0,1])
Expected Behavior :
Early: Wide Beta distributions -> high variance in samples -> exploration
Late: Narrow Beta distributions -> low variance -> exploitation
Posterior mean α/(α+β) converges to true reward mean
Location : Section 7 - "Contextual Bandits"
Your Task : Extend bandits to handle state-dependent actions
Concept :
Standard bandit: Same action values regardless of context
Contextual bandit: Action values depend on current state/context
Example: Recommend movies based on user demographics (state)
Scenario : Medical treatment selection
Context : Patient features (age group: young/adult/senior)
Actions : 3 treatments
Rewards : Treatment effectiveness varies by patient age
Goal : Learn which treatment works best for each age group
Algorithm : LinUCB (Linear UCB for Contextual Bandits)
c
For each timestep t:
1. Observe context x_t (feature vector )
2. For each arm a:
- Estimate reward: θ_a · x_t
- Compute UCB: θ_a · x_t + α * sqrt (x_t ^T * A_a^-1 * x_t )
3. Select action a* = argmax(UCB)
4. Update parameters A_a and b_a with linear regression
Starter Code Provided :
python
class ContextualBanditAgent :
def __init__ (self, n_arms, context_dim ):
self .n_arms = n_arms
self .context_dim = context_dim
pass
def select_action (self, context ):
pass
def update (self, action, context, reward ):
pass
Success Criteria :
Implementation Hints :
Represent context as one-hot vector: [1,0,0] for young, [0,1,0] for adult, etc.
Maintain separate linear model for each arm: θ_a = A_a^-1 * b_a
Update: A_a += context * context^T, b_a += reward * context
Use ridge regression (add λI to A) for numerical stability
Expected Behavior :
Agent discovers treatment 1 best for young, treatment 2 for adults, treatment 3 for seniors
Regret much lower than treating all patients the same way
Confidence bounds tighten faster for frequently-seen contexts
Implement sliding-window UCB where reward distributions change over time:
Use weighted average: Recent rewards count more than old rewards
Detect distribution shifts and re-explore
Compare: Fixed UCB vs Adaptive UCB
Combine UCB and Thompson Sampling:
Maintain posterior distributions (like Thompson Sampling)
Select action using UCB on posterior means + uncertainty
Compare: Pure Thompson vs Bayesian UCB
Replace linear model with neural network:
Use PyTorch to model reward: NN(context, action) → reward
Apply UCB to neural network predictions
Train with replay buffer (store past (context, action, reward) tuples)
Apply bandits to simulated A/B testing:
Scenario: 5 website designs, measure click-through rates
Compare: Traditional A/B test (50/50 split) vs Thompson Sampling
Metric: Cumulative clicks over 1000 users
Show Thompson Sampling achieves higher clicks (less time wasted on bad designs)
Cumulative Regret: ~500 after 1000 steps
Optimal Action Selection: ~10% (random chance for 10 arms)
Learning: No improvement over time
Cumulative Regret: 70-100 after 1000 steps
Optimal Action Selection: ~70-80% (late timesteps)
Learning: Fast initial learning, then plateaus
Cumulative Regret: 40-60 after 1000 steps (15-25% better than epsilon-greedy)
Optimal Action Selection: ~80-90%
Learning: Steady improvement, logarithmic regret bound
Cumulative Regret: 30-50 after 1000 steps (best single-context algorithm)
Optimal Action Selection: ~85-95%
Learning: Fast convergence, optimal Bayesian regret
Cumulative Regret: 20-40 after 1000 steps (30-40% better than non-contextual)
Context-Specific Accuracy: >90% for each context
Learning: Rapid learning when contexts are distinct
Minimum Requirements (for passing):
Target Grade (for excellent work):
Exceptional Work (bonus points):
Solution : Check your UCB implementation:
Ensure you try each action at least once (handle N(a) == 0)
Verify ln(t) is natural log, not log10
Check exploration constant c: Try c=1.0, 2.0, 3.0 (default 2.0)
Ensure Q(a) is average reward: total_reward[a] / N(a)
Solution :
Verify you're sampling from Beta distribution, not using mean
Check Beta parameters aren't diverging: alpha + beta < 10000 in early steps
Ensure rewards are in [0,1] range (normalize if needed)
Try different priors: alpha=1, beta=1 (uniform) vs alpha=0.5, beta=0.5 (Jeffreys)
Solution :
Verify contexts are actually different (check feature vectors)
Ensure you're using context features in prediction: θ · context
Check linear model update: A += context * context.T, b += reward * context
Add regularization: A += 0.01 * np.eye(context_dim) for numerical stability
Solution :
This is normal! Bandit algorithms have inherent randomness
Run multiple seeds (10-20) and plot average regret with confidence intervals
Smooth with rolling average: pd.Series(regret).rolling(50).mean()
Solution :
Check reward distributions are different enough (optimal arm clearly better)
Increase number of timesteps (1000 -> 5000) to see asymptotic behavior
Verify you're tracking cumulative regret, not instantaneous regret
Ensure optimal action is correct: optimal_action = np.argmax(true_means)
Concept 01 : Introduction to RL (exploration-exploitation)
Concept 02 : Q-Learning (value-based methods)
Concept 07 : Multi-Armed Bandits (theory behind UCB, Thompson Sampling)
Auer et al. (2002): "Finite-time Analysis of the Multi-Armed Bandit Problem" (UCB algorithm)
Thompson (1933): "On the Likelihood that One Unknown Probability Exceeds Another" (Thompson Sampling)
Li et al. (2010): "A Contextual-Bandit Approach to Personalized News Article Recommendation" (LinUCB)
Russo et al. (2018): "A Tutorial on Thompson Sampling" (comprehensive survey)
A/B Testing : Optimize website designs, ad campaigns
Recommender Systems : Personalized content recommendations
Clinical Trials : Adaptive treatment assignment
Online Advertising : Ad placement optimization
Hyperparameter Tuning : Efficient neural network optimization
Epsilon-greedy : Fixed exploration rate (always explores ε% of time, even when confident)
UCB : Adaptive exploration (explores less as confidence grows)
Result : UCB wastes less time exploring clearly-bad actions
UCB : Pessimistic (assumes worst-case uncertainty)
Thompson Sampling : Probabilistic (matches exploration probability to uncertainty)
Result : Thompson Sampling finds optimal balance faster
Non-contextual : One policy for all contexts (averages over different situations)
Contextual : Different policy per context (tailored decisions)
Result : Contextual bandits exploit problem structure (if reward truly depends on context)
Epsilon-greedy : O(T) regret (linear growth, not optimal)
UCB : O(log T) regret (logarithmic growth, optimal for stationary bandits)
Thompson Sampling : O(log T) regret (Bayesian optimal)
LinUCB : O(sqrt(T)) regret (optimal for contextual bandits)
Epsilon-Greedy :
✅ Simple to implement and understand
✅ Works well with epsilon decay
❌ Not theoretically optimal (linear regret)
Use : Quick prototypes, teaching, non-critical applications
UCB :
✅ Theoretical guarantees (logarithmic regret)
✅ No hyperparameters to tune (besides c)
❌ Assumes stationary rewards
Use : A/B testing, ad optimization, when you need guarantees
Thompson Sampling :
✅ Often best practical performance
✅ Handles delayed/batch feedback well
❌ Requires problem-specific prior (Beta for Bernoulli, Gaussian for continuous)
Use : Recommender systems, clinical trials, when you have domain knowledge
Contextual Bandits :
✅ Exploits side information (context/state)
✅ Much better when reward depends on context
❌ Requires feature engineering
❌ More complex (higher implementation cost)
Use : Personalized recommendations, when context matters
Complete required TODOs (minimum: TODO 1-2)
Run entire notebook to generate all outputs
Export regret comparison plot : Save final comparison as PNG
Download notebook : File -> Download -> Download .ipynb
Submit via portal : Upload .ipynb and comparison.png
Submission Checklist :
After mastering multi-armed bandits:
Move to Activity 08: RL Debugging and Best Practices
Learn systematic debugging for RL algorithms
Apply to Project 2: Contextual Bandit Application (real-world scenario)
Key Insight : Bandits are stateless RL (no environment dynamics). Next, you'll learn stateful RL (MDPs, POMDPs) where actions affect future states. Bandits are the foundation-master them first!
Good luck! Multi-armed bandits teach you the core of exploration-exploitation. These algorithms power real-world systems serving billions of users (Netflix, Google, Amazon). You're learning production-ready techniques! 🎰🚀