ℹ️ Definition Multi-Armed Bandits are a simplified reinforcement learning framework where an agent must choose among multiple actions (arms) to maximize cumulative reward, balancing exploration (trying new arms) and exploitation (pulling the best known arm).
By the end of this lesson, you will:
In Lessons 1-6, we studied full reinforcement learning with states, actions, rewards, and long-term planning:
State → Action → Next State → Reward → ...
But what if there's no state? Just repeated choices between actions?
Multi-Armed Bandits are the simplest RL problem:
This simplification enables powerful theoretical results and practical applications.
Imagine a casino with K slot machines (one-armed bandits):
🎰 Machine 1: Average payout = $0.50 (unknown)
🎰 Machine 2: Average payout = $0.80 (unknown)
🎰 Machine 3: Average payout = $0.20 (unknown)
...
🎰 Machine K: Average payout = $0.65 (unknown)
Goal: Maximize total money won over T plays.
Challenge: You don't know which machine is best!

Setup:
Key Quantity: Regret = reward we could have achieved with perfect knowledge - actual reward
Regret = T * μ* - Σₜ rₜ
Where μ* = max_i μᵢ (best arm's mean)
Simplicity enables:
Real-world applications:
Exploitation: Choose the best known arm (maximize immediate reward) Exploration: Try other arms to gain information (maximize long-term reward)
Example:
Arm 1: Pulled 100 times, average reward = 0.8
Arm 2: Pulled 10 times, average reward = 0.9
Arm 3: Pulled 2 times, average reward = 1.0
Which arm to pull next?
Exploitation: Pull Arm 1 (most reliable estimate)
Exploration: Pull Arm 3 (uncertain, might be amazing!)
We'll learn three main strategies:
With probability ε: Explore (choose random arm) With probability 1-ε: Exploit (choose best arm)
Initialize:
Q(a) = 0 for all arms (estimated values)
N(a) = 0 for all arms (pull counts)
For t = 1, 2, ..., T:
if random() < ε:
aₜ = random arm # Explore
else:
aₜ = argmax_a Q(a) # Exploit
rₜ = pull arm aₜ
N(aₜ) += 1
# Update estimate (incremental mean)
Q(aₜ) ← Q(aₜ) + (1/N(aₜ)) * (rₜ - Q(aₜ))
import numpy as np
class EpsilonGreedy:
def __init__(self, n_arms, epsilon=0.1):
self.n_arms = n_arms
self.epsilon = epsilon
self.Q = np.zeros(n_arms) # Estimated values
self.N = np.zeros(n_arms) # Pull counts
def select_arm(self):
if np.random.random() < self.epsilon:
return np.random.randint(self.n_arms) # Explore
else:
return np.argmax(self.Q) # Exploit
def update(self, arm, reward):
self.N[arm] += 1
# Incremental update: Q_new = Q_old + (1/N) * (reward - Q_old)
self.Q[arm] += (1 / self.N[arm]) * (reward - self.Q[arm])
# Example usage
bandit = EpsilonGreedy(n_arms=3, epsilon=0.1)
for t in range(1000):
arm = bandit.select_arm()
reward = env.pull(arm) # Get reward from environment
bandit.update(arm, reward)
Often, we want more exploration early and less later:
epsilon_t = epsilon_0 * (decay_rate ** t)
# Example: Start at ε=1.0, end at ε=0.01
epsilon_0 = 1.0
decay_rate = 0.995
epsilon_min = 0.01
epsilon_t = max(epsilon_min, epsilon_0 * (decay_rate ** t))
Strengths:
Weaknesses:
Be optimistic in the face of uncertainty:
UCB(a) = Q(a) + c * sqrt(log(t) / N(a))
↑ ↑
Exploitation Exploration bonus
Where:
Exploration bonus grows when:
Effect: Automatically balances exploration and exploitation!
For t = 1, 2, ..., T:
# Select arm with highest upper confidence bound
aₜ = argmax_a [Q(a) + c * sqrt(log(t) / N(a))]
rₜ = pull arm aₜ
N(aₜ) += 1
Q(aₜ) ← Q(aₜ) + (1/N(aₜ)) * (rₜ - Q(aₜ))
class UCB:
def __init__(self, n_arms, c=2.0):
self.n_arms = n_arms
self.c = c
self.Q = np.zeros(n_arms)
self.N = np.zeros(n_arms)
self.t = 0
def select_arm(self):
self.t += 1
# Pull each arm once first
if np.any(self.N == 0):
return np.argmin(self.N)
# Compute UCB for each arm
ucb_values = self.Q + self.c * np.sqrt(np.log(self.t) / self.N)
return np.argmax(ucb_values)
def update(self, arm, reward):
self.N[arm] += 1
self.Q[arm] += (1 / self.N[arm]) * (reward - self.Q[arm])
UCB Regret Bound:
Regret ≤ O(sqrt(K * T * log(T)))
Where:
Interpretation: UCB has sublinear regret (regret grows slower than T).
Strengths:
Weaknesses:
Bayesian approach: Maintain probability distribution over arm values, sample from distributions.
Algorithm:
For binary rewards 0, 1:
Beta distribution properties:
For t = 1, 2, ..., T:
# Sample from each arm's belief distribution
for arm a:
μ̃ₐ ~ Beta(αₐ, βₐ)
# Choose arm with highest sample
aₜ = argmax_a μ̃ₐ
rₜ = pull arm aₜ
# Update belief
if rₜ == 1:
αₐₜ ← αₐₜ + 1
else:
βₐₜ ← βₐₜ + 1
class ThompsonSampling:
def __init__(self, n_arms):
self.n_arms = n_arms
self.alpha = np.ones(n_arms) # Success counts (Beta prior α)
self.beta = np.ones(n_arms) # Failure counts (Beta prior β)
def select_arm(self):
# Sample from each arm's Beta distribution
samples = np.random.beta(self.alpha, self.beta)
return np.argmax(samples)
def update(self, arm, reward):
# Update belief (Beta posterior)
if reward == 1:
self.alpha[arm] += 1
else:
self.beta[arm] += 1
# For continuous rewards, use Gaussian distribution
class ThompsonSamplingGaussian:
def __init__(self, n_arms):
self.n_arms = n_arms
self.mu = np.zeros(n_arms) # Mean estimates
self.sigma = np.ones(n_arms) # Uncertainty
self.N = np.zeros(n_arms)
def select_arm(self):
# Sample from each arm's Gaussian belief
samples = np.random.normal(self.mu, self.sigma)
return np.argmax(samples)
def update(self, arm, reward):
self.N[arm] += 1
# Update mean (incremental average)
self.mu[arm] += (1 / self.N[arm]) * (reward - self.mu[arm])
# Decrease uncertainty
self.sigma[arm] = 1 / np.sqrt(self.N[arm])
Strengths:
Weaknesses:
Standard bandits: No state, choose same arm every time if it's best.
Contextual bandits: Have state (context), best arm depends on context.
Example - News Recommendation:
Standard bandit: Recommend same article to everyone
Contextual bandit:
- Context: User age, location, interests
- Action: Recommend article
- Reward: Click or no click
- Goal: Personalize recommendations
Setup:
Assumption: Reward is linear in context features:
E[r | x, a] = xᵀ θₐ
Where:
Extends UCB to contextual settings:
class LinUCB:
def __init__(self, n_arms, context_dim, alpha=1.0):
self.n_arms = n_arms
self.alpha = alpha
# For each arm, maintain:
self.A = [np.eye(context_dim) for _ in range(n_arms)] # AᵀA
self.b = [np.zeros(context_dim) for _ in range(n_arms)] # Aᵀr
def select_arm(self, context):
ucb_values = []
for a in range(self.n_arms):
# Compute weight estimate
A_inv = np.linalg.inv(self.A[a])
theta_a = A_inv @ self.b[a]
# Predicted reward
pred_reward = context @ theta_a
# Uncertainty bonus
uncertainty = self.alpha * np.sqrt(context @ A_inv @ context)
# UCB
ucb = pred_reward + uncertainty
ucb_values.append(ucb)
return np.argmax(ucb_values)
def update(self, arm, context, reward):
self.A[arm] += np.outer(context, context)
self.b[arm] += reward * context
For complex contexts (images, text), use neural networks:
class NeuralBandit:
def __init__(self, context_dim, n_arms):
self.network = nn.Sequential(
nn.Linear(context_dim, 128),
nn.ReLU(),
nn.Linear(128, 64),
nn.ReLU(),
nn.Linear(64, n_arms) # Output: expected reward for each arm
)
self.optimizer = optim.Adam(self.network.parameters())
def select_arm(self, context, epsilon=0.1):
if np.random.random() < epsilon:
return np.random.randint(self.n_arms)
with torch.no_grad():
context_tensor = torch.FloatTensor(context)
q_values = self.network(context_tensor)
return q_values.argmax().item()
def update(self, context, arm, reward):
context_tensor = torch.FloatTensor(context)
q_values = self.network(context_tensor)
# MSE loss for chosen arm
target = q_values.clone()
target[arm] = reward
loss = F.mse_loss(q_values, target)
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
Problem: Test K website designs, find best one.
Bandit approach:
Benefit over fixed A/B test: Adaptively shifts traffic to better design.
Problem: Recommend content to users.
Contextual bandit approach:
Example - News:
# User context
context = [age=25, location=USA, interests=["tech", "sports"]]
# Select article
article = neural_bandit.select_arm(context)
# Observe engagement
reward = 1 if user_clicked else 0
# Update model
neural_bandit.update(context, article, reward)
Problem: Test K treatments, minimize patient harm.
Bandit approach:
Benefit: Adaptive trial design minimizes exposure to inferior treatments.
Problem: Choose which ad to show.
Contextual bandit:
Scale: Billions of decisions per day.
Use bandits when:
Examples: A/B testing, ad selection, recommendation systems
Use full RL when:
Examples: Robotics, game playing, autonomous driving
Contextual Bandits + RL: