Demo Mode

No student ID available

Concept 7 of 18

Concept 7: Multi-Armed Bandits and Exploration

Multi-Armed Bandits and Exploration

ℹ️ Definition Multi-Armed Bandits are a simplified reinforcement learning framework where an agent must choose among multiple actions (arms) to maximize cumulative reward, balancing exploration (trying new arms) and exploitation (pulling the best known arm).

Learning Objectives

By the end of this lesson, you will:

Understand the multi-armed bandit problem and its real-world applications
Learn exploration strategies (ε-greedy, UCB, Thompson Sampling)
Implement bandit algorithms from scratch
Extend to contextual bandits with state information
Apply bandits to recommendation systems and A/B testing
Understand when to use bandits vs full RL

Introduction

In Lessons 1-6, we studied full reinforcement learning with states, actions, rewards, and long-term planning:

vbnet

State → Action → Next State → Reward → ...

But what if there's no state? Just repeated choices between actions?

Multi-Armed Bandits are the simplest RL problem:

No states (stateless)
No delayed rewards (immediate feedback)
Focus: Exploration vs Exploitation trade-off

This simplification enables powerful theoretical results and practical applications.

The Multi-Armed Bandit Problem

Slot Machine Analogy

Imagine a casino with K slot machines (one-armed bandits):

ini

🎰 Machine 1: Average payout = $0.50 (unknown)
🎰 Machine 2: Average payout = $0.80 (unknown)
🎰 Machine 3: Average payout = $0.20 (unknown)
...
🎰 Machine K: Average payout = $0.65 (unknown)

Goal: Maximize total money won over T plays.

Challenge: You don't know which machine is best!

Formal Definition

Setup:

K arms (actions)
Each arm i has unknown reward distribution with mean μᵢ
At each timestep t:
- Choose arm aₜ
- Observe reward rₜ ~ P(r | aₜ)
Goal: Maximize cumulative reward Σ rₜ

Key Quantity: Regret = reward we could have achieved with perfect knowledge - actual reward

ini

Regret = T * μ* - Σₜ rₜ

Where μ* = max_i μᵢ (best arm's mean)

Why Bandits Matter

Simplicity enables:

Theoretical guarantees: Provable regret bounds
Efficient algorithms: Computationally cheap
Fast learning: Simpler than full RL

Real-world applications:

A/B testing: Which website design converts best?
Ad placement: Which ad gets most clicks?
Clinical trials: Which treatment works best?
Recommendation systems: Which content keeps users engaged?

The Exploration-Exploitation Dilemma

The Core Trade-Off

Exploitation: Choose the best known arm (maximize immediate reward) Exploration: Try other arms to gain information (maximize long-term reward)

Example:

ini

Arm 1: Pulled 100 times, average reward = 0.8
Arm 2: Pulled 10 times, average reward = 0.9
Arm 3: Pulled 2 times, average reward = 1.0

Which arm to pull next?

Exploitation: Pull Arm 1 (most reliable estimate)
Exploration: Pull Arm 3 (uncertain, might be amazing!)

Exploration Strategies

We'll learn three main strategies:

ε-Greedy: Random exploration with probability ε
Upper Confidence Bound (UCB): Optimistic exploration
Thompson Sampling: Bayesian probabilistic matching

ε-Greedy Algorithm

Idea

With probability ε: Explore (choose random arm) With probability 1-ε: Exploit (choose best arm)

Algorithm

python

Initialize:
  Q(a) = 0 for all arms (estimated values)
  N(a) = 0 for all arms (pull counts)

For t = 1, 2, ..., T:
  if random() < ε:
      aₜ = random arm     # Explore
  else:
      aₜ = argmax_a Q(a)  # Exploit

  rₜ = pull arm aₜ
  N(aₜ) += 1

  # Update estimate (incremental mean)
  Q(aₜ) ← Q(aₜ) + (1/N(aₜ)) * (rₜ - Q(aₜ))

Implementation

python

import numpy as np

class EpsilonGreedy:
    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.Q = np.zeros(n_arms)  # Estimated values
        self.N = np.zeros(n_arms)  # Pull counts

    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)  # Explore
        else:
            return np.argmax(self.Q)  # Exploit

    def update(self, arm, reward):
        self.N[arm] += 1
        # Incremental update: Q_new = Q_old + (1/N) * (reward - Q_old)
        self.Q[arm] += (1 / self.N[arm]) * (reward - self.Q[arm])

# Example usage
bandit = EpsilonGreedy(n_arms=3, epsilon=0.1)
for t in range(1000):
    arm = bandit.select_arm()
    reward = env.pull(arm)  # Get reward from environment
    bandit.update(arm, reward)

Epsilon Decay

Often, we want more exploration early and less later:

python

epsilon_t = epsilon_0 * (decay_rate ** t)

# Example: Start at ε=1.0, end at ε=0.01
epsilon_0 = 1.0
decay_rate = 0.995
epsilon_min = 0.01
epsilon_t = max(epsilon_min, epsilon_0 * (decay_rate ** t))

Strengths and Weaknesses

Strengths:

✅ Simple to understand and implement
✅ Works well in practice
✅ Guaranteed to converge (with epsilon decay)

Weaknesses:

❌ Explores randomly (no smart exploration)
❌ Sensitive to epsilon choice
❌ Doesn't account for uncertainty (all non-greedy arms treated equally)

Upper Confidence Bound (UCB)

Idea

Be optimistic in the face of uncertainty:

Estimate not just the mean reward, but also uncertainty
Choose arm with highest upper confidence bound

scss

UCB(a) = Q(a) + c * sqrt(log(t) / N(a))
          ↑           ↑
       Exploitation  Exploration bonus

Where:

Q(a): Estimated mean reward of arm a
N(a): Number of times arm a pulled
t: Total timesteps
c: Exploration constant (typically c=2)

Intuition

Exploration bonus grows when:

Arm pulled rarely (N(a) small -> bonus large)
Many other arms pulled (t large -> bonus increases)

Effect: Automatically balances exploration and exploitation!

Algorithm

python

For t = 1, 2, ..., T:
  # Select arm with highest upper confidence bound
  aₜ = argmax_a [Q(a) + c * sqrt(log(t) / N(a))]

  rₜ = pull arm aₜ
  N(aₜ) += 1
  Q(aₜ) ← Q(aₜ) + (1/N(aₜ)) * (rₜ - Q(aₜ))

Implementation

python

class UCB:
    def __init__(self, n_arms, c=2.0):
        self.n_arms = n_arms
        self.c = c
        self.Q = np.zeros(n_arms)
        self.N = np.zeros(n_arms)
        self.t = 0

    def select_arm(self):
        self.t += 1

        # Pull each arm once first
        if np.any(self.N == 0):
            return np.argmin(self.N)

        # Compute UCB for each arm
        ucb_values = self.Q + self.c * np.sqrt(np.log(self.t) / self.N)
        return np.argmax(ucb_values)

    def update(self, arm, reward):
        self.N[arm] += 1
        self.Q[arm] += (1 / self.N[arm]) * (reward - self.Q[arm])

Theoretical Guarantee

UCB Regret Bound:

Regret ≤ O(sqrt(K * T * log(T)))

Where:

K: Number of arms
T: Time horizon

Interpretation: UCB has sublinear regret (regret grows slower than T).

Strengths and Weaknesses

Strengths:

✅ Principled exploration (based on uncertainty)
✅ No hyperparameter tuning needed (epsilon-free)
✅ Theoretical guarantees

Weaknesses:

❌ Assumes bounded rewards
❌ Can be conservative (over-explores)
❌ Doesn't naturally extend to contextual bandits

Thompson Sampling

Idea

Bayesian approach: Maintain probability distribution over arm values, sample from distributions.

Algorithm:

Maintain belief distribution P(μₐ | data) for each arm a
Sample estimate μ̃ₐ ~ P(μₐ | data) for each arm
Pull arm with highest sample: aₜ = argmax_a μ̃ₐ

Beta Distribution (Bernoulli Bandits)

For binary rewards 0, 1:

Prior: Beta(α, β) distribution
Update: After observing reward r:
- If r = 1: α <- α + 1 (success)
- If r = 0: β <- β + 1 (failure)

Beta distribution properties:

Mean = α / (α + β)
Variance decreases as α + β increases (less uncertain over time)

Algorithm

python

For t = 1, 2, ..., T:
  # Sample from each arm's belief distribution
  for arm a:
      μ̃ₐ ~ Beta(αₐ, βₐ)

  # Choose arm with highest sample
  aₜ = argmax_a μ̃ₐ

  rₜ = pull arm aₜ

  # Update belief
  if rₜ == 1:
      αₐₜ ← αₐₜ + 1
  else:
      βₐₜ ← βₐₜ + 1

Implementation

python

class ThompsonSampling:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.alpha = np.ones(n_arms)  # Success counts (Beta prior α)
        self.beta = np.ones(n_arms)   # Failure counts (Beta prior β)

    def select_arm(self):
        # Sample from each arm's Beta distribution
        samples = np.random.beta(self.alpha, self.beta)
        return np.argmax(samples)

    def update(self, arm, reward):
        # Update belief (Beta posterior)
        if reward == 1:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

# For continuous rewards, use Gaussian distribution
class ThompsonSamplingGaussian:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.mu = np.zeros(n_arms)    # Mean estimates
        self.sigma = np.ones(n_arms)  # Uncertainty
        self.N = np.zeros(n_arms)

    def select_arm(self):
        # Sample from each arm's Gaussian belief
        samples = np.random.normal(self.mu, self.sigma)
        return np.argmax(samples)

    def update(self, arm, reward):
        self.N[arm] += 1
        # Update mean (incremental average)
        self.mu[arm] += (1 / self.N[arm]) * (reward - self.mu[arm])
        # Decrease uncertainty
        self.sigma[arm] = 1 / np.sqrt(self.N[arm])

Strengths and Weaknesses

Strengths:

✅ Principled Bayesian approach
✅ Often best empirical performance
✅ Naturally handles uncertainty
✅ Extends well to contextual bandits

Weaknesses:

❌ Computationally more expensive (sampling)
❌ Requires choosing prior distribution
❌ Less intuitive than ε-greedy

Contextual Bandits

Motivation

Standard bandits: No state, choose same arm every time if it's best.

Contextual bandits: Have state (context), best arm depends on context.

Example - News Recommendation:

yaml

Standard bandit: Recommend same article to everyone
Contextual bandit:
  - Context: User age, location, interests
  - Action: Recommend article
  - Reward: Click or no click
  - Goal: Personalize recommendations

Formal Definition

Setup:

At each timestep t:
- Observe context xₜ (e.g., user features)
- Choose arm aₜ based on context
- Observe reward rₜ
Goal: Learn policy π(x) that maps contexts to arms

Linear Contextual Bandits

Assumption: Reward is linear in context features:

css

E[r | x, a] = xᵀ θₐ

Where:

x: Context vector (features)
θₐ: Weight vector for arm a (learned)

LinUCB Algorithm

Extends UCB to contextual settings:

python

class LinUCB:
    def __init__(self, n_arms, context_dim, alpha=1.0):
        self.n_arms = n_arms
        self.alpha = alpha

        # For each arm, maintain:
        self.A = [np.eye(context_dim) for _ in range(n_arms)]  # AᵀA
        self.b = [np.zeros(context_dim) for _ in range(n_arms)]  # Aᵀr

    def select_arm(self, context):
        ucb_values = []

        for a in range(self.n_arms):
            # Compute weight estimate
            A_inv = np.linalg.inv(self.A[a])
            theta_a = A_inv @ self.b[a]

            # Predicted reward
            pred_reward = context @ theta_a

            # Uncertainty bonus
            uncertainty = self.alpha * np.sqrt(context @ A_inv @ context)

            # UCB
            ucb = pred_reward + uncertainty
            ucb_values.append(ucb)

        return np.argmax(ucb_values)

    def update(self, arm, context, reward):
        self.A[arm] += np.outer(context, context)
        self.b[arm] += reward * context

Neural Contextual Bandits

For complex contexts (images, text), use neural networks:

python

class NeuralBandit:
    def __init__(self, context_dim, n_arms):
        self.network = nn.Sequential(
            nn.Linear(context_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, n_arms)  # Output: expected reward for each arm
        )
        self.optimizer = optim.Adam(self.network.parameters())

    def select_arm(self, context, epsilon=0.1):
        if np.random.random() < epsilon:
            return np.random.randint(self.n_arms)

        with torch.no_grad():
            context_tensor = torch.FloatTensor(context)
            q_values = self.network(context_tensor)
            return q_values.argmax().item()

    def update(self, context, arm, reward):
        context_tensor = torch.FloatTensor(context)
        q_values = self.network(context_tensor)

        # MSE loss for chosen arm
        target = q_values.clone()
        target[arm] = reward
        loss = F.mse_loss(q_values, target)

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

Applications

A/B Testing

Problem: Test K website designs, find best one.

Bandit approach:

Arms: Website designs
Reward: Click-through rate
Algorithm: Thompson Sampling (fast convergence)

Benefit over fixed A/B test: Adaptively shifts traffic to better design.

Recommendation Systems

Problem: Recommend content to users.

Contextual bandit approach:

Context: User features (age, location, history)
Arms: Content items
Reward: Click, watch time, rating
Algorithm: LinUCB or Neural Bandit

Example - News:

python

# User context
context = [age=25, location=USA, interests=["tech", "sports"]]

# Select article
article = neural_bandit.select_arm(context)

# Observe engagement
reward = 1 if user_clicked else 0

# Update model
neural_bandit.update(context, article, reward)

Clinical Trials

Problem: Test K treatments, minimize patient harm.

Bandit approach:

Arms: Treatments
Reward: Patient outcome
Algorithm: Thompson Sampling (Bayesian, ethical)

Benefit: Adaptive trial design minimizes exposure to inferior treatments.

Online Advertising

Problem: Choose which ad to show.

Contextual bandit:

Context: User profile, page content
Arms: Ads
Reward: Click, conversion
Algorithm: LinUCB

Scale: Billions of decisions per day.

Bandits vs Full RL

When to Use Bandits

Use bandits when:

No delayed rewards: Immediate feedback
Stateless or independent contexts: No sequential dependencies
Fast learning required: Need quick convergence
Simple structure: Don't need long-term planning

Examples: A/B testing, ad selection, recommendation systems

When to Use Full RL

Use full RL when:

Delayed rewards: Actions have long-term consequences
Sequential states: Current action affects future states
Planning required: Need to consider multi-step strategies
Complex dynamics: Environment has intricate state transitions

Examples: Robotics, game playing, autonomous driving

Hybrid Approaches

Contextual Bandits + RL:

Model state transitions with RL
Model immediate reward with bandits
Example: YouTube recommendations (session-level RL, video-level bandits)

Key Takeaways

Multi-armed bandits focus on exploration-exploitation trade-off without states
ε-Greedy explores randomly with probability ε
UCB uses optimistic uncertainty bonuses for principled exploration
Thompson Sampling samples from belief distributions (Bayesian)
Contextual bandits extend to state-dependent action selection
Applications: A/B testing, recommendations, ads, clinical trials
Bandits vs RL: Bandits for immediate feedback, RL for sequential decisions

Looking Ahead

Lesson 8: RL in practice (debugging, deployment, reward shaping)
Lessons 9-15: Generative AI fundamentals
Lesson 16: RLHF uses bandit/RL concepts for LLM alignment!

Summary

Multi-armed bandits are stateless RL problems focused on exploration vs exploitation
ε-Greedy explores randomly (simple but effective)
UCB uses uncertainty bonuses (principled, provable regret bounds)
Thompson Sampling uses Bayesian belief distributions (best empirical performance)
Contextual bandits extend to state-dependent decisions
Applications include A/B testing, recommendations, ads, clinical trials
Bandits complement full RL for problems with immediate feedback and no state dependencies

Concept 7 of 18

Concept 7: Multi-Armed Bandits and Exploration

Multi-Armed Bandits and Exploration

ℹ️ Definition Multi-Armed Bandits are a simplified reinforcement learning framework where an agent must choose among multiple actions (arms) to maximize cumulative reward, balancing exploration (trying new arms) and exploitation (pulling the best known arm).

Learning Objectives

By the end of this lesson, you will:

Understand the multi-armed bandit problem and its real-world applications
Learn exploration strategies (ε-greedy, UCB, Thompson Sampling)
Implement bandit algorithms from scratch
Extend to contextual bandits with state information
Apply bandits to recommendation systems and A/B testing
Understand when to use bandits vs full RL

Introduction

In Lessons 1-6, we studied full reinforcement learning with states, actions, rewards, and long-term planning:

vbnet

State → Action → Next State → Reward → ...

But what if there's no state? Just repeated choices between actions?

Multi-Armed Bandits are the simplest RL problem:

No states (stateless)
No delayed rewards (immediate feedback)
Focus: Exploration vs Exploitation trade-off

This simplification enables powerful theoretical results and practical applications.

The Multi-Armed Bandit Problem

Slot Machine Analogy

Imagine a casino with K slot machines (one-armed bandits):

ini

🎰 Machine 1: Average payout = $0.50 (unknown)
🎰 Machine 2: Average payout = $0.80 (unknown)
🎰 Machine 3: Average payout = $0.20 (unknown)
...
🎰 Machine K: Average payout = $0.65 (unknown)

Goal: Maximize total money won over T plays.

Challenge: You don't know which machine is best!

Formal Definition

Setup:

K arms (actions)
Each arm i has unknown reward distribution with mean μᵢ
At each timestep t:
- Choose arm aₜ
- Observe reward rₜ ~ P(r | aₜ)
Goal: Maximize cumulative reward Σ rₜ

Key Quantity: Regret = reward we could have achieved with perfect knowledge - actual reward

ini

Regret = T * μ* - Σₜ rₜ

Where μ* = max_i μᵢ (best arm's mean)

Why Bandits Matter

Simplicity enables:

Theoretical guarantees: Provable regret bounds
Efficient algorithms: Computationally cheap
Fast learning: Simpler than full RL

Real-world applications:

A/B testing: Which website design converts best?
Ad placement: Which ad gets most clicks?
Clinical trials: Which treatment works best?
Recommendation systems: Which content keeps users engaged?

The Exploration-Exploitation Dilemma

The Core Trade-Off

Exploitation: Choose the best known arm (maximize immediate reward) Exploration: Try other arms to gain information (maximize long-term reward)

Example:

ini

Arm 1: Pulled 100 times, average reward = 0.8
Arm 2: Pulled 10 times, average reward = 0.9
Arm 3: Pulled 2 times, average reward = 1.0

Which arm to pull next?

Exploitation: Pull Arm 1 (most reliable estimate)
Exploration: Pull Arm 3 (uncertain, might be amazing!)

Exploration Strategies

We'll learn three main strategies:

ε-Greedy: Random exploration with probability ε
Upper Confidence Bound (UCB): Optimistic exploration
Thompson Sampling: Bayesian probabilistic matching

Initialize:
  Q(a) = 0 for all arms (estimated values)
  N(a) = 0 for all arms (pull counts)

For t = 1, 2, ..., T:
  if random() < ε:
      aₜ = random arm     # Explore
  else:
      aₜ = argmax_a Q(a)  # Exploit

  rₜ = pull arm aₜ
  N(aₜ) += 1

  # Update estimate (incremental mean)
  Q(aₜ) ← Q(aₜ) + (1/N(aₜ)) * (rₜ - Q(aₜ))

Implementation

python

import numpy as np

class EpsilonGreedy:
    def __init__(self, n_arms, epsilon=0.1):
        self.n_arms = n_arms
        self.epsilon = epsilon
        self.Q = np.zeros(n_arms)  # Estimated values
        self.N = np.zeros(n_arms)  # Pull counts

    def select_arm(self):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.n_arms)  # Explore
        else:
            return np.argmax(self.Q)  # Exploit

    def update(self, arm, reward):
        self.N[arm] += 1
        # Incremental update: Q_new = Q_old + (1/N) * (reward - Q_old)
        self.Q[arm] += (1 / self.N[arm]) * (reward - self.Q[arm])

# Example usage
bandit = EpsilonGreedy(n_arms=3, epsilon=0.1)
for t in range(1000):
    arm = bandit.select_arm()
    reward = env.pull(arm)  # Get reward from environment
    bandit.update(arm, reward)

Epsilon Decay

Often, we want more exploration early and less later:

python

epsilon_t = epsilon_0 * (decay_rate ** t)

# Example: Start at ε=1.0, end at ε=0.01
epsilon_0 = 1.0
decay_rate = 0.995
epsilon_min = 0.01
epsilon_t = max(epsilon_min, epsilon_0 * (decay_rate ** t))

Strengths and Weaknesses

Strengths:

✅ Simple to understand and implement
✅ Works well in practice
✅ Guaranteed to converge (with epsilon decay)

Weaknesses:

❌ Explores randomly (no smart exploration)
❌ Sensitive to epsilon choice
❌ Doesn't account for uncertainty (all non-greedy arms treated equally)

Upper Confidence Bound (UCB)

Idea

Be optimistic in the face of uncertainty:

Estimate not just the mean reward, but also uncertainty
Choose arm with highest upper confidence bound

scss

UCB(a) = Q(a) + c * sqrt(log(t) / N(a))
          ↑           ↑
       Exploitation  Exploration bonus

Where:

Q(a): Estimated mean reward of arm a
N(a): Number of times arm a pulled
t: Total timesteps
c: Exploration constant (typically c=2)

Intuition

Exploration bonus grows when:

Arm pulled rarely (N(a) small -> bonus large)
Many other arms pulled (t large -> bonus increases)

Effect: Automatically balances exploration and exploitation!

Algorithm

python

For t = 1, 2, ..., T:
  # Select arm with highest upper confidence bound
  aₜ = argmax_a [Q(a) + c * sqrt(log(t) / N(a))]

  rₜ = pull arm aₜ
  N(aₜ) += 1
  Q(aₜ) ← Q(aₜ) + (1/N(aₜ)) * (rₜ - Q(aₜ))

Implementation

python

class UCB:
    def __init__(self, n_arms, c=2.0):
        self.n_arms = n_arms
        self.c = c
        self.Q = np.zeros(n_arms)
        self.N = np.zeros(n_arms)
        self.t = 0

    def select_arm(self):
        self.t += 1

        # Pull each arm once first
        if np.any(self.N == 0):
            return np.argmin(self.N)

        # Compute UCB for each arm
        ucb_values = self.Q + self.c * np.sqrt(np.log(self.t) / self.N)
        return np.argmax(ucb_values)

    def update(self, arm, reward):
        self.N[arm] += 1
        self.Q[arm] += (1 / self.N[arm]) * (reward - self.Q[arm])

Theoretical Guarantee

UCB Regret Bound:

Regret ≤ O(sqrt(K * T * log(T)))

Where:

K: Number of arms
T: Time horizon

Interpretation: UCB has sublinear regret (regret grows slower than T).

Strengths and Weaknesses

Strengths:

✅ Principled exploration (based on uncertainty)
✅ No hyperparameter tuning needed (epsilon-free)
✅ Theoretical guarantees

Weaknesses:

❌ Assumes bounded rewards
❌ Can be conservative (over-explores)
❌ Doesn't naturally extend to contextual bandits

Thompson Sampling

Idea

Bayesian approach: Maintain probability distribution over arm values, sample from distributions.

Algorithm:

Maintain belief distribution P(μₐ | data) for each arm a
Sample estimate μ̃ₐ ~ P(μₐ | data) for each arm
Pull arm with highest sample: aₜ = argmax_a μ̃ₐ

Beta Distribution (Bernoulli Bandits)

For binary rewards 0, 1:

Prior: Beta(α, β) distribution
Update: After observing reward r:
- If r = 1: α <- α + 1 (success)
- If r = 0: β <- β + 1 (failure)

Beta distribution properties:

Mean = α / (α + β)
Variance decreases as α + β increases (less uncertain over time)

Algorithm

python

For t = 1, 2, ..., T:
  # Sample from each arm's belief distribution
  for arm a:
      μ̃ₐ ~ Beta(αₐ, βₐ)

  # Choose arm with highest sample
  aₜ = argmax_a μ̃ₐ

  rₜ = pull arm aₜ

  # Update belief
  if rₜ == 1:
      αₐₜ ← αₐₜ + 1
  else:
      βₐₜ ← βₐₜ + 1

Implementation

python

class ThompsonSampling:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.alpha = np.ones(n_arms)  # Success counts (Beta prior α)
        self.beta = np.ones(n_arms)   # Failure counts (Beta prior β)

    def select_arm(self):
        # Sample from each arm's Beta distribution
        samples = np.random.beta(self.alpha, self.beta)
        return np.argmax(samples)

    def update(self, arm, reward):
        # Update belief (Beta posterior)
        if reward == 1:
            self.alpha[arm] += 1
        else:
            self.beta[arm] += 1

# For continuous rewards, use Gaussian distribution
class ThompsonSamplingGaussian:
    def __init__(self, n_arms):
        self.n_arms = n_arms
        self.mu = np.zeros(n_arms)    # Mean estimates
        self.sigma = np.ones(n_arms)  # Uncertainty
        self.N = np.zeros(n_arms)

    def select_arm(self):
        # Sample from each arm's Gaussian belief
        samples = np.random.normal(self.mu, self.sigma)
        return np.argmax(samples)

    def update(self, arm, reward):
        self.N[arm] += 1
        # Update mean (incremental average)
        self.mu[arm] += (1 / self.N[arm]) * (reward - self.mu[arm])
        # Decrease uncertainty
        self.sigma[arm] = 1 / np.sqrt(self.N[arm])

Strengths and Weaknesses

Strengths:

✅ Principled Bayesian approach
✅ Often best empirical performance
✅ Naturally handles uncertainty
✅ Extends well to contextual bandits

Weaknesses:

❌ Computationally more expensive (sampling)
❌ Requires choosing prior distribution
❌ Less intuitive than ε-greedy

Contextual Bandits

Motivation

Standard bandits: No state, choose same arm every time if it's best.

Contextual bandits: Have state (context), best arm depends on context.

Example - News Recommendation:

yaml

Standard bandit: Recommend same article to everyone
Contextual bandit:
  - Context: User age, location, interests
  - Action: Recommend article
  - Reward: Click or no click
  - Goal: Personalize recommendations

Formal Definition

Setup:

At each timestep t:
- Observe context xₜ (e.g., user features)
- Choose arm aₜ based on context
- Observe reward rₜ
Goal: Learn policy π(x) that maps contexts to arms

Linear Contextual Bandits

Assumption: Reward is linear in context features:

css

E[r | x, a] = xᵀ θₐ

Where:

x: Context vector (features)
θₐ: Weight vector for arm a (learned)

LinUCB Algorithm

Extends UCB to contextual settings:

python

class LinUCB:
    def __init__(self, n_arms, context_dim, alpha=1.0):
        self.n_arms = n_arms
        self.alpha = alpha

        # For each arm, maintain:
        self.A = [np.eye(context_dim) for _ in range(n_arms)]  # AᵀA
        self.b = [np.zeros(context_dim) for _ in range(n_arms)]  # Aᵀr

    def select_arm(self, context):
        ucb_values = []

        for a in range(self.n_arms):
            # Compute weight estimate
            A_inv = np.linalg.inv(self.A[a])
            theta_a = A_inv @ self.b[a]

            # Predicted reward
            pred_reward = context @ theta_a

            # Uncertainty bonus
            uncertainty = self.alpha * np.sqrt(context @ A_inv @ context)

            # UCB
            ucb = pred_reward + uncertainty
            ucb_values.append(ucb)

        return np.argmax(ucb_values)

    def update(self, arm, context, reward):
        self.A[arm] += np.outer(context, context)
        self.b[arm] += reward * context

Neural Contextual Bandits

For complex contexts (images, text), use neural networks:

python

class NeuralBandit:
    def __init__(self, context_dim, n_arms):
        self.network = nn.Sequential(
            nn.Linear(context_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, n_arms)  # Output: expected reward for each arm
        )
        self.optimizer = optim.Adam(self.network.parameters())

    def select_arm(self, context, epsilon=0.1):
        if np.random.random() < epsilon:
            return np.random.randint(self.n_arms)

        with torch.no_grad():
            context_tensor = torch.FloatTensor(context)
            q_values = self.network(context_tensor)
            return q_values.argmax().item()

    def update(self, context, arm, reward):
        context_tensor = torch.FloatTensor(context)
        q_values = self.network(context_tensor)

        # MSE loss for chosen arm
        target = q_values.clone()
        target[arm] = reward
        loss = F.mse_loss(q_values, target)

        self.optimizer.zero_grad()
        loss.backward()
        self.optimizer.step()

Applications

A/B Testing

Problem: Test K website designs, find best one.

Bandit approach:

Arms: Website designs
Reward: Click-through rate
Algorithm: Thompson Sampling (fast convergence)

Benefit over fixed A/B test: Adaptively shifts traffic to better design.

Recommendation Systems

Problem: Recommend content to users.

Contextual bandit approach:

Context: User features (age, location, history)
Arms: Content items
Reward: Click, watch time, rating
Algorithm: LinUCB or Neural Bandit

Example - News:

python

# User context
context = [age=25, location=USA, interests=["tech", "sports"]]

# Select article
article = neural_bandit.select_arm(context)

# Observe engagement
reward = 1 if user_clicked else 0

# Update model
neural_bandit.update(context, article, reward)

Clinical Trials

Problem: Test K treatments, minimize patient harm.

Bandit approach:

Arms: Treatments
Reward: Patient outcome
Algorithm: Thompson Sampling (Bayesian, ethical)

Benefit: Adaptive trial design minimizes exposure to inferior treatments.

Online Advertising

Problem: Choose which ad to show.

Contextual bandit:

Context: User profile, page content
Arms: Ads
Reward: Click, conversion
Algorithm: LinUCB

Scale: Billions of decisions per day.

Bandits vs Full RL

When to Use Bandits

Use bandits when:

No delayed rewards: Immediate feedback
Stateless or independent contexts: No sequential dependencies
Fast learning required: Need quick convergence
Simple structure: Don't need long-term planning

Examples: A/B testing, ad selection, recommendation systems

When to Use Full RL

Use full RL when:

Delayed rewards: Actions have long-term consequences
Sequential states: Current action affects future states
Planning required: Need to consider multi-step strategies
Complex dynamics: Environment has intricate state transitions

Examples: Robotics, game playing, autonomous driving

Hybrid Approaches

Contextual Bandits + RL:

Model state transitions with RL
Model immediate reward with bandits
Example: YouTube recommendations (session-level RL, video-level bandits)

Key Takeaways

Multi-armed bandits focus on exploration-exploitation trade-off without states
ε-Greedy explores randomly with probability ε
UCB uses optimistic uncertainty bonuses for principled exploration
Thompson Sampling samples from belief distributions (Bayesian)
Contextual bandits extend to state-dependent action selection
Applications: A/B testing, recommendations, ads, clinical trials
Bandits vs RL: Bandits for immediate feedback, RL for sequential decisions

Looking Ahead

Lesson 8: RL in practice (debugging, deployment, reward shaping)
Lessons 9-15: Generative AI fundamentals
Lesson 16: RLHF uses bandit/RL concepts for LLM alignment!

Summary

Multi-armed bandits are stateless RL problems focused on exploration vs exploitation
ε-Greedy explores randomly (simple but effective)
UCB uses uncertainty bonuses (principled, provable regret bounds)
Thompson Sampling uses Bayesian belief distributions (best empirical performance)
Contextual bandits extend to state-dependent decisions
Applications include A/B testing, recommendations, ads, clinical trials
Bandits complement full RL for problems with immediate feedback and no state dependencies