The Bandit Problem
Picture yourself walking into a casino with a row of k slot machines in front of you. Each machine has a different (unknown) payout rate. You have 1000 coins to play. How do you maximize your winnings?
This is the multi-armed bandit problem—named after the old slang for slot machines (“one-armed bandits”). It’s the simplest setting in reinforcement learning, yet it captures the fundamental challenge that makes learning hard: exploration vs exploitation.
The Setup
A sequential decision problem with k actions (arms), each with an unknown reward distribution. At each step, the agent selects an arm and receives a reward drawn from that arm’s distribution.
Here’s the scenario:
- You face k slot machines (arms)
- Each arm has a true average payout that you don’t know
- When you pull arm , you receive a random reward centered around
- Your goal: maximize total reward over many pulls
The challenge is immediate: if you don’t know which arm is best, how do you find out? And how do you balance learning (trying different arms) with earning (playing the arm that seems best)?
Suppose you have 3 slot machines with these true (unknown to you) payout probabilities:
- Arm 1: Pays 0 otherwise. Mean: $0.20
- Arm 2: Pays 0 otherwise. Mean: $0.50
- Arm 3: Pays 0 otherwise. Mean: $0.80
Arm 3 is clearly best, but you don’t know this! After a few pulls, you might have:
- Arm 1: 2 wins out of 5 pulls (observed rate: 40%)
- Arm 2: 1 win out of 3 pulls (observed rate: 33%)
- Arm 3: 0 wins out of 2 pulls (observed rate: 0%)
Based on these observations, Arm 1 looks best—but that’s just noise. The true best arm (Arm 3) has been unlucky so far. Should you keep trying it?
Action Values
The first concept we need is a way to estimate how good each arm is based on our experience.
The action value is the expected reward from choosing action . This is what we’re trying to learn.
We don’t know the true action values, so we estimate them from experience. The simplest estimate? The sample average of rewards we’ve received from that arm.
If we’ve pulled arm 2 five times and received rewards of 1, 0, 1, 1, 0, our estimate is:
More formally, let be the number of times arm has been selected before time , and let be the rewards received from those selections. The sample average estimate is:
By the law of large numbers, as , this estimate converges to the true value:
The challenge is that we can’t pull all arms infinitely often. We need to decide which arms to learn about.
The Incremental Update Rule
Recomputing the average from scratch every time is wasteful. There’s a more efficient approach.
When you get a new reward, you can update your estimate incrementally:
- Calculate the error: how different was the reward from your estimate?
- Nudge your estimate toward the observed reward
This is expressed as:
Read this as: “New estimate = old estimate + step size times error”
Let’s derive this. The sample average after rewards is:
After receiving the -th reward :
Since :
This update takes constant time regardless of how many rewards we’ve seen.
import numpy as np
def update_q_incrementally(Q, N, action, reward):
"""
Update action-value estimate using incremental update.
Args:
Q: Array of current value estimates for each arm
N: Array of pull counts for each arm
action: The arm that was pulled
reward: The reward received
Returns:
Updated Q and N arrays
"""
N[action] += 1
# Step size is 1/n for the sample average
step_size = 1.0 / N[action]
# Update estimate toward observed reward
Q[action] += step_size * (reward - Q[action])
return Q, N
# Example usage
n_arms = 10
Q = np.zeros(n_arms) # Initial estimates
N = np.zeros(n_arms) # Pull counts
# Simulate some pulls
for _ in range(100):
action = np.random.randint(n_arms) # Random for now
reward = np.random.randn() # Fake reward
Q, N = update_q_incrementally(Q, N, action, reward)Regret: The Performance Metric
How do we measure how well a bandit algorithm performs? We can’t just look at total reward—that depends on how good the arms are. Instead, we measure how much we lost compared to always playing the best arm.
The regret at time is the difference between the reward we could have gotten by always playing the optimal arm and the reward we actually received.
Imagine you eventually discover which arm is best and realize it has mean reward . Looking back, you wish you’d played that arm every single time. Regret measures this gap:
Or equivalently, thinking about per-step loss:
Each time you play a suboptimal arm, you accumulate regret equal to how much worse that arm is.
Let be the expected reward of the optimal arm. The expected regret after steps is:
where is the “gap” for arm .
Key insight: regret accumulates from pulling suboptimal arms. A good algorithm minimizes for suboptimal arms while still learning which arms are suboptimal.
The best possible regret is sublinear in . In fact, it’s been proven that no algorithm can do better than regret in the worst case.
Continuing our 3-arm example with true means , , :
The optimal arm is Arm 3 with .
If after 100 pulls we played:
- Arm 1: 20 times (gap )
- Arm 2: 30 times (gap )
- Arm 3: 50 times (gap )
Expected regret:
We “left 21 dollars on the table” compared to always playing Arm 3.
The Fundamental Tension
Now we can articulate the core dilemma precisely:
The Exploration-Exploitation Dilemma
- Exploit: Play the arm with the highest estimated value to maximize immediate reward
- Explore: Play other arms to improve estimates and potentially find a better arm
Pure exploitation risks never finding the best arm. Pure exploration wastes time on arms you know are bad.
Why Pure Greedy Fails
If you always play the arm with the highest current estimate, you might get stuck on a suboptimal arm forever. Here’s why:
- Early rewards are noisy
- A suboptimal arm might look good by chance
- If you never try other arms, you never learn they’re better
- You accumulate regret indefinitely
Good exploration strategies balance learning with earning—exploring enough to find the best arm, then exploiting it.
Bandit Environments
Let’s build a simple bandit environment for experimentation:
import numpy as np
from typing import List, Optional
class MultiArmedBandit:
"""
A k-armed bandit with Gaussian reward distributions.
Each arm has a mean reward drawn from N(0, 1).
When pulled, rewards are drawn from N(mean, 1).
"""
def __init__(self, n_arms: int = 10, seed: Optional[int] = None):
self.n_arms = n_arms
self.rng = np.random.default_rng(seed)
# True arm means (unknown to the agent)
self.means = self.rng.standard_normal(n_arms)
self.optimal_arm = np.argmax(self.means)
self.optimal_value = self.means[self.optimal_arm]
def pull(self, arm: int) -> float:
"""Pull an arm and receive a reward."""
return self.rng.normal(self.means[arm], 1.0)
def get_optimal_arm(self) -> int:
"""Return the index of the optimal arm (for evaluation only)."""
return self.optimal_arm
def get_regret(self, arm: int) -> float:
"""Return the instantaneous regret for pulling this arm."""
return self.optimal_value - self.means[arm]
class BernoulliBandit:
"""
A k-armed bandit with Bernoulli (0/1) rewards.
Each arm has a probability p of paying 1, else pays 0.
"""
def __init__(self, probabilities: List[float], seed: Optional[int] = None):
self.probabilities = np.array(probabilities)
self.n_arms = len(probabilities)
self.rng = np.random.default_rng(seed)
self.optimal_arm = np.argmax(self.probabilities)
self.optimal_value = self.probabilities[self.optimal_arm]
def pull(self, arm: int) -> float:
"""Pull an arm and receive a 0/1 reward."""
return float(self.rng.random() < self.probabilities[arm])
def get_optimal_arm(self) -> int:
return self.optimal_arm
def get_regret(self, arm: int) -> float:
return self.optimal_value - self.probabilities[arm]
# Example usage
bandit = MultiArmedBandit(n_arms=10, seed=42)
print(f"Optimal arm: {bandit.optimal_arm} with mean {bandit.optimal_value:.3f}")
# Pull each arm a few times
for arm in range(bandit.n_arms):
rewards = [bandit.pull(arm) for _ in range(5)]
print(f"Arm {arm}: true mean = {bandit.means[arm]:.2f}, "
f"sample mean = {np.mean(rewards):.2f}")Summary
We’ve established the multi-armed bandit problem:
- Setup: k arms with unknown reward distributions
- Goal: Maximize cumulative reward (minimize regret)
- Challenge: Exploration vs exploitation
- Key quantity: Action values estimated by sample averages
- Metric: Regret measures how much we lost compared to optimal play
In the next section, we’ll see our first exploration strategies: greedy and epsilon-greedy methods.
The bandit problem may seem simple, but it’s the foundation for all of RL. The exploration strategies we develop here—epsilon-greedy, UCB, Thompson Sampling—appear throughout deep RL, from DQN to policy gradients. Master them here, and you’ll recognize them everywhere.