Upper Confidence Bound (UCB)
Epsilon-greedy explores by occasionally trying random arms. But this is undirected exploration—it doesn’t care whether an arm is well-understood or completely unknown. Could we do better by exploring where uncertainty is highest?
This is the key insight behind Upper Confidence Bound (UCB): explore optimistically. When you’re uncertain about an arm, assume it might be great and try it.
The Core Idea: Optimism in the Face of Uncertainty
When uncertain about an action’s value, assume it’s at the upper end of what’s plausible. This naturally encourages trying actions whose values are poorly estimated.
Imagine you’re choosing between two restaurants:
- Restaurant A: You’ve been 100 times. Average rating: 4.0 stars. You’re confident it’s around 4.0.
- Restaurant B: You’ve been once. That visit was 3.5 stars. But with only one data point, the true rating could be anywhere from 2.0 to 5.0.
Epsilon-greedy would pick A most of the time (it has a higher average) and occasionally try B randomly.
UCB thinks differently: “Restaurant B might be a hidden gem! The single bad experience could have been an unlucky night. Since I’m so uncertain, let me assume it could be as good as 5.0 and try it again.”
This is optimism in the face of uncertainty. By assuming uncertain options might be great, we naturally gravitate toward learning more about them.
The UCB1 Algorithm
UCB adds an exploration bonus to each arm’s value estimate. The bonus is larger for arms we haven’t tried much:
We then pick the arm with the highest UCB value. This elegantly balances:
- Exploitation: Arms with high estimated values score well
- Exploration: Arms with few samples get a large bonus
The UCB1 action selection rule is:
where:
- is the sample average reward for arm
- is the number of times arm has been pulled
- is the current time step
- is an exploration parameter (often or )
The exploration bonus has two key properties:
- It decreases as increases (we’ve learned about this arm)
- It increases slowly with (we should keep exploring a bit as time goes on)
Suppose we have 3 arms at time :
- Arm 1: , pulled times
- Arm 2: , pulled times
- Arm 3: , pulled times
With , the UCB values are:
- Arm 1:
- Arm 2:
- Arm 3:
UCB selects Arm 3! Even though it has the lowest estimated value, its exploration bonus (due to few samples) makes it the most attractive option.
After trying Arm 3 more times, its bonus will shrink and the best arm will emerge.
Why UCB Works
Think of the UCB value as an optimistic upper bound on what the arm’s true value might be. With high probability, the true value is below this bound.
If an arm has been tried many times, we have a good estimate—the bound is tight. If an arm hasn’t been tried much, the bound is loose—we’re uncertain, so we give it the benefit of the doubt.
By always picking the arm with the highest upper bound, we:
- Pick truly good arms (high estimate)
- Try uncertain arms (large bonus)
- Automatically reduce exploration as we learn (bonus shrinks)
The exploration bonus comes from concentration inequalities. Specifically, the Hoeffding bound tells us that for bounded rewards:
By setting and solving, we get the UCB exploration term. This means:
In words: with high probability, the true value is below our UCB value. The arm with the highest UCB is the one that could plausibly be the best.
The Exploration Bonus Dynamics
Let’s visualize how the exploration bonus behaves:
At time with :
- Pulled 1 time: bonus =
- Pulled 10 times: bonus =
- Pulled 100 times: bonus =
- Pulled 500 times: bonus =
The bonus drops quickly with more samples. An arm with only 1 pull gets 22x more bonus than one with 500 pulls!
import numpy as np
class UCB:
"""Upper Confidence Bound bandit algorithm."""
def __init__(self, n_arms, c=2.0):
"""
Initialize UCB agent.
Args:
n_arms: Number of arms
c: Exploration parameter (higher = more exploration)
"""
self.n_arms = n_arms
self.c = c
self.Q = np.zeros(n_arms) # Value estimates
self.N = np.zeros(n_arms) # Pull counts
self.t = 0 # Total steps
def select_action(self):
"""Select arm using UCB1 rule."""
self.t += 1
# First, ensure each arm is tried at least once
for arm in range(self.n_arms):
if self.N[arm] == 0:
return arm
# Compute UCB values for all arms
exploration_bonus = self.c * np.sqrt(np.log(self.t) / self.N)
ucb_values = self.Q + exploration_bonus
# Select arm with highest UCB value
return np.argmax(ucb_values)
def update(self, action, reward):
"""Update value estimate for the selected arm."""
self.N[action] += 1
# Incremental update for sample average
self.Q[action] += (reward - self.Q[action]) / self.N[action]
def get_ucb_values(self):
"""Return current UCB values for visualization."""
if self.t == 0 or np.any(self.N == 0):
return None
exploration_bonus = self.c * np.sqrt(np.log(self.t) / self.N)
return self.Q + exploration_bonus
# Example usage
def run_ucb_demo():
np.random.seed(42)
# Create a 5-armed bandit
true_means = [0.2, 0.4, 0.6, 0.8, 0.5]
n_arms = len(true_means)
agent = UCB(n_arms, c=2.0)
for step in range(200):
action = agent.select_action()
reward = np.random.normal(true_means[action], 0.5)
agent.update(action, reward)
print("Final value estimates:")
for arm in range(n_arms):
print(f" Arm {arm}: Q={agent.Q[arm]:.3f}, "
f"true={true_means[arm]:.3f}, "
f"pulls={int(agent.N[arm])}")
print(f"\nBest arm found: {np.argmax(agent.Q)} (true best: 3)")
run_ucb_demo()Choosing the Exploration Parameter c
Guidelines for Setting c
- : Common default, works well for rewards in
- : Theoretically derived from Hoeffding bound
- Larger : More exploration, better if arms are hard to distinguish
- Smaller : Less exploration, better if you’re confident about arm rankings
If rewards are scaled differently (e.g., ), scale accordingly or normalize your rewards.
UCB and Reward Scaling
UCB’s exploration bonus doesn’t automatically adapt to the scale of rewards. If your rewards are in instead of , the default might over-explore.
Common fixes:
- Normalize rewards to
- Scale proportionally to the reward range
- Use UCB variants that estimate the reward scale
UCB vs Epsilon-Greedy
The key difference is directed vs undirected exploration:
Epsilon-greedy explores randomly:
- Sometimes tries bad arms even when we know they’re bad
- Exploration doesn’t adapt to what we’ve learned
- Simple to tune (just one parameter)
UCB explores intelligently:
- Focuses exploration on uncertain arms
- Automatically reduces exploration as estimates improve
- Never wastes time on arms we’re confident are bad
Consider a 10-armed bandit after 1000 steps. One arm is clearly best (mean 0.9), others are mediocre (mean around 0.5).
Epsilon-greedy ():
- Pulls the best arm about 820 times
- Distributes ~180 pulls uniformly across all 10 arms
- Wastes ~20 pulls on each bad arm, even though we know they’re bad
UCB ():
- Pulls the best arm about 850 times
- Pulls uncertain arms a few times each to confirm they’re bad
- Stops exploring bad arms once confident
- Concentrates remaining exploration on arms that might rival the best
UCB is more sample-efficient because it doesn’t waste exploration on arms that are clearly suboptimal.
Regret Analysis
UCB achieves logarithmic regret—the best possible for this problem.
The expected regret after steps is bounded by:
where is the gap between arm and the optimal arm.
Key insights:
- Regret grows as (sublinear in time)
- Arms with larger gaps contribute less (we identify them quickly)
- Arms with small gaps are harder to distinguish and contribute more
Compare to epsilon-greedy, which achieves regret if is fixed (linear growth forever).
Why Logarithmic Regret is Optimal
Lai and Robbins (1985) proved that no algorithm can achieve better than regret in the worst case for multi-armed bandits. UCB achieves this optimal rate, making it asymptotically the best we can do!
The intuition: even with perfect exploration, you need at least samples to distinguish between arms with similar means.
UCB Variants
Several UCB variants exist for different settings:
import numpy as np
class UCB1Tuned:
"""
UCB1-Tuned: Incorporates variance estimates for tighter bounds.
Uses the observed variance to potentially explore less when
an arm's rewards are consistent.
"""
def __init__(self, n_arms, c=2.0):
self.n_arms = n_arms
self.c = c
self.Q = np.zeros(n_arms)
self.Q_sq = np.zeros(n_arms) # For variance computation
self.N = np.zeros(n_arms)
self.t = 0
def select_action(self):
self.t += 1
for arm in range(self.n_arms):
if self.N[arm] == 0:
return arm
# Compute variance estimate for each arm
variance = self.Q_sq - self.Q ** 2
variance = np.maximum(variance, 0) # Handle numerical issues
# UCB1-Tuned exploration term
ln_t = np.log(self.t)
V = variance + np.sqrt(2 * ln_t / self.N)
exploration = self.c * np.sqrt(ln_t / self.N * np.minimum(0.25, V))
ucb_values = self.Q + exploration
return np.argmax(ucb_values)
def update(self, action, reward):
self.N[action] += 1
n = self.N[action]
# Update mean
self.Q[action] += (reward - self.Q[action]) / n
# Update mean of squares (for variance)
self.Q_sq[action] += (reward ** 2 - self.Q_sq[action]) / n
class UCBBayesian:
"""
Bayesian UCB: Uses posterior distribution for uncertainty.
Assumes Gaussian rewards with known variance. Uses the
posterior mean and standard deviation.
"""
def __init__(self, n_arms, c=2.0, prior_mean=0.0, prior_var=1.0, noise_var=1.0):
self.n_arms = n_arms
self.c = c
self.noise_var = noise_var
# Prior: mean and precision (inverse variance)
self.mean = np.full(n_arms, prior_mean)
self.precision = np.full(n_arms, 1.0 / prior_var)
self.t = 0
def select_action(self):
self.t += 1
# Posterior standard deviation
posterior_std = 1.0 / np.sqrt(self.precision)
# UCB value
ucb_values = self.mean + self.c * posterior_std
return np.argmax(ucb_values)
def update(self, action, reward):
# Bayesian update for Gaussian posterior
noise_precision = 1.0 / self.noise_var
new_precision = self.precision[action] + noise_precision
new_mean = (self.precision[action] * self.mean[action] +
noise_precision * reward) / new_precision
self.precision[action] = new_precision
self.mean[action] = new_meanWhen to Use UCB
UCB is Great When:
- You need sample-efficient exploration
- The problem is stationary (arm values don’t change)
- You want theoretical guarantees on performance
- Arms are reasonably distinguishable
Consider Alternatives When:
- The problem is non-stationary (use sliding window UCB)
- You have prior knowledge about arms (use Thompson Sampling)
- You need very simple implementation (use epsilon-greedy)
- Rewards are highly variable and you want variance-aware exploration
Summary
UCB provides principled, directed exploration through optimism:
- Core principle: “Be optimistic about uncertain options”
- UCB1 formula:
- Exploration bonus shrinks as we learn about each arm
- Achieves optimal regret
- More sample-efficient than epsilon-greedy
In the next section, we’ll see Thompson Sampling—another approach to principled exploration that takes a Bayesian perspective.
UCB extends naturally to contextual bandits (as LinUCB) and reinforcement learning (as UCB-based exploration bonuses in Q-learning). The principle of “optimism in the face of uncertainty” appears throughout RL research.