Thompson Sampling
UCB explores by being optimistic—always assuming uncertain arms might be great. Thompson Sampling takes a different, more elegant approach: it explores by being uncertain.
Rather than computing upper bounds, Thompson Sampling maintains a probability distribution over each arm’s true value and samples from these beliefs. When you’re uncertain, your samples vary widely, naturally leading to exploration. When you’re confident, your samples concentrate, leading to exploitation.
The Bayesian Perspective
Maintain a posterior distribution over each arm’s reward. At each step, sample a value from each distribution and play the arm with the highest sample.
Imagine you believe each arm has some true success probability, but you don’t know what it is. For each arm, you have a “belief distribution” representing how likely different values are.
Thompson Sampling works like this:
- For each arm, draw a random sample from your belief about its value
- Play the arm with the highest sample
- Update your belief based on the outcome
The magic: when you’re uncertain about an arm, your samples are spread out—sometimes high, sometimes low. This means you’ll sometimes sample a high value even for an arm with a low average, causing you to try it. As you learn, your beliefs tighten, and samples become more consistent.
Beta Distributions for Bernoulli Bandits
The cleanest case is when rewards are binary (0 or 1), like click/no-click or success/failure.
For Bernoulli bandits, we maintain a Beta distribution for each arm’s success probability:
The Beta distribution has these properties:
- Defined on (perfect for probabilities)
- Mean:
- Variance:
- Higher means lower variance (more confident)
Prior: Start with (uniform distribution over )
Posterior Update after observing reward :
- If (success):
- If (failure):
This is the conjugate prior for Bernoulli likelihood—the posterior is also Beta.
Consider arm with true success probability 0.7. Starting from Beta(1, 1):
After 0 observations: Beta(1, 1) - uniform, anything is possible
After 10 observations (7 successes, 3 failures): Beta(8, 4)
- Mean: 0.67
- 95% credible interval: roughly [0.38, 0.89]
After 100 observations (70 successes, 30 failures): Beta(71, 31)
- Mean: 0.70
- 95% credible interval: roughly [0.60, 0.78]
The distribution sharpens around the true value as we collect more data.
The Thompson Sampling Algorithm
At each step:
- For each arm , sample
- Select the arm with the highest sample:
- Observe reward
- Update: if , increment ; otherwise increment
The beauty is in how sampling naturally balances exploration and exploitation:
- Uncertain arms (wide distribution) → samples vary a lot → sometimes high → sometimes selected
- Confident good arms (narrow distribution, high mean) → samples consistently high → often selected
- Confident bad arms (narrow distribution, low mean) → samples consistently low → rarely selected
import numpy as np
class ThompsonSampling:
"""
Thompson Sampling for Bernoulli bandits.
Maintains Beta distribution posteriors for each arm's
success probability.
"""
def __init__(self, n_arms):
self.n_arms = n_arms
# Beta(1, 1) prior = uniform distribution
self.alpha = np.ones(n_arms) # Successes + 1
self.beta = np.ones(n_arms) # Failures + 1
def select_action(self):
"""Sample from each posterior and pick the highest."""
samples = np.random.beta(self.alpha, self.beta)
return np.argmax(samples)
def update(self, action, reward):
"""Update the posterior for the selected arm."""
if reward > 0:
self.alpha[action] += 1
else:
self.beta[action] += 1
def get_expected_values(self):
"""Return the posterior mean for each arm."""
return self.alpha / (self.alpha + self.beta)
def get_confidence(self):
"""Return a measure of confidence (total observations + prior)."""
return self.alpha + self.beta - 2 # Subtract prior
# Example usage
def run_thompson_demo():
np.random.seed(42)
# True success probabilities (arm 2 is best)
true_probs = [0.3, 0.5, 0.7, 0.2]
n_arms = len(true_probs)
agent = ThompsonSampling(n_arms)
for step in range(500):
action = agent.select_action()
# Bernoulli reward
reward = 1.0 if np.random.random() < true_probs[action] else 0.0
agent.update(action, reward)
print("Final posterior parameters:")
for arm in range(n_arms):
print(f" Arm {arm}: alpha={agent.alpha[arm]:.0f}, "
f"beta={agent.beta[arm]:.0f}, "
f"mean={agent.get_expected_values()[arm]:.3f}, "
f"true={true_probs[arm]}")
print(f"\nArm pulls: {agent.alpha + agent.beta - 2}")
run_thompson_demo()Visualizing the Magic
Let’s trace through Thompson Sampling on a simple example to see how it naturally explores.
Setup: 2 arms with true probabilities (worse) and (better).
Step 1: Both arms have Beta(1,1). Draw samples:
- Arm 1: sample = 0.62
- Arm 2: sample = 0.41
- Pick Arm 1 (higher sample)
- Observe failure (unlucky) → Arm 1 becomes Beta(1, 2)
Step 2: Draw samples:
- Arm 1: Beta(1, 2), sample = 0.28
- Arm 2: Beta(1, 1), sample = 0.75
- Pick Arm 2 (higher sample)
- Observe success → Arm 2 becomes Beta(2, 1)
Step 3: Draw samples:
- Arm 1: Beta(1, 2), sample = 0.35
- Arm 2: Beta(2, 1), sample = 0.81
- Pick Arm 2 again
- …
As we continue:
- Arm 2’s distribution concentrates around 0.7
- Arm 1’s distribution concentrates around 0.3
- Samples from Arm 2 are almost always higher
- We mostly exploit Arm 2, occasionally trying Arm 1 due to sampling variance
After 50 pulls, suppose we have:
- Arm 1: Beta(8, 17) - 7 successes, 16 failures
- Arm 2: Beta(21, 10) - 20 successes, 9 failures
If we draw 1000 sample pairs:
- Arm 1 wins (higher sample): about 50 times (5%)
- Arm 2 wins (higher sample): about 950 times (95%)
Thompson Sampling naturally converges to exploiting the best arm while maintaining a small probability of exploration.
Thompson Sampling for Gaussian Rewards
When rewards are continuous (not just 0/1), we can use a Normal-Inverse-Gamma prior.
For Gaussian rewards with unknown mean and known variance :
Prior:
Posterior after observations with sample mean :
Thompson Sampling: sample from each arm’s posterior, pick the highest sample.
import numpy as np
class ThompsonSamplingGaussian:
"""
Thompson Sampling for Gaussian bandits with known variance.
Uses Normal prior on the mean, which gives Normal posterior.
"""
def __init__(self, n_arms, prior_mean=0.0, prior_var=1.0, reward_var=1.0):
self.n_arms = n_arms
self.reward_var = reward_var
# Prior parameters for each arm
self.prior_mean = prior_mean
self.prior_var = prior_var
# Posterior parameters (mean and precision = 1/variance)
self.post_mean = np.full(n_arms, prior_mean)
self.post_precision = np.full(n_arms, 1.0 / prior_var)
def select_action(self):
"""Sample from each posterior and pick the highest."""
post_var = 1.0 / self.post_precision
samples = np.random.normal(self.post_mean, np.sqrt(post_var))
return np.argmax(samples)
def update(self, action, reward):
"""Bayesian update for Gaussian posterior."""
reward_precision = 1.0 / self.reward_var
# Update precision (inverse variance adds)
new_precision = self.post_precision[action] + reward_precision
# Update mean
new_mean = (self.post_precision[action] * self.post_mean[action] +
reward_precision * reward) / new_precision
self.post_precision[action] = new_precision
self.post_mean[action] = new_mean
def get_posterior_stats(self):
"""Return posterior mean and standard deviation for each arm."""
post_std = 1.0 / np.sqrt(self.post_precision)
return self.post_mean, post_std
# Example usage
def run_gaussian_thompson():
np.random.seed(42)
# True means
true_means = [0.0, 0.5, 1.0, 0.3]
n_arms = len(true_means)
reward_var = 1.0
agent = ThompsonSamplingGaussian(n_arms, reward_var=reward_var)
for step in range(300):
action = agent.select_action()
reward = np.random.normal(true_means[action], np.sqrt(reward_var))
agent.update(action, reward)
means, stds = agent.get_posterior_stats()
print("Final posterior estimates:")
for arm in range(n_arms):
print(f" Arm {arm}: mean={means[arm]:.3f} +/- {stds[arm]:.3f}, "
f"true={true_means[arm]}")
run_gaussian_thompson()Why Thompson Sampling Works So Well
Thompson Sampling has some remarkable properties:
-
Probability matching: The probability of selecting an arm is proportional to the probability that it’s actually optimal. This is a natural way to allocate exploration.
-
Automatically adapts: No tuning parameters (beyond the prior). The posterior uncertainty naturally determines exploration.
-
Information-directed: Unlike UCB which is optimistic about all uncertain arms equally, Thompson Sampling focuses on arms that might plausibly be optimal.
-
Handles correlations: When arms have related rewards, Bayesian reasoning can share information.
Thompson Sampling achieves the optimal regret bound, matching UCB asymptotically. But in practice, it often performs better, especially when:
-
There are many arms: Thompson Sampling scales better because it naturally ignores arms with very low probability of being optimal.
-
Arms are similar: Thompson Sampling uses posterior beliefs to focus exploration on the truly competitive arms.
-
Priors are informative: If you have good prior knowledge, Thompson Sampling can incorporate it directly.
The regret bound for Thompson Sampling on Bernoulli bandits:
where is the Kullback-Leibler divergence between arm distributions.
Thompson Sampling vs UCB
Both achieve optimal regret, but they have different characteristics:
Thompson Sampling advantages:
- Often better empirical performance
- No exploration parameter to tune
- Naturally Bayesian—can incorporate prior knowledge
- Handles structured problems (correlations between arms) gracefully
- Randomized: can help in adversarial settings
UCB advantages:
- Deterministic: easier to analyze and debug
- Doesn’t require specifying a prior
- Simpler to explain (“optimism in uncertainty”)
- Works with any bounded reward distribution
Consider 100 arms where:
- 95 arms have success probability 0.1
- 5 arms have success probability 0.5
UCB with standard parameters:
- Must try each arm multiple times to reduce uncertainty
- Wastes many pulls on the 95 bad arms
Thompson Sampling:
- After a few failures on a bad arm, its posterior concentrates around low values
- Samples from bad arms are consistently low
- Quickly focuses on the 5 competitive arms
Thompson Sampling is more aggressive about eliminating clearly inferior arms.
Practical Considerations
Choosing Priors
For Bernoulli bandits:
- Beta(1, 1) is uninformative (uniform) - good default
- Beta(0.5, 0.5) (Jeffreys prior) is also popular
- If you expect arms to be similar, use an empirical Bayes approach
For Gaussian bandits:
- Set prior mean to your best guess
- Set prior variance based on how uncertain you are
- Larger prior variance = more exploration initially
Model Mismatch
Thompson Sampling assumes rewards follow a particular distribution (Beta for Bernoulli, Normal for Gaussian). If the true reward distribution is very different, performance can suffer.
For example, if rewards are heavy-tailed (occasional large payouts), a Gaussian model might underexplore because it doesn’t expect extreme values.
When in doubt:
- Use robust priors with heavier tails
- Consider non-parametric Bayesian methods
- Fall back to UCB which makes weaker assumptions
Here’s a more robust implementation that handles edge cases:
import numpy as np
from scipy import stats
class RobustThompsonSampling:
"""
Thompson Sampling with practical improvements.
Handles numerical stability and provides diagnostics.
"""
def __init__(self, n_arms, prior_alpha=1.0, prior_beta=1.0):
self.n_arms = n_arms
self.alpha = np.full(n_arms, prior_alpha)
self.beta = np.full(n_arms, prior_beta)
self.pulls = np.zeros(n_arms, dtype=int)
self.history = []
def select_action(self):
"""Sample from each posterior and pick the highest."""
samples = np.random.beta(self.alpha, self.beta)
action = np.argmax(samples)
self.history.append({
'samples': samples.copy(),
'action': action
})
return action
def update(self, action, reward):
"""Update with continuous reward in [0, 1]."""
self.pulls[action] += 1
# Handle continuous rewards by treating as fractional success
if reward > 0:
self.alpha[action] += reward
self.beta[action] += (1 - reward)
else:
self.beta[action] += 1
def get_credible_intervals(self, confidence=0.95):
"""Return credible intervals for each arm."""
intervals = []
for arm in range(self.n_arms):
low = stats.beta.ppf((1-confidence)/2, self.alpha[arm], self.beta[arm])
high = stats.beta.ppf((1+confidence)/2, self.alpha[arm], self.beta[arm])
intervals.append((low, high))
return intervals
def get_probability_of_optimal(self, n_samples=10000):
"""Estimate probability each arm is optimal via Monte Carlo."""
samples = np.random.beta(
self.alpha.reshape(-1, 1),
self.beta.reshape(-1, 1),
size=(self.n_arms, n_samples)
)
best = np.argmax(samples, axis=0)
probs = np.bincount(best, minlength=self.n_arms) / n_samples
return probsSummary
Thompson Sampling provides a Bayesian approach to exploration:
- Core idea: Sample from beliefs, pick the best sample
- Beta distributions work perfectly for Bernoulli (0/1) rewards
- Exploration emerges naturally from uncertainty in beliefs
- No exploration parameter to tune
- Often outperforms UCB in practice, especially with many arms
Thompson Sampling was invented in 1933 but was largely forgotten until the 2010s when researchers rediscovered its remarkable effectiveness. Today, it’s widely used in industry for A/B testing, recommendations, and ad selection.
In the next section, we’ll compare all our exploration strategies and provide practical guidance on when to use each.