Multi-Armed Bandits
What You'll Learn
- Formalize the multi-armed bandit problem and its key assumptions
- Implement and analyze greedy and epsilon-greedy exploration strategies
- Understand Upper Confidence Bound (UCB) and optimism in the face of uncertainty
- Apply Thompson Sampling using Bayesian reasoning
- Compare exploration strategies and know when to use each
You’re at a casino with k slot machines. Each has a different (unknown) probability of paying out. You have limited pulls. How do you maximize your winnings?
Welcome to the multi-armed bandit problem—the simplest RL setting, and yet remarkably rich. Here, we’ll develop the fundamental exploration strategies that power everything from online advertising to clinical trials.
Multi-Armed Bandit Playground
Explore the exploration-exploitation tradeoff. Try manual play or compare algorithms.
Why Bandits?
Before we tackle full reinforcement learning with states and transitions, let’s master the core challenge: exploration vs exploitation. Bandits strip away everything else to focus on this fundamental tradeoff.
The name “multi-armed bandit” comes from imagining a gambler facing multiple slot machines (one-armed bandits), trying to figure out which one pays best while maximizing winnings.
Chapter Overview
This chapter develops the core exploration strategies used throughout RL:
The Bandit Problem
Setup, action-values, and regret as our metric
Greedy Methods
From pure exploitation to epsilon-greedy exploration
Upper Confidence Bound
Optimism in the face of uncertainty
Thompson Sampling
The Bayesian approach to exploration
Comparing Strategies
Empirical comparison and practical recommendations
The Big Picture
Every exploration strategy reflects a different philosophy:
- Greedy: Always exploit what you know
- Epsilon-greedy: Mostly exploit, sometimes explore randomly
- UCB: Explore where uncertainty is high
- Thompson Sampling: Explore probabilistically based on beliefs
There’s no universal best—the right choice depends on your problem. By the end of this chapter, you’ll understand the tradeoffs and know when to use each.
A sequential decision problem where an agent repeatedly chooses among k actions (“arms”) with unknown reward distributions, aiming to maximize cumulative reward over time.
Prerequisites
This chapter builds on:
- The agent-environment loop from The RL Framework
- The exploration-exploitation tradeoff from Exploration vs Exploitation
Key Takeaways
- Bandits isolate the exploration-exploitation tradeoff from sequential decision-making
- Epsilon-greedy explores randomly with fixed probability
- UCB explores where uncertainty is high (“optimism in the face of uncertainty”)
- Thompson Sampling samples from posterior beliefs to balance exploration and exploitation
- Regret measures how much reward we lost compared to always playing the best arm