Multi-Armed Bandits

What You'll Learn

Formalize the multi-armed bandit problem and its key assumptions
Implement and analyze greedy and epsilon-greedy exploration strategies
Understand Upper Confidence Bound (UCB) and optimism in the face of uncertainty
Apply Thompson Sampling using Bayesian reasoning
Compare exploration strategies and know when to use each

You’re at a casino with k slot machines. Each has a different (unknown) probability of paying out. You have limited pulls. How do you maximize your winnings?

Welcome to the multi-armed bandit problem—the simplest RL setting, and yet remarkably rich. Here, we’ll develop the fundamental exploration strategies that power everything from online advertising to clinical trials.

Multi-Armed Bandit Playground

Explore the exploration-exploitation tradeoff. Try manual play or compare algorithms.

Arms:Mode:

🎰

Arm 1

n:0

Est:?

🎰

Arm 2

n:0

Est:?

🎰

Arm 3

n:0

Est:?

🎰

Arm 4

n:0

Est:?

🎰

Arm 5

n:0

Est:?

🎰

Arm 6

n:0

Est:?

🎰

Arm 7

n:0

Est:?

Click an arm to pull it!

Steps

Rewards

Win Rate

0.00

Regret

Manual Mode: You decide which arm to pull. Can you find the best arm while minimizing regret?

Why Bandits?

Before we tackle full reinforcement learning with states and transitions, let’s master the core challenge: exploration vs exploitation. Bandits strip away everything else to focus on this fundamental tradeoff.

ℹ️Note

The name “multi-armed bandit” comes from imagining a gambler facing multiple slot machines (one-armed bandits), trying to figure out which one pays best while maximizing winnings.

Chapter Overview

This chapter develops the core exploration strategies used throughout RL:

The Bandit Problem

Setup, action-values, and regret as our metric

Greedy Methods

From pure exploitation to epsilon-greedy exploration

Upper Confidence Bound

Optimism in the face of uncertainty

Thompson Sampling

The Bayesian approach to exploration

Comparing Strategies

Empirical comparison and practical recommendations

The Big Picture

Every exploration strategy reflects a different philosophy:

Greedy: Always exploit what you know
Epsilon-greedy: Mostly exploit, sometimes explore randomly
UCB: Explore where uncertainty is high
Thompson Sampling: Explore probabilistically based on beliefs

There’s no universal best—the right choice depends on your problem. By the end of this chapter, you’ll understand the tradeoffs and know when to use each.

📖Multi-Armed Bandit

A sequential decision problem where an agent repeatedly chooses among k actions (“arms”) with unknown reward distributions, aiming to maximize cumulative reward over time.

Prerequisites

This chapter builds on:

The agent-environment loop from The RL Framework
The exploration-exploitation tradeoff from Exploration vs Exploitation

Key Takeaways

Bandits isolate the exploration-exploitation tradeoff from sequential decision-making
Epsilon-greedy explores randomly with fixed probability
UCB explores where uncertainty is high (“optimism in the face of uncertainty”)
Thompson Sampling samples from posterior beliefs to balance exploration and exploitation
Regret measures how much reward we lost compared to always playing the best arm

Next ChapterContextual Bandits→