Chapter 101
📝Draft

Multi-Armed Bandits

Master the exploration-exploitation tradeoff in the simplest RL setting

Prerequisites:

Multi-Armed Bandits

What You'll Learn

  • Formalize the multi-armed bandit problem and its key assumptions
  • Implement and analyze greedy and epsilon-greedy exploration strategies
  • Understand Upper Confidence Bound (UCB) and optimism in the face of uncertainty
  • Apply Thompson Sampling using Bayesian reasoning
  • Compare exploration strategies and know when to use each

You’re at a casino with k slot machines. Each has a different (unknown) probability of paying out. You have limited pulls. How do you maximize your winnings?

Welcome to the multi-armed bandit problem—the simplest RL setting, and yet remarkably rich. Here, we’ll develop the fundamental exploration strategies that power everything from online advertising to clinical trials.

Multi-Armed Bandit Playground

Explore the exploration-exploitation tradeoff. Try manual play or compare algorithms.

🎰
Arm 1
n:0
Est:?
🎰
Arm 2
n:0
Est:?
🎰
Arm 3
n:0
Est:?
🎰
Arm 4
n:0
Est:?
🎰
Arm 5
n:0
Est:?
🎰
Arm 6
n:0
Est:?
🎰
Arm 7
n:0
Est:?
Click an arm to pull it!
0
Steps
0
Rewards
-
Win Rate
0.00
Regret
Manual Mode: You decide which arm to pull. Can you find the best arm while minimizing regret?

Why Bandits?

Before we tackle full reinforcement learning with states and transitions, let’s master the core challenge: exploration vs exploitation. Bandits strip away everything else to focus on this fundamental tradeoff.

ℹ️Note

The name “multi-armed bandit” comes from imagining a gambler facing multiple slot machines (one-armed bandits), trying to figure out which one pays best while maximizing winnings.

Chapter Overview

This chapter develops the core exploration strategies used throughout RL:

The Big Picture

Every exploration strategy reflects a different philosophy:

  • Greedy: Always exploit what you know
  • Epsilon-greedy: Mostly exploit, sometimes explore randomly
  • UCB: Explore where uncertainty is high
  • Thompson Sampling: Explore probabilistically based on beliefs

There’s no universal best—the right choice depends on your problem. By the end of this chapter, you’ll understand the tradeoffs and know when to use each.

📖Multi-Armed Bandit

A sequential decision problem where an agent repeatedly chooses among k actions (“arms”) with unknown reward distributions, aiming to maximize cumulative reward over time.

Prerequisites

This chapter builds on:


Key Takeaways

  • Bandits isolate the exploration-exploitation tradeoff from sequential decision-making
  • Epsilon-greedy explores randomly with fixed probability
  • UCB explores where uncertainty is high (“optimism in the face of uncertainty”)
  • Thompson Sampling samples from posterior beliefs to balance exploration and exploitation
  • Regret measures how much reward we lost compared to always playing the best arm
Next ChapterContextual Bandits