The Core Dilemma
Every RL agent faces a fundamental question at every step: should I use what I know, or try something new?
Exploitation means choosing the best action according to current knowledge—maximizing immediate expected reward.
Exploration means trying actions that might not seem optimal—gathering information that could improve future decisions.
You’re in a new city for a week, looking for a good restaurant. You’ve found a decent place that serves acceptable food. Every evening, you face a choice:
- Exploit: Go back to the known restaurant. Guaranteed acceptable meal.
- Explore: Try a new place. It might be amazing—or terrible.
If you only exploit, you might never discover the incredible restaurant around the corner. If you only explore, you’ll waste evenings on mediocre meals when you already know a good option.
🎯 Exploit
Get reward now based on current knowledge
🔍 Explore
Learn more for better future decisions
Try It Yourself
Experience the dilemma firsthand. You have three slot machines—each with a hidden probability of winning. Your goal: maximize total wins. But which machine is best?
The Slot Machine Game
Three slot machines, each with a hidden win probability. Which one is best?
The Dangers of Extremes
We can formalize this tradeoff using regret—the cumulative difference between what you earned and what you could have earned by always choosing the best option:
where is the best action’s true value and is what you chose
The goal is to minimize regret: learn the best action quickly enough that you don’t waste too many pulls on suboptimal choices.
Strategies Preview
The art of RL is balancing exploration and exploitation. Throughout this book, you’ll learn progressively sophisticated strategies:
Exploit most of the time, but with probability ε, choose randomly. Simple and surprisingly effective.
Be optimistic about uncertain options. The less you know about an action, the more you should try it.
Maintain beliefs about each action’s value. Sample from these beliefs to decide what to try.
Reward the agent for discovering novel states. Used in complex environments where random exploration fails.
Exploration vs. exploitation isn’t just an RL concept—it’s fundamental to learning and decision-making:
- Scientists balance replicating known experiments vs. testing new hypotheses
- Companies balance improving existing products vs. creating new ones
- You balance practicing skills you have vs. learning new ones
- Hiring managers balance promoting known performers vs. giving chances to unknowns
Understanding this tradeoff helps you think about decisions in everyday life.
In the next chapter, we’ll explore the RL Landscape—a map of the different algorithm families you’ll learn. Then you’ll get hands-on with an interactive GridWorld demo.
After that, we dive into Multi-Armed Bandits: the simplest RL setting where you’ll implement and compare these exploration strategies yourself.