Exploration vs Exploitation | The RL Framework

The Core Dilemma

Every RL agent faces a fundamental question at every step: should I use what I know, or try something new?

📖Exploration vs Exploitation

Exploitation means choosing the best action according to current knowledge—maximizing immediate expected reward.

Exploration means trying actions that might not seem optimal—gathering information that could improve future decisions.

📌The Restaurant Problem

You’re in a new city for a week, looking for a good restaurant. You’ve found a decent place that serves acceptable food. Every evening, you face a choice:

Exploit: Go back to the known restaurant. Guaranteed acceptable meal.
Explore: Try a new place. It might be amazing—or terrible.

If you only exploit, you might never discover the incredible restaurant around the corner. If you only explore, you’ll waste evenings on mediocre meals when you already know a good option.

🎯 Exploit

Use what you know

Get reward now based on current knowledge

🔍 Explore

Try new things

Learn more for better future decisions

Try It Yourself

Experience the dilemma firsthand. You have three slot machines—each with a hidden probability of winning. Your goal: maximize total wins. But which machine is best?

The Slot Machine Game

Three slot machines, each with a hidden win probability. Which one is best?

🎰

Machine 1

Pulls:0

Wins:0

Est. Value:?

🎰

Machine 2

Pulls:0

Wins:0

Est. Value:?

🎰

Machine 3

Pulls:0

Wins:0

Est. Value:?

Pull a machine to see if you win!

Total Pulls

Total Wins

Win Rate

The Dilemma: Should you keep pulling the machine that seems best so far (exploit), or try others to discover if they're actually better (explore)? If you only exploit, you might miss the truly best option. If you only explore, you waste pulls on bad machines.

The Dangers of Extremes

⚠️

Too much exploitation

You get stuck with mediocrity. The “best” option you know might not be the true best. You’ll never discover what you’re missing.

⚠️

Too much exploration

You waste resources trying random options. Even when you find the best, you keep wandering instead of capitalizing on your knowledge.

∑Mathematical Details

We can formalize this tradeoff using regret—the cumulative difference between what you earned and what you could have earned by always choosing the best option:

Cumulative Regret

Regret(T)=

\sum_{t=1}^{T} \left( \mu^* - \mu_{a_t} \right)

where $\mu^*$ is the best action’s true value and $\mu_{a_t}$ is what you chose

The goal is to minimize regret: learn the best action quickly enough that you don’t waste too many pulls on suboptimal choices.

Strategies Preview

The art of RL is balancing exploration and exploitation. Throughout this book, you’ll learn progressively sophisticated strategies:

ε-Greedy

Exploit most of the time, but with probability ε, choose randomly. Simple and surprisingly effective.

Covered in: Multi-Armed Bandits

Upper Confidence Bound (UCB)

Be optimistic about uncertain options. The less you know about an action, the more you should try it.

Covered in: Multi-Armed Bandits

Thompson Sampling

Maintain beliefs about each action’s value. Sample from these beliefs to decide what to try.

Covered in: Multi-Armed Bandits

Intrinsic Motivation

Reward the agent for discovering novel states. Used in complex environments where random exploration fails.

Covered in: Deep Q-Networks

💡This Tradeoff Is Everywhere

Exploration vs. exploitation isn’t just an RL concept—it’s fundamental to learning and decision-making:

Scientists balance replicating known experiments vs. testing new hypotheses
Companies balance improving existing products vs. creating new ones
You balance practicing skills you have vs. learning new ones
Hiring managers balance promoting known performers vs. giving chances to unknowns

Understanding this tradeoff helps you think about decisions in everyday life.

ℹ️Coming Up

In the next chapter, we’ll explore the RL Landscape—a map of the different algorithm families you’ll learn. Then you’ll get hands-on with an interactive GridWorld demo.

After that, we dive into Multi-Armed Bandits: the simplest RL setting where you’ll implement and compare these exploration strategies yourself.