Exploration vs Exploitation
What You'll Learn
- Articulate the exploration-exploitation dilemma formally
- Implement and compare multiple exploration strategies
- Understand when and why each strategy is appropriate
- Recognize the impact of exploration on learning speed and final performance
- Tune exploration parameters for practical problems
The Fundamental Dilemma
Imagine you’ve found a great restaurant near your home. The food is good, the prices are fair, and you know exactly what to order. Do you keep going there, safe and satisfied? Or do you try that new place around the corner—which might be amazing, or might be terrible?
This is the exploration-exploitation dilemma, and it’s at the heart of reinforcement learning.
Every time your agent makes a decision, it faces a choice:
Exploitation: Take the action that looks best based on what you know. This is safe—you’re using your hard-earned knowledge. But you might be missing something better.
Exploration: Try something different to gather more information. This might lead to discovering better options—or waste time on inferior ones.
Neither extreme works:
- Pure exploitation (): The agent gets stuck with whatever it found first. If the first path to the goal wasn’t optimal, it never finds the better one.
- Pure exploration (): The agent never uses what it learns. It randomly stumbles around forever.
The art is finding the right balance, and that balance often needs to change over time.
Why Exploration Is Essential
Consider Q-learning with pure exploitation. The agent starts with all Q-values equal (say, zero). It picks an action at random (ties are broken randomly), takes it, and updates that Q-value. Now that Q-value is different from the others.
Next time in that state, it picks that same action—not because it’s good, but because it was tried first. It never samples the alternatives. The agent gets locked into its first choices, which were essentially random.
Exploration breaks this cycle. By sometimes trying non-greedy actions, the agent ensures it gathers information about all options, not just the ones it stumbled upon early.
ε-Greedy: The Simple Baseline
In the previous chapter, we used -greedy exploration. Let’s examine it more carefully.
The idea is simple: most of the time, take the best action. But with some small probability , choose randomly instead.
- With probability : Take the greedy action (exploit)
- With probability : Take a random action (explore)
This ensures every action has some chance of being selected, while still mostly leveraging what we’ve learned.
Mathematical Details
The -greedy action selection probability is:
- For the greedy action : probability is
- For all other actions: probability is
where is the number of actions.
Note that the greedy action also gets its share of the exploration probability—so it’s selected with probability , slightly higher than .
In practice, we often simplify this:
- With probability : random action (from all actions, including greedy)
- With probability : greedy action
Implementation
import numpy as np
class EpsilonGreedy:
"""Epsilon-greedy exploration strategy."""
def __init__(self, epsilon=0.1):
self.epsilon = epsilon
def select_action(self, Q_values):
"""
Select action using epsilon-greedy.
Args:
Q_values: Array of Q-values for each action
Returns:
Selected action index
"""
if np.random.random() < self.epsilon:
# Explore: random action
return np.random.randint(len(Q_values))
else:
# Exploit: greedy action
return np.argmax(Q_values)
def get_action_probs(self, Q_values):
"""Return probability of each action (for visualization)."""
n_actions = len(Q_values)
probs = np.ones(n_actions) * self.epsilon / n_actions
best_action = np.argmax(Q_values)
probs[best_action] += 1 - self.epsilon
return probsChoosing
A common starting point is (10% exploration). This works well for many problems, but is rarely optimal.
- Too high (): Agent explores too much, learning is slow, and learned policy is never used at full potential.
- Too low (): Agent might not explore enough to find good regions of the state space.
ε-Decay: Reducing Exploration Over Time
A fixed faces a fundamental problem: the right amount of exploration early in training (lots!) isn’t the right amount later (less!).
Early on, the agent knows nothing. Q-values are unreliable. Exploring widely makes sense—you’re gathering information, and your “greedy” action isn’t really better than others.
Later, Q-values are accurate. The greedy action really is the best. Continuing to explore 10% of the time means 10% of your actions are suboptimal, even when you know better.
Solution: decay over time. Start with high exploration, gradually reduce it as learning progresses.
Mathematical Details
Common decay schedules:
Exponential decay:
where is the step or episode number.
Linear decay:
where is the total number of decay steps.
Inverse decay:
Implementation
class EpsilonDecay:
"""Epsilon-greedy with decay over time."""
def __init__(self, epsilon_start=1.0, epsilon_end=0.01, decay_rate=0.995):
self.epsilon_start = epsilon_start
self.epsilon_end = epsilon_end
self.decay_rate = decay_rate
self.epsilon = epsilon_start
self.step = 0
def select_action(self, Q_values):
"""Select action with current epsilon, then decay."""
if np.random.random() < self.epsilon:
return np.random.randint(len(Q_values))
return np.argmax(Q_values)
def decay(self):
"""Decay epsilon after each episode (or step)."""
self.epsilon = max(
self.epsilon_end,
self.epsilon * self.decay_rate
)
self.step += 1
def get_epsilon(self):
"""Get current epsilon value."""
return self.epsilonUsage:
exploration = EpsilonDecay(epsilon_start=1.0, epsilon_end=0.01, decay_rate=0.995)
for episode in range(1000):
state = env.reset()
done = False
while not done:
action = exploration.select_action(agent.Q[state])
# ... take action, observe, update ...
exploration.decay() # Decay at end of each episodeDecay rate matters a lot.
With decay_rate=0.995:
- After 100 episodes:
- After 500 episodes:
- After 1000 episodes:
With decay_rate=0.999:
- After 1000 episodes: (still exploring heavily!)
Tune this based on how long training runs and when you expect good Q-values.
Boltzmann (Softmax) Exploration
-greedy has a blind spot: it treats all non-greedy actions equally. But what if one action has Q-value 10 and another has Q-value -100? Should we really explore both with equal probability?
Boltzmann exploration (also called softmax exploration) chooses actions with probability proportional to their estimated value. Higher Q-values → higher probability. Lower Q-values → lower probability.
This is more nuanced than -greedy:
- Actions that look promising get more exploration
- Actions that look terrible get less
- But nothing is completely ruled out
The temperature parameter controls how peaked the distribution is:
- High → nearly uniform (exploration)
- Low → nearly deterministic (exploitation)
- → pure greedy
- → pure random
Mathematical Details
The Boltzmann (softmax) action selection probability is:
where is the temperature parameter.
Properties:
- As : for (deterministic greedy)
- As : (uniform random)
Example: If we have two actions with :
| Temperature | P(action 0) | P(action 1) |
|---|---|---|
| 0.9999 | 0.0001 | |
| 0.88 | 0.12 | |
| 0.73 | 0.27 | |
| 0.62 | 0.38 | |
| 0.52 | 0.48 |
Implementation
class BoltzmannExploration:
"""Boltzmann (softmax) exploration strategy."""
def __init__(self, temperature=1.0):
self.temperature = temperature
def select_action(self, Q_values):
"""Select action using Boltzmann distribution."""
# Compute softmax probabilities
# Subtract max for numerical stability
Q_stable = Q_values - np.max(Q_values)
exp_Q = np.exp(Q_stable / self.temperature)
probs = exp_Q / np.sum(exp_Q)
# Sample from distribution
return np.random.choice(len(Q_values), p=probs)
def get_action_probs(self, Q_values):
"""Return probability of each action."""
Q_stable = Q_values - np.max(Q_values)
exp_Q = np.exp(Q_stable / self.temperature)
return exp_Q / np.sum(exp_Q)
def set_temperature(self, temperature):
"""Update temperature (for annealing)."""
self.temperature = temperatureLike -greedy, Boltzmann exploration benefits from annealing: start with high temperature (more exploration), gradually reduce it. This gives you the best of both worlds: broad exploration early, focused exploitation late.
Upper Confidence Bound (UCB)
Both -greedy and Boltzmann use randomness for exploration. But randomness isn’t the only option. What if we could explore systematically based on our uncertainty?
UCB embodies the principle of “optimism in the face of uncertainty.”
The idea: for each action, consider not just your best estimate of its value, but also how uncertain you are about that estimate. Then pick the action with the highest upper confidence bound—the action that could be best if your estimates are off.
Why upper bound? Because we’re optimistic. We assume uncertain actions might actually be great. This encourages us to try them and reduce our uncertainty.
Actions we’ve tried many times have tight confidence bounds—we know what they’re worth. Actions we’ve rarely tried have wide bounds—they might be amazing! UCB systematically explores the under-sampled actions.
Mathematical Details
The UCB action selection rule is:
where:
- is the estimated action value
- is the number of times state was visited
- is the number of times action was taken in state
- is the exploration constant (controls exploration vs exploitation)
The second term is the exploration bonus:
- Large when is small (rarely tried actions)
- Shrinks as grows (well-explored actions)
- Grows with (ensures we keep exploring as total visits increase)
Implementation
class UCBExploration:
"""Upper Confidence Bound exploration strategy."""
def __init__(self, c=2.0, n_actions=4):
self.c = c # Exploration constant
self.n_actions = n_actions
self.action_counts = {} # N(s, a)
self.state_counts = {} # N(s)
def select_action(self, state, Q_values):
"""Select action using UCB."""
# Initialize counts for new states
if state not in self.state_counts:
self.state_counts[state] = 0
self.action_counts[state] = np.zeros(self.n_actions)
N_s = self.state_counts[state]
# If any action hasn't been tried, try it
untried = np.where(self.action_counts[state] == 0)[0]
if len(untried) > 0:
return np.random.choice(untried)
# Compute UCB values
N_sa = self.action_counts[state]
exploration_bonus = self.c * np.sqrt(np.log(N_s) / N_sa)
ucb_values = Q_values + exploration_bonus
return np.argmax(ucb_values)
def update_counts(self, state, action):
"""Update visit counts after taking action."""
if state not in self.state_counts:
self.state_counts[state] = 0
self.action_counts[state] = np.zeros(self.n_actions)
self.state_counts[state] += 1
self.action_counts[state][action] += 1UCB’s limitation: It requires tracking visit counts for every state-action pair. This works for tabular settings but breaks down when:
- The state space is huge (you visit most states at most once)
- States are continuous (you never visit exactly the same state twice)
For deep RL, we need different approaches—which we’ll see in the DQN chapter.
Comparing Strategies
Let’s put all four strategies head-to-head.
Implementation
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict
class QLearningWithExploration:
"""Q-learning agent with configurable exploration strategy."""
def __init__(self, n_actions, exploration_strategy, alpha=0.1, gamma=0.99):
self.n_actions = n_actions
self.exploration = exploration_strategy
self.alpha = alpha
self.gamma = gamma
self.Q = defaultdict(lambda: np.zeros(n_actions))
def select_action(self, state):
Q_values = self.Q[state]
# UCB needs special handling (needs state)
if hasattr(self.exploration, 'update_counts'):
action = self.exploration.select_action(state, Q_values)
self.exploration.update_counts(state, action)
return action
return self.exploration.select_action(Q_values)
def update(self, state, action, reward, next_state, done):
if done:
td_target = reward
else:
td_target = reward + self.gamma * np.max(self.Q[next_state])
td_error = td_target - self.Q[state][action]
self.Q[state][action] += self.alpha * td_error
def compare_strategies(env, strategies, num_episodes=500, num_runs=10):
"""Compare multiple exploration strategies."""
results = {name: [] for name in strategies}
for name, make_strategy in strategies.items():
for run in range(num_runs):
strategy = make_strategy() # Fresh strategy each run
agent = QLearningWithExploration(
n_actions=4,
exploration_strategy=strategy
)
episode_rewards = []
for ep in range(num_episodes):
state = env.reset()
done = False
total_reward = 0
while not done:
action = agent.select_action(state)
next_state, reward, done, _ = env.step(action)
agent.update(state, action, reward, next_state, done)
total_reward += reward
state = next_state
episode_rewards.append(total_reward)
# Decay epsilon if applicable
if hasattr(strategy, 'decay'):
strategy.decay()
results[name].append(episode_rewards)
return results
# Define strategies to compare
strategies = {
'ε-greedy (0.1)': lambda: EpsilonGreedy(epsilon=0.1),
'ε-decay (1.0→0.01)': lambda: EpsilonDecay(1.0, 0.01, 0.995),
'Boltzmann (τ=1.0)': lambda: BoltzmannExploration(temperature=1.0),
'UCB (c=2.0)': lambda: UCBExploration(c=2.0),
}
# Run comparison (assuming env is defined)
# results = compare_strategies(env, strategies)When to Use Each Strategy
| Strategy | Best For | Avoid When |
|---|---|---|
| ε-greedy | Simple problems, quick prototyping, robust baseline | You need optimal exploration efficiency |
| ε-decay | Training runs with defined length, decreasing uncertainty | Environment changes or is highly stochastic |
| Boltzmann | When action values have meaningful relative differences | Q-values have very different scales early vs late |
| UCB | Tabular settings, need systematic exploration | Large/continuous state spaces |
Practical advice: Start with -decay. It’s simple, robust, and works well across many problems. Only switch to fancier methods if you have a specific reason:
- Use UCB if you have a small state-action space and want guaranteed coverage
- Use Boltzmann if the relative differences in Q-values are meaningful signals
- Use more sophisticated methods (count-based bonuses, curiosity) if you have large state spaces with sparse rewards
Practical Guidelines
Tuning Exploration
-
Start with high exploration: or . You want to see the full state space early.
-
Decay aggressively but not too fast: You want most exploration done by the halfway point of training. A good rule of thumb: should reach its minimum around 70-80% through training.
-
Never go to zero: Even at convergence, a tiny bit of exploration (like ) helps handle non-stationarity and prevents getting stuck.
-
Monitor learning curves: If the agent converges quickly but to a suboptimal policy, you probably under-explored. If learning is slow, you might be exploring too much.
Common Pitfalls
1. Decaying too fast in complex environments: In environments with sparse rewards or many local optima, aggressive decay can lock you into suboptimal behavior before you’ve found the good stuff.
2. Not accounting for Q-scale in Boltzmann: If Q-values are in the thousands, makes the distribution essentially greedy. Scale based on expected Q-value magnitudes.
3. Using UCB in function approximation settings: UCB needs visit counts. With neural networks and continuous states, each state is essentially unique. UCB doesn’t apply directly—you need pseudo-count methods or other approaches.
4. Confusing evaluation and training: During training, exploration is necessary. During evaluation (testing what you’ve learned), you should usually be greedy (). Don’t report training performance as your agent’s capability!
Summary
Key Takeaways
- The exploration-exploitation dilemma is fundamental: exploit to maximize reward, explore to find better options. Neither extreme works alone.
- -greedy is simple and robust: with probability , take a random action. A good default is .
- -decay adapts over time: start with high exploration, reduce as Q-values become reliable. Match decay schedule to training length.
- Boltzmann (softmax) explores proportionally to value estimates. Temperature controls the trade-off: high → random, low → greedy.
- UCB explores systematically based on uncertainty. It prefers under-sampled actions but requires visit counts.
- There’s no universally best strategy. The right choice depends on state space size, training time, and reward structure.
These exploration strategies work well when we can track visit counts or store Q-values for every state-action pair. But what happens when the state space is so large—images, continuous values, millions of states—that we can’t possibly track everything?
That’s where function approximation comes in, and Q-learning meets deep learning.
Exercises
Conceptual Questions
-
Why does pure exploitation () fail even after extensive training? Think about how the agent’s early random choices affect what it learns.
-
Explain intuitively what “optimism in the face of uncertainty” means in UCB. Why does preferring uncertain actions lead to good exploration?
-
When might you prefer Boltzmann exploration over -greedy? Give a concrete scenario where the relative Q-values matter.
-
What’s the downside of exploring too much? If some exploration is good, why not maximize it?
Coding Challenges
-
Implement all four exploration strategies and compare them on GridWorld. Plot learning curves (reward per episode) averaged over 10 runs. Which converges fastest? Which finds the best final policy?
-
Create an adaptive -decay that slows down when learning stalls. Hint: monitor the TD error or reward over recent episodes. If performance isn’t improving, decay slower.
-
Implement a hybrid UCB- strategy: Use UCB for action selection, but maintain a minimum exploration probability. Compare with pure UCB on a problem where some states are visited very rarely.
Exploration
-
Design your own exploration strategy. What principles would guide it? Consider:
- What information is available? (Q-values, counts, recent rewards)
- How should exploration change over time?
- Should it depend on the state?
Implement it and compare against -decay on your choice of environment.