Chapter 103
📝Draft

Exploration vs Exploitation

Master the fundamental dilemma of reinforcement learning and learn strategies to balance discovery with optimization

Exploration vs Exploitation

What You'll Learn

  • Articulate the exploration-exploitation dilemma formally
  • Implement and compare multiple exploration strategies
  • Understand when and why each strategy is appropriate
  • Recognize the impact of exploration on learning speed and final performance
  • Tune exploration parameters for practical problems

The Fundamental Dilemma

Imagine you’ve found a great restaurant near your home. The food is good, the prices are fair, and you know exactly what to order. Do you keep going there, safe and satisfied? Or do you try that new place around the corner—which might be amazing, or might be terrible?

This is the exploration-exploitation dilemma, and it’s at the heart of reinforcement learning.

Every time your agent makes a decision, it faces a choice:

Exploitation: Take the action that looks best based on what you know. This is safe—you’re using your hard-earned knowledge. But you might be missing something better.

Exploration: Try something different to gather more information. This might lead to discovering better options—or waste time on inferior ones.

Neither extreme works:

  • Pure exploitation (ϵ=0\epsilon = 0): The agent gets stuck with whatever it found first. If the first path to the goal wasn’t optimal, it never finds the better one.
  • Pure exploration (ϵ=1\epsilon = 1): The agent never uses what it learns. It randomly stumbles around forever.

The art is finding the right balance, and that balance often needs to change over time.

Why Exploration Is Essential

Consider Q-learning with pure exploitation. The agent starts with all Q-values equal (say, zero). It picks an action at random (ties are broken randomly), takes it, and updates that Q-value. Now that Q-value is different from the others.

Next time in that state, it picks that same action—not because it’s good, but because it was tried first. It never samples the alternatives. The agent gets locked into its first choices, which were essentially random.

Exploration breaks this cycle. By sometimes trying non-greedy actions, the agent ensures it gathers information about all options, not just the ones it stumbled upon early.

ε-Greedy: The Simple Baseline

In the previous chapter, we used ϵ\epsilon-greedy exploration. Let’s examine it more carefully.

The idea is simple: most of the time, take the best action. But with some small probability ϵ\epsilon, choose randomly instead.

  • With probability 1ϵ1 - \epsilon: Take the greedy action (exploit)
  • With probability ϵ\epsilon: Take a random action (explore)

This ensures every action has some chance of being selected, while still mostly leveraging what we’ve learned.

Mathematical Details

The ϵ\epsilon-greedy action selection probability is:

  • For the greedy action a=argmaxaQ(s,a)a = \arg\max_{a'} Q(s, a'): probability is 1ϵ+ϵA1 - \epsilon + \frac{\epsilon}{|\mathcal{A}|}
  • For all other actions: probability is ϵA\frac{\epsilon}{|\mathcal{A}|}

where A|\mathcal{A}| is the number of actions.

Note that the greedy action also gets its share of the exploration probability—so it’s selected with probability 1ϵ+ϵ/A1 - \epsilon + \epsilon/|\mathcal{A}|, slightly higher than 1ϵ1 - \epsilon.

In practice, we often simplify this:

  • With probability ϵ\epsilon: random action (from all actions, including greedy)
  • With probability 1ϵ1 - \epsilon: greedy action
</>Implementation
import numpy as np

class EpsilonGreedy:
    """Epsilon-greedy exploration strategy."""

    def __init__(self, epsilon=0.1):
        self.epsilon = epsilon

    def select_action(self, Q_values):
        """
        Select action using epsilon-greedy.

        Args:
            Q_values: Array of Q-values for each action

        Returns:
            Selected action index
        """
        if np.random.random() < self.epsilon:
            # Explore: random action
            return np.random.randint(len(Q_values))
        else:
            # Exploit: greedy action
            return np.argmax(Q_values)

    def get_action_probs(self, Q_values):
        """Return probability of each action (for visualization)."""
        n_actions = len(Q_values)
        probs = np.ones(n_actions) * self.epsilon / n_actions
        best_action = np.argmax(Q_values)
        probs[best_action] += 1 - self.epsilon
        return probs

Choosing ϵ\epsilon

💡Tip

A common starting point is ϵ=0.1\epsilon = 0.1 (10% exploration). This works well for many problems, but is rarely optimal.

  • Too high (ϵ=0.5\epsilon = 0.5): Agent explores too much, learning is slow, and learned policy is never used at full potential.
  • Too low (ϵ=0.01\epsilon = 0.01): Agent might not explore enough to find good regions of the state space.

ε-Decay: Reducing Exploration Over Time

A fixed ϵ\epsilon faces a fundamental problem: the right amount of exploration early in training (lots!) isn’t the right amount later (less!).

Early on, the agent knows nothing. Q-values are unreliable. Exploring widely makes sense—you’re gathering information, and your “greedy” action isn’t really better than others.

Later, Q-values are accurate. The greedy action really is the best. Continuing to explore 10% of the time means 10% of your actions are suboptimal, even when you know better.

Solution: decay ϵ\epsilon over time. Start with high exploration, gradually reduce it as learning progresses.

Mathematical Details

Common decay schedules:

Exponential decay: ϵt=max(ϵmin,ϵ0decayt)\epsilon_t = \max(\epsilon_{\min}, \epsilon_0 \cdot \text{decay}^t)

where tt is the step or episode number.

Linear decay: ϵt=max(ϵmin,ϵ0ϵ0ϵminTt)\epsilon_t = \max\left(\epsilon_{\min}, \epsilon_0 - \frac{\epsilon_0 - \epsilon_{\min}}{T} \cdot t\right)

where TT is the total number of decay steps.

Inverse decay: ϵt=11+βt\epsilon_t = \frac{1}{1 + \beta \cdot t}

</>Implementation
class EpsilonDecay:
    """Epsilon-greedy with decay over time."""

    def __init__(self, epsilon_start=1.0, epsilon_end=0.01, decay_rate=0.995):
        self.epsilon_start = epsilon_start
        self.epsilon_end = epsilon_end
        self.decay_rate = decay_rate
        self.epsilon = epsilon_start
        self.step = 0

    def select_action(self, Q_values):
        """Select action with current epsilon, then decay."""
        if np.random.random() < self.epsilon:
            return np.random.randint(len(Q_values))
        return np.argmax(Q_values)

    def decay(self):
        """Decay epsilon after each episode (or step)."""
        self.epsilon = max(
            self.epsilon_end,
            self.epsilon * self.decay_rate
        )
        self.step += 1

    def get_epsilon(self):
        """Get current epsilon value."""
        return self.epsilon

Usage:

exploration = EpsilonDecay(epsilon_start=1.0, epsilon_end=0.01, decay_rate=0.995)

for episode in range(1000):
    state = env.reset()
    done = False
    while not done:
        action = exploration.select_action(agent.Q[state])
        # ... take action, observe, update ...
    exploration.decay()  # Decay at end of each episode

Boltzmann (Softmax) Exploration

ϵ\epsilon-greedy has a blind spot: it treats all non-greedy actions equally. But what if one action has Q-value 10 and another has Q-value -100? Should we really explore both with equal probability?

Boltzmann exploration (also called softmax exploration) chooses actions with probability proportional to their estimated value. Higher Q-values → higher probability. Lower Q-values → lower probability.

This is more nuanced than ϵ\epsilon-greedy:

  • Actions that look promising get more exploration
  • Actions that look terrible get less
  • But nothing is completely ruled out

The temperature parameter τ\tau controls how peaked the distribution is:

  • High τ\tau → nearly uniform (exploration)
  • Low τ\tau → nearly deterministic (exploitation)
  • τ0\tau \rightarrow 0 → pure greedy
  • τ\tau \rightarrow \infty → pure random
Mathematical Details

The Boltzmann (softmax) action selection probability is:

π(as)=exp(Q(s,a)/τ)aexp(Q(s,a)/τ)\pi(a|s) = \frac{\exp(Q(s,a) / \tau)}{\sum_{a'} \exp(Q(s,a') / \tau)}

where τ\tau is the temperature parameter.

Properties:

  • As τ0\tau \rightarrow 0: π(as)1\pi(a|s) \rightarrow 1 for a=argmaxQa = \arg\max Q (deterministic greedy)
  • As τ\tau \rightarrow \infty: π(as)1/A\pi(a|s) \rightarrow 1/|\mathcal{A}| (uniform random)

Example: If we have two actions with Q=[2.0,1.0]Q = [2.0, 1.0]:

TemperatureP(action 0)P(action 1)
τ=0.1\tau = 0.10.99990.0001
τ=0.5\tau = 0.50.880.12
τ=1.0\tau = 1.00.730.27
τ=2.0\tau = 2.00.620.38
τ=10.0\tau = 10.00.520.48
</>Implementation
class BoltzmannExploration:
    """Boltzmann (softmax) exploration strategy."""

    def __init__(self, temperature=1.0):
        self.temperature = temperature

    def select_action(self, Q_values):
        """Select action using Boltzmann distribution."""
        # Compute softmax probabilities
        # Subtract max for numerical stability
        Q_stable = Q_values - np.max(Q_values)
        exp_Q = np.exp(Q_stable / self.temperature)
        probs = exp_Q / np.sum(exp_Q)

        # Sample from distribution
        return np.random.choice(len(Q_values), p=probs)

    def get_action_probs(self, Q_values):
        """Return probability of each action."""
        Q_stable = Q_values - np.max(Q_values)
        exp_Q = np.exp(Q_stable / self.temperature)
        return exp_Q / np.sum(exp_Q)

    def set_temperature(self, temperature):
        """Update temperature (for annealing)."""
        self.temperature = temperature
ℹ️Note

Like ϵ\epsilon-greedy, Boltzmann exploration benefits from annealing: start with high temperature (more exploration), gradually reduce it. This gives you the best of both worlds: broad exploration early, focused exploitation late.

Upper Confidence Bound (UCB)

Both ϵ\epsilon-greedy and Boltzmann use randomness for exploration. But randomness isn’t the only option. What if we could explore systematically based on our uncertainty?

UCB embodies the principle of “optimism in the face of uncertainty.”

The idea: for each action, consider not just your best estimate of its value, but also how uncertain you are about that estimate. Then pick the action with the highest upper confidence bound—the action that could be best if your estimates are off.

Why upper bound? Because we’re optimistic. We assume uncertain actions might actually be great. This encourages us to try them and reduce our uncertainty.

Actions we’ve tried many times have tight confidence bounds—we know what they’re worth. Actions we’ve rarely tried have wide bounds—they might be amazing! UCB systematically explores the under-sampled actions.

Mathematical Details

The UCB action selection rule is:

a=argmaxa[Q(s,a)+clnN(s)N(s,a)]a = \arg\max_a \left[ Q(s, a) + c \sqrt{\frac{\ln N(s)}{N(s, a)}} \right]

where:

  • Q(s,a)Q(s, a) is the estimated action value
  • N(s)N(s) is the number of times state ss was visited
  • N(s,a)N(s, a) is the number of times action aa was taken in state ss
  • cc is the exploration constant (controls exploration vs exploitation)

The second term is the exploration bonus:

  • Large when N(s,a)N(s, a) is small (rarely tried actions)
  • Shrinks as N(s,a)N(s, a) grows (well-explored actions)
  • Grows with lnN(s)\ln N(s) (ensures we keep exploring as total visits increase)
</>Implementation
class UCBExploration:
    """Upper Confidence Bound exploration strategy."""

    def __init__(self, c=2.0, n_actions=4):
        self.c = c  # Exploration constant
        self.n_actions = n_actions
        self.action_counts = {}  # N(s, a)
        self.state_counts = {}   # N(s)

    def select_action(self, state, Q_values):
        """Select action using UCB."""
        # Initialize counts for new states
        if state not in self.state_counts:
            self.state_counts[state] = 0
            self.action_counts[state] = np.zeros(self.n_actions)

        N_s = self.state_counts[state]

        # If any action hasn't been tried, try it
        untried = np.where(self.action_counts[state] == 0)[0]
        if len(untried) > 0:
            return np.random.choice(untried)

        # Compute UCB values
        N_sa = self.action_counts[state]
        exploration_bonus = self.c * np.sqrt(np.log(N_s) / N_sa)
        ucb_values = Q_values + exploration_bonus

        return np.argmax(ucb_values)

    def update_counts(self, state, action):
        """Update visit counts after taking action."""
        if state not in self.state_counts:
            self.state_counts[state] = 0
            self.action_counts[state] = np.zeros(self.n_actions)

        self.state_counts[state] += 1
        self.action_counts[state][action] += 1

Comparing Strategies

Let’s put all four strategies head-to-head.

</>Implementation
import numpy as np
import matplotlib.pyplot as plt
from collections import defaultdict

class QLearningWithExploration:
    """Q-learning agent with configurable exploration strategy."""

    def __init__(self, n_actions, exploration_strategy, alpha=0.1, gamma=0.99):
        self.n_actions = n_actions
        self.exploration = exploration_strategy
        self.alpha = alpha
        self.gamma = gamma
        self.Q = defaultdict(lambda: np.zeros(n_actions))

    def select_action(self, state):
        Q_values = self.Q[state]

        # UCB needs special handling (needs state)
        if hasattr(self.exploration, 'update_counts'):
            action = self.exploration.select_action(state, Q_values)
            self.exploration.update_counts(state, action)
            return action

        return self.exploration.select_action(Q_values)

    def update(self, state, action, reward, next_state, done):
        if done:
            td_target = reward
        else:
            td_target = reward + self.gamma * np.max(self.Q[next_state])

        td_error = td_target - self.Q[state][action]
        self.Q[state][action] += self.alpha * td_error


def compare_strategies(env, strategies, num_episodes=500, num_runs=10):
    """Compare multiple exploration strategies."""
    results = {name: [] for name in strategies}

    for name, make_strategy in strategies.items():
        for run in range(num_runs):
            strategy = make_strategy()  # Fresh strategy each run
            agent = QLearningWithExploration(
                n_actions=4,
                exploration_strategy=strategy
            )

            episode_rewards = []
            for ep in range(num_episodes):
                state = env.reset()
                done = False
                total_reward = 0

                while not done:
                    action = agent.select_action(state)
                    next_state, reward, done, _ = env.step(action)
                    agent.update(state, action, reward, next_state, done)
                    total_reward += reward
                    state = next_state

                episode_rewards.append(total_reward)

                # Decay epsilon if applicable
                if hasattr(strategy, 'decay'):
                    strategy.decay()

            results[name].append(episode_rewards)

    return results


# Define strategies to compare
strategies = {
    'ε-greedy (0.1)': lambda: EpsilonGreedy(epsilon=0.1),
    'ε-decay (1.0→0.01)': lambda: EpsilonDecay(1.0, 0.01, 0.995),
    'Boltzmann (τ=1.0)': lambda: BoltzmannExploration(temperature=1.0),
    'UCB (c=2.0)': lambda: UCBExploration(c=2.0),
}

# Run comparison (assuming env is defined)
# results = compare_strategies(env, strategies)

When to Use Each Strategy

StrategyBest ForAvoid When
ε-greedySimple problems, quick prototyping, robust baselineYou need optimal exploration efficiency
ε-decayTraining runs with defined length, decreasing uncertaintyEnvironment changes or is highly stochastic
BoltzmannWhen action values have meaningful relative differencesQ-values have very different scales early vs late
UCBTabular settings, need systematic explorationLarge/continuous state spaces
💡Tip

Practical advice: Start with ϵ\epsilon-decay. It’s simple, robust, and works well across many problems. Only switch to fancier methods if you have a specific reason:

  • Use UCB if you have a small state-action space and want guaranteed coverage
  • Use Boltzmann if the relative differences in Q-values are meaningful signals
  • Use more sophisticated methods (count-based bonuses, curiosity) if you have large state spaces with sparse rewards

Practical Guidelines

Tuning Exploration

  1. Start with high exploration: ϵ=1.0\epsilon = 1.0 or τ=10.0\tau = 10.0. You want to see the full state space early.

  2. Decay aggressively but not too fast: You want most exploration done by the halfway point of training. A good rule of thumb: ϵ\epsilon should reach its minimum around 70-80% through training.

  3. Never go to zero: Even at convergence, a tiny bit of exploration (like ϵ=0.01\epsilon = 0.01) helps handle non-stationarity and prevents getting stuck.

  4. Monitor learning curves: If the agent converges quickly but to a suboptimal policy, you probably under-explored. If learning is slow, you might be exploring too much.

Common Pitfalls

Summary

Key Takeaways

  • The exploration-exploitation dilemma is fundamental: exploit to maximize reward, explore to find better options. Neither extreme works alone.
  • ϵ\epsilon-greedy is simple and robust: with probability ϵ\epsilon, take a random action. A good default is ϵ=0.1\epsilon = 0.1.
  • ϵ\epsilon-decay adapts over time: start with high exploration, reduce as Q-values become reliable. Match decay schedule to training length.
  • Boltzmann (softmax) explores proportionally to value estimates. Temperature τ\tau controls the trade-off: high → random, low → greedy.
  • UCB explores systematically based on uncertainty. It prefers under-sampled actions but requires visit counts.
  • There’s no universally best strategy. The right choice depends on state space size, training time, and reward structure.

These exploration strategies work well when we can track visit counts or store Q-values for every state-action pair. But what happens when the state space is so large—images, continuous values, millions of states—that we can’t possibly track everything?

That’s where function approximation comes in, and Q-learning meets deep learning.

Next ChapterDeep Q-Networks

Exercises

Conceptual Questions

  1. Why does pure exploitation (ϵ=0\epsilon = 0) fail even after extensive training? Think about how the agent’s early random choices affect what it learns.

  2. Explain intuitively what “optimism in the face of uncertainty” means in UCB. Why does preferring uncertain actions lead to good exploration?

  3. When might you prefer Boltzmann exploration over ϵ\epsilon-greedy? Give a concrete scenario where the relative Q-values matter.

  4. What’s the downside of exploring too much? If some exploration is good, why not maximize it?

Coding Challenges

  1. Implement all four exploration strategies and compare them on GridWorld. Plot learning curves (reward per episode) averaged over 10 runs. Which converges fastest? Which finds the best final policy?

  2. Create an adaptive ϵ\epsilon-decay that slows down when learning stalls. Hint: monitor the TD error or reward over recent episodes. If performance isn’t improving, decay slower.

  3. Implement a hybrid UCB-ϵ\epsilon strategy: Use UCB for action selection, but maintain a minimum exploration probability. Compare with pure UCB on a problem where some states are visited very rarely.

Exploration

  1. Design your own exploration strategy. What principles would guide it? Consider:

    • What information is available? (Q-values, counts, recent rewards)
    • How should exploration change over time?
    • Should it depend on the state?

    Implement it and compare against ϵ\epsilon-decay on your choice of environment.