Dynamic Programming • Part 1 of 3
📝Draft

The Policy Improvement Theorem

Why greedy improvement always works

The Policy Improvement Theorem

The policy improvement theorem is one of the most beautiful results in reinforcement learning. It tells us that acting greedily with respect to a value function can never make things worse, and usually makes things better.

This simple guarantee is the foundation of both policy iteration and value iteration.

The Greedy Policy

📖Greedy Policy

Given a value function VV, the greedy policy selects the action that maximizes the expected one-step return:

π(s)=argmaxa[R(s,a)+γsP(ss,a)V(s)]\pi'(s) = \arg\max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right]

Think of it as a one-step lookahead. For each action, you ask: “If I take this action, I get some immediate reward. Then I end up in a new state, which has some value according to my current estimates. Which action maximizes this total?”

The greedy policy always picks the action with the highest answer.

</>Implementation
def compute_greedy_policy(mdp, V, gamma=0.99):
    """
    Compute the greedy policy with respect to value function V.

    Args:
        mdp: MDP with states, actions(s), transitions(s, a)
        V: dict mapping state -> value
        gamma: discount factor

    Returns:
        policy: dict mapping state -> best action
    """
    policy = {}

    for s in mdp.states:
        if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
            policy[s] = None  # No action in terminal states
            continue

        best_action = None
        best_value = float('-inf')

        for a in mdp.actions(s):
            # Compute Q(s, a) = expected reward + discounted future value
            action_value = 0.0
            for s_next, prob, reward in mdp.transitions(s, a):
                action_value += prob * (reward + gamma * V[s_next])

            if action_value > best_value:
                best_value = action_value
                best_action = a

        policy[s] = best_action

    return policy

The Q-Function Connection

The greedy policy is choosing the action that maximizes Qπ(s,a)Q^\pi(s, a), the action-value function.

Mathematical Details

Recall that the action-value function is:

Qπ(s,a)=R(s,a)+γsP(ss,a)Vπ(s)Q^\pi(s, a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s')

This is exactly what we are maximizing! The greedy policy is:

π(s)=argmaxaQπ(s,a)\pi'(s) = \arg\max_a Q^\pi(s, a)

The Q-function tells us: “If I take action aa in state ss, then follow policy π\pi forever after, what return do I expect?”

The greedy policy says: “Always take the action with the highest Q-value.”

The Theorem

Now we come to the key result.

📖Policy Improvement Theorem

Let π\pi and π\pi' be two policies such that for all states ss:

Qπ(s,π(s))Vπ(s)Q^\pi(s, \pi'(s)) \geq V^\pi(s)

Then π\pi' is at least as good as π\pi for all states:

Vπ(s)Vπ(s) for all sV^{\pi'}(s) \geq V^\pi(s) \text{ for all } s

If the inequality is strict for any state where ππ\pi' \neq \pi, then π\pi' is strictly better overall.

The theorem says: if you find an action that looks better than what the current policy would do (even just for one step), and you take that action instead, you cannot make things worse overall.

This is remarkable. You might worry: “What if being greedy now hurts me later?” The theorem guarantees this never happens. Local greed leads to global improvement.

Why It Works

Mathematical Details

Proof Sketch:

Start from the assumption: Qπ(s,π(s))Vπ(s)Q^\pi(s, \pi'(s)) \geq V^\pi(s) for all ss.

This means: Vπ(s)Qπ(s,π(s))V^\pi(s) \leq Q^\pi(s, \pi'(s)) =R(s,π(s))+γsP(ss,π(s))Vπ(s)= R(s, \pi'(s)) + \gamma \sum_{s'} P(s'|s, \pi'(s)) V^\pi(s')

Now we use the key trick. The value Vπ(s)V^\pi(s') for each next state ss' also satisfies the same inequality: Vπ(s)Qπ(s,π(s))V^\pi(s') \leq Q^\pi(s', \pi'(s')) =R(s,π(s))+γsP(ss,π(s))Vπ(s)= R(s', \pi'(s')) + \gamma \sum_{s''} P(s''|s', \pi'(s')) V^\pi(s'')

Substituting back: Vπ(s)R(s,π(s))+γsP(ss,π(s))[R(s,π(s))+γsP(ss,π(s))Vπ(s)]V^\pi(s) \leq R(s, \pi'(s)) + \gamma \sum_{s'} P(s'|s, \pi'(s)) \left[ R(s', \pi'(s')) + \gamma \sum_{s''} P(s''|s', \pi'(s')) V^\pi(s'') \right]

We can keep unrolling this. After infinite unrolling, we get: Vπ(s)Eπ[R1+γR2+γ2R3+...S0=s]=Vπ(s)V^\pi(s) \leq \mathbb{E}_{\pi'}\left[ R_1 + \gamma R_2 + \gamma^2 R_3 + ... | S_0 = s \right] = V^{\pi'}(s)

The left side is bounded by what we get if we follow π\pi' forever, which is exactly Vπ(s)V^{\pi'}(s).

A Worked Example

📌Improving a Random Policy

Consider a simple 3-state MDP:

State A: Can go to B (action 'right') or stay in A (action 'stay')
State B: Can go to C (action 'right') or go to A (action 'left')
State C: Terminal with reward +10

All transitions have reward -1 except the transition B to C which gives +10. Let γ=0.9\gamma = 0.9.

Step 1: Start with a random policy

π(A)=right with prob 0.5, stay with prob 0.5\pi(A) = \text{right with prob 0.5, stay with prob 0.5} π(B)=right with prob 0.5, left with prob 0.5\pi(B) = \text{right with prob 0.5, left with prob 0.5}

Step 2: Evaluate this policy

After running policy evaluation, we find:

  • Vπ(A)3.5V^\pi(A) \approx 3.5
  • Vπ(B)5.5V^\pi(B) \approx 5.5
  • Vπ(C)=0V^\pi(C) = 0

Step 3: Compute Q-values

For state A:

  • Qπ(A,right)=1+0.9×5.5=3.95Q^\pi(A, \text{right}) = -1 + 0.9 \times 5.5 = 3.95
  • Qπ(A,stay)=1+0.9×3.5=2.15Q^\pi(A, \text{stay}) = -1 + 0.9 \times 3.5 = 2.15

For state B:

  • Qπ(B,right)=10+0.9×0=10Q^\pi(B, \text{right}) = 10 + 0.9 \times 0 = 10
  • Qπ(B,left)=1+0.9×3.5=2.15Q^\pi(B, \text{left}) = -1 + 0.9 \times 3.5 = 2.15

Step 4: Construct greedy policy

π(A)=right (since 3.95 > 2.15)\pi'(A) = \text{right (since 3.95 > 2.15)} π(B)=right (since 10 > 2.15)\pi'(B) = \text{right (since 10 > 2.15)}

Step 5: Verify improvement

The new policy always goes right. Evaluating it:

  • Vπ(A)=1+0.9×Vπ(B)=1+0.9×10=8V^{\pi'}(A) = -1 + 0.9 \times V^{\pi'}(B) = -1 + 0.9 \times 10 = 8
  • Vπ(B)=10V^{\pi'}(B) = 10

Indeed, Vπ(A)=8>3.5=Vπ(A)V^{\pi'}(A) = 8 > 3.5 = V^\pi(A) and Vπ(B)=10>5.5=Vπ(B)V^{\pi'}(B) = 10 > 5.5 = V^\pi(B).

The greedy policy is strictly better!

When Does Improvement Stop?

💡Tip

The greedy improvement process stops when the greedy policy equals the current policy. At this point:

π(s)=argmaxaQπ(s,a)=π(s) for all s\pi'(s) = \arg\max_a Q^\pi(s, a) = \pi(s) \text{ for all } s

This means the current policy is already greedy with respect to its own value function. Such a policy satisfies the Bellman optimality equation and is therefore optimal.

Mathematical Details

When π=π\pi = \pi' (the policy no longer changes), we have:

Vπ(s)=maxaQπ(s,a)=maxa[R(s,a)+γsP(ss,a)Vπ(s)]V^\pi(s) = \max_a Q^\pi(s, a) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') \right]

This is exactly the Bellman optimality equation! Therefore, Vπ=VV^\pi = V^* and π=π\pi = \pi^*.

Why This Is Remarkable

Consider why this theorem is surprising:

  1. Greedy is not usually optimal. In many settings, being greedy leads to suboptimal outcomes. Eating dessert first feels great now but is not optimal for health. The policy improvement theorem tells us that for MDPs with accurate value functions, greedy actually IS optimal.

  2. Local changes give global improvement. We only look one step ahead when computing the greedy action. Yet the improvement is not just one step; the new policy is better from every state for all time.

  3. We cannot get stuck. Either the greedy policy is different (and better), or it is the same (and optimal). There is no local optimum trap.

Stochastic Policies

So far we have discussed deterministic greedy policies. What about stochastic policies?

Mathematical Details

For stochastic policies, the improvement theorem still holds. If π\pi' is greedy with respect to VπV^\pi, meaning:

π(as)>0 only if aargmaxaQπ(s,a)\pi'(a|s) > 0 \text{ only if } a \in \arg\max_{a'} Q^\pi(s, a')

Then Vπ(s)Vπ(s)V^{\pi'}(s) \geq V^\pi(s) for all ss.

When multiple actions are tied for the maximum Q-value, we can break ties arbitrarily (even stochastically) without affecting the improvement guarantee.

ℹ️Note

In practice, we usually work with deterministic policies for DP because they are simpler and just as good. For any MDP, there exists a deterministic optimal policy. Stochastic policies become important in settings with partial observability or when we need exploration.

Implementation: Policy Improvement Step

</>Implementation
def policy_improvement(mdp, V, policy, gamma=0.99):
    """
    Perform one step of policy improvement.

    Args:
        mdp: MDP object
        V: Current value function (dict: state -> value)
        policy: Current policy (dict: state -> action)
        gamma: Discount factor

    Returns:
        new_policy: Improved policy
        is_stable: True if policy did not change
    """
    is_stable = True
    new_policy = {}

    for s in mdp.states:
        if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
            new_policy[s] = None
            continue

        old_action = policy.get(s)

        # Find the greedy action
        best_action = None
        best_value = float('-inf')

        for a in mdp.actions(s):
            action_value = 0.0
            for s_next, prob, reward in mdp.transitions(s, a):
                action_value += prob * (reward + gamma * V[s_next])

            if action_value > best_value:
                best_value = action_value
                best_action = a

        new_policy[s] = best_action

        # Check if policy changed
        if old_action != best_action:
            is_stable = False

    return new_policy, is_stable


def compute_q_values(mdp, V, gamma=0.99):
    """
    Compute Q(s, a) for all state-action pairs.

    Useful for understanding what the greedy policy sees.
    """
    Q = {}

    for s in mdp.states:
        Q[s] = {}
        for a in mdp.actions(s):
            action_value = 0.0
            for s_next, prob, reward in mdp.transitions(s, a):
                action_value += prob * (reward + gamma * V[s_next])
            Q[s][a] = action_value

    return Q

Common Misconceptions

Summary

ℹ️Note

Key Takeaways:

  1. The greedy policy selects argmaxaQπ(s,a)\arg\max_a Q^\pi(s, a) at each state
  2. The policy improvement theorem guarantees: if π\pi' is greedy w.r.t. VπV^\pi, then VπVπV^{\pi'} \geq V^\pi
  3. Improvement stops when the policy equals its own greedy policy, which means optimality
  4. This is the foundation for both policy iteration and value iteration

Now that we understand why improvement works, we can put it together with evaluation to build complete algorithms. The next section introduces Policy Iteration, which alternates between evaluation and improvement until convergence.