The Policy Improvement Theorem
The policy improvement theorem is one of the most beautiful results in reinforcement learning. It tells us that acting greedily with respect to a value function can never make things worse, and usually makes things better.
This simple guarantee is the foundation of both policy iteration and value iteration.
The Greedy Policy
Given a value function , the greedy policy selects the action that maximizes the expected one-step return:
Think of it as a one-step lookahead. For each action, you ask: “If I take this action, I get some immediate reward. Then I end up in a new state, which has some value according to my current estimates. Which action maximizes this total?”
The greedy policy always picks the action with the highest answer.
def compute_greedy_policy(mdp, V, gamma=0.99):
"""
Compute the greedy policy with respect to value function V.
Args:
mdp: MDP with states, actions(s), transitions(s, a)
V: dict mapping state -> value
gamma: discount factor
Returns:
policy: dict mapping state -> best action
"""
policy = {}
for s in mdp.states:
if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
policy[s] = None # No action in terminal states
continue
best_action = None
best_value = float('-inf')
for a in mdp.actions(s):
# Compute Q(s, a) = expected reward + discounted future value
action_value = 0.0
for s_next, prob, reward in mdp.transitions(s, a):
action_value += prob * (reward + gamma * V[s_next])
if action_value > best_value:
best_value = action_value
best_action = a
policy[s] = best_action
return policyThe Q-Function Connection
The greedy policy is choosing the action that maximizes , the action-value function.
Recall that the action-value function is:
This is exactly what we are maximizing! The greedy policy is:
The Q-function tells us: “If I take action in state , then follow policy forever after, what return do I expect?”
The greedy policy says: “Always take the action with the highest Q-value.”
The Theorem
Now we come to the key result.
Let and be two policies such that for all states :
Then is at least as good as for all states:
If the inequality is strict for any state where , then is strictly better overall.
The theorem says: if you find an action that looks better than what the current policy would do (even just for one step), and you take that action instead, you cannot make things worse overall.
This is remarkable. You might worry: “What if being greedy now hurts me later?” The theorem guarantees this never happens. Local greed leads to global improvement.
Why It Works
Proof Sketch:
Start from the assumption: for all .
This means:
Now we use the key trick. The value for each next state also satisfies the same inequality:
Substituting back:
We can keep unrolling this. After infinite unrolling, we get:
The left side is bounded by what we get if we follow forever, which is exactly .
A Worked Example
Consider a simple 3-state MDP:
State A: Can go to B (action 'right') or stay in A (action 'stay')
State B: Can go to C (action 'right') or go to A (action 'left')
State C: Terminal with reward +10All transitions have reward -1 except the transition B to C which gives +10. Let .
Step 1: Start with a random policy
Step 2: Evaluate this policy
After running policy evaluation, we find:
Step 3: Compute Q-values
For state A:
For state B:
Step 4: Construct greedy policy
Step 5: Verify improvement
The new policy always goes right. Evaluating it:
Indeed, and .
The greedy policy is strictly better!
When Does Improvement Stop?
The greedy improvement process stops when the greedy policy equals the current policy. At this point:
This means the current policy is already greedy with respect to its own value function. Such a policy satisfies the Bellman optimality equation and is therefore optimal.
When (the policy no longer changes), we have:
This is exactly the Bellman optimality equation! Therefore, and .
Why This Is Remarkable
Consider why this theorem is surprising:
-
Greedy is not usually optimal. In many settings, being greedy leads to suboptimal outcomes. Eating dessert first feels great now but is not optimal for health. The policy improvement theorem tells us that for MDPs with accurate value functions, greedy actually IS optimal.
-
Local changes give global improvement. We only look one step ahead when computing the greedy action. Yet the improvement is not just one step; the new policy is better from every state for all time.
-
We cannot get stuck. Either the greedy policy is different (and better), or it is the same (and optimal). There is no local optimum trap.
Stochastic Policies
So far we have discussed deterministic greedy policies. What about stochastic policies?
For stochastic policies, the improvement theorem still holds. If is greedy with respect to , meaning:
Then for all .
When multiple actions are tied for the maximum Q-value, we can break ties arbitrarily (even stochastically) without affecting the improvement guarantee.
In practice, we usually work with deterministic policies for DP because they are simpler and just as good. For any MDP, there exists a deterministic optimal policy. Stochastic policies become important in settings with partial observability or when we need exploration.
Implementation: Policy Improvement Step
def policy_improvement(mdp, V, policy, gamma=0.99):
"""
Perform one step of policy improvement.
Args:
mdp: MDP object
V: Current value function (dict: state -> value)
policy: Current policy (dict: state -> action)
gamma: Discount factor
Returns:
new_policy: Improved policy
is_stable: True if policy did not change
"""
is_stable = True
new_policy = {}
for s in mdp.states:
if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
new_policy[s] = None
continue
old_action = policy.get(s)
# Find the greedy action
best_action = None
best_value = float('-inf')
for a in mdp.actions(s):
action_value = 0.0
for s_next, prob, reward in mdp.transitions(s, a):
action_value += prob * (reward + gamma * V[s_next])
if action_value > best_value:
best_value = action_value
best_action = a
new_policy[s] = best_action
# Check if policy changed
if old_action != best_action:
is_stable = False
return new_policy, is_stable
def compute_q_values(mdp, V, gamma=0.99):
"""
Compute Q(s, a) for all state-action pairs.
Useful for understanding what the greedy policy sees.
"""
Q = {}
for s in mdp.states:
Q[s] = {}
for a in mdp.actions(s):
action_value = 0.0
for s_next, prob, reward in mdp.transitions(s, a):
action_value += prob * (reward + gamma * V[s_next])
Q[s][a] = action_value
return QCommon Misconceptions
Misconception 1: “The greedy policy is only one step better”
Wrong. The greedy policy is better for all future time, from every state. The one-step lookahead identifies improvements that compound over the entire trajectory.
Misconception 2: “We need to know the optimal values to improve”
Wrong. We can improve with respect to any value function. Even if is an imperfect estimate, the greedy policy with respect to is guaranteed to be at least as good as the policy that generated .
Misconception 3: “Multiple iterations might lead to worse policies”
Wrong. The policy improvement theorem guarantees monotonic improvement. Each iteration is at least as good as the previous one, until we reach optimality.
Summary
Key Takeaways:
- The greedy policy selects at each state
- The policy improvement theorem guarantees: if is greedy w.r.t. , then
- Improvement stops when the policy equals its own greedy policy, which means optimality
- This is the foundation for both policy iteration and value iteration
Now that we understand why improvement works, we can put it together with evaluation to build complete algorithms. The next section introduces Policy Iteration, which alternates between evaluation and improvement until convergence.