Policy Improvement

What You'll Learn

Explain the policy improvement theorem and why it guarantees progress
Implement policy iteration from scratch
Implement value iteration from scratch
Compare policy iteration and value iteration on the same MDP
Solve simple MDPs to find optimal policies using DP methods

We can now evaluate any policy, computing $V^\pi$ for any choice of $\pi$ . But our goal is not just to evaluate policies. We want to find the best policy.

The remarkable insight of this chapter is that once we have $V^\pi$ , we can construct a better policy. And by alternating between evaluation and improvement, we can climb all the way to the optimum.

Value Iteration Visualization

Watch value iteration solve the GridWorld problem step by step.

Gamma (γ):0.90Step Reward:Goal Reward:Speed:

0.0

🧱

0.0

🎯

Click "Step" or "Play" to start value iteration

Iteration

0.0000

Max Delta

0.90

Gamma

🎯Goal (+10)

🧱Obstacle

High Value

Low Value

How it works: Value iteration repeatedly applies the Bellman equation V(s) = max_a [R(s,a) + γ V(s')] until values converge. Each cell shows its estimated value, and arrows show the optimal policy. Higher gamma values make the agent care more about future rewards. More negative step rewards encourage shorter paths.

The Core Insight

📖Policy Improvement

The process of constructing a new policy that is at least as good as the current policy, by acting greedily with respect to the current value function.

Suppose you know $V^\pi(s)$ for all states. From any state $s$ , you can ask: “What if I took a different action just once, then followed $\pi$ afterward? Would I do better?”

The greedy action is the one that maximizes this one-step lookahead. The policy improvement theorem tells us something magical: if we switch to always taking the greedy action, we are guaranteed to do at least as well as before, and usually better.

This leads to two powerful algorithms:

Policy Iteration: Alternate between full policy evaluation and greedy improvement
Value Iteration: Combine evaluation and improvement into a single operation

Both converge to the optimal policy $\pi^*$ and optimal value function $V^*$ .

Chapter Overview

The Policy Improvement Theorem

The theoretical foundation: why greedy improvement always works

Policy Iteration

Alternating evaluation and improvement until convergence

Value Iteration

Finding optimal values directly with a single update rule

Two Paths to the Same Destination

Policy Iteration

Evaluate current policy completely ( $V^\pi$ )
Improve: make policy greedy w.r.t. $V^\pi$
Repeat until policy stops changing

Typically converges in very few policy iterations, but each iteration requires full evaluation.

Value Iteration

Apply Bellman optimality backup to all states
Repeat until values converge
Extract greedy policy at the end

Simpler per-iteration, but may require many more iterations to converge.

ℹ️Note

Both algorithms find the same optimal policy. They differ in how they organize computation. The best choice depends on the problem structure and computational constraints.

The Greedy Policy

The key operation in both algorithms is constructing a greedy policy from a value function.

∑Mathematical Details

Given a value function $V$ , the greedy policy selects:

$\pi'(s) = \arg\max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right]$

This picks the action that maximizes expected immediate reward plus discounted future value.

The greedy policy answers: “Looking one step ahead and using my current value estimates for what comes after, which action looks best?”

It is called “greedy” because it always takes what looks best right now. Surprisingly, this myopic approach leads to globally optimal behavior when combined with accurate value estimates.

Why This Matters

Dynamic Programming gives us exact solutions to MDPs. When we have access to the complete model (all transitions and rewards), DP finds the true optimal policy.

But DP has limitations:

⚠️Warning

DP requires the full model. In most real-world problems, we do not know $P(s'|s,a)$ precisely. We might have a simulator, or we might only have real experience.

DP requires enumeration. We must loop over all states and all actions. This becomes impractical when the state space is large (millions of states) or continuous.

These limitations motivate reinforcement learning: methods that learn from samples without requiring the full model, and that can use function approximation to handle large state spaces.

Despite these limitations, DP is essential:

It provides the theoretical foundation for understanding RL
Many RL algorithms are approximate versions of DP methods
For small MDPs, DP gives the ground truth to compare against

Key Questions We Will Answer

Why does greedy improvement never make things worse?
How many policy iterations are typically needed?
When should you use policy iteration vs. value iteration?
What is the relationship between these algorithms and Q-learning?

Key Takeaways

The policy improvement theorem guarantees that greedy policies are never worse
Policy iteration alternates: evaluate $\rightarrow$ improve $\rightarrow$ evaluate $\rightarrow$ …
Value iteration applies the Bellman optimality backup directly to find $V^*$
Both converge to the optimal policy for finite MDPs
DP is foundational but requires knowing the full model

Next ChapterMulti-Armed Bandits→