Policy Improvement
What You'll Learn
- Explain the policy improvement theorem and why it guarantees progress
- Implement policy iteration from scratch
- Implement value iteration from scratch
- Compare policy iteration and value iteration on the same MDP
- Solve simple MDPs to find optimal policies using DP methods
We can now evaluate any policy, computing for any choice of . But our goal is not just to evaluate policies. We want to find the best policy.
The remarkable insight of this chapter is that once we have , we can construct a better policy. And by alternating between evaluation and improvement, we can climb all the way to the optimum.
Value Iteration Visualization
Watch value iteration solve the GridWorld problem step by step.
The Core Insight
The process of constructing a new policy that is at least as good as the current policy, by acting greedily with respect to the current value function.
Suppose you know for all states. From any state , you can ask: βWhat if I took a different action just once, then followed afterward? Would I do better?β
The greedy action is the one that maximizes this one-step lookahead. The policy improvement theorem tells us something magical: if we switch to always taking the greedy action, we are guaranteed to do at least as well as before, and usually better.
This leads to two powerful algorithms:
- Policy Iteration: Alternate between full policy evaluation and greedy improvement
- Value Iteration: Combine evaluation and improvement into a single operation
Both converge to the optimal policy and optimal value function .
Chapter Overview
The Policy Improvement Theorem
The theoretical foundation: why greedy improvement always works
Policy Iteration
Alternating evaluation and improvement until convergence
Value Iteration
Finding optimal values directly with a single update rule
Two Paths to the Same Destination
Policy Iteration
- Evaluate current policy completely ()
- Improve: make policy greedy w.r.t.
- Repeat until policy stops changing
Typically converges in very few policy iterations, but each iteration requires full evaluation.
Value Iteration
- Apply Bellman optimality backup to all states
- Repeat until values converge
- Extract greedy policy at the end
Simpler per-iteration, but may require many more iterations to converge.
Both algorithms find the same optimal policy. They differ in how they organize computation. The best choice depends on the problem structure and computational constraints.
The Greedy Policy
The key operation in both algorithms is constructing a greedy policy from a value function.
Given a value function , the greedy policy selects:
This picks the action that maximizes expected immediate reward plus discounted future value.
The greedy policy answers: βLooking one step ahead and using my current value estimates for what comes after, which action looks best?β
It is called βgreedyβ because it always takes what looks best right now. Surprisingly, this myopic approach leads to globally optimal behavior when combined with accurate value estimates.
Why This Matters
Dynamic Programming gives us exact solutions to MDPs. When we have access to the complete model (all transitions and rewards), DP finds the true optimal policy.
But DP has limitations:
DP requires the full model. In most real-world problems, we do not know precisely. We might have a simulator, or we might only have real experience.
DP requires enumeration. We must loop over all states and all actions. This becomes impractical when the state space is large (millions of states) or continuous.
These limitations motivate reinforcement learning: methods that learn from samples without requiring the full model, and that can use function approximation to handle large state spaces.
Despite these limitations, DP is essential:
- It provides the theoretical foundation for understanding RL
- Many RL algorithms are approximate versions of DP methods
- For small MDPs, DP gives the ground truth to compare against
Key Questions We Will Answer
- Why does greedy improvement never make things worse?
- How many policy iterations are typically needed?
- When should you use policy iteration vs. value iteration?
- What is the relationship between these algorithms and Q-learning?
Key Takeaways
- The policy improvement theorem guarantees that greedy policies are never worse
- Policy iteration alternates: evaluate improve evaluate β¦
- Value iteration applies the Bellman optimality backup directly to find
- Both converge to the optimal policy for finite MDPs
- DP is foundational but requires knowing the full model