Chapter 22
πŸ“Draft

Policy Improvement

Finding better policies through value functions

Prerequisites:

Policy Improvement

What You'll Learn

  • Explain the policy improvement theorem and why it guarantees progress
  • Implement policy iteration from scratch
  • Implement value iteration from scratch
  • Compare policy iteration and value iteration on the same MDP
  • Solve simple MDPs to find optimal policies using DP methods

We can now evaluate any policy, computing VΟ€V^\pi for any choice of Ο€\pi. But our goal is not just to evaluate policies. We want to find the best policy.

The remarkable insight of this chapter is that once we have VΟ€V^\pi, we can construct a better policy. And by alternating between evaluation and improvement, we can climb all the way to the optimum.

Value Iteration Visualization

Watch value iteration solve the GridWorld problem step by step.

0.0
S
0.0
0.0
0.0
0.0
🧱
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
0.0
🎯
Click "Step" or "Play" to start value iteration
0
Iteration
0.0000
Max Delta
0.90
Gamma
🎯Goal (+10)
🧱Obstacle
High Value
Low Value
How it works: Value iteration repeatedly applies the Bellman equation V(s) = max_a [R(s,a) + Ξ³ V(s')] until values converge. Each cell shows its estimated value, and arrows show the optimal policy. Higher gamma values make the agent care more about future rewards. More negative step rewards encourage shorter paths.

The Core Insight

πŸ“–Policy Improvement

The process of constructing a new policy that is at least as good as the current policy, by acting greedily with respect to the current value function.

Suppose you know VΟ€(s)V^\pi(s) for all states. From any state ss, you can ask: β€œWhat if I took a different action just once, then followed Ο€\pi afterward? Would I do better?”

The greedy action is the one that maximizes this one-step lookahead. The policy improvement theorem tells us something magical: if we switch to always taking the greedy action, we are guaranteed to do at least as well as before, and usually better.

This leads to two powerful algorithms:

  • Policy Iteration: Alternate between full policy evaluation and greedy improvement
  • Value Iteration: Combine evaluation and improvement into a single operation

Both converge to the optimal policy Ο€βˆ—\pi^* and optimal value function Vβˆ—V^*.

Chapter Overview

Two Paths to the Same Destination

Policy Iteration

  1. Evaluate current policy completely (VΟ€V^\pi)
  2. Improve: make policy greedy w.r.t. VΟ€V^\pi
  3. Repeat until policy stops changing

Typically converges in very few policy iterations, but each iteration requires full evaluation.

Value Iteration

  1. Apply Bellman optimality backup to all states
  2. Repeat until values converge
  3. Extract greedy policy at the end

Simpler per-iteration, but may require many more iterations to converge.

ℹ️Note

Both algorithms find the same optimal policy. They differ in how they organize computation. The best choice depends on the problem structure and computational constraints.

The Greedy Policy

The key operation in both algorithms is constructing a greedy policy from a value function.

βˆ‘Mathematical Details

Given a value function VV, the greedy policy selects:

Ο€β€²(s)=arg⁑max⁑a[R(s,a)+Ξ³βˆ‘sβ€²P(sβ€²βˆ£s,a)V(sβ€²)]\pi'(s) = \arg\max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right]

This picks the action that maximizes expected immediate reward plus discounted future value.

The greedy policy answers: β€œLooking one step ahead and using my current value estimates for what comes after, which action looks best?”

It is called β€œgreedy” because it always takes what looks best right now. Surprisingly, this myopic approach leads to globally optimal behavior when combined with accurate value estimates.

Why This Matters

Dynamic Programming gives us exact solutions to MDPs. When we have access to the complete model (all transitions and rewards), DP finds the true optimal policy.

But DP has limitations:

Despite these limitations, DP is essential:

  1. It provides the theoretical foundation for understanding RL
  2. Many RL algorithms are approximate versions of DP methods
  3. For small MDPs, DP gives the ground truth to compare against

Key Questions We Will Answer

  • Why does greedy improvement never make things worse?
  • How many policy iterations are typically needed?
  • When should you use policy iteration vs. value iteration?
  • What is the relationship between these algorithms and Q-learning?

Key Takeaways

  • The policy improvement theorem guarantees that greedy policies are never worse
  • Policy iteration alternates: evaluate β†’\rightarrow improve β†’\rightarrow evaluate β†’\rightarrow …
  • Value iteration applies the Bellman optimality backup directly to find Vβˆ—V^*
  • Both converge to the optimal policy for finite MDPs
  • DP is foundational but requires knowing the full model
Next ChapterMulti-Armed Bandits→