Chapter 21
📝Draft

Policy Evaluation

Computing the value of a policy when you know the model

Policy Evaluation

What You'll Learn

  • Implement iterative policy evaluation from scratch
  • Explain why policy evaluation is guaranteed to converge
  • Choose appropriate stopping criteria for practical applications
  • Trace through policy evaluation on a simple MDP step by step
  • Understand the computational requirements and trade-offs

We have the Bellman equations. We know that the value function VπV^\pi must satisfy them. But how do we actually compute VπV^\pi for a given policy?

The answer is surprisingly elegant: just keep applying the Bellman equation until the values stop changing.

The Big Idea

📖Policy Evaluation

The task of computing the state-value function VπV^\pi for a given policy π\pi. Also called “prediction” because we are predicting the expected return from each state under that policy.

Imagine you want to know how valuable each state is, but you have no idea where to start. So you guess: set all values to zero. Now ask yourself, “If my values were correct, what would the Bellman equation tell me?” The Bellman equation combines immediate rewards with future values to give you a new, better estimate.

Repeat this process. Each round, your estimates become more accurate. Eventually, they stop changing because you have found the values that exactly satisfy the Bellman equation. Those are the true values under policy π\pi.

This process is called iterative policy evaluation, and it is our first Dynamic Programming algorithm.

What is Dynamic Programming?

ℹ️Note

Dynamic Programming (DP) refers to a collection of algorithms that compute optimal policies given a perfect model of the environment. The term comes from Richard Bellman, who developed these techniques in the 1950s.

The “dynamic” refers to the sequential, multi-stage nature of the decision problem. The “programming” means optimization in the mathematical sense, not computer programming.

DP methods work by using the Bellman equations to iteratively improve value estimates. They are foundational because:

  1. They are exact: When we know the MDP, DP finds the true values
  2. They are foundational: TD learning, Q-learning, and other RL methods are approximations of DP ideas
  3. They build intuition: Seeing DP work clearly prepares you for more complex methods

Chapter Overview

This chapter covers the mechanics of computing VπV^\pi:

A Preview: The Update Rule

The heart of policy evaluation is the Bellman expectation backup:

Mathematical Details

Vk+1(s)=aπ(as)[R(s,a)+γsP(ss,a)Vk(s)]V_{k+1}(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s') \right]

This equation says: the new value estimate for state ss equals the expected immediate reward plus the discounted value of where we end up, averaged over the actions we might take under policy π\pi.

We apply this update to every state, creating a “sweep” through the state space. After enough sweeps, the values converge to VπV^\pi.

Key Questions We Will Answer

  • How does the iterative algorithm work, step by step?
  • Why is convergence guaranteed? What property ensures we reach the right answer?
  • How do we choose when to stop? What is a good threshold?
  • What is the computational cost, and how does it scale with the MDP size?

The Path Ahead

Once we can evaluate any policy, the natural next question is: How do we find a better policy? That is the subject of the next chapter on Policy Improvement, where we will use value functions to construct better and better policies until we reach the optimum.


Key Takeaways

  • Policy evaluation computes VπV^\pi for a given policy π\pi
  • The algorithm: initialize values, apply Bellman updates, repeat until convergence
  • Convergence is guaranteed because the Bellman operator is a contraction
  • Stop when the maximum value change falls below a threshold θ\theta
  • This is the first step toward finding optimal policies
Next ChapterPolicy Improvement