Policy Evaluation
What You'll Learn
- Implement iterative policy evaluation from scratch
- Explain why policy evaluation is guaranteed to converge
- Choose appropriate stopping criteria for practical applications
- Trace through policy evaluation on a simple MDP step by step
- Understand the computational requirements and trade-offs
We have the Bellman equations. We know that the value function must satisfy them. But how do we actually compute for a given policy?
The answer is surprisingly elegant: just keep applying the Bellman equation until the values stop changing.
The Big Idea
The task of computing the state-value function for a given policy . Also called “prediction” because we are predicting the expected return from each state under that policy.
Imagine you want to know how valuable each state is, but you have no idea where to start. So you guess: set all values to zero. Now ask yourself, “If my values were correct, what would the Bellman equation tell me?” The Bellman equation combines immediate rewards with future values to give you a new, better estimate.
Repeat this process. Each round, your estimates become more accurate. Eventually, they stop changing because you have found the values that exactly satisfy the Bellman equation. Those are the true values under policy .
This process is called iterative policy evaluation, and it is our first Dynamic Programming algorithm.
What is Dynamic Programming?
Dynamic Programming (DP) refers to a collection of algorithms that compute optimal policies given a perfect model of the environment. The term comes from Richard Bellman, who developed these techniques in the 1950s.
The “dynamic” refers to the sequential, multi-stage nature of the decision problem. The “programming” means optimization in the mathematical sense, not computer programming.
DP methods work by using the Bellman equations to iteratively improve value estimates. They are foundational because:
- They are exact: When we know the MDP, DP finds the true values
- They are foundational: TD learning, Q-learning, and other RL methods are approximations of DP ideas
- They build intuition: Seeing DP work clearly prepares you for more complex methods
DP requires knowing the full MDP, meaning all transition probabilities and rewards . In most real-world problems, we do not have this luxury. That is where reinforcement learning comes in: learning from experience when the model is unknown.
But before we can run, we must walk. Understanding DP is essential for understanding everything that follows.
Chapter Overview
This chapter covers the mechanics of computing :
Iterative Policy Evaluation
The algorithm: initialize, sweep, repeat until convergence
Convergence and Stopping
Why it converges, when to stop, and computational costs
A Preview: The Update Rule
The heart of policy evaluation is the Bellman expectation backup:
This equation says: the new value estimate for state equals the expected immediate reward plus the discounted value of where we end up, averaged over the actions we might take under policy .
We apply this update to every state, creating a “sweep” through the state space. After enough sweeps, the values converge to .
Key Questions We Will Answer
- How does the iterative algorithm work, step by step?
- Why is convergence guaranteed? What property ensures we reach the right answer?
- How do we choose when to stop? What is a good threshold?
- What is the computational cost, and how does it scale with the MDP size?
The Path Ahead
Once we can evaluate any policy, the natural next question is: How do we find a better policy? That is the subject of the next chapter on Policy Improvement, where we will use value functions to construct better and better policies until we reach the optimum.
Key Takeaways
- Policy evaluation computes for a given policy
- The algorithm: initialize values, apply Bellman updates, repeat until convergence
- Convergence is guaranteed because the Bellman operator is a contraction
- Stop when the maximum value change falls below a threshold
- This is the first step toward finding optimal policies