The Bellman Equations
What You'll Learn
- Derive the Bellman expectation equation for and
- Derive the Bellman optimality equations for and
- Explain what “bootstrapping” means and why it enables efficient learning
- Understand why Bellman equations are the foundation of all RL algorithms
Richard Bellman discovered something beautiful in the 1950s: the value of a state depends on the values of states you can reach from it. This simple recursive insight---that you can break down a complex problem into simpler subproblems---is the foundation of everything in reinforcement learning.
The Recursive Nature of Value
Consider standing at a crossroads. How valuable is your current position? It depends on:
- What reward you get right now (maybe there’s a rest stop here)
- Where you can go from here (the available roads)
- How valuable those destinations are (what awaits down each path)
But wait---the value of those destinations depends on their successors. And those depend on further successors. It seems like an infinite regress!
Bellman’s insight was that this recursion is actually a feature, not a bug. The value of any state can be expressed in terms of immediate rewards plus the values of neighboring states. This creates a system of equations that we can solve.
Recursive relationships that express the value of a state (or state-action pair) in terms of immediate rewards and the values of successor states. They form a system of equations that value functions must satisfy.
The Core Insight: Value Flows Backward
Think about how you might evaluate different starting positions in a maze:
Consider a 3-state path: Start, Middle, Goal
- Goal value: (the reward)
- Middle value:
- Start value:
Value flows backward from rewards through the state space.
This backward flow is the essence of the Bellman equations. Each state’s value depends on its successors’ values, which depend on their successors’ values, all the way to the rewards.
Chapter Overview
This chapter explores the mathematical heart of reinforcement learning---the Bellman equations:
Bellman Expectation Equations
How values relate under a specific policy
Bellman Optimality Equations
The equations that define optimal behavior
Why Bellman Matters
Bootstrapping and the foundation of RL
The Two Types of Bellman Equations
There are two families of Bellman equations, and understanding the difference is crucial:
Expectation Equations
The Bellman expectation equations describe values under a specific policy . They answer: “If I follow this particular strategy, how good is each state?”
The key feature: we average over actions according to the policy .
Optimality Equations
The Bellman optimality equations describe the best possible values. They answer: “If I act optimally from now on, how good is each state?”
The key feature: we take the maximum over actions, not the average.
The shift from to is subtle but profound. It means the optimality equations don’t require knowing the optimal policy---the operator embeds the optimal action choice directly into the equation.
Why This Chapter Matters
The Bellman equations are not just theoretical curiosities---they’re the foundation of every major RL algorithm:
If you understand the Bellman equations deeply, you understand the core of reinforcement learning. Everything else is about how to solve or approximate these equations under different conditions---when you know the model, when you don’t, when states are discrete, when they’re continuous.
The Journey Ahead
In this chapter, we’ll build up the Bellman equations step by step:
-
Bellman Expectation: We’ll derive the equations from the definition of return, showing exactly how today’s value depends on tomorrow’s value.
-
Bellman Optimality: We’ll see how replacing the policy average with a maximum gives us equations that define optimal behavior without needing to specify an optimal policy.
-
Why It Matters: We’ll explore bootstrapping---the key insight that lets us update estimates using other estimates---and see why this makes RL algorithms possible.
By the end of this chapter, you’ll be comfortable with all four Bellman equations:
For state values:
- Expectation:
- Optimality:
For action values:
- Expectation:
- Optimality:
Key Takeaways
- Bellman equations express values recursively in terms of successor values
- Expectation equations average over the policy:
- Optimality equations maximize over actions:
- Value flows backward from rewards through the state space
- These equations are the foundation of all RL algorithms