The Bellman Equations

What You'll Learn

Derive the Bellman expectation equation for $V^\pi$ and $Q^\pi$
Derive the Bellman optimality equations for $V^*$ and $Q^*$
Explain what “bootstrapping” means and why it enables efficient learning
Understand why Bellman equations are the foundation of all RL algorithms

Richard Bellman discovered something beautiful in the 1950s: the value of a state depends on the values of states you can reach from it. This simple recursive insight---that you can break down a complex problem into simpler subproblems---is the foundation of everything in reinforcement learning.

The Recursive Nature of Value

Consider standing at a crossroads. How valuable is your current position? It depends on:

What reward you get right now (maybe there’s a rest stop here)
Where you can go from here (the available roads)
How valuable those destinations are (what awaits down each path)

But wait---the value of those destinations depends on their successors. And those depend on further successors. It seems like an infinite regress!

Bellman’s insight was that this recursion is actually a feature, not a bug. The value of any state can be expressed in terms of immediate rewards plus the values of neighboring states. This creates a system of equations that we can solve.

📖Bellman Equations

Recursive relationships that express the value of a state (or state-action pair) in terms of immediate rewards and the values of successor states. They form a system of equations that value functions must satisfy.

The Core Insight: Value Flows Backward

Think about how you might evaluate different starting positions in a maze:

📌A Simple Maze

Consider a 3-state path: Start, Middle, Goal

Start

→

Mid

→

+10

Goal value: $V(\text{Goal}) = 10$ (the reward)
Middle value: $V(\text{Mid}) = 0 + \gamma \cdot V(\text{Goal}) = 0 + 0.9 \cdot 10 = 9$
Start value: $V(\text{Start}) = 0 + \gamma \cdot V(\text{Mid}) = 0 + 0.9 \cdot 9 = 8.1$

Value flows backward from rewards through the state space.

This backward flow is the essence of the Bellman equations. Each state’s value depends on its successors’ values, which depend on their successors’ values, all the way to the rewards.

Chapter Overview

This chapter explores the mathematical heart of reinforcement learning---the Bellman equations:

Bellman Expectation Equations

How values relate under a specific policy

Bellman Optimality Equations

The equations that define optimal behavior

Why Bellman Matters

Bootstrapping and the foundation of RL

The Two Types of Bellman Equations

There are two families of Bellman equations, and understanding the difference is crucial:

Expectation Equations

The Bellman expectation equations describe values under a specific policy $\pi$ . They answer: “If I follow this particular strategy, how good is each state?”

$V^\pi(s) = \sum_{a} \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') \right]$

The key feature: we average over actions according to the policy $\pi(a|s)$ .

Optimality Equations

The Bellman optimality equations describe the best possible values. They answer: “If I act optimally from now on, how good is each state?”

$V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right]$

The key feature: we take the maximum over actions, not the average.

ℹ️Note

The shift from $\sum_a \pi(a|s)$ to $\max_a$ is subtle but profound. It means the optimality equations don’t require knowing the optimal policy---the $\max$ operator embeds the optimal action choice directly into the equation.

Why This Chapter Matters

The Bellman equations are not just theoretical curiosities---they’re the foundation of every major RL algorithm:

Dynamic Programming:Solves Bellman equations exactly when model is known

Monte Carlo:Estimates values by sampling complete returns

Temporal Difference:Approximates Bellman with single-step sampled updates

Q-Learning:Learns optimal values via sampled Bellman optimality updates

💡Tip

If you understand the Bellman equations deeply, you understand the core of reinforcement learning. Everything else is about how to solve or approximate these equations under different conditions---when you know the model, when you don’t, when states are discrete, when they’re continuous.

The Journey Ahead

In this chapter, we’ll build up the Bellman equations step by step:

Bellman Expectation: We’ll derive the equations from the definition of return, showing exactly how today’s value depends on tomorrow’s value.
Bellman Optimality: We’ll see how replacing the policy average with a maximum gives us equations that define optimal behavior without needing to specify an optimal policy.
Why It Matters: We’ll explore bootstrapping---the key insight that lets us update estimates using other estimates---and see why this makes RL algorithms possible.

∑Mathematical Details

By the end of this chapter, you’ll be comfortable with all four Bellman equations:

For state values:

Expectation: $V^\pi(s) = \sum_{a} \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') \right]$
Optimality: $V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right]$

For action values:

Expectation: $Q^\pi(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \sum_{a'} \pi(a'|s') Q^\pi(s',a')$
Optimality: $Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')$

Key Takeaways

Bellman equations express values recursively in terms of successor values
Expectation equations average over the policy: $\sum_a \pi(a|s) [\ldots]$
Optimality equations maximize over actions: $\max_a [\ldots]$
Value flows backward from rewards through the state space
These equations are the foundation of all RL algorithms

Next ChapterPolicy Evaluation→