Bellman Expectation Equations
The Bellman expectation equations describe how values relate to each other when following a specific policy. They’re the mathematical foundation that connects the value of any state to the values of states you can reach from it.
Starting from Returns
Recall from the previous chapter that the state-value function is defined as the expected return:
The key insight comes from the recursive structure of returns.
The return has a natural decomposition:
We can factor out from all but the first term:
Today’s return = today’s reward + discounted tomorrow’s return.
This simple recursive relationship is the heart of the Bellman equations. If we know tomorrow’s value, we can compute today’s value.
Let’s derive this more carefully. Starting from the definition of return:
Separating the first reward:
Substituting :
This recursive structure of returns translates directly into a recursive structure of value functions.
Deriving the Bellman Equation for
Now we can derive the Bellman expectation equation. Starting from the value function definition:
Using the recursive structure :
By linearity of expectation:
The first term is the expected immediate reward. For the second term, we use the law of iterated expectations---we can condition on the next state :
This gives us the Bellman expectation equation:
The value of a state equals the expected immediate reward plus the discounted expected value of the next state.
Let’s break this equation into its components:
-
: Average over all actions, weighted by how likely the policy is to choose each one
-
: The expected immediate reward for taking action in state
-
: The discount factor---how much we care about future rewards
-
: The expected value of the next state, averaging over where we might end up
The equation says: “My value is what I expect to get right now, plus the discounted value of where I’ll be next.”
Visualizing with Backup Diagrams
A backup diagram visualizes how a state’s value depends on its successors. The name comes from the idea that value “backs up” from future states to the current state.
The backup diagram for shows:
Reading the diagram:
- Top node (large circle): The state whose value we’re computing
- Small filled circles: Actions, weighted by policy probability
- Bottom nodes: Possible next states, weighted by transition probability
- Edges: Carry rewards and transition probabilities
Value flows upward: we sum over next state values, weight by transitions and policy, add rewards, and get .
The Bellman Equation for
The action-value function has its own Bellman equation. Recall that:
The value of taking action in state equals the immediate reward plus the discounted expected Q-value of the next state-action pair.
This equation is slightly different from the equation because we’ve already committed to taking action . So we:
- Collect the immediate reward
- Transition to some next state with probability
- In that next state, the policy chooses action with probability
- Get the discounted value of that next state-action pair
The two sums represent: first averaging over where we end up, then averaging over what action the policy takes there.
The V-Q Relationship
The state and action value functions are intimately connected:
From Q to V: The state value is the policy-weighted average of action values:
From V to Q: The action value is the reward plus discounted next state value:
These relationships let us convert between V and Q. They also show that knowing one is equivalent to knowing the other (given the policy and dynamics).
These V-Q relationships are incredibly useful. Many algorithms compute V from Q or vice versa. For example:
- Policy improvement uses to find better actions
- TD learning often updates directly
- Q-learning updates directly
A Worked Example
Let’s compute values by hand for a simple MDP.
Consider this MDP with states , where is terminal:
Setup:
- Actions:
rightonly (deterministic policy) - Transitions: Deterministic (right always succeeds)
- Rewards: 0 for transitions, +10 for reaching C
- Discount:
Computing Values (working backward from terminal state):
State C is terminal with value 10:
State B: One step from C
State A: Two steps from C
The values decrease as we move away from the reward, discounted by at each step.
Bellman Equations Form a System
For an MDP with states, the Bellman expectation equation gives us equations (one per state) with unknowns (the values).
For a given policy , we can write this as a linear system. Define:
- as the vector of values
- as the expected immediate rewards under
- as the state transition matrix under
Then the Bellman equation becomes:
Rearranging:
This shows that the Bellman expectation equation has a unique solution (when and the inverse exists, which it always does for proper MDPs).
While we can solve the Bellman equation by matrix inversion, this requires computation and memory for states. For large state spaces (millions of states), this is impractical. That’s why we use iterative methods like Dynamic Programming or sampling methods like TD learning.
Implementation
Here’s how to implement a single Bellman backup for the state-value function:
def bellman_backup_v(s, V, mdp, policy, gamma=0.99):
"""
Compute new value for state s using Bellman expectation equation.
Args:
s: Current state
V: Dictionary mapping states to current value estimates
mdp: MDP with transitions(s, a) and reward(s, a) methods
policy: Function policy(s, a) -> probability of action a in state s
gamma: Discount factor
Returns:
Updated value for state s
"""
value = 0.0
# Sum over actions, weighted by policy
for a in mdp.actions(s):
action_value = 0.0
# Get immediate reward
action_value += mdp.reward(s, a)
# Sum over possible next states, weighted by transition probability
for s_next, prob in mdp.transitions(s, a):
action_value += gamma * prob * V.get(s_next, 0.0)
# Weight by policy probability
value += policy(s, a) * action_value
return valueAnd for the action-value function:
def bellman_backup_q(s, a, Q, mdp, policy, gamma=0.99):
"""
Compute new Q-value for state-action pair (s, a).
Args:
s: Current state
a: Action taken
Q: Dictionary mapping (state, action) to current Q estimates
mdp: MDP with transitions and reward methods
policy: Function policy(s, a) -> probability
gamma: Discount factor
Returns:
Updated Q-value for (s, a)
"""
value = mdp.reward(s, a)
# Sum over possible next states
for s_next, prob in mdp.transitions(s, a):
# At next state, average over actions according to policy
next_state_value = sum(
policy(s_next, a_next) * Q.get((s_next, a_next), 0.0)
for a_next in mdp.actions(s_next)
)
value += gamma * prob * next_state_value
return valueKey implementation notes:
- Handle terminal states: they typically have value 0 (no future rewards)
- Use
.get(key, 0.0)for dictionaries to handle unvisited states - The policy function returns probabilities that should sum to 1
Here’s a complete example computing values for a simple MDP:
class SimpleMDP:
"""A simple 3-state MDP for demonstration."""
def __init__(self):
self.states = ['A', 'B', 'C'] # C is terminal
self._actions = {'A': ['right'], 'B': ['right'], 'C': []}
def actions(self, s):
return self._actions.get(s, [])
def transitions(self, s, a):
"""Return list of (next_state, probability) tuples."""
if s == 'A' and a == 'right':
return [('B', 1.0)]
elif s == 'B' and a == 'right':
return [('C', 1.0)]
return []
def reward(self, s, a):
"""Reward for taking action a in state s."""
if s == 'B' and a == 'right':
return 10.0 # Reward for reaching terminal
return 0.0
def is_terminal(self, s):
return s == 'C'
def compute_values(mdp, policy, gamma=0.9, max_iterations=100, theta=1e-6):
"""Compute V^pi using iterative Bellman backups."""
# Initialize values to 0
V = {s: 0.0 for s in mdp.states}
for iteration in range(max_iterations):
delta = 0.0 # Track maximum change
for s in mdp.states:
if mdp.is_terminal(s):
continue # Terminal states stay at 0
old_value = V[s]
V[s] = bellman_backup_v(s, V, mdp, policy, gamma)
delta = max(delta, abs(V[s] - old_value))
if delta < theta:
print(f"Converged in {iteration + 1} iterations")
break
return V
# Example usage
mdp = SimpleMDP()
policy = lambda s, a: 1.0 # Deterministic policy, always takes available action
V = compute_values(mdp, policy, gamma=0.9)
print(f"V(A) = {V['A']:.2f}") # Should be 8.10
print(f"V(B) = {V['B']:.2f}") # Should be 9.00
print(f"V(C) = {V['C']:.2f}") # Should be 0.00Output:
Converged in 3 iterations
V(A) = 8.10
V(B) = 9.00
V(C) = 0.00Stochastic Policies and Transitions
The Bellman equations handle stochastic policies and transitions naturally. Let’s see an example.
Suppose in state A, the agent takes action right, but:
- With probability 0.8, transitions to B (intended)
- With probability 0.2, stays in A (slipped)
With and , :
The value is lower than before (8.1) because sometimes we slip and stay in place, delaying the reward.
Summary
The Bellman expectation equations establish the fundamental recursive structure of value functions:
- For : The value of a state is the expected immediate reward plus the discounted expected value of the next state
- For : The value of a state-action pair is the immediate reward plus the discounted expected value of the next state-action pair
- The equations form a system of linear equations that has a unique solution
- Values can be computed by iterative application of the Bellman backup
These expectation equations describe values under a fixed policy. But what if we want to find the best policy? That’s where the Bellman optimality equations come in---they replace the policy average with a maximum, capturing what optimal behavior looks like.