Bellman Optimality Equations
The Bellman optimality equations describe the best possible values---what you’d achieve if you acted perfectly. They’re the target that RL algorithms try to solve, and they contain a beautiful insight: you can define optimal behavior without explicitly specifying an optimal policy.
From Expectation to Optimality
The Bellman expectation equations average over actions according to a policy . But what if we want to find the best policy?
Consider two policies at the same state:
- Policy A: Takes action
leftwith value 5 - Policy B: Takes action
rightwith value 8
Clearly Policy B is better at this state. The optimal policy would choose right.
The key insight: The optimal value of a state is the value of the best action from that state. We don’t need to average over a policy---we just take the maximum.
This leads to a fundamental shift in the Bellman equation:
The Bellman Optimality Equation for
The optimal value of a state is the maximum expected return achievable by taking the best action.
Let’s derive this from the definition of optimal value. The optimal value function is defined as:
For any policy , we have:
The maximum over policies is achieved by a policy that puts all probability on the best action:
Substituting the definition of :
The equation says: “The optimal value of state is achieved by taking the action that maximizes immediate reward plus discounted future value.”
Notice that we don’t need to know the optimal policy! The operator finds the optimal action for us. Wherever appears on the right side, we assume future optimal behavior.
The Bellman Optimality Equation for
The optimal action-value equals the immediate reward plus the discounted optimal value of the next state.
Compare the two optimality equations:
For : The is on the outside---we’re choosing which action to take in state
For : The is on the inside---we’ve already taken action , and we assume optimal action selection in the next state
This distinction matters for algorithms. Q-learning updates directly; value iteration updates directly.
The placement of reflects what’s already been decided:
So we have the fundamental relationships:
The Circular Problem---Solved
Here’s a puzzle that seems to make optimal control impossible:
The Circular Problem:
- To find the optimal policy , you need to know
- To compute , you need to follow policy
How can we break this cycle?
The Bellman optimality equation elegantly solves this circularity. The operator embeds the optimal policy directly into the equation.
Think about it: if we have for all successor states, then:
The optimal policy just picks the action that achieves the maximum. We don’t need to know it in advance---it falls out of solving for .
This is why the Bellman optimality equations are so powerful: they define what optimal values look like without requiring us to specify the optimal policy. We can solve for first, then extract by acting greedily.
Visualizing with Backup Diagrams
The backup diagram for optimality equations looks similar to expectation diagrams, but with a crucial difference.
Key difference from expectation backup:
- Instead of summing over actions with policy weights
- We take the max over actions
The arc labeled “MAX” indicates we select the best action, not average over them.
For , the action is already chosen (we’re at the node). The MAX is over the next action at each successor state.
A Worked Example
Let’s compute optimal values for a simple decision problem.
An agent starts at state S and can go left or right:
Setup:
- States: S (start), L (left, terminal), R (right, terminal)
- Actions:
left,rightfrom S only - Rewards: +5 for reaching L, +10 for reaching R
- Discount:
Computing Optimal Values:
Terminal states have their reward as value:
For state S, apply the Bellman optimality equation:
The optimal action is right, and the optimal value is 9.0.
Computing Optimal Q-Values:
The optimal policy is:
Why Optimality Equations Are Nonlinear
Unlike the Bellman expectation equations, the optimality equations are nonlinear due to the operator. This has important implications for how we solve them.
Consider the expectation equation in matrix form:
This is linear in ---we can solve it by matrix inversion.
The optimality equation cannot be written this way because is not a linear operation:
The action that achieves the maximum depends on the current values, creating a nonlinear relationship.
Why does this matter? Because we can’t just solve a linear system to find optimal values.
Instead, we need iterative methods:
- Start with any initial guess for
- Apply the Bellman optimality operator repeatedly
- The values converge to the true optimum
This is the basis of Value Iteration, which we’ll study in the Dynamic Programming chapter.
Extracting the Optimal Policy
Once we have or , extracting the optimal policy is straightforward.
def extract_policy_from_v(V_star, mdp, gamma=0.99):
"""
Extract optimal policy from optimal value function V*.
Args:
V_star: Dictionary mapping states to optimal values
mdp: MDP with transitions and reward methods
gamma: Discount factor
Returns:
policy: Dictionary mapping states to best actions
"""
policy = {}
for s in mdp.states:
if mdp.is_terminal(s):
continue
best_action = None
best_value = float('-inf')
for a in mdp.actions(s):
# Compute Q*(s, a) from V*
action_value = mdp.reward(s, a)
for s_next, prob in mdp.transitions(s, a):
action_value += gamma * prob * V_star.get(s_next, 0.0)
if action_value > best_value:
best_value = action_value
best_action = a
policy[s] = best_action
return policy
def extract_policy_from_q(Q_star, mdp):
"""
Extract optimal policy from optimal Q-function.
This is even simpler!
"""
policy = {}
for s in mdp.states:
if mdp.is_terminal(s):
continue
# Just pick the action with highest Q-value
best_action = max(
mdp.actions(s),
key=lambda a: Q_star.get((s, a), float('-inf'))
)
policy[s] = best_action
return policyThis is why Q-learning is so popular: if you learn directly, the optimal policy is simply . No need to know the MDP dynamics!
Comparing Expectation and Optimality
Let’s see how the same MDP has different values under a policy versus optimal:
Consider a GridWorld where the agent can move in four directions:
Random policy (25% each direction):
With , the agent wanders randomly. (takes many steps on average).
Optimal policy (right, then up):
The agent goes directly to the goal. .
The optimal value is nearly twice as high because it doesn’t waste time wandering.
Implementation: Value Iteration
Here’s a basic implementation of value iteration, which repeatedly applies the Bellman optimality operator:
def value_iteration(mdp, gamma=0.99, theta=1e-6, max_iterations=1000):
"""
Find optimal values using the Bellman optimality equation.
Args:
mdp: MDP with states, actions, transitions, reward methods
gamma: Discount factor
theta: Convergence threshold
max_iterations: Maximum iterations
Returns:
V_star: Dictionary of optimal values
policy: Dictionary of optimal actions
"""
# Initialize values arbitrarily
V = {s: 0.0 for s in mdp.states}
for iteration in range(max_iterations):
delta = 0.0
for s in mdp.states:
if mdp.is_terminal(s):
continue
old_value = V[s]
# Apply Bellman optimality operator
action_values = []
for a in mdp.actions(s):
q_value = mdp.reward(s, a)
for s_next, prob in mdp.transitions(s, a):
q_value += gamma * prob * V.get(s_next, 0.0)
action_values.append(q_value)
V[s] = max(action_values) if action_values else 0.0
delta = max(delta, abs(V[s] - old_value))
if delta < theta:
print(f"Value iteration converged in {iteration + 1} iterations")
break
# Extract optimal policy
policy = extract_policy_from_v(V, mdp, gamma)
return V, policyKey points:
- We don’t need to specify a policy---the finds the best action
- Convergence is guaranteed for
- After convergence, we extract the policy by acting greedily with respect to
Summary
The Bellman optimality equations define what optimal behavior looks like:
Key insights:
- The operator replaces policy averaging, finding the best action automatically
- Optimality equations are nonlinear, requiring iterative solution methods
- Once we have or , the optimal policy is just greedy action selection
- The circular problem (need to find , need to find ) is elegantly solved
The Bellman equations tell us what optimal values must satisfy, but not how to find them. In the next section, we’ll explore why this recursive structure is so powerful---and how the concept of bootstrapping enables practical algorithms.