Bellman Optimality Equations

The Bellman optimality equations describe the best possible values---what you’d achieve if you acted perfectly. They’re the target that RL algorithms try to solve, and they contain a beautiful insight: you can define optimal behavior without explicitly specifying an optimal policy.

From Expectation to Optimality

The Bellman expectation equations average over actions according to a policy $\pi$ . But what if we want to find the best policy?

Consider two policies at the same state:

Policy A: Takes action left with value 5
Policy B: Takes action right with value 8

Clearly Policy B is better at this state. The optimal policy would choose right.

The key insight: The optimal value of a state is the value of the best action from that state. We don’t need to average over a policy---we just take the maximum.

This leads to a fundamental shift in the Bellman equation:

Expectation Equation

V^\pi(s) = \sum_a \pi(a|s) [\ldots]

Average over policy

Optimality Equation

V^*(s) = \max_a [\ldots]

Maximum over actions

The Bellman Optimality Equation for $V^*$

📖Bellman Optimality Equation for V*

$V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right]$

The optimal value of a state is the maximum expected return achievable by taking the best action.

∑Mathematical Details

Let’s derive this from the definition of optimal value. The optimal value function is defined as:

$V^*(s) = \max_\pi V^\pi(s)$

For any policy $\pi$ , we have:

$V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a)$

The maximum over policies is achieved by a policy that puts all probability on the best action:

$V^*(s) = \max_a Q^*(s,a)$

Substituting the definition of $Q^*(s,a)$ :

$V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right]$

The equation says: “The optimal value of state $s$ is achieved by taking the action that maximizes immediate reward plus discounted future value.”

Notice that we don’t need to know the optimal policy! The $\max$ operator finds the optimal action for us. Wherever $V^*$ appears on the right side, we assume future optimal behavior.

The Bellman Optimality Equation for $Q^*$

📖Bellman Optimality Equation for Q*

$Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')$

The optimal action-value equals the immediate reward plus the discounted optimal value of the next state.

Compare the two optimality equations:

For $V^*$ : The $\max$ is on the outside---we’re choosing which action to take in state $s$

For $Q^*$ : The $\max$ is on the inside---we’ve already taken action $a$ , and we assume optimal action selection in the next state

This distinction matters for algorithms. Q-learning updates $Q^*$ directly; value iteration updates $V^*$ directly.

∑Mathematical Details

The placement of $\max$ reflects what’s already been decided:

$V^*(s) = \max_a [\underbrace{R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')}_{\text{This is } Q^*(s,a)}]$

$Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \underbrace{\max_{a'} Q^*(s',a')}_{\text{This is } V^*(s')}$

So we have the fundamental relationships:

$V^*(s) = \max_a Q^*(s,a)$
$Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')$

The Circular Problem---Solved

Here’s a puzzle that seems to make optimal control impossible:

⚠️Warning

The Circular Problem:

To find the optimal policy $\pi^*$ , you need to know $V^*$
To compute $V^*$ , you need to follow policy $\pi^*$

How can we break this cycle?

The Bellman optimality equation elegantly solves this circularity. The $\max$ operator embeds the optimal policy directly into the equation.

Think about it: if we have $V^*(s')$ for all successor states, then:

$\pi^*(s) = \arg\max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right]$

The optimal policy just picks the action that achieves the maximum. We don’t need to know it in advance---it falls out of solving for $V^*$ .

💡Tip

This is why the Bellman optimality equations are so powerful: they define what optimal values look like without requiring us to specify the optimal policy. We can solve for $V^*$ first, then extract $\pi^*$ by acting greedily.

Visualizing with Backup Diagrams

The backup diagram for optimality equations looks similar to expectation diagrams, but with a crucial difference.

📌Backup Diagram for V*

V^*

MAX

a_1

s'_1

s'_2

a_2

s'_3

Key difference from expectation backup:

Instead of summing over actions with policy weights $\pi(a|s)$
We take the max over actions

The arc labeled “MAX” indicates we select the best action, not average over them.

📌Backup Diagram for Q*

Q^*

(s, a)

\downarrow

take action

a

s'_1

MAX

s'_2

MAX

For $Q^*$ , the action is already chosen (we’re at the $(s, a)$ node). The MAX is over the next action $a'$ at each successor state.

A Worked Example

Let’s compute optimal values for a simple decision problem.

📌Two-Path Choice

An agent starts at state S and can go left or right:

left

←

right

→

+10

Setup:

States: S (start), L (left, terminal), R (right, terminal)
Actions: left, right from S only
Rewards: +5 for reaching L, +10 for reaching R
Discount: $\gamma = 0.9$

Computing Optimal Values:

Terminal states have their reward as value:

$V^*(L) = 5$
$V^*(R) = 10$

For state S, apply the Bellman optimality equation:

$V^*(S) = \max\left\{ \underbrace{0 + 0.9 \times 5}_{\text{go left}}, \underbrace{0 + 0.9 \times 10}_{\text{go right}} \right\}$

$V^*(S) = \max\{4.5, 9.0\} = 9.0$

The optimal action is right, and the optimal value is 9.0.

Computing Optimal Q-Values:

$Q^*(S, \text{left}) = 0 + 0.9 \times 5 = 4.5$
$Q^*(S, \text{right}) = 0 + 0.9 \times 10 = 9.0$

The optimal policy is: $\pi^*(S) = \text{right}$

Why Optimality Equations Are Nonlinear

⚠️Warning

Unlike the Bellman expectation equations, the optimality equations are nonlinear due to the $\max$ operator. This has important implications for how we solve them.

∑Mathematical Details

Consider the expectation equation in matrix form:

$\mathbf{v}^\pi = \mathbf{r}^\pi + \gamma \mathbf{P}^\pi \mathbf{v}^\pi$

This is linear in $\mathbf{v}^\pi$ ---we can solve it by matrix inversion.

The optimality equation cannot be written this way because $\max$ is not a linear operation:

$V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right]$

The action that achieves the maximum depends on the current values, creating a nonlinear relationship.

Why does this matter? Because we can’t just solve a linear system to find optimal values.

Instead, we need iterative methods:

Start with any initial guess for $V^*$
Apply the Bellman optimality operator repeatedly
The values converge to the true optimum

This is the basis of Value Iteration, which we’ll study in the Dynamic Programming chapter.

Extracting the Optimal Policy

Once we have $V^*$ or $Q^*$ , extracting the optimal policy is straightforward.

</>Implementation

def extract_policy_from_v(V_star, mdp, gamma=0.99):
    """
    Extract optimal policy from optimal value function V*.

    Args:
        V_star: Dictionary mapping states to optimal values
        mdp: MDP with transitions and reward methods
        gamma: Discount factor

    Returns:
        policy: Dictionary mapping states to best actions
    """
    policy = {}

    for s in mdp.states:
        if mdp.is_terminal(s):
            continue

        best_action = None
        best_value = float('-inf')

        for a in mdp.actions(s):
            # Compute Q*(s, a) from V*
            action_value = mdp.reward(s, a)
            for s_next, prob in mdp.transitions(s, a):
                action_value += gamma * prob * V_star.get(s_next, 0.0)

            if action_value > best_value:
                best_value = action_value
                best_action = a

        policy[s] = best_action

    return policy


def extract_policy_from_q(Q_star, mdp):
    """
    Extract optimal policy from optimal Q-function.
    This is even simpler!
    """
    policy = {}

    for s in mdp.states:
        if mdp.is_terminal(s):
            continue

        # Just pick the action with highest Q-value
        best_action = max(
            mdp.actions(s),
            key=lambda a: Q_star.get((s, a), float('-inf'))
        )
        policy[s] = best_action

    return policy

💡Tip

This is why Q-learning is so popular: if you learn $Q^*$ directly, the optimal policy is simply $\pi^*(s) = \arg\max_a Q^*(s, a)$ . No need to know the MDP dynamics!

Comparing Expectation and Optimality

Let’s see how the same MDP has different values under a policy versus optimal:

📌Policy vs Optimal Values

Consider a GridWorld where the agent can move in four directions:

+10

Random policy (25% each direction):

With $\gamma = 0.9$ , the agent wanders randomly. $V^\pi(S) \approx 4.3$ (takes many steps on average).

Optimal policy (right, then up):

The agent goes directly to the goal. $V^*(S) = 0.9^2 \times 10 = 8.1$ .

The optimal value is nearly twice as high because it doesn’t waste time wandering.

Implementation: Value Iteration

</>Implementation

Here’s a basic implementation of value iteration, which repeatedly applies the Bellman optimality operator:

def value_iteration(mdp, gamma=0.99, theta=1e-6, max_iterations=1000):
    """
    Find optimal values using the Bellman optimality equation.

    Args:
        mdp: MDP with states, actions, transitions, reward methods
        gamma: Discount factor
        theta: Convergence threshold
        max_iterations: Maximum iterations

    Returns:
        V_star: Dictionary of optimal values
        policy: Dictionary of optimal actions
    """
    # Initialize values arbitrarily
    V = {s: 0.0 for s in mdp.states}

    for iteration in range(max_iterations):
        delta = 0.0

        for s in mdp.states:
            if mdp.is_terminal(s):
                continue

            old_value = V[s]

            # Apply Bellman optimality operator
            action_values = []
            for a in mdp.actions(s):
                q_value = mdp.reward(s, a)
                for s_next, prob in mdp.transitions(s, a):
                    q_value += gamma * prob * V.get(s_next, 0.0)
                action_values.append(q_value)

            V[s] = max(action_values) if action_values else 0.0
            delta = max(delta, abs(V[s] - old_value))

        if delta < theta:
            print(f"Value iteration converged in {iteration + 1} iterations")
            break

    # Extract optimal policy
    policy = extract_policy_from_v(V, mdp, gamma)

    return V, policy

Key points:

We don’t need to specify a policy---the $\max$ finds the best action
Convergence is guaranteed for $\gamma < 1$
After convergence, we extract the policy by acting greedily with respect to $V^*$

Summary

The Bellman optimality equations define what optimal behavior looks like:

$V^*(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')]$
$Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')$

Key insights:

The $\max$ operator replaces policy averaging, finding the best action automatically
Optimality equations are nonlinear, requiring iterative solution methods
Once we have $V^*$ or $Q^*$ , the optimal policy is just greedy action selection
The circular problem (need $\pi^*$ to find $V^*$ , need $V^*$ to find $\pi^*$ ) is elegantly solved

ℹ️Note

The Bellman equations tell us what optimal values must satisfy, but not how to find them. In the next section, we’ll explore why this recursive structure is so powerful---and how the concept of bootstrapping enables practical algorithms.

Bellman Optimality Equations

From Expectation to Optimality

The Bellman Optimality Equation for V∗V^*V∗

The Bellman Optimality Equation for Q∗Q^*Q∗

The Circular Problem---Solved

Visualizing with Backup Diagrams

A Worked Example

Why Optimality Equations Are Nonlinear

Extracting the Optimal Policy

Comparing Expectation and Optimality

Implementation: Value Iteration

Summary

The Bellman Optimality Equation for $V^*$

The Bellman Optimality Equation for $Q^*$