Markov Decision Processes • Part 2 of 3
📝Draft

Bellman Optimality Equations

The equations that define optimal behavior

Bellman Optimality Equations

The Bellman optimality equations describe the best possible values---what you’d achieve if you acted perfectly. They’re the target that RL algorithms try to solve, and they contain a beautiful insight: you can define optimal behavior without explicitly specifying an optimal policy.

From Expectation to Optimality

The Bellman expectation equations average over actions according to a policy π\pi. But what if we want to find the best policy?

Consider two policies at the same state:

  • Policy A: Takes action left with value 5
  • Policy B: Takes action right with value 8

Clearly Policy B is better at this state. The optimal policy would choose right.

The key insight: The optimal value of a state is the value of the best action from that state. We don’t need to average over a policy---we just take the maximum.

This leads to a fundamental shift in the Bellman equation:

Expectation Equation
Vπ(s)=aπ(as)[]V^\pi(s) = \sum_a \pi(a|s) [\ldots]
Average over policy
Optimality Equation
V(s)=maxa[]V^*(s) = \max_a [\ldots]
Maximum over actions

The Bellman Optimality Equation for VV^*

📖Bellman Optimality Equation for V*

V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right]

The optimal value of a state is the maximum expected return achievable by taking the best action.

Mathematical Details

Let’s derive this from the definition of optimal value. The optimal value function is defined as:

V(s)=maxπVπ(s)V^*(s) = \max_\pi V^\pi(s)

For any policy π\pi, we have:

Vπ(s)=aπ(as)Qπ(s,a)V^\pi(s) = \sum_a \pi(a|s) Q^\pi(s,a)

The maximum over policies is achieved by a policy that puts all probability on the best action:

V(s)=maxaQ(s,a)V^*(s) = \max_a Q^*(s,a)

Substituting the definition of Q(s,a)Q^*(s,a):

V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right]

The equation says: “The optimal value of state ss is achieved by taking the action that maximizes immediate reward plus discounted future value.”

Notice that we don’t need to know the optimal policy! The max\max operator finds the optimal action for us. Wherever VV^* appears on the right side, we assume future optimal behavior.

The Bellman Optimality Equation for QQ^*

📖Bellman Optimality Equation for Q*

Q(s,a)=R(s,a)+γsP(ss,a)maxaQ(s,a)Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')

The optimal action-value equals the immediate reward plus the discounted optimal value of the next state.

Compare the two optimality equations:

For VV^*: The max\max is on the outside---we’re choosing which action to take in state ss

For QQ^*: The max\max is on the inside---we’ve already taken action aa, and we assume optimal action selection in the next state

This distinction matters for algorithms. Q-learning updates QQ^* directly; value iteration updates VV^* directly.

Mathematical Details

The placement of max\max reflects what’s already been decided:

V(s)=maxa[R(s,a)+γsP(ss,a)V(s)This is Q(s,a)]V^*(s) = \max_a [\underbrace{R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')}_{\text{This is } Q^*(s,a)}]

Q(s,a)=R(s,a)+γsP(ss,a)maxaQ(s,a)This is V(s)Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \underbrace{\max_{a'} Q^*(s',a')}_{\text{This is } V^*(s')}

So we have the fundamental relationships:

  • V(s)=maxaQ(s,a)V^*(s) = \max_a Q^*(s,a)
  • Q(s,a)=R(s,a)+γsP(ss,a)V(s)Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')

The Circular Problem---Solved

Here’s a puzzle that seems to make optimal control impossible:

The Bellman optimality equation elegantly solves this circularity. The max\max operator embeds the optimal policy directly into the equation.

Think about it: if we have V(s)V^*(s') for all successor states, then:

π(s)=argmaxa[R(s,a)+γsP(ss,a)V(s)]\pi^*(s) = \arg\max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right]

The optimal policy just picks the action that achieves the maximum. We don’t need to know it in advance---it falls out of solving for VV^*.

💡Tip

This is why the Bellman optimality equations are so powerful: they define what optimal values look like without requiring us to specify the optimal policy. We can solve for VV^* first, then extract π\pi^* by acting greedily.

Visualizing with Backup Diagrams

The backup diagram for optimality equations looks similar to expectation diagrams, but with a crucial difference.

📌Backup Diagram for V*
VV^*
MAX
a1a_1
s1s'_1
s2s'_2
a2a_2
s3s'_3

Key difference from expectation backup:

  • Instead of summing over actions with policy weights π(as)\pi(a|s)
  • We take the max over actions

The arc labeled “MAX” indicates we select the best action, not average over them.

📌Backup Diagram for Q*
QQ^*
(s, a)
\downarrow take action aa
s1s'_1
MAX
s2s'_2
MAX

For QQ^*, the action is already chosen (we’re at the (s,a)(s, a) node). The MAX is over the next action aa' at each successor state.

A Worked Example

Let’s compute optimal values for a simple decision problem.

📌Two-Path Choice

An agent starts at state S and can go left or right:

L
+5
left
S
right
R
+10

Setup:

  • States: S (start), L (left, terminal), R (right, terminal)
  • Actions: left, right from S only
  • Rewards: +5 for reaching L, +10 for reaching R
  • Discount: γ=0.9\gamma = 0.9

Computing Optimal Values:

Terminal states have their reward as value:

  • V(L)=5V^*(L) = 5
  • V(R)=10V^*(R) = 10

For state S, apply the Bellman optimality equation:

V(S)=max{0+0.9×5go left,0+0.9×10go right}V^*(S) = \max\left\{ \underbrace{0 + 0.9 \times 5}_{\text{go left}}, \underbrace{0 + 0.9 \times 10}_{\text{go right}} \right\}

V(S)=max{4.5,9.0}=9.0V^*(S) = \max\{4.5, 9.0\} = 9.0

The optimal action is right, and the optimal value is 9.0.

Computing Optimal Q-Values:

  • Q(S,left)=0+0.9×5=4.5Q^*(S, \text{left}) = 0 + 0.9 \times 5 = 4.5
  • Q(S,right)=0+0.9×10=9.0Q^*(S, \text{right}) = 0 + 0.9 \times 10 = 9.0

The optimal policy is: π(S)=right\pi^*(S) = \text{right}

Why Optimality Equations Are Nonlinear

Mathematical Details

Consider the expectation equation in matrix form:

vπ=rπ+γPπvπ\mathbf{v}^\pi = \mathbf{r}^\pi + \gamma \mathbf{P}^\pi \mathbf{v}^\pi

This is linear in vπ\mathbf{v}^\pi---we can solve it by matrix inversion.

The optimality equation cannot be written this way because max\max is not a linear operation:

V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s') \right]

The action that achieves the maximum depends on the current values, creating a nonlinear relationship.

Why does this matter? Because we can’t just solve a linear system to find optimal values.

Instead, we need iterative methods:

  1. Start with any initial guess for VV^*
  2. Apply the Bellman optimality operator repeatedly
  3. The values converge to the true optimum

This is the basis of Value Iteration, which we’ll study in the Dynamic Programming chapter.

Extracting the Optimal Policy

Once we have VV^* or QQ^*, extracting the optimal policy is straightforward.

</>Implementation
def extract_policy_from_v(V_star, mdp, gamma=0.99):
    """
    Extract optimal policy from optimal value function V*.

    Args:
        V_star: Dictionary mapping states to optimal values
        mdp: MDP with transitions and reward methods
        gamma: Discount factor

    Returns:
        policy: Dictionary mapping states to best actions
    """
    policy = {}

    for s in mdp.states:
        if mdp.is_terminal(s):
            continue

        best_action = None
        best_value = float('-inf')

        for a in mdp.actions(s):
            # Compute Q*(s, a) from V*
            action_value = mdp.reward(s, a)
            for s_next, prob in mdp.transitions(s, a):
                action_value += gamma * prob * V_star.get(s_next, 0.0)

            if action_value > best_value:
                best_value = action_value
                best_action = a

        policy[s] = best_action

    return policy


def extract_policy_from_q(Q_star, mdp):
    """
    Extract optimal policy from optimal Q-function.
    This is even simpler!
    """
    policy = {}

    for s in mdp.states:
        if mdp.is_terminal(s):
            continue

        # Just pick the action with highest Q-value
        best_action = max(
            mdp.actions(s),
            key=lambda a: Q_star.get((s, a), float('-inf'))
        )
        policy[s] = best_action

    return policy
💡Tip

This is why Q-learning is so popular: if you learn QQ^* directly, the optimal policy is simply π(s)=argmaxaQ(s,a)\pi^*(s) = \arg\max_a Q^*(s, a). No need to know the MDP dynamics!

Comparing Expectation and Optimality

Let’s see how the same MDP has different values under a policy versus optimal:

📌Policy vs Optimal Values

Consider a GridWorld where the agent can move in four directions:

+10
S

Random policy (25% each direction):

With γ=0.9\gamma = 0.9, the agent wanders randomly. Vπ(S)4.3V^\pi(S) \approx 4.3 (takes many steps on average).

Optimal policy (right, then up):

The agent goes directly to the goal. V(S)=0.92×10=8.1V^*(S) = 0.9^2 \times 10 = 8.1.

The optimal value is nearly twice as high because it doesn’t waste time wandering.

Implementation: Value Iteration

</>Implementation

Here’s a basic implementation of value iteration, which repeatedly applies the Bellman optimality operator:

def value_iteration(mdp, gamma=0.99, theta=1e-6, max_iterations=1000):
    """
    Find optimal values using the Bellman optimality equation.

    Args:
        mdp: MDP with states, actions, transitions, reward methods
        gamma: Discount factor
        theta: Convergence threshold
        max_iterations: Maximum iterations

    Returns:
        V_star: Dictionary of optimal values
        policy: Dictionary of optimal actions
    """
    # Initialize values arbitrarily
    V = {s: 0.0 for s in mdp.states}

    for iteration in range(max_iterations):
        delta = 0.0

        for s in mdp.states:
            if mdp.is_terminal(s):
                continue

            old_value = V[s]

            # Apply Bellman optimality operator
            action_values = []
            for a in mdp.actions(s):
                q_value = mdp.reward(s, a)
                for s_next, prob in mdp.transitions(s, a):
                    q_value += gamma * prob * V.get(s_next, 0.0)
                action_values.append(q_value)

            V[s] = max(action_values) if action_values else 0.0
            delta = max(delta, abs(V[s] - old_value))

        if delta < theta:
            print(f"Value iteration converged in {iteration + 1} iterations")
            break

    # Extract optimal policy
    policy = extract_policy_from_v(V, mdp, gamma)

    return V, policy

Key points:

  • We don’t need to specify a policy---the max\max finds the best action
  • Convergence is guaranteed for γ<1\gamma < 1
  • After convergence, we extract the policy by acting greedily with respect to VV^*

Summary

The Bellman optimality equations define what optimal behavior looks like:

  • V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_a [R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^*(s')]
  • Q(s,a)=R(s,a)+γsP(ss,a)maxaQ(s,a)Q^*(s,a) = R(s,a) + \gamma \sum_{s'} P(s'|s,a) \max_{a'} Q^*(s',a')

Key insights:

  • The max\max operator replaces policy averaging, finding the best action automatically
  • Optimality equations are nonlinear, requiring iterative solution methods
  • Once we have VV^* or QQ^*, the optimal policy is just greedy action selection
  • The circular problem (need π\pi^* to find VV^*, need VV^* to find π\pi^*) is elegantly solved
ℹ️Note

The Bellman equations tell us what optimal values must satisfy, but not how to find them. In the next section, we’ll explore why this recursive structure is so powerful---and how the concept of bootstrapping enables practical algorithms.