Markov Decision Processes • Part 3 of 3
📝Draft

Why Bellman Matters

Bootstrapping: the key insight behind all RL algorithms

Why Bellman Matters

The Bellman equations aren’t just elegant mathematics---they’re the engine that powers all of reinforcement learning. Understanding why they matter is as important as knowing what they say.

The Power of Recursion

The Bellman equations reveal a beautiful property of value functions: values are self-consistent.

📖Self-Consistency of Values

If VV is the true value function for a policy π\pi, then for every state ss:

V(s)=aπ(as)[R(s,a)+γsP(ss,a)V(s)]V(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right]

The value at any state is exactly what you’d compute from its neighbors’ values.

This self-consistency has a remarkable implication: if we find a value function that satisfies the Bellman equation at every state, we’ve found the true value function.

It’s like a giant Sudoku puzzle. Each cell’s value is constrained by its neighbors. If all constraints are satisfied simultaneously, the solution is correct.

Bootstrapping: The Key Insight

📖Bootstrapping

Using an estimate of the value function to update that same estimate. The Bellman equations tell us that values depend on other values---so we can use current estimates to improve our estimates.

The name comes from the phrase “pulling yourself up by your bootstraps.” It seems impossible---how can you improve estimates using those same (potentially wrong) estimates?

Here’s the magic: even if your initial estimates are completely wrong, applying the Bellman equation consistently will drive them toward the truth. Why? Because each update:

  1. Uses the actual reward from the environment (ground truth)
  2. Makes the value locally consistent with neighbors
  3. When all values are locally consistent, they’re globally correct
📌Bootstrapping in Action

Consider three states in a line: A, B, C (terminal, reward +10).

Initial estimates (random):

  • V(A)=0V(A) = 0, V(B)=0V(B) = 0, V(C)=10V(C) = 10

After one Bellman backup (from C toward A, with γ=0.9\gamma = 0.9):

  • V(B)=0+0.9×10=9V(B) = 0 + 0.9 \times 10 = 9

After another backup:

  • V(A)=0+0.9×9=8.1V(A) = 0 + 0.9 \times 9 = 8.1
8.1
A
9.0
B
10
C

Value “propagated” from the reward at C backward through the chain. Each update used the current estimate of the next state---that’s bootstrapping.

Why Bootstrapping Works

Mathematical Details

The reason bootstrapping converges is that the Bellman operator is a contraction mapping. This means:

  1. There exists a unique fixed point VV^* such that applying the operator doesn’t change it
  2. Repeated application contracts the distance to this fixed point
  3. Starting from any initial values, we converge to VV^*

More formally, define the Bellman operator T\mathcal{T} for a policy π\pi:

(TπV)(s)=aπ(as)[R(s,a)+γsP(ss,a)V(s)](\mathcal{T}^\pi V)(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right]

Then for any two value functions V1V_1 and V2V_2:

TπV1TπV2γV1V2\|\mathcal{T}^\pi V_1 - \mathcal{T}^\pi V_2\|_\infty \leq \gamma \|V_1 - V_2\|_\infty

Since γ<1\gamma < 1, the operator shrinks distances by at least factor γ\gamma. After kk iterations:

VkVπγkV0Vπ0\|V_k - V^\pi\|_\infty \leq \gamma^k \|V_0 - V^\pi\|_\infty \rightarrow 0

Think of it like this: no matter how wrong your initial guess is, each iteration gets you at least γ\gamma fraction closer to the truth. After enough iterations, you’re arbitrarily close.

This is why you can initialize values to zero (or random numbers) and still converge. The contraction property guarantees it.

💡Tip

The contraction property explains why we need γ<1\gamma < 1 for infinite-horizon problems. If γ=1\gamma = 1, the operator might not contract, and convergence isn’t guaranteed (though special cases like episodic tasks can still work).

Local Updates, Global Solutions

One of the most remarkable aspects of Bellman equations is that local updates lead to global solutions.

Each Bellman backup only looks at immediate neighbors:

  • Current state ss
  • Available actions aa
  • Immediate successors ss'
  • One-step rewards rr

Yet repeated local updates produce globally optimal values. How?

The key is that value information propagates through the state space. When you update state ss, it affects neighbors of ss in the next iteration. Those updates affect their neighbors. Eventually, information from anywhere in the MDP reaches everywhere.

📌Value Propagation in GridWorld

Consider a 4x4 GridWorld with a goal in the corner:

0
0
0
+10
0
0
0
0
0
0
0
0
S
0
0
0
Initial values (all 0 except goal)

After several iterations of Bellman backups (with γ=0.9\gamma = 0.9):

6.6
7.3
8.1
10
5.9
6.6
7.3
8.1
5.3
5.9
6.6
7.3
4.8
5.3
5.9
6.6
Converged values (forms a gradient toward goal)

Values form a “gradient” that points toward the goal. Following the gradient (moving to higher-valued neighbors) leads to the optimal path.

The Foundation of RL Algorithms

Every major RL algorithm relates to the Bellman equations in some way. Understanding this connection helps you see the unity beneath the diversity of methods.

Dynamic Programming

Key idea: Solve Bellman equations exactly when the MDP model is known.

Policy Evaluation: Apply Bellman expectation equation repeatedly until convergence.

Vk+1(s)=aπ(as)[R(s,a)+γsP(ss,a)Vk(s)]V_{k+1}(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s') \right]

Value Iteration: Apply Bellman optimality equation repeatedly.

Vk+1(s)=maxa[R(s,a)+γsP(ss,a)Vk(s)]V_{k+1}(s) = \max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V_k(s') \right]

Both directly implement the Bellman backup as an update rule.

Monte Carlo Methods

Key idea: Estimate values by sampling complete returns, without bootstrapping.

V(s)V(s)+α(GtV(s))V(s) \leftarrow V(s) + \alpha (G_t - V(s))

Monte Carlo doesn’t use the Bellman equation directly---it estimates the expected return by averaging actual returns. This is not bootstrapping because GtG_t is the complete return, not an estimate.

Trade-off: No bias (uses true returns) but high variance (returns can vary widely).

Temporal Difference Learning

Key idea: Sample-based approximation of the Bellman equation. This is bootstrapping.

V(s)V(s)+α(r+γV(s)V(s))V(s) \leftarrow V(s) + \alpha \left( r + \gamma V(s') - V(s) \right)

Instead of summing over all actions and next states (like DP), TD samples one trajectory and updates based on what actually happened. The r+γV(s)r + \gamma V(s') is a bootstrap estimate of the true return.

Trade-off: Some bias (uses estimated V(s)V(s')) but lower variance (uses actual reward).

Q-Learning

Key idea: Learn optimal Q-values via sampled Bellman optimality updates.

Q(s,a)Q(s,a)+α(r+γmaxaQ(s,a)Q(s,a))Q(s,a) \leftarrow Q(s,a) + \alpha \left( r + \gamma \max_{a'} Q(s',a') - Q(s,a) \right)

This is a sample-based version of the Bellman optimality equation for Q. The max\max inside the update is what makes it learn optimal values regardless of the behavior policy.

ℹ️Note

Notice the pattern: DP uses the full Bellman equation, TD/Q-learning sample it. This is the fundamental trade-off in RL: exact but expensive (DP) vs. approximate but efficient (TD).

Bellman Equations in Deep RL

Even in deep reinforcement learning with neural networks, Bellman equations remain central.

📌DQN: Deep Q-Networks

DQN approximates QQ^* with a neural network Qθ(s,a)Q_\theta(s, a) and trains it to satisfy the Bellman optimality equation:

Loss=E[(r+γmaxaQθ(s,a)Qθ(s,a))2]\text{Loss} = \mathbb{E}\left[ \left( r + \gamma \max_{a'} Q_{\theta^-}(s', a') - Q_\theta(s, a) \right)^2 \right]

The network learns to make the Bellman equation hold (approximately) across all state-action pairs. The target r+γmaxaQ(s,a)r + \gamma \max_{a'} Q(s', a') is still the Bellman backup---just computed with a neural network instead of a table.

</>Implementation

Here’s the Bellman-based loss function used in DQN:

def compute_dqn_loss(q_network, target_network, batch, gamma=0.99):
    """
    Compute the DQN loss based on Bellman optimality equation.

    Args:
        q_network: The Q-network being trained
        target_network: Frozen copy for stable targets
        batch: (states, actions, rewards, next_states, dones)
        gamma: Discount factor

    Returns:
        Mean squared Bellman error
    """
    states, actions, rewards, next_states, dones = batch

    # Current Q-values for taken actions
    q_values = q_network(states)
    q_values = q_values.gather(1, actions.unsqueeze(1)).squeeze(1)

    # Target: Bellman optimality equation
    with torch.no_grad():
        next_q_values = target_network(next_states)
        max_next_q = next_q_values.max(dim=1)[0]

        # If done, no future value
        target = rewards + gamma * max_next_q * (1 - dones)

    # Loss: How far are we from satisfying Bellman?
    loss = torch.nn.functional.mse_loss(q_values, target)

    return loss

Why This Matters for You

Understanding the Bellman equations deeply gives you:

  1. Intuition for debugging: When an RL algorithm fails, ask: “Is the Bellman equation being satisfied? Where is it breaking down?”

  2. Algorithm design: Many RL innovations are creative ways to approximate or modify Bellman updates (n-step returns, eligibility traces, distributional RL).

  3. Unified perspective: All RL algorithms are trying to solve or approximate the Bellman equations. Once you see this, the field becomes more coherent.

💡Tip

When learning a new RL algorithm, always ask: “How does this relate to the Bellman equations?” You’ll find the answer illuminating.

Looking Ahead

The Bellman equations tell us what optimal values look like, but not how to find them efficiently. In the chapters ahead, we’ll explore algorithms that tackle this challenge:

Dynamic Programming:Solve Bellman exactly when you know the MDP
Monte Carlo:Estimate values from sampled episodes
TD Learning:Bootstrap with single-step samples
Q-Learning:Learn optimal values without a model

Common Misconceptions

Before we move on, let’s address some common points of confusion:

Summary

The Bellman equations are the mathematical heart of reinforcement learning:

  • Self-consistency: Values must satisfy the Bellman equation at every state
  • Bootstrapping: We can use estimates to improve estimates, converging to truth
  • Contraction: The Bellman operator shrinks errors, guaranteeing convergence
  • Local to global: Each update only looks at neighbors, yet globally optimal values emerge
  • Universal foundation: Every RL algorithm is solving or approximating these equations
ℹ️Note

Richard Bellman’s insight in the 1950s was that complex sequential decision problems have recursive structure. This “principle of optimality” underlies not just RL, but optimal control, operations research, and many other fields. In RL, it’s the foundation everything else builds on.

Understanding why Bellman matters gives you more than theoretical knowledge---it gives you the intuition to understand, debug, and design RL algorithms. When something goes wrong, you can trace it back to the Bellman equations. When designing something new, you can think about how it relates to these fundamental principles.