Dynamic Programming • Part 2 of 2
📝Draft

Convergence and Stopping

How do we know when we're done?

Convergence and Stopping

We have seen that iterative policy evaluation works. But why does it work? What mathematical property guarantees that we will reach the correct answer? And in practice, how do we decide when to stop iterating?

This section answers these questions.

Why Policy Evaluation Converges

The key property is that the Bellman operator is a contraction mapping. This is a powerful mathematical concept that guarantees convergence.

📖Contraction Mapping

A function TT is a contraction if applying it to any two inputs brings them closer together. Formally, there exists γ[0,1)\gamma \in [0, 1) such that for all inputs V1V_1 and V2V_2:

T(V1)T(V2)γV1V2\|T(V_1) - T(V_2)\| \leq \gamma \|V_1 - V_2\|

The constant γ\gamma is called the contraction factor.

Imagine you have two different guesses for the value function. After one Bellman update, those guesses become more similar. After another update, they become even more similar. No matter how far apart you started, you keep getting pushed toward the same point.

That point is the fixed point of the operator, and it is the true value function VπV^\pi.

The Bellman Operator

Mathematical Details

Let us define the Bellman expectation operator TπT^\pi formally. For any value function VV, we define:

(TπV)(s)=aπ(as)[R(s,a)+γsP(ss,a)V(s)](T^\pi V)(s) = \sum_a \pi(a|s) \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V(s') \right]

This operator takes a value function and returns a new value function. Iterative policy evaluation repeatedly applies this operator:

Vk+1=TπVkV_{k+1} = T^\pi V_k

The key theorem is:

Theorem (Bellman Operator is a Contraction): For any two value functions V1V_1 and V2V_2:

TπV1TπV2γV1V2\|T^\pi V_1 - T^\pi V_2\|_\infty \leq \gamma \|V_1 - V_2\|_\infty

where \|\cdot\|_\infty is the max-norm: V=maxsV(s)\|V\|_\infty = \max_s |V(s)|.

Proof Sketch: Consider any state ss:

(TπV1)(s)(TπV2)(s)=aπ(as)γsP(ss,a)(V1(s)V2(s))|(T^\pi V_1)(s) - (T^\pi V_2)(s)| = \left| \sum_a \pi(a|s) \gamma \sum_{s'} P(s'|s,a) (V_1(s') - V_2(s')) \right|

The immediate rewards cancel since they depend only on ss and aa, not on VV. Using the triangle inequality and the fact that aπ(as)=1\sum_a \pi(a|s) = 1 and sP(ss,a)=1\sum_{s'} P(s'|s,a) = 1:

(TπV1)(s)(TπV2)(s)γmaxsV1(s)V2(s)=γV1V2|(T^\pi V_1)(s) - (T^\pi V_2)(s)| \leq \gamma \max_{s'} |V_1(s') - V_2(s')| = \gamma \|V_1 - V_2\|_\infty

Since this holds for all states, it holds for the maximum, proving the contraction.

The Fixed Point Theorem

Mathematical Details

Banach Fixed Point Theorem: If TT is a contraction mapping on a complete metric space, then:

  1. TT has a unique fixed point VV^* such that T(V)=VT(V^*) = V^*
  2. For any starting point V0V_0, the sequence Vk+1=T(Vk)V_{k+1} = T(V_k) converges to VV^*

For our Bellman operator:

  • The fixed point is exactly VπV^\pi (the true value function)
  • Starting from any V0V_0, we converge to VπV^\pi

The contraction property is like a rubber band. No matter where you start, each iteration pulls you closer to the center. The discount factor γ\gamma controls how strong the pull is:

  • Higher γ\gamma (close to 1): Weak contraction, slow convergence
  • Lower γ\gamma (close to 0): Strong contraction, fast convergence

Convergence Speed

How fast do we converge? The answer depends on γ\gamma.

Mathematical Details

After kk iterations, the error is bounded by:

VkVπγkV0Vπ\|V_k - V^\pi\|_\infty \leq \gamma^k \|V_0 - V^\pi\|_\infty

Each iteration reduces the error by a factor of at least γ\gamma. This gives us exponential convergence.

Example: With γ=0.99\gamma = 0.99:

  • After 100 iterations: error reduced by factor 0.991000.370.99^{100} \approx 0.37
  • After 500 iterations: error reduced by factor 0.995000.00670.99^{500} \approx 0.0067
  • After 1000 iterations: error reduced by factor 0.9910000.000040.99^{1000} \approx 0.00004

With γ=0.9\gamma = 0.9:

  • After 100 iterations: error reduced by factor 0.91000.0000270.9^{100} \approx 0.000027
💡Tip

The number of iterations needed scales roughly as O(1/(1γ))O(1/(1-\gamma)). This makes intuitive sense:

  • With γ=0.9\gamma = 0.9, information needs to propagate about 10 steps to decay to 1/e
  • With γ=0.99\gamma = 0.99, information needs to propagate about 100 steps

Higher discount factors make the agent care about more distant rewards, which requires more iterations to properly evaluate.

The Stopping Criterion

We cannot iterate forever, so we need a practical stopping rule.

📖Stopping Criterion

Stop iterating when the maximum change in value across all states falls below a threshold θ\theta:

Δ=maxsVk+1(s)Vk(s)<θ\Delta = \max_s |V_{k+1}(s) - V_k(s)| < \theta

When values stop changing, we have essentially reached the fixed point. The threshold θ\theta controls how close we need to get before declaring victory.

Choosing Theta

📌Choosing the Right Threshold

Consider a GridWorld where rewards range from -1 (per step) to +10 (goal). The values might range from -20 to +10.

  • θ=101\theta = 10^{-1}: Very coarse. Values accurate to within 0.1. Fast but imprecise.
  • θ=103\theta = 10^{-3}: Reasonable for most applications. Values accurate to within 0.001.
  • θ=106\theta = 10^{-6}: High precision. May need many more iterations.
  • θ=108\theta = 10^{-8}: Near machine precision. Usually overkill.
💡Tip

Practical guidance for choosing θ\theta:

  1. Consider your precision needs. If you only care about which action is best (not the exact values), a coarse θ\theta is fine.

  2. Scale with rewards. If rewards are in millions, θ=0.01\theta = 0.01 might be too tight. If rewards are tiny, it might be too loose.

  3. Start coarse, refine if needed. Begin with θ=103\theta = 10^{-3} and only decrease if you see policy instabilities.

  4. Think about downstream use. If these values feed into policy improvement, slightly inaccurate values often still produce the correct greedy policy.

Error Bounds

Mathematical Details

How close are we to the true VπV^\pi when we stop? If Δ<θ\Delta < \theta, then:

VkVπγ1γθ\|V_k - V^\pi\|_\infty \leq \frac{\gamma}{1 - \gamma} \cdot \theta

This tells us the worst-case error in our value estimates.

Example: With γ=0.99\gamma = 0.99 and θ=103\theta = 10^{-3}:

Max error0.990.01×103=0.099\text{Max error} \leq \frac{0.99}{0.01} \times 10^{-3} = 0.099

With γ=0.9\gamma = 0.9 and θ=103\theta = 10^{-3}:

Max error0.90.1×103=0.009\text{Max error} \leq \frac{0.9}{0.1} \times 10^{-3} = 0.009

Higher γ\gamma amplifies the error bound.

Computational Complexity

How expensive is each iteration, and how many iterations do we need?

Per-Sweep Cost

Mathematical Details

Each sweep visits every state and, for each state, considers all actions and all possible transitions:

Cost per sweep=O(SAS)=O(S2A)\text{Cost per sweep} = O(|\mathcal{S}| \cdot |\mathcal{A}| \cdot |\mathcal{S}|) = O(|\mathcal{S}|^2 \cdot |\mathcal{A}|)

Breaking this down:

  • S|\mathcal{S}| states to update
  • A|\mathcal{A}| actions to consider per state
  • S|\mathcal{S}| possible next states per action (worst case)

Sparse MDPs

💡Tip

In practice, most MDPs are sparse: each state-action pair leads to only a few possible next states. For example, in a GridWorld, moving right leads to at most a few cells, not all cells.

For sparse MDPs with dd average transitions per state-action:

Cost per sweep=O(SAd)\text{Cost per sweep} = O(|\mathcal{S}| \cdot |\mathcal{A}| \cdot d)

This is typically much smaller than the worst-case bound.

Total Cost

Mathematical Details

If we need kk iterations to converge:

Total cost=O(kS2A)\text{Total cost} = O(k \cdot |\mathcal{S}|^2 \cdot |\mathcal{A}|)

The number of iterations kk depends on:

  • The discount factor γ\gamma (higher means more iterations)
  • The desired precision θ\theta (smaller means more iterations)
  • The MDP structure (longer paths to rewards means more iterations)

A rough estimate: kO(11γlog1θ)k \sim O\left(\frac{1}{1-\gamma} \log \frac{1}{\theta}\right)

</>Implementation
def analyze_convergence(mdp, policy, gamma=0.99, theta=1e-6):
    """
    Analyze the convergence of policy evaluation.

    Returns detailed statistics about the convergence process.
    """
    V = {s: 0.0 for s in mdp.states}
    history = []

    iteration = 0
    while True:
        delta = 0
        old_V = V.copy()

        for s in mdp.states:
            if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
                continue

            old_value = V[s]
            new_value = 0.0

            for a in mdp.actions(s):
                action_prob = policy.get(s, {}).get(a, 0.0)
                for s_next, prob, reward in mdp.transitions(s, a):
                    new_value += action_prob * prob * (reward + gamma * V[s_next])

            V[s] = new_value
            delta = max(delta, abs(old_value - new_value))

        iteration += 1
        history.append({
            'iteration': iteration,
            'delta': delta,
            'max_value': max(V.values()),
            'min_value': min(V.values()),
        })

        if delta < theta:
            break

        if iteration > 100000:  # Safety limit
            print("Warning: Did not converge")
            break

    return V, history


def plot_convergence(history):
    """Plot convergence metrics."""
    import matplotlib.pyplot as plt

    iterations = [h['iteration'] for h in history]
    deltas = [h['delta'] for h in history]

    plt.figure(figsize=(10, 4))

    plt.subplot(1, 2, 1)
    plt.semilogy(iterations, deltas)
    plt.xlabel('Iteration')
    plt.ylabel('Max Value Change (Delta)')
    plt.title('Convergence: Log Scale')
    plt.grid(True, alpha=0.3)

    plt.subplot(1, 2, 2)
    plt.plot(iterations, deltas)
    plt.xlabel('Iteration')
    plt.ylabel('Max Value Change (Delta)')
    plt.title('Convergence: Linear Scale')
    plt.grid(True, alpha=0.3)

    plt.tight_layout()
    plt.show()

The Effect of Gamma

The discount factor γ\gamma has a profound effect on convergence.

📌Convergence Speed vs Gamma

Consider the same 10-state chain MDP evaluated with different discount factors:

Discount Factor γ\gammaIterations to θ=106\theta = 10^{-6}Interpretation
0.5~25Very myopic, fast convergence
0.9~130Balanced, moderate convergence
0.95~270Far-sighted, slower convergence
0.99~1,400Very far-sighted, slow convergence
0.999~14,000Extremely far-sighted, very slow

Why does higher γ\gamma mean slower convergence?

With high γ\gamma, the value of a state depends heavily on states far in the future. Information about rewards must propagate through many steps. Each iteration propagates values one step, so more steps means more iterations.

With low γ\gamma, only nearby rewards matter. Values stabilize quickly because distant states barely affect current values.

Practical Considerations

Monitoring Convergence

</>Implementation
def policy_evaluation_verbose(mdp, policy, gamma=0.99, theta=1e-6):
    """
    Policy evaluation with detailed progress reporting.
    """
    V = {s: 0.0 for s in mdp.states}
    iteration = 0

    print("Iteration | Max Delta | Value Range")
    print("-" * 45)

    while True:
        delta = 0

        for s in mdp.states:
            if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
                continue

            old_value = V[s]
            new_value = 0.0

            for a in mdp.actions(s):
                action_prob = policy.get(s, {}).get(a, 0.0)
                for s_next, prob, reward in mdp.transitions(s, a):
                    new_value += action_prob * prob * (reward + gamma * V[s_next])

            V[s] = new_value
            delta = max(delta, abs(old_value - new_value))

        iteration += 1

        # Print progress every 10 iterations or at convergence
        if iteration % 10 == 0 or delta < theta:
            min_v = min(V.values())
            max_v = max(V.values())
            print(f"{iteration:9d} | {delta:9.2e} | [{min_v:.2f}, {max_v:.2f}]")

        if delta < theta:
            print(f"\nConverged after {iteration} iterations!")
            break

        if iteration > 100000:
            print(f"\nWarning: Did not converge after {iteration} iterations")
            break

    return V

When Convergence is Slow

💡Tip

If policy evaluation is taking too long, consider:

  1. Increase θ\theta: Do you really need 10810^{-8} precision? Often 10310^{-3} suffices.

  2. Lower γ\gamma (if appropriate): If your problem allows, a smaller discount factor speeds things up dramatically.

  3. Use prioritized sweeps: Update states that changed the most first. This can significantly accelerate convergence.

  4. Initialize with previous values: If evaluating similar policies, start from the old value function instead of zeros.

Summary

ℹ️Note

Key Takeaways:

  1. Convergence is guaranteed because the Bellman operator is a contraction
  2. Convergence speed depends on γ\gamma: higher discount means slower convergence
  3. Stop when the maximum value change Δ\Delta falls below threshold θ\theta
  4. Choose θ\theta based on precision needs; 10310^{-3} to 10610^{-6} is typical
  5. Computational cost is O(S2A)O(|\mathcal{S}|^2 \cdot |\mathcal{A}|) per sweep for dense MDPs

With policy evaluation mastered, we can compute VπV^\pi for any policy. The next chapter tackles the crucial question: How do we use these values to find a better policy?