Convergence and Stopping
We have seen that iterative policy evaluation works. But why does it work? What mathematical property guarantees that we will reach the correct answer? And in practice, how do we decide when to stop iterating?
This section answers these questions.
Why Policy Evaluation Converges
The key property is that the Bellman operator is a contraction mapping. This is a powerful mathematical concept that guarantees convergence.
A function is a contraction if applying it to any two inputs brings them closer together. Formally, there exists such that for all inputs and :
The constant is called the contraction factor.
Imagine you have two different guesses for the value function. After one Bellman update, those guesses become more similar. After another update, they become even more similar. No matter how far apart you started, you keep getting pushed toward the same point.
That point is the fixed point of the operator, and it is the true value function .
The Bellman Operator
Let us define the Bellman expectation operator formally. For any value function , we define:
This operator takes a value function and returns a new value function. Iterative policy evaluation repeatedly applies this operator:
The key theorem is:
Theorem (Bellman Operator is a Contraction): For any two value functions and :
where is the max-norm: .
Proof Sketch: Consider any state :
The immediate rewards cancel since they depend only on and , not on . Using the triangle inequality and the fact that and :
Since this holds for all states, it holds for the maximum, proving the contraction.
The Fixed Point Theorem
Banach Fixed Point Theorem: If is a contraction mapping on a complete metric space, then:
- has a unique fixed point such that
- For any starting point , the sequence converges to
For our Bellman operator:
- The fixed point is exactly (the true value function)
- Starting from any , we converge to
The contraction property is like a rubber band. No matter where you start, each iteration pulls you closer to the center. The discount factor controls how strong the pull is:
- Higher (close to 1): Weak contraction, slow convergence
- Lower (close to 0): Strong contraction, fast convergence
Convergence Speed
How fast do we converge? The answer depends on .
After iterations, the error is bounded by:
Each iteration reduces the error by a factor of at least . This gives us exponential convergence.
Example: With :
- After 100 iterations: error reduced by factor
- After 500 iterations: error reduced by factor
- After 1000 iterations: error reduced by factor
With :
- After 100 iterations: error reduced by factor
The number of iterations needed scales roughly as . This makes intuitive sense:
- With , information needs to propagate about 10 steps to decay to 1/e
- With , information needs to propagate about 100 steps
Higher discount factors make the agent care about more distant rewards, which requires more iterations to properly evaluate.
The Stopping Criterion
We cannot iterate forever, so we need a practical stopping rule.
Stop iterating when the maximum change in value across all states falls below a threshold :
When values stop changing, we have essentially reached the fixed point. The threshold controls how close we need to get before declaring victory.
Choosing Theta
Consider a GridWorld where rewards range from -1 (per step) to +10 (goal). The values might range from -20 to +10.
- : Very coarse. Values accurate to within 0.1. Fast but imprecise.
- : Reasonable for most applications. Values accurate to within 0.001.
- : High precision. May need many more iterations.
- : Near machine precision. Usually overkill.
Practical guidance for choosing :
-
Consider your precision needs. If you only care about which action is best (not the exact values), a coarse is fine.
-
Scale with rewards. If rewards are in millions, might be too tight. If rewards are tiny, it might be too loose.
-
Start coarse, refine if needed. Begin with and only decrease if you see policy instabilities.
-
Think about downstream use. If these values feed into policy improvement, slightly inaccurate values often still produce the correct greedy policy.
Error Bounds
How close are we to the true when we stop? If , then:
This tells us the worst-case error in our value estimates.
Example: With and :
With and :
Higher amplifies the error bound.
The error bound can be quite loose in practice. Real errors are often much smaller than the bound. However, this gives you a guaranteed upper limit on how wrong your values might be.
Computational Complexity
How expensive is each iteration, and how many iterations do we need?
Per-Sweep Cost
Each sweep visits every state and, for each state, considers all actions and all possible transitions:
Breaking this down:
- states to update
- actions to consider per state
- possible next states per action (worst case)
Sparse MDPs
In practice, most MDPs are sparse: each state-action pair leads to only a few possible next states. For example, in a GridWorld, moving right leads to at most a few cells, not all cells.
For sparse MDPs with average transitions per state-action:
This is typically much smaller than the worst-case bound.
Total Cost
If we need iterations to converge:
The number of iterations depends on:
- The discount factor (higher means more iterations)
- The desired precision (smaller means more iterations)
- The MDP structure (longer paths to rewards means more iterations)
A rough estimate:
def analyze_convergence(mdp, policy, gamma=0.99, theta=1e-6):
"""
Analyze the convergence of policy evaluation.
Returns detailed statistics about the convergence process.
"""
V = {s: 0.0 for s in mdp.states}
history = []
iteration = 0
while True:
delta = 0
old_V = V.copy()
for s in mdp.states:
if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
continue
old_value = V[s]
new_value = 0.0
for a in mdp.actions(s):
action_prob = policy.get(s, {}).get(a, 0.0)
for s_next, prob, reward in mdp.transitions(s, a):
new_value += action_prob * prob * (reward + gamma * V[s_next])
V[s] = new_value
delta = max(delta, abs(old_value - new_value))
iteration += 1
history.append({
'iteration': iteration,
'delta': delta,
'max_value': max(V.values()),
'min_value': min(V.values()),
})
if delta < theta:
break
if iteration > 100000: # Safety limit
print("Warning: Did not converge")
break
return V, history
def plot_convergence(history):
"""Plot convergence metrics."""
import matplotlib.pyplot as plt
iterations = [h['iteration'] for h in history]
deltas = [h['delta'] for h in history]
plt.figure(figsize=(10, 4))
plt.subplot(1, 2, 1)
plt.semilogy(iterations, deltas)
plt.xlabel('Iteration')
plt.ylabel('Max Value Change (Delta)')
plt.title('Convergence: Log Scale')
plt.grid(True, alpha=0.3)
plt.subplot(1, 2, 2)
plt.plot(iterations, deltas)
plt.xlabel('Iteration')
plt.ylabel('Max Value Change (Delta)')
plt.title('Convergence: Linear Scale')
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()The Effect of Gamma
The discount factor has a profound effect on convergence.
Consider the same 10-state chain MDP evaluated with different discount factors:
| Discount Factor | Iterations to | Interpretation |
|---|---|---|
| 0.5 | ~25 | Very myopic, fast convergence |
| 0.9 | ~130 | Balanced, moderate convergence |
| 0.95 | ~270 | Far-sighted, slower convergence |
| 0.99 | ~1,400 | Very far-sighted, slow convergence |
| 0.999 | ~14,000 | Extremely far-sighted, very slow |
Why does higher mean slower convergence?
With high , the value of a state depends heavily on states far in the future. Information about rewards must propagate through many steps. Each iteration propagates values one step, so more steps means more iterations.
With low , only nearby rewards matter. Values stabilize quickly because distant states barely affect current values.
When (no discounting), convergence is not guaranteed for continuing tasks (infinite episodes). The values could grow without bound. For episodic tasks with guaranteed termination, can work, but convergence may still be slow.
Always use for continuing tasks.
Practical Considerations
Monitoring Convergence
def policy_evaluation_verbose(mdp, policy, gamma=0.99, theta=1e-6):
"""
Policy evaluation with detailed progress reporting.
"""
V = {s: 0.0 for s in mdp.states}
iteration = 0
print("Iteration | Max Delta | Value Range")
print("-" * 45)
while True:
delta = 0
for s in mdp.states:
if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
continue
old_value = V[s]
new_value = 0.0
for a in mdp.actions(s):
action_prob = policy.get(s, {}).get(a, 0.0)
for s_next, prob, reward in mdp.transitions(s, a):
new_value += action_prob * prob * (reward + gamma * V[s_next])
V[s] = new_value
delta = max(delta, abs(old_value - new_value))
iteration += 1
# Print progress every 10 iterations or at convergence
if iteration % 10 == 0 or delta < theta:
min_v = min(V.values())
max_v = max(V.values())
print(f"{iteration:9d} | {delta:9.2e} | [{min_v:.2f}, {max_v:.2f}]")
if delta < theta:
print(f"\nConverged after {iteration} iterations!")
break
if iteration > 100000:
print(f"\nWarning: Did not converge after {iteration} iterations")
break
return VWhen Convergence is Slow
If policy evaluation is taking too long, consider:
-
Increase : Do you really need precision? Often suffices.
-
Lower (if appropriate): If your problem allows, a smaller discount factor speeds things up dramatically.
-
Use prioritized sweeps: Update states that changed the most first. This can significantly accelerate convergence.
-
Initialize with previous values: If evaluating similar policies, start from the old value function instead of zeros.
Summary
Key Takeaways:
- Convergence is guaranteed because the Bellman operator is a contraction
- Convergence speed depends on : higher discount means slower convergence
- Stop when the maximum value change falls below threshold
- Choose based on precision needs; to is typical
- Computational cost is per sweep for dense MDPs
With policy evaluation mastered, we can compute for any policy. The next chapter tackles the crucial question: How do we use these values to find a better policy?