Value Iteration
What if we did not fully evaluate each policy before improving? Value iteration takes a shortcut: it applies the Bellman optimality equation directly, finding optimal values without ever explicitly computing intermediate policies.
The Algorithm
An algorithm that repeatedly applies the Bellman optimality backup:
When values converge, they equal . The optimal policy is then extracted as the greedy policy with respect to .
Value iteration combines evaluation and improvement into a single step. Instead of asking “What is the value of following policy ?”, it asks “What is the value of acting optimally?”
The max operator handles improvement (pick the best action). The Bellman backup handles evaluation (propagate values). Each iteration makes progress on both fronts simultaneously.
Why It Works
The Bellman optimality operator is defined as:
Key properties:
-
is a contraction:
-
is the unique fixed point:
-
Convergence is guaranteed: Starting from any , repeated application of converges to
The proof follows the same Banach fixed-point argument as policy evaluation, but with the max operator instead of the expectation over a fixed policy.
The Full Algorithm
Algorithm: Value Iteration
Input: MDP , threshold
Output: Optimal value function , optimal policy
-
Initialize for all
-
Repeat:
- For each state :
- Until
-
Extract policy:
- For each state :
- For each state :
-
Return ,
Complete Implementation
def value_iteration(mdp, gamma=0.99, theta=1e-6):
"""
Find optimal values using value iteration.
Args:
mdp: MDP with states, actions(s), transitions(s, a)
gamma: discount factor
theta: convergence threshold
Returns:
V: optimal value function (dict: state -> value)
iterations: number of iterations until convergence
"""
V = {s: 0.0 for s in mdp.states}
iteration = 0
while True:
delta = 0
iteration += 1
for s in mdp.states:
if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
continue
old_value = V[s]
# Bellman optimality backup: take max over actions
best_value = float('-inf')
for a in mdp.actions(s):
action_value = 0.0
for s_next, prob, reward in mdp.transitions(s, a):
action_value += prob * (reward + gamma * V[s_next])
best_value = max(best_value, action_value)
V[s] = best_value
delta = max(delta, abs(old_value - best_value))
if delta < theta:
break
return V, iteration
def extract_policy(mdp, V, gamma=0.99):
"""
Extract the greedy policy from a value function.
Args:
mdp: MDP object
V: value function (dict: state -> value)
gamma: discount factor
Returns:
policy: dict mapping state -> best action
"""
policy = {}
for s in mdp.states:
if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
policy[s] = None
continue
best_action = None
best_value = float('-inf')
for a in mdp.actions(s):
action_value = 0.0
for s_next, prob, reward in mdp.transitions(s, a):
action_value += prob * (reward + gamma * V[s_next])
if action_value > best_value:
best_value = action_value
best_action = a
policy[s] = best_action
return policy
def value_iteration_full(mdp, gamma=0.99, theta=1e-6):
"""
Value iteration with policy extraction.
Returns both optimal values and optimal policy.
"""
V, iterations = value_iteration(mdp, gamma, theta)
policy = extract_policy(mdp, V, gamma)
print(f"Value iteration converged after {iterations} iterations")
return policy, V, iterationsA Worked Example
Consider a simple chain: A -> B -> C (terminal, reward +10). Moving costs -1. Let .
[A] --(-1)--> [B] --(+10)--> [C]Iteration 0: , ,
Iteration 1:
- (terminal)
Iteration 2:
Iteration 3:
- (no change)
Values have converged! , , .
The optimal policy is: always move right.
Let us trace value iteration on a 4x4 grid with the goal in the bottom-right corner. Reward is -1 per step, .
Initial values: All zeros.
After iteration 1: Only states adjacent to the goal have non-zero values:
0 0 0 0
0 0 0 0
0 0 0 -1
0 0 -1 0After iteration 2: Values spread one step further:
0 0 0 0
0 0 0 -1
0 0 -1 -2
0 -1 -2 0After iteration 3:
0 0 0 -1
0 0 -1 -2
0 -1 -2 -3
-1 -2 -3 0Convergence (after about 6 iterations):
-6 -5 -4 -3
-5 -4 -3 -2
-4 -3 -2 -1
-3 -2 -1 0These are the optimal values! The optimal policy points toward the goal from every state.
Policy Iteration vs Value Iteration
| Aspect | Policy Iteration | Value Iteration |
|---|---|---|
| Update rule | Bellman expectation (fixed ) | Bellman optimality (max over actions) |
| Per iteration | Full policy evaluation | Single Bellman backup |
| Intermediate | Explicit policies at each step | Only values (policy at the end) |
| Iterations | Few (2-10) | Many (often 100s) |
| Per-iteration cost | High (many sweeps) | Low (one sweep) |
| Total cost | Often lower | Often higher |
| When to use | Small MDPs, low | Large MDPs, high |
When to use which?
- Policy Iteration: When policy evaluation is cheap (small state space, low )
- Value Iteration: When you want simplicity and do not need intermediate policies
- Modified Policy Iteration: A good middle ground for most cases
In practice, the differences are often small. Both find the same answer.
Convergence Analysis
Value iteration converges at the same rate as policy evaluation:
The number of iterations to achieve error is:
For , this can be hundreds to thousands of iterations. But each iteration is fast (just one sweep), so the total time is often comparable to policy iteration.
Value iteration typically needs more iterations than policy iteration, but each iteration is cheaper. The trade-off depends on:
- State space size: Larger means more expensive evaluation, favoring value iteration
- Discount factor: Higher means more iterations needed for both
- Precision needed: Tighter means more iterations
For very large MDPs, value iteration (or approximate variants) is often preferred because policy evaluation becomes the bottleneck.
The Connection to Q-Learning
Value iteration is essentially Q-learning with a model. Compare:
Value Iteration:
Q-Learning:
The key differences:
- Value iteration uses the known model to compute expectations
- Q-learning samples transitions from experience
- Value iteration uses full sweeps; Q-learning uses individual samples
Q-learning is value iteration without the model, using samples instead of expectations.
This connection is important. When you understand value iteration deeply, you understand the core of Q-learning. The algorithmic structure is the same; only the source of transition information differs.
Implementation Details
Handling Terminal States
def value_iteration_with_terminals(mdp, gamma=0.99, theta=1e-6):
"""
Value iteration with proper terminal state handling.
"""
V = {s: 0.0 for s in mdp.states}
# Terminal states have fixed value (often 0 or terminal reward)
for s in mdp.terminal_states:
V[s] = mdp.terminal_reward.get(s, 0.0)
while True:
delta = 0
for s in mdp.states:
# Skip terminal states - they have fixed values
if s in mdp.terminal_states:
continue
old_value = V[s]
# Find best action value
best_value = float('-inf')
for a in mdp.actions(s):
action_value = 0.0
for s_next, prob, reward in mdp.transitions(s, a):
action_value += prob * (reward + gamma * V[s_next])
best_value = max(best_value, action_value)
V[s] = best_value
delta = max(delta, abs(old_value - best_value))
if delta < theta:
break
return VTracking Convergence
def value_iteration_verbose(mdp, gamma=0.99, theta=1e-6, log_interval=10):
"""
Value iteration with detailed progress tracking.
"""
V = {s: 0.0 for s in mdp.states}
history = []
print("Iter | Max Delta | Value Range")
print("-" * 40)
iteration = 0
while True:
delta = 0
iteration += 1
for s in mdp.states:
if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
continue
old_value = V[s]
best_value = float('-inf')
for a in mdp.actions(s):
action_value = 0.0
for s_next, prob, reward in mdp.transitions(s, a):
action_value += prob * (reward + gamma * V[s_next])
best_value = max(best_value, action_value)
V[s] = best_value
delta = max(delta, abs(old_value - best_value))
history.append({
'iteration': iteration,
'delta': delta,
'max_value': max(V.values()),
'min_value': min(V.values()),
})
if iteration % log_interval == 0 or delta < theta:
min_v = min(V.values())
max_v = max(V.values())
print(f"{iteration:4d} | {delta:9.2e} | [{min_v:.2f}, {max_v:.2f}]")
if delta < theta:
print(f"\nConverged after {iteration} iterations!")
break
return V, historyAdvanced: Prioritized Sweeps
Standard value iteration updates states in a fixed order. But some states change more than others. Prioritized sweeping maintains a priority queue of states ordered by how much their values changed. It updates high-priority states first.
This can dramatically speed up convergence because:
- States near rewards get updated first
- Information propagates faster through the MDP
- We skip states that have already converged
This idea extends to RL as prioritized experience replay.
Common Mistakes
Mistake 1: Extracting policy before convergence
The policy is only guaranteed optimal after values converge. Extracting early may give suboptimal actions:
# Wrong: extract during iteration
for iteration in range(100):
update_values()
policy = extract_policy(V) # May not be optimal yet!
# Right: extract after convergence
V = value_iteration(mdp)
policy = extract_policy(V) # Now optimalMistake 2: Using greedy policy during value iteration
Value iteration does not maintain an explicit policy. The max operation implicitly considers all actions. Do not confuse this with following a policy:
# Value iteration: max over ALL actions
V[s] = max(Q(s, a) for a in actions)
# NOT following a specific policy
V[s] = Q(s, policy[s]) # This is policy evaluation!Mistake 3: Forgetting to handle empty action spaces
Some states might have no available actions (terminal states, or dead ends):
actions = mdp.actions(s)
if not actions:
V[s] = 0 # or terminal reward
continue
best_value = max(compute_Q(s, a) for a in actions)Summary
Key Takeaways:
- Value iteration applies the Bellman optimality backup:
- Convergence is guaranteed by the contraction property
- Extract policy at the end by taking greedy actions with respect to
- More iterations than policy iteration, but simpler and cheaper per iteration
- Same answer as policy iteration, different computational path
The Road Ahead
Dynamic Programming gives us exact solutions, but requires knowing the model. What if we do not have ? What if the state space is too large to enumerate?
That is where reinforcement learning comes in. In the following chapters, we will explore:
- Monte Carlo methods: Learn from complete episodes
- Temporal Difference learning: Learn from incomplete episodes
- Q-learning: Value iteration without a model
The ideas from this chapter, especially the Bellman equations and the principle of greedy improvement, carry forward into all of RL.