Dynamic Programming • Part 2 of 3
📝Draft

Policy Iteration

Alternating evaluation and improvement until optimal

Policy Iteration

Policy iteration is a ping-pong algorithm. It alternates between two steps: evaluate the current policy to get VπV^\pi, then improve the policy by acting greedily. Repeat until the policy stops changing. When it stops, you have found the optimal policy.

The Algorithm

📖Policy Iteration

An algorithm that alternates between:

  1. Policy Evaluation: Compute VπV^\pi for the current policy π\pi
  2. Policy Improvement: Construct a new policy π\pi' that is greedy with respect to VπV^\pi

Repeat until the policy no longer changes. The final policy is optimal.

Think of it as a feedback loop:

  • Evaluation answers: “How good is this policy?”
  • Improvement answers: “What policy would be best given these values?”

Each round, evaluation produces new values and improvement produces a new policy. They chase each other until both stabilize at the optimum.

The remarkable thing is that this process terminates, and when it does, we have found π\pi^* and VV^*.

Pseudocode

Mathematical Details

Algorithm: Policy Iteration

Input: MDP (S,A,P,R,γ)(S, A, P, R, \gamma)

Output: Optimal policy π\pi^*, optimal value function VV^*

  1. Initialize π(s)\pi(s) arbitrarily for all sSs \in S

  2. Repeat:

    • Policy Evaluation:
      • Compute VπV^\pi using iterative policy evaluation
    • Policy Improvement:
      • policy_stableTrue\text{policy\_stable} \leftarrow \text{True}
      • For each state sSs \in S:
        • old_actionπ(s)\text{old\_action} \leftarrow \pi(s)
        • π(s)argmaxa[R(s,a)+γsP(ss,a)Vπ(s)]\pi(s) \leftarrow \arg\max_a \left[ R(s,a) + \gamma \sum_{s'} P(s'|s,a) V^\pi(s') \right]
        • If old_actionπ(s)\text{old\_action} \neq \pi(s): policy_stableFalse\text{policy\_stable} \leftarrow \text{False}
    • Until policy_stable\text{policy\_stable} is True
  3. Return π\pi, VπV^\pi

Why It Works

Policy iteration converges because of two properties:

💡Tip
  1. Monotonic Improvement: Each policy is at least as good as the previous one (policy improvement theorem)
  2. Finite Policies: There are only finitely many deterministic policies for a finite MDP

Since we only improve and never get worse, and there are only finitely many policies, we must eventually stop improving. That can only happen at the optimal policy.

Mathematical Details

More formally, let Π\Pi be the set of all deterministic policies. For a finite MDP with S|S| states and A|A| actions:

Π=AS|\Pi| = |A|^{|S|}

This is finite. Each policy iteration step either:

  • Produces a strictly better policy (different from before)
  • Produces the same policy (we have converged)

We never revisit a policy because improvement is monotonic. Therefore, we must converge in at most AS|A|^{|S|} iterations.

In practice, convergence happens much faster, often in just a few iterations.

Complete Implementation

</>Implementation
def policy_evaluation(mdp, policy, gamma=0.99, theta=1e-8):
    """
    Evaluate a policy using iterative policy evaluation.

    Args:
        mdp: MDP with states, actions(s), transitions(s, a)
        policy: dict mapping state -> action
        gamma: discount factor
        theta: convergence threshold

    Returns:
        V: dict mapping state -> value
    """
    V = {s: 0.0 for s in mdp.states}

    while True:
        delta = 0

        for s in mdp.states:
            if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
                continue

            old_value = V[s]
            action = policy[s]

            # Bellman backup for deterministic policy
            new_value = 0.0
            for s_next, prob, reward in mdp.transitions(s, action):
                new_value += prob * (reward + gamma * V[s_next])

            V[s] = new_value
            delta = max(delta, abs(old_value - new_value))

        if delta < theta:
            break

    return V


def policy_iteration(mdp, gamma=0.99, theta=1e-8):
    """
    Find optimal policy using policy iteration.

    Args:
        mdp: MDP object with states, actions(s), transitions(s, a)
        gamma: discount factor
        theta: convergence threshold for policy evaluation

    Returns:
        policy: optimal policy (dict: state -> action)
        V: optimal value function (dict: state -> value)
        history: list of dicts with iteration statistics
    """
    import random

    # Initialize with arbitrary policy
    policy = {}
    for s in mdp.states:
        if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
            policy[s] = None
        else:
            policy[s] = random.choice(list(mdp.actions(s)))

    history = []
    iteration = 0

    while True:
        iteration += 1

        # Step 1: Policy Evaluation
        V = policy_evaluation(mdp, policy, gamma, theta)

        # Step 2: Policy Improvement
        policy_stable = True
        changes = 0

        for s in mdp.states:
            if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
                continue

            old_action = policy[s]

            # Find greedy action
            best_action = None
            best_value = float('-inf')

            for a in mdp.actions(s):
                action_value = 0.0
                for s_next, prob, reward in mdp.transitions(s, a):
                    action_value += prob * (reward + gamma * V[s_next])

                if action_value > best_value:
                    best_value = action_value
                    best_action = a

            policy[s] = best_action

            if old_action != best_action:
                policy_stable = False
                changes += 1

        # Record history
        history.append({
            'iteration': iteration,
            'policy_changes': changes,
            'max_value': max(V.values()),
            'min_value': min(V.values()),
        })

        print(f"Iteration {iteration}: {changes} policy changes")

        if policy_stable:
            print(f"Policy iteration converged after {iteration} iterations")
            break

    return policy, V, history

A Worked Example

📌4x4 GridWorld

Consider a 4x4 grid. The agent starts anywhere and wants to reach the bottom-right corner (the goal). Each step costs -1. The discount factor is γ=1\gamma = 1 (no discounting for this episodic task).

 .  .  .  .
 .  .  .  .
 .  .  .  .
 .  .  .  G    G = Goal (reward 0)

Iteration 0: Initialize with random policy

Suppose we start with a policy that always goes “down”:

 v  v  v  v
 v  v  v  v
 v  v  v  v
 v  v  v  G

Iteration 1: Evaluate, then Improve

After evaluation, we find:

  • States in the bottom row (except goal) have value -1 (one step to goal)
  • States in the third row have value -2 (two steps)
  • And so on…

Values under “always down” policy:

-3  -4  -5  -6
-2  -3  -4  -5
-1  -2  -3  -4
 0  -1  -2  -3     Wait, this is wrong for column 0!

Actually, the leftmost column cannot reach the goal by going down forever. Let us assume walls bounce you back. After proper evaluation:

-6  -5  -4  -3
-5  -4  -3  -2
-4  -3  -2  -1
-3  -2  -1   0

Now we improve. For each state, we check which action is best:

  • From (0,0): Right leads to -5, Down leads to -5. Tie! Pick either.
  • From (0,3): Down leads to -2. That is best.
  • From (3,0): Right leads to -2. That is best.

The improved policy becomes diagonal arrows pointing toward the goal.

Iteration 2: Evaluate and Improve again

The new policy is already optimal (shortest paths to goal). After evaluation and improvement, no changes occur.

Result: Converged in 2 iterations!

Optimal policy:
 >  >  >  v
 >  >  >  v
 >  >  >  v
 >  >  >  G

Optimal values:
-3  -2  -1  0      (assuming bottom-right is (3,3))
-2  ...
...

Convergence Speed

💡Tip

Policy iteration typically converges very quickly, often in just 2-5 iterations for many practical MDPs. This is surprising given that there could be AS|A|^{|S|} possible policies.

Why so fast? The policy improvement step makes large jumps in policy space. Unlike gradient descent which takes small steps, policy improvement switches to the globally greedy action at every state simultaneously. This aggressive improvement leads to rapid convergence.

📌Typical Iteration Counts
MDP SizeStatesActionsTypical Iterations
Small GridWorld1642-3
Medium GridWorld10043-5
Large GridWorld100044-7
Complex stochastic MDP500105-10

The iteration count grows slowly with MDP size. The bulk of computation is in policy evaluation, not in the number of improvement steps.

Computational Cost

Mathematical Details

The cost of policy iteration comes from two components:

Per policy evaluation: O(kS2A)O(k \cdot |S|^2 \cdot |A|) where kk is the number of evaluation sweeps (depends on γ\gamma and θ\theta)

Per policy improvement: O(SAS)=O(S2A)O(|S| \cdot |A| \cdot |S|) = O(|S|^2 \cdot |A|) for one pass over all states

Total: If we need nn policy iterations, each requiring kk evaluation sweeps:

Total cost=O(nkS2A)\text{Total cost} = O(n \cdot k \cdot |S|^2 \cdot |A|)

Since nn is typically small (2-10) but kk can be large (hundreds to thousands for high γ\gamma), the evaluation cost dominates.

Variations

Modified Policy Iteration

ℹ️Note

Modified Policy Iteration is a practical variant that does not run policy evaluation to full convergence. Instead, it runs only kk evaluation sweeps before improving.

When k=1k = 1 (one sweep), this becomes very similar to value iteration. When k=k = \infty (full convergence), this is standard policy iteration.

In practice, k=10k = 10 to k=50k = 50 often works well, balancing evaluation accuracy with computation time.

</>Implementation
def modified_policy_iteration(mdp, gamma=0.99, k=10, theta=1e-6):
    """
    Modified policy iteration with limited evaluation sweeps.

    Args:
        mdp: MDP object
        gamma: discount factor
        k: number of evaluation sweeps per iteration
        theta: overall convergence threshold
    """
    import random

    # Initialize
    policy = {s: random.choice(list(mdp.actions(s)))
              for s in mdp.states
              if not (hasattr(mdp, 'terminal_states') and s in mdp.terminal_states)}
    V = {s: 0.0 for s in mdp.states}

    iteration = 0
    while True:
        iteration += 1

        # Limited policy evaluation: only k sweeps
        for _ in range(k):
            for s in mdp.states:
                if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
                    continue

                action = policy.get(s)
                if action is None:
                    continue

                new_value = 0.0
                for s_next, prob, reward in mdp.transitions(s, action):
                    new_value += prob * (reward + gamma * V[s_next])
                V[s] = new_value

        # Policy improvement
        policy_stable = True
        max_delta = 0

        for s in mdp.states:
            if hasattr(mdp, 'terminal_states') and s in mdp.terminal_states:
                continue

            old_action = policy.get(s)
            old_value = V[s]

            # Find best action and its value
            best_action = None
            best_value = float('-inf')

            for a in mdp.actions(s):
                action_value = 0.0
                for s_next, prob, reward in mdp.transitions(s, a):
                    action_value += prob * (reward + gamma * V[s_next])

                if action_value > best_value:
                    best_value = action_value
                    best_action = a

            policy[s] = best_action
            V[s] = best_value  # Update value to greedy value
            max_delta = max(max_delta, abs(old_value - best_value))

            if old_action != best_action:
                policy_stable = False

        if policy_stable and max_delta < theta:
            break

    return policy, V

Asynchronous Policy Iteration

💡Tip

In asynchronous variants, we do not wait to update all states before improving. Instead, we can:

  1. Update states in any order
  2. Improve the policy for a state as soon as its value is updated
  3. Prioritize states that are likely to change the most

These variants can converge faster in practice, especially for large MDPs with sparse reward structures.

Common Mistakes

Policy Iteration vs. Other Methods

AspectPolicy IterationValue IterationQ-Learning
Model required?YesYesNo
IterationsFew (2-10)Many (100s)Many (1000s+)
Per-iteration costHigh (full evaluation)Low (one backup)Very low (one sample)
ConvergenceExactExactApproximate
MemoryO(O( states ))O(O( states ))O(O( states ×\times actions ))

Summary

ℹ️Note

Key Takeaways:

  1. Policy iteration alternates between evaluation and improvement
  2. Convergence is guaranteed and typically fast (2-10 iterations)
  3. Evaluation cost dominates, especially for high γ\gamma
  4. Modified policy iteration trades evaluation accuracy for speed
  5. This is the gold standard for small MDPs with known models

Policy iteration gives us exact optimal policies when we know the MDP. But what if we want a simpler algorithm that combines evaluation and improvement in one step? That is value iteration, which we cover next.