Temporal Difference Learning • Part 1 of 3
📝Draft

From Prediction to Control

Using TD for action-value estimation

From Prediction to Control

TD(0) gives us a powerful way to learn value functions. But there’s a catch: it learns state values V(s)V(s)—how good it is to be in a state. For control, we need to know how good it is to take specific actions. That’s where action values come in.

The Prediction Problem

So far, we’ve focused on prediction: given a policy π\pi, estimate its value function Vπ(s)V^\pi(s).

TD(0) does this beautifully: V(St)V(St)+α[Rt+1+γV(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha\left[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)\right]

But now we want control: find a policy that maximizes expected return. This requires:

  1. Evaluate the current policy
  2. Improve the policy based on what we learned
  3. Repeat

This is Generalized Policy Iteration (GPI)—the same framework we saw in Dynamic Programming.

📖Control Problem

The task of finding a policy that maximizes expected return. Unlike prediction (estimating values for a fixed policy), control requires both evaluation and policy improvement. The goal is to find π\pi^*, the optimal policy.

Why State Values Aren’t Enough

Here’s the problem: to improve a policy using state values, you need a model of the environment.

Mathematical Details

With state values, the greedy policy improvement says: pick the action that leads to the best next state:

π(s)=argmaxasp(ss,a)[R(s,a,s)+γV(s)]\pi'(s) = \arg\max_a \sum_{s'} p(s'|s,a) \left[R(s,a,s') + \gamma V(s')\right]

But this requires knowing p(ss,a)p(s'|s,a)—the transition probabilities. That’s exactly what we’re trying to avoid in model-free RL!

Suppose you’re in a state with V=5.0V = 5.0, and you have two actions: LEFT and RIGHT. Which should you pick?

With only state values, you’d need to know:

  • Where does LEFT take me? What’s the value there?
  • Where does RIGHT take me? What’s the value there?

Without a model, you can’t answer these questions. You’d have to try each action many times and see where you end up—that’s expensive and slow.

Action values solve this problem. Q(s,LEFT)Q(s, LEFT) directly tells you how good LEFT is. No model needed.

Enter Action Values

📖Action Value Function

The function Qπ(s,a)Q^\pi(s,a) gives the expected return starting from state ss, taking action aa, and then following policy π\pi:

Qπ(s,a)=Eπ[GtSt=s,At=a]Q^\pi(s,a) = \mathbb{E}_\pi\left[G_t \mid S_t = s, A_t = a\right]

With action values, policy improvement is trivial:

π(s)=argmaxaQ(s,a)\pi'(s) = \arg\max_a Q(s, a)

No model needed—just pick the action with the highest Q-value. This is the key insight that makes model-free control possible.

TD for Action Values

The transition from TD(0) to SARSA is straightforward. Instead of learning V(s)V(s), we learn Q(s,a)Q(s,a):

TD(0) for states:

  • We observe: state StS_t, reward Rt+1R_{t+1}, next state St+1S_{t+1}
  • Update: V(St)V(St)+α[Rt+1+γV(St+1)V(St)]V(S_t) \leftarrow V(S_t) + \alpha[R_{t+1} + \gamma V(S_{t+1}) - V(S_t)]

TD for state-action pairs (SARSA):

  • We observe: state StS_t, action AtA_t, reward Rt+1R_{t+1}, next state St+1S_{t+1}, next action At+1A_{t+1}
  • Update: Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)]

The structure is identical—we just track state-action pairs instead of states alone.

Mathematical Details

The SARSA update rule follows directly from the Bellman equation for action values:

Qπ(s,a)=Eπ[Rt+1+γQπ(St+1,At+1)St=s,At=a]Q^\pi(s,a) = \mathbb{E}_\pi\left[R_{t+1} + \gamma Q^\pi(S_{t+1}, A_{t+1}) \mid S_t = s, A_t = a\right]

Just as TD(0) samples the Bellman equation for VV, SARSA samples this equation for QQ:

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha\left[R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t)\right]

The key requirement: we need to know At+1A_{t+1}—the next action we’ll take—before we can do the update.

The GPI Framework with TD

Generalized Policy Iteration alternates between:

  1. Policy Evaluation: Estimate QπQ^\pi for the current policy
  2. Policy Improvement: Make the policy greedy with respect to QQ
ℹ️Note

In practice, we don’t fully evaluate the policy before improving it. We interleave evaluation and improvement, updating Q-values and the policy after every step. This is sometimes called “online” or “incremental” GPI.

📌Example

GPI in Action

Episode 1, Step 1:

  • State: s=(1,2)s = (1, 2)
  • Current policy (random at first) picks action: RIGHT
  • We observe reward 1-1 and next state (1,3)(1, 3)
  • Policy picks next action: UP
  • SARSA update: Q((1,2),RIGHT)Q((1,2),RIGHT)+α[]Q((1,2), RIGHT) \leftarrow Q((1,2), RIGHT) + \alpha[\dots]

Now the policy can use the updated Q-value! If Q((1,2),RIGHT)Q((1,2), RIGHT) is now the highest, the greedy policy would choose RIGHT next time we’re in (1,2)(1,2).

The policy and Q-values evolve together, step by step.

Why We Need Both Evaluation and Improvement

The Exploration Problem

To learn good Q-values, we need to try different actions. But to act well, we should exploit what we’ve learned. This is the exploration-exploitation tradeoff we saw in bandits.

For TD control, we typically use epsilon-greedy exploration:

  • With probability ε\varepsilon: pick a random action (explore)
  • With probability 1ε1-\varepsilon: pick argmaxaQ(s,a)\arg\max_a Q(s,a) (exploit)

This ensures we try all actions (at least occasionally) while mostly choosing what seems best. Common choices: ε=0.1\varepsilon = 0.1 or ε=0.05\varepsilon = 0.05.

</>Implementation
import numpy as np

def epsilon_greedy(Q, state, epsilon, n_actions):
    """
    Select action using epsilon-greedy policy.

    Args:
        Q: Q-table, Q[state] is array of action values
        state: Current state
        epsilon: Exploration probability
        n_actions: Number of available actions

    Returns:
        Selected action
    """
    if np.random.random() < epsilon:
        # Explore: random action
        return np.random.randint(n_actions)
    else:
        # Exploit: best action according to Q
        return np.argmax(Q[state])

From V to Q: A Visual Comparison

AspectState Values V(s)V(s)Action Values Q(s,a)Q(s,a)
What it measuresHow good to be in state ssHow good to take action aa in state ss
Table sizeOne entry per stateOne entry per state-action pair
Policy improvementNeeds modelModel-free
TD updateV(s)+γV(s)V(s) \leftarrow \dots + \gamma V(s')Q(s,a)+γQ(s,a)Q(s,a) \leftarrow \dots + \gamma Q(s', a')

The SARSA Tuple

SARSA gets its name from the five elements needed for each update:

  1. S — Current state StS_t
  2. A — Action taken AtA_t
  3. R — Reward received Rt+1R_{t+1}
  4. S’ — Next state St+1S_{t+1}
  5. A’ — Next action At+1A_{t+1} (selected from the policy)

The crucial insight: we need to select AA' before updating. This is what makes SARSA an on-policy method—it uses the action from the same policy it’s evaluating.

</>Implementation
# The SARSA experience tuple
class SARSATuple:
    """One step of experience for SARSA."""

    def __init__(self, state, action, reward, next_state, next_action):
        self.s = state        # S
        self.a = action       # A
        self.r = reward       # R
        self.s_next = next_state    # S'
        self.a_next = next_action   # A'

    def __repr__(self):
        return f"SARSA({self.s}, {self.a}, {self.r}, {self.s_next}, {self.a_next})"

Summary

The transition from prediction to control requires:

  1. Action values instead of state values—so we can improve policies without a model
  2. Exploration—so we try all actions and learn about them
  3. Policy improvement—using our Q-estimates to make better decisions

SARSA combines these elements into a complete TD control algorithm. In the next section, we’ll see exactly how it works.