From Prediction to Control
TD(0) gives us a powerful way to learn value functions. But there’s a catch: it learns state values —how good it is to be in a state. For control, we need to know how good it is to take specific actions. That’s where action values come in.
The Prediction Problem
So far, we’ve focused on prediction: given a policy , estimate its value function .
TD(0) does this beautifully:
But now we want control: find a policy that maximizes expected return. This requires:
- Evaluate the current policy
- Improve the policy based on what we learned
- Repeat
This is Generalized Policy Iteration (GPI)—the same framework we saw in Dynamic Programming.
The task of finding a policy that maximizes expected return. Unlike prediction (estimating values for a fixed policy), control requires both evaluation and policy improvement. The goal is to find , the optimal policy.
Why State Values Aren’t Enough
Here’s the problem: to improve a policy using state values, you need a model of the environment.
With state values, the greedy policy improvement says: pick the action that leads to the best next state:
But this requires knowing —the transition probabilities. That’s exactly what we’re trying to avoid in model-free RL!
Suppose you’re in a state with , and you have two actions: LEFT and RIGHT. Which should you pick?
With only state values, you’d need to know:
- Where does LEFT take me? What’s the value there?
- Where does RIGHT take me? What’s the value there?
Without a model, you can’t answer these questions. You’d have to try each action many times and see where you end up—that’s expensive and slow.
Action values solve this problem. directly tells you how good LEFT is. No model needed.
Enter Action Values
The function gives the expected return starting from state , taking action , and then following policy :
With action values, policy improvement is trivial:
No model needed—just pick the action with the highest Q-value. This is the key insight that makes model-free control possible.
TD for Action Values
The transition from TD(0) to SARSA is straightforward. Instead of learning , we learn :
TD(0) for states:
- We observe: state , reward , next state
- Update:
TD for state-action pairs (SARSA):
- We observe: state , action , reward , next state , next action
- Update:
The structure is identical—we just track state-action pairs instead of states alone.
The SARSA update rule follows directly from the Bellman equation for action values:
Just as TD(0) samples the Bellman equation for , SARSA samples this equation for :
The key requirement: we need to know —the next action we’ll take—before we can do the update.
The GPI Framework with TD
Generalized Policy Iteration alternates between:
- Policy Evaluation: Estimate for the current policy
- Policy Improvement: Make the policy greedy with respect to
In practice, we don’t fully evaluate the policy before improving it. We interleave evaluation and improvement, updating Q-values and the policy after every step. This is sometimes called “online” or “incremental” GPI.
GPI in Action
Episode 1, Step 1:
- State:
- Current policy (random at first) picks action: RIGHT
- We observe reward and next state
- Policy picks next action: UP
- SARSA update:
Now the policy can use the updated Q-value! If is now the highest, the greedy policy would choose RIGHT next time we’re in .
The policy and Q-values evolve together, step by step.
Why We Need Both Evaluation and Improvement
It might seem like we could just learn Q* directly without worrying about the current policy. Q-learning does exactly this—but there’s a catch.
SARSA learns about the policy it’s following. If your policy explores (which it must, to find good actions), SARSA’s Q-values reflect that exploration. This can be safer in risky environments.
We’ll explore this trade-off deeply when we compare SARSA and Q-learning.
The Exploration Problem
To learn good Q-values, we need to try different actions. But to act well, we should exploit what we’ve learned. This is the exploration-exploitation tradeoff we saw in bandits.
For TD control, we typically use epsilon-greedy exploration:
- With probability : pick a random action (explore)
- With probability : pick (exploit)
This ensures we try all actions (at least occasionally) while mostly choosing what seems best. Common choices: or .
import numpy as np
def epsilon_greedy(Q, state, epsilon, n_actions):
"""
Select action using epsilon-greedy policy.
Args:
Q: Q-table, Q[state] is array of action values
state: Current state
epsilon: Exploration probability
n_actions: Number of available actions
Returns:
Selected action
"""
if np.random.random() < epsilon:
# Explore: random action
return np.random.randint(n_actions)
else:
# Exploit: best action according to Q
return np.argmax(Q[state])From V to Q: A Visual Comparison
| Aspect | State Values | Action Values |
|---|---|---|
| What it measures | How good to be in state | How good to take action in state |
| Table size | One entry per state | One entry per state-action pair |
| Policy improvement | Needs model | Model-free |
| TD update |
The SARSA Tuple
SARSA gets its name from the five elements needed for each update:
- S — Current state
- A — Action taken
- R — Reward received
- S’ — Next state
- A’ — Next action (selected from the policy)
The crucial insight: we need to select before updating. This is what makes SARSA an on-policy method—it uses the action from the same policy it’s evaluating.
# The SARSA experience tuple
class SARSATuple:
"""One step of experience for SARSA."""
def __init__(self, state, action, reward, next_state, next_action):
self.s = state # S
self.a = action # A
self.r = reward # R
self.s_next = next_state # S'
self.a_next = next_action # A'
def __repr__(self):
return f"SARSA({self.s}, {self.a}, {self.r}, {self.s_next}, {self.a_next})"Summary
The transition from prediction to control requires:
- Action values instead of state values—so we can improve policies without a model
- Exploration—so we try all actions and learn about them
- Policy improvement—using our Q-estimates to make better decisions
SARSA combines these elements into a complete TD control algorithm. In the next section, we’ll see exactly how it works.