SARSA: On-Policy TD Control
What You'll Learn
- Explain the transition from prediction to control
- Implement SARSA for learning action values
- Explain what “on-policy” means and its implications
- Recognize when SARSA’s cautious behavior is desirable
- Train an agent to solve GridWorld using SARSA
TD(0) learns —the value of states under a policy. But to control behavior, we need to evaluate actions, not just states. SARSA is TD for action values—and it gets its name from the quintuple that defines each update.
Why SARSA?
SARSA extends TD learning to control problems by learning action-values instead of state-values . The key insight is that SARSA learns the value of the policy it’s actually following, including the exploration. This makes it “safe”—it learns to avoid dangerous states even if there’s a risky shortcut.
An on-policy TD control algorithm that learns action values. The name comes from the tuple used in each update. SARSA learns the value of the policy it follows, including exploration.
The SARSA Update
The SARSA update rule:
Notice the key difference from Q-learning (which we’ll see next): SARSA uses the actual next action , not the best possible action. This makes SARSA an on-policy method.
def sarsa_update(Q, state, action, reward, next_state, next_action,
alpha=0.1, gamma=0.99, done=False):
"""Single SARSA update."""
if done:
td_target = reward
else:
td_target = reward + gamma * Q[next_state][next_action]
td_error = td_target - Q[state][action]
Q[state][action] += alpha * td_error
return td_errorChapter Overview
This chapter introduces SARSA—the first TD control algorithm. We’ll cover:
From Prediction to Control
Why we need action values and how GPI works with TD
The SARSA Algorithm
Complete algorithm with implementation and worked examples
On-Policy Behavior
The Cliff Walking example and why SARSA is safe
Prerequisites
This chapter assumes you’re comfortable with:
- Introduction to TD Learning — TD(0) and the TD error
- The RL Framework — Value functions and policies
Key Questions We’ll Answer
- Why do we need action values for control?
- What does “on-policy” mean and why does it matter?
- Why is SARSA considered “safe” or “cautious”?
- When should you choose SARSA over other methods?
The Big Picture
SARSA sits at a crucial point in the RL landscape:
- It’s the first control algorithm we’ve seen that learns from experience
- Unlike policy evaluation, it can actually find good policies
- Unlike Q-learning (coming next), it accounts for exploration in its values
The name itself encodes the algorithm: at each step, you have State, take Action, get Reward, enter new State, and pick the next Action. Then you update.
| From | To | What Changes |
|---|---|---|
| TD(0) | SARSA | State values → Action values |
| DP | SARSA | Model-based → Model-free |
| MC Control | SARSA | Episode end → Every step |
Key Takeaways
- SARSA learns action values using TD updates
- It’s called SARSA because each update uses
- On-policy: SARSA learns the value of the policy it follows
- SARSA is “safe” because it accounts for exploration in its value estimates
- The learned values reflect for the exploratory policy, not
- Choose SARSA when exploration is costly and safety matters