Chapter 112
📝Draft

SARSA

On-policy TD control: learning while following your current policy

SARSA: On-Policy TD Control

What You'll Learn

  • Explain the transition from prediction to control
  • Implement SARSA for learning action values
  • Explain what “on-policy” means and its implications
  • Recognize when SARSA’s cautious behavior is desirable
  • Train an agent to solve GridWorld using SARSA

TD(0) learns VπV^\pi—the value of states under a policy. But to control behavior, we need to evaluate actions, not just states. SARSA is TD for action values—and it gets its name from the quintuple (S,A,R,S,A)(S, A, R, S', A') that defines each update.

Why SARSA?

SARSA extends TD learning to control problems by learning action-values Q(s,a)Q(s,a) instead of state-values V(s)V(s). The key insight is that SARSA learns the value of the policy it’s actually following, including the exploration. This makes it “safe”—it learns to avoid dangerous states even if there’s a risky shortcut.

📖SARSA

An on-policy TD control algorithm that learns action values. The name comes from the tuple (St,At,Rt+1,St+1,At+1)(S_t, A_t, R_{t+1}, S_{t+1}, A_{t+1}) used in each update. SARSA learns the value of the policy it follows, including exploration.

The SARSA Update

Mathematical Details

The SARSA update rule:

Q(St,At)Q(St,At)+α[Rt+1+γQ(St+1,At+1)Q(St,At)]Q(S_t, A_t) \leftarrow Q(S_t, A_t) + \alpha \left[ R_{t+1} + \gamma Q(S_{t+1}, A_{t+1}) - Q(S_t, A_t) \right]

Notice the key difference from Q-learning (which we’ll see next): SARSA uses the actual next action At+1A_{t+1}, not the best possible action. This makes SARSA an on-policy method.

</>Implementation
def sarsa_update(Q, state, action, reward, next_state, next_action,
                 alpha=0.1, gamma=0.99, done=False):
    """Single SARSA update."""
    if done:
        td_target = reward
    else:
        td_target = reward + gamma * Q[next_state][next_action]

    td_error = td_target - Q[state][action]
    Q[state][action] += alpha * td_error
    return td_error

Chapter Overview

This chapter introduces SARSA—the first TD control algorithm. We’ll cover:

Prerequisites

This chapter assumes you’re comfortable with:

Key Questions We’ll Answer

  • Why do we need action values for control?
  • What does “on-policy” mean and why does it matter?
  • Why is SARSA considered “safe” or “cautious”?
  • When should you choose SARSA over other methods?

The Big Picture

SARSA sits at a crucial point in the RL landscape:

  • It’s the first control algorithm we’ve seen that learns from experience
  • Unlike policy evaluation, it can actually find good policies
  • Unlike Q-learning (coming next), it accounts for exploration in its values

The name itself encodes the algorithm: at each step, you have State, take Action, get Reward, enter new State, and pick the next Action. Then you update.

FromToWhat Changes
TD(0)SARSAState values → Action values
DPSARSAModel-based → Model-free
MC ControlSARSAEpisode end → Every step

Key Takeaways

  • SARSA learns action values Q(s,a)Q(s,a) using TD updates
  • It’s called SARSA because each update uses (S,A,R,S,A)(S, A, R, S', A')
  • On-policy: SARSA learns the value of the policy it follows
  • SARSA is “safe” because it accounts for exploration in its value estimates
  • The learned values reflect QπQ^\pi for the exploratory policy, not QQ^*
  • Choose SARSA when exploration is costly and safety matters
Next ChapterQ-Learning