On-Policy Behavior
SARSA has a distinctive characteristic that sets it apart from Q-learning: it’s an on-policy algorithm. This seemingly technical distinction has profound practical implications—it makes SARSA “safe” in ways that Q-learning is not.
What Does “On-Policy” Mean?
A learning method is on-policy if it learns about the policy it is currently following. The values it estimates are for the behavior policy, including any exploration. SARSA learns where is the epsilon-greedy policy it follows.
Think of it this way:
On-policy (SARSA): “I’m learning how good my actions are, given how I actually behave—including my mistakes and random exploration.”
Off-policy (Q-learning): “I’m learning how good the optimal actions are, regardless of what I actually do.”
The difference matters when exploration can be costly.
SARSA Learns , Not
Let’s be precise about what SARSA converges to.
SARSA’s update uses where is sampled from the current policy (typically epsilon-greedy). This means SARSA approximates:
This is the value of action in state under policy —the policy we’re following.
In contrast, Q-learning approximates:
This is the value under the optimal policy .
The key difference: SARSA’s Q-values include the “cost” of exploration. Q-learning’s don’t.
Common Misconception: “SARSA learns ”
This is wrong! SARSA learns for the epsilon-greedy policy. Only as does . If you train SARSA with , the Q-values reflect what happens with 10% random actions.
The Cliff Walking Example
This classic example from Sutton & Barto perfectly illustrates the on-policy/off-policy distinction.
The Cliff Walking Environment
Imagine a 4x12 grid:
┌───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┬───┐
│ │ │ │ │ │ │ │ │ │ │ │ │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ │ │ │ │ │ │ │ │ │ │ │ │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ │ │ │ │ │ │ │ │ │ │ │ │
├───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┼───┤
│ S │ C │ C │ C │ C │ C │ C │ C │ C │ C │ C │ G │
└───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┴───┘- S: Start position
- G: Goal (reward +10)
- C: Cliff (reward -100, episode ends, back to start)
- Every other step: reward -1
The Dilemma:
- The shortest path runs right along the cliff edge
- But with epsilon-greedy exploration, you might randomly step into the cliff!
What SARSA learns: “If I walk near the cliff, my random exploration might push me in. Better to take the safe path far from the edge.”
What Q-learning learns: “The optimal path is along the cliff. I should go that way.”
Both algorithms find the same optimal policy if . But with exploration:
- SARSA takes the safe path (longer but avoids cliff disasters)
- Q-learning takes the risky path (shorter but falls off during training)
Visualizing the Difference
Learned Policies in Cliff Walking
After training with :
SARSA’s path (safe route):
→ → → → → → → → → → → ↓
↓
↓
S GQ-learning’s path (optimal but risky):
S → → → → → → → → → → GSARSA’s path is longer but accounts for the risk of falling. Q-learning’s path is optimal but the agent falls off the cliff frequently during training.
import numpy as np
class CliffWalking:
"""The Cliff Walking environment from Sutton & Barto."""
def __init__(self):
self.height = 4
self.width = 12
self.start = (3, 0)
self.goal = (3, 11)
# Cliff is the bottom row between start and goal
self.cliff = [(3, i) for i in range(1, 11)]
self.state = None
def reset(self):
self.state = self.start
return self.state
def step(self, action):
"""Actions: 0=UP, 1=DOWN, 2=LEFT, 3=RIGHT"""
row, col = self.state
if action == 0:
row = max(0, row - 1)
elif action == 1:
row = min(self.height - 1, row + 1)
elif action == 2:
col = max(0, col - 1)
elif action == 3:
col = min(self.width - 1, col + 1)
self.state = (row, col)
# Check for cliff
if self.state in self.cliff:
return self.start, -100, False, {} # Back to start
# Check for goal
if self.state == self.goal:
return self.state, -1, True, {}
return self.state, -1, False, {}
def compare_sarsa_qlearning(num_episodes=500, num_runs=100):
"""Compare SARSA and Q-learning on Cliff Walking."""
sarsa_rewards = np.zeros((num_runs, num_episodes))
qlearn_rewards = np.zeros((num_runs, num_episodes))
for run in range(num_runs):
# Create agents
sarsa_agent = SARSAgent(n_actions=4, epsilon=0.1, alpha=0.5)
qlearn_agent = QLearningAgent(n_actions=4, epsilon=0.1, alpha=0.5)
env_sarsa = CliffWalking()
env_qlearn = CliffWalking()
for ep in range(num_episodes):
# Train SARSA
reward_sarsa, _ = train_sarsa_episode(env_sarsa, sarsa_agent)
sarsa_rewards[run, ep] = reward_sarsa
# Train Q-learning
reward_qlearn, _ = train_qlearning_episode(env_qlearn, qlearn_agent)
qlearn_rewards[run, ep] = reward_qlearn
return sarsa_rewards.mean(axis=0), qlearn_rewards.mean(axis=0)Why Is SARSA “Safer”?
SARSA’s safety comes from a subtle but important fact: its Q-values account for the exploration policy.
When SARSA estimates , it considers what will happen if you:
- Take action from state
- Then follow the epsilon-greedy policy (with its random exploration)
If the epsilon-greedy policy sometimes leads to disaster (like falling off a cliff), SARSA’s Q-values will be lower for nearby states. This naturally leads to safer behavior.
Q-learning, in contrast, estimates what would happen if you always took the best action afterwards. It ignores the fact that you’ll sometimes explore randomly.
Consider a state one step from the cliff. Let’s compute the Q-value for moving toward the cliff:
SARSA’s calculation:
With , there’s a 10% chance is random and might push you into the cliff. This lowers the expected future value.
Q-learning’s calculation:
This assumes you’ll take the best action next, ignoring the 10% chance of random movement.
SARSA gives a lower Q-value for risky states, encouraging the agent to stay away.
When Is On-Policy Better?
Choose SARSA (on-policy) when:
- Exploration can be costly (falling off cliffs, breaking robots)
- You want online performance during training to be good
- You care about safety during the learning process
- The environment is risky and mistakes are expensive
Choose Q-learning (off-policy) when:
- You want to learn the optimal policy regardless of behavior
- Exploration costs are low
- You can afford to make mistakes during training
- You want to learn from data collected by other policies
The Exploration-Exploitation-Safety Tradeoff
There’s a three-way tradeoff at play:
- Exploration: Try new actions to find better strategies
- Exploitation: Use what you’ve learned to get good rewards
- Safety: Avoid costly mistakes
SARSA tends toward safety—it accounts for the cost of exploration in its values. Q-learning tends toward optimality—it finds the best policy regardless of exploration costs.
Neither is universally better. The choice depends on your domain.
Real-World Implications
When SARSA’s Caution Matters
Robot navigation: A robot learning to navigate shouldn’t learn that “going near the stairs is fine” if its random exploration might make it fall.
Financial trading: An algorithm that occasionally makes random trades shouldn’t learn strategies that assume perfect execution.
Autonomous vehicles: During training, random exploration shouldn’t lead to paths that assume perfect control.
In all these cases, SARSA’s on-policy nature provides a safety margin.
Expected SARSA: A Middle Ground
Recall Expected SARSA from the previous section:
Interestingly, Expected SARSA is still on-policy (it uses the policy probabilities), but with lower variance than regular SARSA. It often performs better than both SARSA and Q-learning in practice.
The Convergence Picture
Both SARSA and Q-learning converge under appropriate conditions:
SARSA converges to where is the policy being followed. If is GLIE (Greedy in the Limit with Infinite Exploration)—meaning it eventually becomes greedy—then SARSA converges to .
Q-learning converges to directly, regardless of the behavior policy (as long as all state-action pairs are visited).
GLIE policies satisfy:
- All state-action pairs are explored infinitely often
- The policy converges to the greedy policy:
A decaying epsilon-greedy policy (where ) is GLIE.
Summary
The on-policy nature of SARSA has important consequences:
| Aspect | SARSA (On-Policy) | Q-Learning (Off-Policy) |
|---|---|---|
| Learns | (current policy) | (optimal policy) |
| Accounts for exploration | Yes | No |
| Safety during training | Higher | Lower |
| Optimal policy | Only as | Yes |
| Online performance | Better | Worse |
| Sample efficiency | Similar | Often better |
The choice between SARSA and Q-learning isn’t about which is “better”—it’s about which matches your needs. In safety-critical applications, SARSA’s caution is a feature, not a bug.
In the next chapter, we’ll dive deep into Q-learning—the off-policy algorithm that learns optimal values directly. We’ll see the famous Cliff Walking comparison in action and understand why Q-learning is perhaps the most influential algorithm in RL history.