Q-Learning: Off-Policy TD Control
What You'll Learn
- Explain the key difference between SARSA and Q-learning
- Implement Q-learning for control
- Explain what “off-policy” means and why it matters
- Demonstrate Q-learning on Cliff Walking
- Identify the “deadly triad” and its implications
SARSA is safe but learns the value of being epsilon-greedy. What if we could learn the optimal policy while still exploring? Q-learning does exactly this—it learns regardless of what behavior policy we follow. It’s perhaps the most important algorithm in reinforcement learning.
Q-Learning vs SARSA: CliffWalking
Compare how Q-Learning (off-policy) and SARSA (on-policy) learn different paths.
Why Q-Learning?
Q-learning’s key innovation is using max over next actions instead of the actual next action. This means we’re always learning toward the optimal policy, even while behaving suboptimally. The behavior policy can be anything—as long as it tries all actions eventually.
An off-policy TD control algorithm that learns the optimal action-value function directly. Unlike SARSA, Q-learning uses the maximum Q-value for the next state, learning about the greedy policy while following an exploratory one.
The Q-Learning Update
The Q-learning update rule:
The difference from SARSA is just one word: max instead of the actual next action . But this one change is profound—it means we learn instead of .
def q_learning_update(Q, state, action, reward, next_state,
alpha=0.1, gamma=0.99, done=False):
"""Single Q-learning update."""
if done:
td_target = reward
else:
# Key difference from SARSA: use max, not the actual next action
td_target = reward + gamma * np.max(Q[next_state])
td_error = td_target - Q[state][action]
Q[state][action] += alpha * td_error
return td_errorChapter Overview
This chapter introduces Q-learning—the foundation of modern deep RL. We’ll cover:
The Q-Learning Idea
Learning optimal values regardless of behavior policy
The Q-Learning Algorithm
Complete algorithm with implementation and examples
SARSA vs Q-Learning
The CliffWalking experiment and when to use each
Convergence and the Deadly Triad
When Q-learning works and when it breaks
Prerequisites
This chapter assumes you’re comfortable with:
- SARSA — The on-policy TD control algorithm
- Introduction to TD Learning — TD updates and the TD error
Key Questions We’ll Answer
- What does “off-policy” mean and why is it powerful?
- How does one word change (max vs ) transform the algorithm?
- Why does Q-learning find the risky path in Cliff Walking?
- What is the “deadly triad” and why should we care?
The Big Picture
The relationship between SARSA and Q-learning parallels a fundamental question in RL:
Should we learn about what we do, or what we should do?
- SARSA learns about the policy being followed (including exploration)
- Q-learning learns about the optimal policy (ignoring exploration)
Both are valid choices with different trade-offs. Q-learning finds the true optimal, but ignores exploration risk. SARSA is safer during training but may not find the best policy.
Key Takeaways
- Q-learning uses instead of
- This makes it off-policy: it learns regardless of behavior
- Q-learning finds optimal policies but ignores exploration risk
- The deadly triad (off-policy + function approximation + bootstrapping) can cause instability
- Double Q-learning addresses maximization bias
- Q-learning is the foundation for DQN and modern deep RL