The Q-Learning Idea
Q-learning is perhaps the most influential algorithm in reinforcement learning. It’s the foundation of DQN, which famously learned to play Atari games, and its descendants power many modern AI systems. The core idea is elegant: learn the optimal value function directly, regardless of how you’re actually behaving.
From SARSA to Q-Learning
Recall SARSA’s approach:
- Take action in state
- Observe reward and next state
- Choose next action from your policy
- Update using
SARSA uses the action you’ll actually take next. This means if your policy sometimes explores randomly, the Q-values account for that exploration.
Q-learning changes one thing: instead of using the next action , use the best action:
This single change transforms the algorithm from on-policy to off-policy.
The One-Word Difference
SARSA (on-policy):
Q-Learning (off-policy):
The only difference: max replaces the actual next action .
But this small change has profound implications:
- SARSA learns — the value of the current policy
- Q-learning learns — the value of the optimal policy
What Does Off-Policy Mean?
Learning about one policy (the target policy) while following a different policy (the behavior policy). In Q-learning, the target policy is greedy with respect to Q, while the behavior policy can be anything that ensures sufficient exploration.
Think of off-policy learning like this:
On-policy (SARSA): “I’m learning how good my actions are, given that I’ll keep behaving the same way—including my mistakes and random exploration.”
Off-policy (Q-learning): “I’m learning how good actions would be if I always acted optimally afterwards—even though I’m not actually acting optimally right now.”
Q-learning separates learning from behavior. You can follow a wildly exploratory policy, even a random one, and still learn the optimal Q-values.
Why Is This Powerful?
Off-policy learning gives us several advantages:
Learn the Optimal Policy
Q-learning converges to regardless of what policy you follow (given sufficient exploration).
Learn from Any Data
You can learn from data collected by other policies, human demonstrations, or even random exploration.
Experience Replay
You can store transitions and learn from them multiple times—key for sample efficiency in deep RL.
Flexible Exploration
The behavior policy can be anything: epsilon-greedy, Boltzmann, UCB, or even human control.
The Two Policies
In off-policy learning, we maintain two distinct policies:
Behavior Policy (): The policy we actually follow. This is what generates our experience. It must explore to ensure we visit all important state-action pairs.
Target Policy (): The policy we’re learning about. In Q-learning, this is the greedy policy:
We follow but learn about . The Q-values we learn reflect what would happen if we followed , not .
The Power of Off-Policy Learning
Imagine you’re training a robot to navigate a warehouse. You have:
- Data from human operators (who sometimes make suboptimal choices)
- Data from a previous, imperfect navigation system
- Data from safe, random exploration
With SARSA (on-policy), you’d need to collect fresh data for every policy you evaluate.
With Q-learning (off-policy), you can:
- Pool all your data together
- Learn from all of it simultaneously
- Converge to the optimal policy even though none of your data came from that policy
This is why off-policy learning is so valuable for real-world applications.
The Intuition Behind Max
Why does using instead of lead to learning ?
The key is what the TD target represents:
SARSA’s target:
- “The reward I got, plus what I expect if I keep following my current policy”
Q-learning’s target:
- “The reward I got, plus what I expect if I act optimally from now on”
By always asking “what if I were optimal?”, Q-learning learns the value of being optimal—.
The actual behavior doesn’t matter (as long as we explore enough) because we’re always updating toward optimal future behavior.
The Bellman optimality equation for Q* is:
Q-learning is essentially sampling this equation:
- We observe a transition
- We use as our estimate of what the right-hand side should be
- We update toward this target
Over time, with enough samples and the right conditions, converges to .
The Trade-Off: Optimality vs. Safety
Q-learning learns optimal values, but there’s a catch: it ignores the exploration risk.
When we use , we’re assuming we’ll take the best action next time. But if we’re actually exploring with epsilon-greedy, we might take a random (possibly bad) action.
This makes Q-learning learn riskier strategies than SARSA—strategies that assume perfect future behavior. In the Cliff Walking environment, this leads Q-learning right along the cliff edge, where random exploration can be fatal.
We’ll explore this trade-off in detail when we compare SARSA and Q-learning on Cliff Walking.
Historical Note
Q-learning was developed by Chris Watkins in his 1989 PhD thesis. The name “Q” was chosen simply as a symbol for the action-value function, following the convention of using for state values. Watkins’ proof that Q-learning converges to optimal values was a landmark result that established the theoretical foundations of model-free reinforcement learning.
Summary
The Q-learning idea is beautifully simple:
- Use max instead of the actual next action
- This separates behavior from learning — you can explore freely while learning optimal values
- Q-learning converges to — the optimal action-value function
- Off-policy learning enables — learning from any data, experience replay, and flexible exploration
In the next section, we’ll see the complete Q-learning algorithm and implement it step by step.