The TD Idea
Imagine trying to predict how long your commute will take. Monte Carlo would say: “Wait until you arrive, then update your prediction based on the actual time.” But you’re smarter than that. As you drive, you constantly revise your estimate: “I’ve been driving for 10 minutes and I’m at the highway—based on how long the rest usually takes, I think the total will be about 25 minutes.”
This is the essence of Temporal Difference (TD) learning: updating predictions based on other predictions, before the final outcome is known.
The Problem with Waiting
Recall from Dynamic Programming that we can compute value functions if we have a model of the environment. And from Monte Carlo methods, we can estimate values from complete episodes of experience. But both approaches have significant limitations:
- Dynamic Programming requires a complete model of the environment (transition probabilities, rewards)
- Monte Carlo requires waiting until an episode ends to learn anything
What if the episode is very long? What if it never ends (continuing tasks)? What if we want to learn online, updating our estimates as we go?
A family of RL methods that learn by comparing predictions made at successive time steps. The “temporal difference” is the gap between what we predicted a state’s value would be and what we now think it should be, based on the immediate reward and our estimate of the next state’s value.
The Key Insight: Bootstrapping
Here’s the central idea of TD learning:
Instead of waiting for the actual return (which requires the episode to end), we can use our current estimate as a stand-in for everything that comes after the next state.
Think of it like this: if you’re predicting the total score of a basketball game at halftime, you don’t need to wait for the game to end. You can use the current score plus your estimate of how the second half will go.
The actual return is:
The TD approximation replaces everything after with our estimate:
This is called the TD target.
Using estimated values to update other estimates. In TD learning, we use our current estimate of to improve our estimate of , rather than waiting for the actual return. The name comes from the phrase “pulling yourself up by your bootstraps.”
The term “bootstrapping” captures both the power and the risk of this approach:
- Power: We can learn immediately, without waiting
- Risk: We’re building estimates on top of estimates—if our estimates are wrong, we might reinforce those errors
Let’s be precise about what’s happening. The true value function satisfies the Bellman equation:
This says: the value of a state equals the expected immediate reward plus the discounted value of the next state.
Monte Carlo estimates by averaging actual returns:
TD learning replaces the actual return with the bootstrapped estimate:
The term in brackets is called the TD error:
The TD Error: A Surprise Signal
The TD error tells us how “surprised” we are at each step. It measures the difference between:
- What we expected: — our estimate of the current state’s value
- What we now think: — the reward we got plus our estimate of where we ended up
If , things went better than expected (positive surprise). If , things went worse than expected (negative surprise). If , our prediction was perfect.
Over time, as our value estimates improve, the TD error should average to zero—meaning our predictions are consistent with what actually happens.
A Concrete Example
Suppose you’re in GridWorld, in a state with estimated value . You take an action, receive reward , and end up in a state with estimated value . With :
- TD target:
- TD error:
The TD error is positive! This means the transition was slightly better than our current estimate suggests. With learning rate :
- New estimate:
We’ve nudged our estimate upward to account for this surprisingly good transition.
Why TD Learning is Powerful
TD learning gives us several advantages over other methods:
Online Learning
Updates happen every step, not just at episode end. Learning is immediate and continuous.
No Model Required
Like Monte Carlo, TD learns from experience. No need to know transition probabilities.
Continuing Tasks
Works for tasks that never end. MC requires episodes; TD doesn’t.
Lower Variance
Updates depend on one transition, not an entire episode. Less noise in estimates.
The Cost: Bias
TD learning introduces bias because we’re using estimates to update estimates. If is wrong, we’re learning from a faulty target. This bias decreases as our estimates improve, but it never fully disappears until convergence.
The bias-variance tradeoff is fundamental:
- Monte Carlo: Unbiased (uses actual returns) but high variance (returns are noisy)
- TD Learning: Biased (uses estimated returns) but low variance (single-step updates)
In practice, TD’s lower variance often wins. Learning faster with slightly biased estimates usually beats learning slowly with perfect but noisy estimates.
Here’s the core TD idea in code:
import numpy as np
def td_update(V, state, reward, next_state, alpha=0.1, gamma=0.99, done=False):
"""
Perform a single TD(0) update.
The key insight: instead of waiting for the actual return,
we use r + gamma * V[next_state] as our target.
"""
if done:
# Terminal state: no future value
td_target = reward
else:
# Bootstrapping: use our estimate of the next state's value
td_target = reward + gamma * V[next_state]
# TD error: how surprised are we?
td_error = td_target - V[state]
# Update: move our estimate toward the target
V[state] = V[state] + alpha * td_error
return td_errorNotice the structure:
- Compute the target: (or just if terminal)
- Compute the error: how far off were we?
- Update: take a step toward the target
This pattern—target, error, update—appears throughout reinforcement learning.
Connection to the Bellman Equation
TD learning can be understood as a sample-based version of the Bellman equation.
The Bellman equation for says:
This requires computing an expectation over all possible transitions—which needs a model.
TD learning replaces the expectation with a sample:
Each transition gives us one sample of what the right-hand side of the Bellman equation should be. By averaging many such samples (through the incremental update), we converge to the true value function.
This is why TD learning is sometimes called “sample-based dynamic programming” or “model-free bootstrapping.”
The Relationship to DP and MC
TD learning sits between Dynamic Programming and Monte Carlo:
| Aspect | DP | MC | TD |
|---|---|---|---|
| Bootstraps? | Yes | No | Yes |
| Uses samples? | No | Yes | Yes |
| Requires model? | Yes | No | No |
| Requires episodes? | No | Yes | No |
| Update timing | Synchronous | End of episode | Every step |
TD learning is sometimes called the “one weird trick” of RL. It combines the sample-based learning of Monte Carlo with the bootstrapping of Dynamic Programming, getting the best of both worlds while avoiding their main limitations.
Summary
The TD idea is deceptively simple but profoundly powerful:
- Don’t wait for the end—update predictions immediately using new observations
- Bootstrap from estimates—use as a stand-in for the unknown future
- The TD error is your guide—it tells you how to adjust your predictions
This single idea—learning from temporal differences—is the foundation for Q-learning, SARSA, and all modern deep RL algorithms.
In the next section, we’ll see how to turn this idea into a concrete algorithm: TD(0).