Introduction to Temporal Difference Learning
What You'll Learn
- Explain the TD learning idea: bootstrapping from estimates
- Implement TD(0) for value prediction
- Identify the TD error and its role in learning
- Compare TD learning with Monte Carlo and Dynamic Programming
- Explain bias-variance tradeoffs in TD vs MC
Monte Carlo methods wait until the end of an episode to learn. Dynamic programming needs a model of the environment. What if we could learn step-by-step, from experience, without a model? That’s TD learning—the heart of modern reinforcement learning.
TD(0) vs Monte Carlo Learning
Compare how TD and MC learn value estimates on the Random Walk problem.
- TD(0) updates after every step using bootstrap estimates (V(s) depends on V(s'))
- Monte Carlo waits until episode end to update, using actual returns
Why TD Learning Matters
Temporal Difference (TD) learning combines the best of both worlds:
- Like Monte Carlo, it learns from experience (no model needed)
- Like Dynamic Programming, it updates estimates from estimates (bootstrapping), allowing learning before an episode ends
This is the key insight that enables practical reinforcement learning. Nearly every successful RL algorithm—from Q-learning to DQN to PPO—builds on TD ideas.
A class of RL methods that learn by comparing predictions made at different points in time. The “temporal difference” is the gap between successive predictions—if our value estimates are correct, this difference should be zero on average.
The TD Error
The TD error—the difference between what we expected and what we got—becomes our learning signal:
Think of the TD error as a “surprise” signal:
- If : Things went better than expected. Increase .
- If : Things went worse than expected. Decrease .
- If : Our prediction was perfect. No change needed.
This simple idea powers everything from tabular Q-learning to modern deep RL systems.
Chapter Overview
This chapter introduces TD learning—the foundation for Q-learning, SARSA, and all of modern deep RL. We’ll cover:
The TD Idea
Bootstrapping: learning from incomplete returns without waiting for episodes to end
TD(0) Prediction
The simplest TD method for value estimation with worked examples
TD vs Monte Carlo
Bias, variance, and when to use each approach
TD at a Glance
The core TD(0) update rule:
Where:
- is our current estimate of the state’s value
- is the learning rate
- is the TD target
- is the TD error
def td_update(V, state, reward, next_state, alpha=0.1, gamma=0.99, done=False):
"""Single TD(0) update."""
if done:
td_target = reward
else:
td_target = reward + gamma * V[next_state]
td_error = td_target - V[state]
V[state] += alpha * td_error
return td_errorPrerequisites
This chapter assumes you’re comfortable with:
- The Bellman Equations — TD targets are sample-based Bellman updates
- Policy Evaluation — We’re solving the same problem, but without a model
Key Questions We’ll Answer
- How can we learn from incomplete episodes?
- What is bootstrapping and why does it introduce bias?
- How does the TD error act as a “surprise” signal?
- When should you use TD over Monte Carlo?
- How does information propagate through TD updates?
The Big Picture
TD learning represents a fundamental shift in how we think about learning from experience:
TD’s ability to learn incrementally, without a model, from incomplete episodes makes it the practical workhorse of reinforcement learning.
Key Takeaways
- TD learning updates estimates using other estimates (bootstrapping)
- The TD error measures “surprise”
- TD has lower variance than MC but introduces bias
- TD can learn online, during an episode, without waiting for termination
- TD works for continuing (non-episodic) tasks where MC cannot
- The bias-variance tradeoff often favors TD in practice