Introduction to Temporal Difference Learning

What You'll Learn

Explain the TD learning idea: bootstrapping from estimates
Implement TD(0) for value prediction
Identify the TD error and its role in learning
Compare TD learning with Monte Carlo and Dynamic Programming
Explain bias-variance tradeoffs in TD vs MC

Monte Carlo methods wait until the end of an episode to learn. Dynamic programming needs a model of the environment. What if we could learn step-by-step, from experience, without a model? That’s TD learning—the heart of modern reinforcement learning.

TD(0) vs Monte Carlo Learning

Compare how TD and MC learn value estimates on the Random Walk problem.

Random Walk: Agent starts at S3, moves left or right with 50% probability. Left terminal gives 0 reward, right terminal gives +1 reward.

Alpha (α):0.10Speed:

TD(0)

0.50

Episodes: 0|RMSE: 0.236

Monte Carlo

0.50

Episodes: 0|RMSE: 0.236

True Values (dashed lines)

S1: 0.17S2: 0.33S3: 0.50S4: 0.67S5: 0.83

Key Difference:

TD(0) updates after every step using bootstrap estimates (V(s) depends on V(s'))
Monte Carlo waits until episode end to update, using actual returns

Watch how TD values change immediately while MC values only update when an episode terminates. TD typically learns faster due to its online updates!

Why TD Learning Matters

Temporal Difference (TD) learning combines the best of both worlds:

Like Monte Carlo, it learns from experience (no model needed)
Like Dynamic Programming, it updates estimates from estimates (bootstrapping), allowing learning before an episode ends

This is the key insight that enables practical reinforcement learning. Nearly every successful RL algorithm—from Q-learning to DQN to PPO—builds on TD ideas.

📖Temporal Difference Learning

A class of RL methods that learn by comparing predictions made at different points in time. The “temporal difference” is the gap between successive predictions—if our value estimates are correct, this difference should be zero on average.

The TD Error

The TD error—the difference between what we expected and what we got—becomes our learning signal:

$\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$

Think of the TD error as a “surprise” signal:

If $\delta_t > 0$ : Things went better than expected. Increase $V(S_t)$ .
If $\delta_t < 0$ : Things went worse than expected. Decrease $V(S_t)$ .
If $\delta_t = 0$ : Our prediction was perfect. No change needed.

This simple idea powers everything from tabular Q-learning to modern deep RL systems.

Chapter Overview

This chapter introduces TD learning—the foundation for Q-learning, SARSA, and all of modern deep RL. We’ll cover:

TD at a Glance

∑Mathematical Details

The core TD(0) update rule:

$V(S_t) \leftarrow V(S_t) + \alpha \left[ R_{t+1} + \gamma V(S_{t+1}) - V(S_t) \right]$

Where:

$V(S_t)$ is our current estimate of the state’s value
$\alpha$ is the learning rate
$R_{t+1} + \gamma V(S_{t+1})$ is the TD target
$R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ is the TD error $\delta_t$

</>Implementation

def td_update(V, state, reward, next_state, alpha=0.1, gamma=0.99, done=False):
    """Single TD(0) update."""
    if done:
        td_target = reward
    else:
        td_target = reward + gamma * V[next_state]

    td_error = td_target - V[state]
    V[state] += alpha * td_error
    return td_error

Prerequisites

This chapter assumes you’re comfortable with:

The Bellman Equations — TD targets are sample-based Bellman updates
Policy Evaluation — We’re solving the same problem, but without a model

Key Questions We’ll Answer

How can we learn from incomplete episodes?
What is bootstrapping and why does it introduce bias?
How does the TD error act as a “surprise” signal?
When should you use TD over Monte Carlo?
How does information propagate through TD updates?

The Big Picture

TD learning represents a fundamental shift in how we think about learning from experience:

Dynamic Programming

✗Needs model

✓No complete episodes

Updates after computing all states

Monte Carlo

✓Model-free

✗Needs complete episodes

Updates at end of episode

TD Learning

✓Model-free

✓No complete episodes needed

Updates every step!

TD’s ability to learn incrementally, without a model, from incomplete episodes makes it the practical workhorse of reinforcement learning.

Key Takeaways

TD learning updates estimates using other estimates (bootstrapping)
The TD error $\delta_t = R_{t+1} + \gamma V(S_{t+1}) - V(S_t)$ measures “surprise”
TD has lower variance than MC but introduces bias
TD can learn online, during an episode, without waiting for termination
TD works for continuing (non-episodic) tasks where MC cannot
The bias-variance tradeoff often favors TD in practice

Next ChapterSARSA→

Introduction to TD Learning