TD vs Monte Carlo
We’ve now seen two fundamentally different approaches to learning value functions from experience:
- Monte Carlo (MC): Wait until the episode ends, then update using the actual return
- Temporal Difference (TD): Update after each step using an estimated return
Both converge to the same answer—the true value function —but they take very different paths to get there. Understanding when to use which is crucial for practical RL.
The Core Difference
The key distinction is what we use as the “target” for learning:
Monte Carlo target: The actual return
TD target: The estimated return
MC uses real experience all the way to the end. TD uses one step of real experience, then substitutes its own estimate.
Monte Carlo
- Waits for episode to end
- Uses actual return
- Unbiased estimates
- High variance (many random events)
- Requires episodic tasks
- Updates at episode end
TD Learning
- Updates every step
- Uses TD target
- Biased estimates (bootstrapping)
- Low variance (single transition)
- Works for continuing tasks
- Updates immediately
The Bias-Variance Tradeoff
This is perhaps the most important concept in comparing TD and MC.
A learning method is biased if its expected estimate differs from the true value. TD is biased because it bootstraps from , which is itself an estimate. If our estimates are wrong, we’re learning from incorrect targets.
Variance measures how much an estimate fluctuates across different samples. MC has high variance because the return depends on many random transitions. TD has low variance because it only depends on a single transition.
Let’s be more precise. Consider estimating :
MC Estimator:
The MC estimator is the average of actual returns from state . This is an unbiased estimator of because .
However, each is a sum of many random variables (all future rewards), so the variance can be very large, especially for long episodes.
TD Estimator: Updates toward
This is biased because we’re using instead of the true value . But the variance is lower because we’re only sampling a single transition.
The key insight: bias is often fixable (estimates improve over time), but variance can severely slow learning. This is why TD often outperforms MC despite being biased.
A Tale of Two Updates
Comparing Updates on the Same Experience
An agent starts in state A, transitions to B (reward 0), then to C (reward 0), then to terminal (reward +10). Assume , .
Current estimates: , , .
Monte Carlo update for state A:
- Actual return:
- Update:
TD(0) updates (step by step):
- At A→B: (no change)
- At B→C: (no change)
- At C→terminal:
After one episode, MC updated A directly toward the correct value. TD only updated C, and needs more episodes for that information to propagate back to A and B.
But what if the episode had been 1000 steps long with lots of randomness? MC would use all that noise in its update. TD would break it into manageable pieces.
When MC Beats TD
Monte Carlo has advantages in certain situations:
1. Very noisy bootstrapping targets
If your value estimates are very wrong, TD’s bias can hurt. MC doesn’t depend on value estimates, so it eventually gets the right answer.
2. Simple problems with short episodes
If episodes are short and returns are not too noisy, MC’s unbiasedness shines without too much variance.
3. When you need guaranteed unbiasedness
Some applications (like certain policy gradient methods) theoretically require unbiased estimates.
4. Batch settings with complete episodes
If you have a fixed dataset of complete episodes and want to extract as much signal as possible, MC can be preferable.
When TD Beats MC
TD has advantages in more situations:
1. Online learning
TD can update after every step. MC must wait for the episode to end. For learning during interaction, TD is essential.
2. Continuing tasks
Tasks that never end (like keeping a robot balanced forever) can’t use MC at all. TD handles them naturally.
3. Long episodes with high variance
If episodes are long and returns are noisy, MC’s variance becomes crippling. TD’s step-by-step updates are more stable.
4. Fast credit assignment
Information about rewards propagates backward through TD updates. MC only assigns credit at the end of an episode.
5. Limited data
TD is often more sample-efficient because it extracts more learning from each transition.
The Random Walk Experiment
The classic comparison is on the Random Walk problem we saw earlier.
TD vs MC on Random Walk
In this experiment (from Sutton & Barto), both methods try to estimate the true value function for the 5-state random walk.
Key findings:
- TD converges faster than MC for most learning rates
- MC eventually catches up but takes more episodes
- TD’s advantage grows when starting from poor initial estimates
This experiment demonstrates TD’s practical superiority in a controlled setting, even though MC is unbiased.
Here’s code to compare TD and MC on the Random Walk:
import numpy as np
from collections import defaultdict
class RandomWalk:
"""5-state random walk environment."""
def __init__(self):
self.states = [1, 2, 3, 4, 5] # Named by position
self.state = 3 # Start in middle
def reset(self):
self.state = 3
return self.state
def step(self, action=None):
"""Move randomly left or right."""
if np.random.random() < 0.5:
self.state -= 1
else:
self.state += 1
if self.state == 0:
return 0, 0, True, {} # Left terminal, reward 0
elif self.state == 6:
return 6, 1, True, {} # Right terminal, reward 1
else:
return self.state, 0, False, {}
def td_0_episode(env, V, alpha=0.1, gamma=1.0):
"""Run one episode of TD(0)."""
state = env.reset()
done = False
while not done:
next_state, reward, done, _ = env.step()
if done:
td_target = reward
else:
td_target = reward + gamma * V[next_state]
V[state] += alpha * (td_target - V[state])
state = next_state
def mc_episode(env, V, alpha=0.1, gamma=1.0):
"""Run one episode of Monte Carlo."""
state = env.reset()
episode = [(state, None)] # (state, reward) pairs
done = False
# Generate complete episode
while not done:
next_state, reward, done, _ = env.step()
episode.append((next_state, reward))
# Compute returns and update (backward)
G = 0
for t in range(len(episode) - 2, -1, -1):
state, _ = episode[t]
_, reward = episode[t + 1]
G = reward + gamma * G
V[state] += alpha * (G - V[state])
def run_comparison(num_runs=100, num_episodes=100):
"""Compare TD and MC over multiple runs."""
true_V = {1: 1/6, 2: 2/6, 3: 3/6, 4: 4/6, 5: 5/6}
td_errors = np.zeros((num_runs, num_episodes))
mc_errors = np.zeros((num_runs, num_episodes))
for run in range(num_runs):
# Initialize value estimates
V_td = defaultdict(lambda: 0.5)
V_mc = defaultdict(lambda: 0.5)
env_td = RandomWalk()
env_mc = RandomWalk()
for ep in range(num_episodes):
td_0_episode(env_td, V_td, alpha=0.1)
mc_episode(env_mc, V_mc, alpha=0.1)
# Compute RMS error
td_rms = np.sqrt(np.mean([(V_td[s] - true_V[s])**2 for s in true_V]))
mc_rms = np.sqrt(np.mean([(V_mc[s] - true_V[s])**2 for s in true_V]))
td_errors[run, ep] = td_rms
mc_errors[run, ep] = mc_rms
return td_errors.mean(axis=0), mc_errors.mean(axis=0)
# Run and plot
td_err, mc_err = run_comparison()
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 5))
plt.plot(td_err, label='TD(0)', color='cyan')
plt.plot(mc_err, label='Monte Carlo', color='blue')
plt.xlabel('Episodes')
plt.ylabel('RMS Error (averaged over states)')
plt.title('TD vs MC on Random Walk')
plt.legend()
plt.grid(True, alpha=0.3)
plt.show()Deeper Analysis
Why does TD have lower variance?
The MC return is:
Each is random, so is a sum of many random variables. Its variance grows with episode length.
The TD target is:
This depends on only one random reward and one random state . The variance is bounded, regardless of episode length.
Why is TD biased?
only if .
When (which is always true during learning), the TD target is biased. The bias disappears as , which is why TD still converges correctly.
Common Misconceptions
“TD’s bias means it converges to the wrong answer.”
No. TD is asymptotically unbiased—the bias goes to zero as estimates improve. Both TD and MC converge to the same correct answer; they just take different paths.
“Low variance is always better.”
Not always. If your estimates are very wrong, TD will happily reinforce those errors. MC is immune to this because it uses actual returns. In practice, TD usually wins, but not universally.
“You have to choose between TD and MC.”
Not true! TD(λ) and n-step TD methods interpolate between them. You can get the benefits of both. Many practical algorithms use a mix.
The Spectrum Between TD and MC
TD(0) and MC are extremes of a spectrum:
- TD(0): Use 1 step of real reward, then bootstrap
- TD(1): Use 2 steps of real reward, then bootstrap
- TD(n): Use n+1 steps of real reward, then bootstrap
- TD(∞) = MC: Use all real rewards, no bootstrapping
The sweet spot often lies somewhere in between. TD(λ) provides a way to blend these via eligibility traces, effectively averaging over all n-step returns.
For now, the key insight is: TD(0) is not the only option, and neither is MC. Both are special cases of a broader family.
Practical Recommendations
Based on the tradeoffs, here are guidelines for choosing between TD and MC:
Use TD when:
- You need online learning (updating during episodes)
- Tasks are continuing (no natural episode end)
- Episodes are long and returns are noisy
- You want faster credit assignment
- Sample efficiency matters
Use MC when:
- Episodes are short and simple
- You have complete episodes in a batch
- You need guaranteed unbiased estimates
- Bootstrap estimates would be very inaccurate
- You’re using policy gradient methods that require returns
In practice, TD methods (and their extensions to control: SARSA, Q-learning) dominate. The ability to learn online, handle continuing tasks, and extract more from each sample makes TD the workhorse of modern RL.
Summary
| Aspect | TD | MC |
|---|---|---|
| Bias | Yes (from bootstrapping) | No |
| Variance | Low (single step) | High (full episode) |
| Online learning | Yes | No |
| Continuing tasks | Yes | No |
| Sample efficiency | Generally higher | Generally lower |
| Implementation | Slightly simpler | Need to store episodes |
| Convergence | Same fixed point | Same fixed point |
Both TD and MC are fundamental tools in the RL toolkit. TD’s practical advantages—especially for control problems with large state spaces—make it the foundation for most modern algorithms.
In the next chapter, we’ll extend TD learning from prediction (estimating ) to control (finding a good policy). That’s where SARSA comes in.