TD(0) Prediction
Now that we understand the TD idea, let’s turn it into a concrete algorithm. TD(0) is the simplest form of TD learning—it looks just one step ahead and bootstraps from the very next state. The “(0)” refers to the fact that we use zero steps of actual returns before bootstrapping (unlike TD(n) which uses n steps).
The TD(0) Algorithm
TD(0) is remarkably simple. For each transition the agent experiences:
- Observe the current state
- Take an action according to the policy, receive reward , arrive in state
- Update by nudging it toward
- Repeat
That’s it. We never wait for an episode to end. We never need to know how the environment works. We just learn from each transition as it happens.
The simplest TD prediction method. It updates the value of the current state based on the immediate reward and the estimated value of the next state, using only a single step of actual experience before bootstrapping.
The TD(0) update rule:
Let’s break down each component:
- : Our current estimate of the state’s value
- : Learning rate (step size), typically between 0.01 and 0.5
- : The immediate reward we just received
- : Discount factor, typically 0.9 to 0.99
- : Our estimate of the next state’s value
- : The TD target — our revised estimate of what should be
- : The TD error — how much we should adjust
The update moves a fraction of the way toward the TD target.
Pseudocode for TD(0) Prediction
Here’s the complete algorithm for estimating for a given policy :
Algorithm: TD(0) for Estimating V^π
─────────────────────────────────────
Input: policy π to evaluate
Parameters: step size α ∈ (0, 1], discount γ
Initialize V(s) arbitrarily for all s ∈ S
Initialize V(terminal) = 0
Loop for each episode:
Initialize S (first state of episode)
Loop for each step of episode:
A ← action given by π for S
Take action A, observe R, S'
V(S) ← V(S) + α[R + γV(S') - V(S)]
S ← S'
until S is terminal
Unlike Monte Carlo, we update after every single step. We don’t need to store the episode or wait for it to end.
A Worked Example: Random Walk
The classic example for TD(0) is the Random Walk problem from Sutton and Barto.
The Random Walk
Consider a chain of 5 states: A, B, C, D, E, with terminal states at both ends.
[Left Terminal] ← A ← B ← C → D → E → [Right Terminal]- Starting from any state, the agent moves left or right with equal probability (50/50)
- Reaching the left terminal gives reward 0
- Reaching the right terminal gives reward +1
- All other transitions give reward 0
Question: What is the true value of state C under the random policy?
By symmetry and calculation, the true values are:
- , , , ,
State C is in the middle, and on average, you’ll reach the right terminal (reward +1) half the time, giving .
Let’s trace through some TD(0) updates:
TD(0) Updates on Random Walk
Suppose our initial estimates are all , and , .
Episode 1: Start in C, move right to D, move right to E, move right to Terminal (reward +1).
-
At C: Observed transition C → D with
- TD target:
- TD error:
- Update: (no change)
-
At D: Observed transition D → E with
- TD target:
- TD error:
- Update: (no change)
-
At E: Observed transition E → Terminal with
- TD target: (terminal has value 0)
- TD error:
- Update:
Now . In the next episode, this higher value will propagate backward to D, then to C, and so on. This is how TD learning spreads information.
The Learning Rate α
The learning rate controls how much we update on each step:
- too small: Learning is very slow. We need many samples to move the estimate significantly.
- too large: Learning is fast but unstable. Estimates oscillate wildly.
- just right: We converge to good estimates in reasonable time.
A common heuristic: start with or . If learning is unstable, reduce it. If learning is too slow, increase it. For convergence guarantees, should decay over time (but constant often works well in practice).
For TD(0) to converge to the true value function , we need:
- All states are visited infinitely often
- The step sizes satisfy: and
These are the standard stochastic approximation conditions. A common choice is or where is the number of visits to state .
In practice, a constant (like 0.1) often works well, especially in non-stationary environments where older experience may not be relevant.
TD Error as a Prediction Error Signal
The TD error has a beautiful interpretation:
It measures the prediction error at each step. We predicted the current state was worth . After seeing what actually happened (reward and landing in ), we now think it should be worth .
The TD error is the gap between these two assessments. If our value function is perfect, the TD error averages to zero. If not, the TD error tells us exactly how to fix it.
Here’s why the TD error is so important:
- It guides learning: Positive TD error → increase the value estimate. Negative → decrease it.
- It indicates surprise: Large TD error means something unexpected happened.
- It measures progress: As learning proceeds, TD errors should get smaller on average.
- It connects to neuroscience: The TD error resembles dopamine signals in the brain (more on this in future chapters).
Here’s a complete TD(0) implementation:
import numpy as np
from collections import defaultdict
class TD0Predictor:
"""TD(0) for estimating V^π."""
def __init__(self, alpha=0.1, gamma=0.99):
self.alpha = alpha
self.gamma = gamma
self.V = defaultdict(float) # Value estimates, default to 0
def update(self, state, reward, next_state, done):
"""
Perform a single TD(0) update.
Args:
state: Current state
reward: Reward received
next_state: Next state
done: Whether next_state is terminal
Returns:
td_error: The TD error for this transition
"""
# Compute TD target
if done:
td_target = reward # Terminal state has no future value
else:
td_target = reward + self.gamma * self.V[next_state]
# Compute TD error
td_error = td_target - self.V[state]
# Update value estimate
self.V[state] += self.alpha * td_error
return td_error
def evaluate_policy(self, env, policy, num_episodes=1000):
"""
Evaluate a policy using TD(0).
Args:
env: Environment with reset() and step() methods
policy: Function mapping state to action
num_episodes: Number of episodes to run
Returns:
V: Estimated value function
td_errors: List of TD errors (for monitoring)
"""
td_errors = []
for episode in range(num_episodes):
state = env.reset()
done = False
episode_errors = []
while not done:
action = policy(state)
next_state, reward, done, _ = env.step(action)
td_error = self.update(state, reward, next_state, done)
episode_errors.append(td_error)
state = next_state
td_errors.append(np.mean(np.abs(episode_errors)))
return dict(self.V), td_errorsLet’s use this to evaluate a random policy on a simple environment:
# Example: Random Walk environment
class RandomWalk:
"""5-state random walk from Sutton & Barto."""
def __init__(self):
self.states = ['A', 'B', 'C', 'D', 'E']
self.state = 'C' # Start in the middle
def reset(self):
self.state = 'C'
return self.state
def step(self, action):
# action is ignored - movement is random
idx = self.states.index(self.state)
if np.random.random() < 0.5:
idx -= 1 # Move left
else:
idx += 1 # Move right
if idx < 0:
return 'terminal_left', 0, True, {}
elif idx > 4:
return 'terminal_right', 1, True, {}
else:
self.state = self.states[idx]
return self.state, 0, False, {}
# Evaluate
env = RandomWalk()
predictor = TD0Predictor(alpha=0.1, gamma=1.0)
random_policy = lambda s: None # Action doesn't matter
V, errors = predictor.evaluate_policy(env, random_policy, num_episodes=1000)
# True values for comparison
true_V = {'A': 1/6, 'B': 2/6, 'C': 3/6, 'D': 4/6, 'E': 5/6}
print("Estimated values vs True values:")
for state in ['A', 'B', 'C', 'D', 'E']:
print(f" {state}: {V.get(state, 0):.3f} (true: {true_V[state]:.3f})")Handling Terminal States
A common bug in TD implementations is handling terminal states incorrectly. When the episode ends, there is no next state to bootstrap from. The TD target should just be the reward, not .
Always check: if done: td_target = reward else: td_target = reward + gamma * V[next_state]
Why “TD(0)”?
The “0” in TD(0) refers to the number of actual reward steps we use before bootstrapping.
- TD(0): Bootstrap immediately, using
- TD(1): Use one step of actual reward, then bootstrap:
- TD(n): Use n steps of actual rewards before bootstrapping
TD(0) is the extreme of “bootstrap as soon as possible.” At the other extreme, TD(∞) = Monte Carlo (wait for all rewards, no bootstrapping).
The TD(λ) algorithm interpolates between these extremes using eligibility traces—but that’s a topic for another day. For now, TD(0) gives us the core intuition.
Convergence Properties
TD(0) is guaranteed to converge to the true value function under these conditions:
- Tabular representation: Each state has its own independent estimate
- Sufficient exploration: Every state is visited infinitely often (in the limit)
- Appropriate step sizes: Learning rate decays appropriately (or is small enough)
- Fixed policy: We’re evaluating a single, unchanging policy
The convergence is to the same fixed point as Monte Carlo—the true value function . But the path to get there is different:
- MC converges by averaging full returns (unbiased but noisy)
- TD converges by averaging single-step estimates (biased but less noisy)
In practice, TD often converges faster because each update uses less noisy information.
Monitoring Learning Progress
You can track learning progress by monitoring the TD error over time:
import matplotlib.pyplot as plt
def plot_td_learning(td_errors, window=50):
"""Plot TD error over episodes."""
# Smooth with moving average
smoothed = np.convolve(td_errors, np.ones(window)/window, mode='valid')
plt.figure(figsize=(10, 4))
plt.plot(smoothed)
plt.xlabel('Episode')
plt.ylabel('Mean Absolute TD Error')
plt.title('TD(0) Learning Progress')
plt.grid(True, alpha=0.3)
plt.show()A decreasing TD error curve indicates that learning is progressing. If it plateaus high, you may need more exploration or a different learning rate.
Summary
TD(0) is the foundation of temporal difference learning:
- Simple update:
- Online learning: Update after every transition, no need to wait
- TD error: Measures prediction quality, guides learning
- Convergence: Guaranteed to find under standard conditions
TD(0) is for prediction—evaluating a fixed policy. But we usually want control—finding the best policy. That’s where SARSA and Q-learning come in.
In the next section, we’ll compare TD to Monte Carlo methods more carefully, examining their respective strengths and weaknesses.