Planning with Learned Models
Once we have a model of how the world works, we can use it to plan. Planning means using the model to look ahead, simulating possible futures to find good actions without actually taking them.
What is Planning?
In model-based RL, planning is the process of using an internal model to simulate trajectories (sequences of states, actions, and rewards) and use these simulations to improve the policy or value estimates—all without interacting with the real environment.
Imagine you’re playing chess. Before making a move, you might think: “If I move my bishop here, my opponent will probably respond by moving their knight there. Then I could take their pawn…” This mental simulation is planning.
The key insight: every move you consider in your head is essentially a “free” sample. You’re learning from simulated experience, not real experience. A chess grandmaster might consider thousands of positions mentally before making a single actual move.
Model-based RL gives agents this same capability: simulate many possible futures in the model, then choose the best action in reality.
Types of Planning
There are several ways to use a model for planning:
Generate simulated experience to improve value estimates during “off time” between real interactions.
Plan specifically for the current state to choose the best action right now.
Background Planning: Simulated Experience
The simplest form of planning is background planning: use the model to generate simulated experiences and update value estimates, just as if they were real experiences.
Think of it this way: after each real experience, you “daydream” about similar experiences. You sample a state-action pair you’ve seen before, ask the model what would happen, and learn from that imagined outcome.
This amplifies each real experience: one real step might generate 10 or 50 simulated steps, dramatically accelerating learning.
In background planning, we repeatedly:
- Sample a previously visited state-action pair
- Query the model:
- Update Q-values as if were a real experience:
The update is identical to Q-learning—the only difference is that came from the model, not the real environment.
import numpy as np
def background_planning_step(Q, model, alpha=0.1, gamma=0.99):
"""
Perform one step of background planning.
Args:
Q: Q-table as numpy array (n_states, n_actions)
model: TabularModel that can sample and predict
alpha: Learning rate
gamma: Discount factor
Returns:
TD error from the simulated update
"""
# Sample a previously experienced state-action pair
result = model.sample_experienced()
if result is None:
return 0.0 # No experience to plan with yet
s, a = result
# Ask the model what would happen
s_prime, r = model.predict(s, a)
if s_prime is None:
return 0.0
# Q-learning update on simulated experience
td_target = r + gamma * np.max(Q[s_prime])
td_error = td_target - Q[s, a]
Q[s, a] += alpha * td_error
return td_error
def planning_loop(Q, model, n_planning_steps, alpha=0.1, gamma=0.99):
"""
Perform multiple planning steps.
This is typically called after each real environment step.
"""
total_error = 0.0
for _ in range(n_planning_steps):
error = background_planning_step(Q, model, alpha, gamma)
total_error += abs(error)
return total_error / max(n_planning_steps, 1)Decision-Time Planning: Search
For more focused planning, we can do decision-time planning: plan specifically for the current state to decide what action to take right now.
Background planning improves your general knowledge. Decision-time planning is like focusing intensely on the current problem: “I’m in this specific state. What should I do?”
The classic example is Monte Carlo Tree Search (MCTS), used by AlphaGo and AlphaZero. Before each move, the agent simulates thousands of possible game trajectories from the current position and chooses the most promising action.
Rollout Planning
The simplest decision-time planning is rollout: simulate full trajectories from the current state and estimate the value of each action by the average return.
To decide on an action in state :
- For each action :
- Simulate trajectories starting with
- For each trajectory, use the model and a rollout policy to generate states and rewards
- Compute the return for each trajectory
- Choose the action with the highest average return
def rollout_planning(state, model, n_actions, n_rollouts=50, horizon=10, gamma=0.99):
"""
Decision-time planning via rollouts.
Args:
state: Current state
model: Environment model with predict(s, a) method
n_actions: Number of possible actions
n_rollouts: Number of rollouts per action
horizon: How many steps to simulate
gamma: Discount factor
Returns:
Best action according to rollout estimates
"""
action_values = np.zeros(n_actions)
for a in range(n_actions):
returns = []
for _ in range(n_rollouts):
# Simulate trajectory starting with action a
s = state
total_return = 0.0
discount = 1.0
for t in range(horizon):
if t == 0:
action = a # First action is the one we're evaluating
else:
action = np.random.randint(n_actions) # Random rollout policy
s_next, r = model.predict(s, action)
if s_next is None:
break
total_return += discount * r
discount *= gamma
s = s_next
returns.append(total_return)
action_values[a] = np.mean(returns)
return np.argmax(action_values)Model Predictive Control (MPC)
A powerful planning approach is Model Predictive Control (MPC): plan an action sequence, execute only the first action, then re-plan.
MPC is like constantly updating your GPS while driving. You plan a route, start driving, but then re-plan based on where you actually end up. If the model was slightly wrong, you correct course immediately rather than blindly following a potentially outdated plan.
This makes MPC robust to model errors: you never commit to a long action sequence based on an imperfect model.
At each time step in state :
- Optimize over action sequences of length :
- Simulate each sequence in the model to estimate total reward
- Execute only (the first action)
- Observe the new state and repeat
The optimization can be done via random shooting, cross-entropy method, or gradient-based methods.
def mpc_action(state, model, n_actions, horizon=10, n_sequences=100, gamma=0.99):
"""
Model Predictive Control action selection.
Uses random shooting to find good action sequences.
"""
best_return = float('-inf')
best_first_action = 0
for _ in range(n_sequences):
# Sample random action sequence
actions = np.random.randint(n_actions, size=horizon)
# Simulate sequence in model
s = state
total_return = 0.0
discount = 1.0
for a in actions:
s_next, r = model.predict(s, a)
if s_next is None:
break
total_return += discount * r
discount *= gamma
s = s_next
# Keep track of best sequence
if total_return > best_return:
best_return = total_return
best_first_action = actions[0]
return best_first_action
def cross_entropy_mpc(state, model, n_actions, horizon=10, n_samples=100,
n_elite=10, n_iterations=5, gamma=0.99):
"""
MPC with Cross-Entropy Method for optimization.
More sample-efficient than pure random shooting.
"""
# Initialize action distribution (uniform)
action_probs = np.ones((horizon, n_actions)) / n_actions
for _ in range(n_iterations):
# Sample action sequences from current distribution
sequences = []
returns = []
for _ in range(n_samples):
actions = np.array([
np.random.choice(n_actions, p=action_probs[t])
for t in range(horizon)
])
# Evaluate sequence
s = state
total_return = 0.0
discount = 1.0
for a in actions:
s_next, r = model.predict(s, a)
if s_next is None:
break
total_return += discount * r
discount *= gamma
s = s_next
sequences.append(actions)
returns.append(total_return)
# Select elite sequences
elite_idx = np.argsort(returns)[-n_elite:]
elite_sequences = [sequences[i] for i in elite_idx]
# Update action distribution toward elite sequences
action_probs = np.zeros((horizon, n_actions))
for seq in elite_sequences:
for t, a in enumerate(seq):
action_probs[t, a] += 1
action_probs /= n_elite
# Add small uniform noise for exploration
action_probs = 0.9 * action_probs + 0.1 / n_actions
# Return the most likely first action
return np.argmax(action_probs[0])Planning Horizon and Model Accuracy
Longer planning horizons let you consider more future consequences, but model errors compound over time. There’s a sweet spot: plan far enough to make good decisions, but not so far that accumulated model errors dominate.
If your model has 95% accuracy per step, after 10 steps you might have only accuracy. After 50 steps, you’re at . Your “plan” is mostly fantasy at that point.
The right horizon depends on:
- Model accuracy: Better models allow longer horizons
- Task horizon: Some tasks require long-term planning
- Discount factor: High makes distant rewards matter more
If the model has expected error per step, and errors accumulate (a reasonable approximation), the total expected error after steps is roughly .
For meaningful planning, we want this accumulated error to be small relative to the actual state differences that matter for decision-making.
A practical heuristic: if model accuracy drops noticeably after steps (measured by held-out prediction error), keep the planning horizon at or below .
Prioritized Planning
Not all planning is equally valuable. Prioritized planning focuses computational effort on the most useful simulated experiences.
Imagine you just discovered a new shortcut in a maze. Which experiences are most valuable to simulate?
- Experiences that might be affected by this new knowledge (states near the shortcut)
- Experiences where the current values might be wrong (high TD error)
- States you’re likely to visit soon (relevant to current policy)
Random planning wastes time on experiences that don’t change anything. Prioritized planning focuses where it matters.
import heapq
class PrioritizedPlanning:
"""
Prioritized planning based on TD error magnitude.
States with higher TD errors are more valuable to plan from.
"""
def __init__(self, n_states, n_actions, theta=0.01):
self.n_states = n_states
self.n_actions = n_actions
self.theta = theta # Priority threshold
self.priority_queue = [] # Min-heap (negated priorities for max behavior)
self.priorities = {} # (s, a) -> priority
def update_priority(self, s, a, td_error):
"""Update priority for a state-action pair."""
priority = abs(td_error)
if priority > self.theta:
self.priorities[(s, a)] = priority
# Python heapq is a min-heap, so negate for max-priority behavior
heapq.heappush(self.priority_queue, (-priority, s, a))
def get_highest_priority(self):
"""Get the state-action pair with highest priority."""
while self.priority_queue:
neg_priority, s, a = heapq.heappop(self.priority_queue)
# Check if this priority is still valid (might be outdated)
if (s, a) in self.priorities:
current_priority = self.priorities.pop((s, a))
if abs(-neg_priority - current_priority) < 0.01:
return s, a
return None
def prioritized_planning_step(Q, model, priority_queue, alpha=0.1, gamma=0.99):
"""
One step of prioritized planning.
"""
result = priority_queue.get_highest_priority()
if result is None:
return 0.0
s, a = result
# Simulate from this state-action pair
s_prime, r = model.predict(s, a)
if s_prime is None:
return 0.0
# Q-learning update
td_target = r + gamma * np.max(Q[s_prime])
td_error = td_target - Q[s, a]
Q[s, a] += alpha * td_error
# Update priorities for predecessor states
# (In practice, you'd track which states lead to s)
return abs(td_error)Summary
Planning is how model-based RL extracts maximum value from learned models. By simulating experiences before taking real actions, agents can:
- Learn more efficiently (background planning amplifies real experience)
- Make better immediate decisions (decision-time planning and MPC)
- Avoid costly mistakes (think before acting)
The challenge is balancing planning depth against model accuracy. In the next section, we’ll see how the Dyna architecture elegantly combines planning with learning from real experience.