The Offline Setting
Standard reinforcement learning assumes you can interact with your environment: try actions, observe outcomes, and learn from experience. But what if interaction is impossible, dangerous, or prohibitively expensive? This is the offline RL setting.
Why Offline RL?
Consider these scenarios:
Healthcare: You want to learn the best treatment protocol for cancer patients. You can’t experiment on patients—running an untested treatment just to see what happens is unethical. But you have 20 years of medical records: what treatments doctors prescribed, patient outcomes, lab results, everything.
Autonomous Driving: You want to learn to drive safely. Crashing cars to learn about accidents is expensive and dangerous. But you have millions of miles of logged driving data from human drivers.
Industrial Control: You want to optimize a chemical plant. Running random experiments could damage equipment worth millions. But you have logs of how operators have run the plant for years.
In each case: exploration is off the table, but you have data. Can you learn from it?
Reinforcement learning from a fixed, previously-collected dataset with no ability to collect new data during training. The agent learns entirely from this logged experience.
Online vs Offline: A Fundamental Distinction
- Learn while interacting with environment
- Can try any action to see what happens
- Exploration enables discovering new behaviors
- Policy and data distribution evolve together
- Learn from fixed, pre-collected dataset
- Cannot query environment during training
- Limited to behaviors in the dataset
- Must extrapolate carefully beyond data
Online RL learns from data collected by the current policy:
The data distribution changes as the policy improves.
Offline RL learns from a fixed dataset collected by some behavior policy :
The data distribution is fixed—it doesn’t change during training. The behavior policy could be human experts, previous RL policies, random actions, or any combination.
The objective is the same—maximize expected return—but the learning constraints differ fundamentally:
The Behavior Policy
The policy that was used to collect the offline dataset. This might be a human expert, a previous RL agent, a hand-coded controller, or a mixture of different policies.
The behavior policy shapes what you can learn:
- If is an expert, the data shows good behavior but may lack diversity
- If is random, the data covers many states but with poor outcomes
- If is a mediocre policy, you might learn to outperform it—or not
Ideally, the behavior policy covers the states and actions that matter for the task. If the optimal action in some state was never tried by the behavior policy, you have no data about what would happen—and that’s a problem.
Consider learning to play chess from a dataset of games:
Expert dataset: Games from grandmasters. High-quality moves, but limited coverage of suboptimal positions (experts don’t often get into bad positions).
Amateur dataset: Games from casual players. Many suboptimal moves, but broader coverage of what happens when things go wrong.
Mixed dataset: Both expert and amateur games. Better coverage, but confusing labels—the same position might show both good and bad moves.
Each has tradeoffs. In offline RL, dataset coverage often matters more than raw dataset size.
Setting Up an Offline RL Problem
import numpy as np
from collections import deque
class OfflineDataset:
"""
A fixed dataset for offline RL.
Once created, this dataset cannot be modified during training.
"""
def __init__(self, transitions):
"""
Args:
transitions: List of (state, action, reward, next_state, done) tuples
"""
self.transitions = transitions
self.n_samples = len(transitions)
# Pre-compute statistics for normalization
states = np.array([t[0] for t in transitions])
self.state_mean = states.mean(axis=0)
self.state_std = states.std(axis=0) + 1e-8
# Track which state-action pairs are in the dataset
self._index_transitions()
def _index_transitions(self):
"""Index the dataset for analysis."""
self.state_action_counts = {}
for s, a, r, s_next, done in self.transitions:
key = (tuple(np.round(s, 2)), a)
self.state_action_counts[key] = self.state_action_counts.get(key, 0) + 1
def sample(self, batch_size):
"""Sample a random batch from the dataset."""
indices = np.random.randint(0, self.n_samples, size=batch_size)
batch = [self.transitions[i] for i in indices]
states = np.array([t[0] for t in batch])
actions = np.array([t[1] for t in batch])
rewards = np.array([t[2] for t in batch])
next_states = np.array([t[3] for t in batch])
dones = np.array([t[4] for t in batch])
return states, actions, rewards, next_states, dones
def normalize_state(self, state):
"""Normalize state using dataset statistics."""
return (state - self.state_mean) / self.state_std
def is_in_distribution(self, state, action, threshold=0.1):
"""
Check if a state-action pair is close to something in the dataset.
This is a simplistic check; real implementations use more sophisticated
distance metrics or density models.
"""
rounded_state = tuple(np.round(state, 2))
return (rounded_state, action) in self.state_action_counts
def collect_offline_dataset(env, behavior_policy, n_episodes=100):
"""
Collect a fixed dataset using a behavior policy.
Args:
env: The environment
behavior_policy: Policy used for data collection (e.g., human, expert)
n_episodes: Number of episodes to collect
Returns:
OfflineDataset: A fixed dataset for offline RL
"""
transitions = []
for episode in range(n_episodes):
state = env.reset()
done = False
while not done:
# Behavior policy selects action
action = behavior_policy(state)
# Execute in environment
next_state, reward, done, _ = env.step(action)
# Store transition
transitions.append((
state.copy(),
action,
reward,
next_state.copy(),
done
))
state = next_state
print(f"Collected {len(transitions)} transitions from {n_episodes} episodes")
return OfflineDataset(transitions)
# Example behavior policies
def random_policy(state):
"""Random behavior policy."""
return np.random.randint(4) # Assuming 4 actions
def epsilon_greedy_behavior(q_table, epsilon=0.3):
"""Epsilon-greedy behavior policy (for tabular case)."""
def policy(state):
if np.random.random() < epsilon:
return np.random.randint(q_table.shape[1])
return np.argmax(q_table[state])
return policyWhat Can We Learn from Offline Data?
Given a fixed dataset, what’s actually possible?
Best case: The behavior policy explored thoroughly. We have data about every action in every relevant state. We can accurately estimate Q-values and find a near-optimal policy.
Typical case: The behavior policy covered some states well, others poorly. We can improve on the behavior policy for covered states, but must be careful about unfamiliar territory.
Worst case: The behavior policy was narrow (only one way of doing things). We can imitate it, but improving is risky—any deviation takes us into unknown territory.
The key insight: offline RL can only work with what’s in the data. Asking for behavior the dataset doesn’t cover is asking to extrapolate—which neural networks do unreliably.
The fundamental limitation of offline RL is captured by the coverage assumption. Ideally, we want the behavior policy to have coverage over states that matter:
If the optimal policy visits a state where the behavior policy never went, we have no data about that state—and our estimates there will be unreliable.
In practice, we often have partial coverage: the behavior policy covers some of the important states. The question becomes: how do we make the best use of what we have without making catastrophic mistakes in uncovered regions?
When is Offline RL Appropriate?
- Real interaction is expensive or dangerous
- You have a good existing dataset
- Safety is critical during deployment
- You need reproducible training
- The behavior policy covers relevant states
- You have a safe simulator
- Exploration is cheap and safe
- Dataset coverage is poor
- The task requires novel behaviors
- You can do online fine-tuning after offline pretraining
The best results often come from combining offline and online learning:
- Offline pretraining: Learn a reasonable initial policy from logged data
- Online fine-tuning: Improve with limited online interaction
This gets the safety benefits of offline learning with the coverage benefits of online learning.
Summary
The offline RL setting fundamentally changes the learning problem:
- We learn from a fixed dataset collected by a behavior policy
- We cannot explore or collect new data during training
- The dataset coverage determines what we can learn
- We must extrapolate carefully—or not at all—beyond the data
This constraint makes offline RL both harder (can’t explore) and more practical (safe, uses existing data). In the next section, we’ll see why standard RL algorithms fail in this setting due to distribution shift.