Offline Reinforcement Learning

What You'll Learn

Explain why offline RL is important for real-world applications
Describe the distribution shift problem
Implement Conservative Q-Learning (CQL)
Explain behavior cloning and its limitations
Understand the tradeoff between conservatism and optimality
Identify when offline RL is appropriate vs online RL

What if you can’t explore? In healthcare, you can’t experiment on patients to learn a treatment policy. In autonomous driving, you can’t crash cars to learn safe behavior. In industrial control, you can’t risk damaging expensive equipment.

But you have years of logged data from doctors, human drivers, and plant operators. Can you learn good policies from this fixed dataset, without any new interaction?

Offline RL is learning to drive from dashcam footage. You watch thousands of hours of driving videos: what human drivers did, what happened as a result. But you never actually get behind the wheel during training. Can you learn to drive well?

The answer is: yes, but it’s tricky. The challenge is that your policy might want to do something the humans never did—and you have no data about what happens then.

Chapter Overview

In this chapter:

The Offline Setting

Why offline RL matters and when to use it: learning from fixed datasets without environment interaction

Distribution Shift

The core challenge: why standard RL algorithms fail on offline data and how Q-values become unreliable

Conservative Methods

Solutions like CQL that stay close to the data by penalizing out-of-distribution actions

The Big Picture

📖Offline RL

Reinforcement learning from a fixed dataset of previously collected experience, with no ability to interact with the environment during training. Also called “batch RL” or “data-driven RL.”

Offline RL learns from a fixed dataset without environment interaction. The core challenge is distribution shift: the learned policy might choose actions never seen in the data, and we have no way to know if those actions are good or catastrophic. Conservative methods explicitly discourage out-of-distribution actions.

The key difference from online RL:

Online RL: Try something, see what happens, learn from it. If you’re uncertain about an action, you can explore and find out.

Offline RL: You only have the data you have. If you’re uncertain about an action not in the data, you can’t explore—you must either trust your extrapolation or avoid that action entirely.

This fundamental constraint changes everything about how we approach the problem.

Why Offline RL Matters

Real-World Applications

Healthcare: treatment policies from medical records
Autonomous driving: learning from human demonstrations
Robotics: using collected teleoperation data
Industrial control: optimizing from historical operations

Key Benefits

Safety: no risky exploration during training
Leverage existing data: use what you already have
Reproducibility: same data gives same training
Cost-effective: no need for expensive simulators

💡Start Here

New to offline RL? Begin with The Offline Setting to understand when and why offline RL is needed, then continue through the sections in order.

Connection to RLHF

ℹ️Looking Ahead

Offline RL principles are central to how modern language models are trained. In RLHF (Reinforcement Learning from Human Feedback), we collect comparison data offline—humans don’t interact with the model during reward model training. The techniques in this chapter lay the groundwork for the next chapter on RLHF.

Key Takeaways

Offline RL learns from fixed datasets without environment interaction
Distribution shift causes Q-learning to overestimate unseen actions
Conservative methods like CQL penalize out-of-distribution actions
Dataset quality matters more than dataset quantity
Offline RL enables deployment in safety-critical domains
Modern language model training (RLHF) uses offline RL principles

Next ChapterRLHF and Language Models→