Offline Reinforcement Learning
What You'll Learn
- Explain why offline RL is important for real-world applications
- Describe the distribution shift problem
- Implement Conservative Q-Learning (CQL)
- Explain behavior cloning and its limitations
- Understand the tradeoff between conservatism and optimality
- Identify when offline RL is appropriate vs online RL
What if you can’t explore? In healthcare, you can’t experiment on patients to learn a treatment policy. In autonomous driving, you can’t crash cars to learn safe behavior. In industrial control, you can’t risk damaging expensive equipment.
But you have years of logged data from doctors, human drivers, and plant operators. Can you learn good policies from this fixed dataset, without any new interaction?
Offline RL is learning to drive from dashcam footage. You watch thousands of hours of driving videos: what human drivers did, what happened as a result. But you never actually get behind the wheel during training. Can you learn to drive well?
The answer is: yes, but it’s tricky. The challenge is that your policy might want to do something the humans never did—and you have no data about what happens then.
Chapter Overview
The Big Picture
Reinforcement learning from a fixed dataset of previously collected experience, with no ability to interact with the environment during training. Also called “batch RL” or “data-driven RL.”
Offline RL learns from a fixed dataset without environment interaction. The core challenge is distribution shift: the learned policy might choose actions never seen in the data, and we have no way to know if those actions are good or catastrophic. Conservative methods explicitly discourage out-of-distribution actions.
The key difference from online RL:
Online RL: Try something, see what happens, learn from it. If you’re uncertain about an action, you can explore and find out.
Offline RL: You only have the data you have. If you’re uncertain about an action not in the data, you can’t explore—you must either trust your extrapolation or avoid that action entirely.
This fundamental constraint changes everything about how we approach the problem.
Why Offline RL Matters
- Healthcare: treatment policies from medical records
- Autonomous driving: learning from human demonstrations
- Robotics: using collected teleoperation data
- Industrial control: optimizing from historical operations
- Safety: no risky exploration during training
- Leverage existing data: use what you already have
- Reproducibility: same data gives same training
- Cost-effective: no need for expensive simulators
New to offline RL? Begin with The Offline Setting to understand when and why offline RL is needed, then continue through the sections in order.
Connection to RLHF
Offline RL principles are central to how modern language models are trained. In RLHF (Reinforcement Learning from Human Feedback), we collect comparison data offline—humans don’t interact with the model during reward model training. The techniques in this chapter lay the groundwork for the next chapter on RLHF.
Key Takeaways
- Offline RL learns from fixed datasets without environment interaction
- Distribution shift causes Q-learning to overestimate unseen actions
- Conservative methods like CQL penalize out-of-distribution actions
- Dataset quality matters more than dataset quantity
- Offline RL enables deployment in safety-critical domains
- Modern language model training (RLHF) uses offline RL principles