The Road to Modern RL
Reinforcement learning has a rich history spanning decades of research. Understanding where we came from helps appreciate where we’re going.
Richard Bellman develops dynamic programming and the Bellman equation—the mathematical backbone of RL. The term “curse of dimensionality” is coined.
Bellman, R. (1957). Dynamic Programming. Princeton University Press.
Richard Sutton publishes the foundational paper on TD learning, showing how to learn from incomplete episodes by bootstrapping from estimates.
Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9-44.
Chris Watkins introduces Q-learning in his PhD thesis, proving that a simple update rule can learn optimal behavior without a model of the environment.
Watkins, C. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University.
Gerald Tesauro at IBM creates TD-Gammon, a backgammon program that learns through self-play. It reaches world-champion level and discovers novel strategies that human experts adopt.
Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3), 58-68.
Sutton & Barto publish Reinforcement Learning: An Introduction, unifying decades of research into a coherent framework. It becomes the definitive textbook.
Sutton, R. & Barto, A. (1998, 2nd ed. 2018). Reinforcement Learning: An Introduction. MIT Press. Free online
DeepMind combines deep learning with Q-learning. DQN learns to play Atari games from raw pixels, achieving superhuman performance on many games with a single architecture.
Key innovations: experience replay, target networks, end-to-end learning from pixels
Mnih, V. et al. (2013). Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602. Nature paper: 2015.
DeepMind’s AlphaGo defeats world champion Lee Sedol at Go, a game thought to be decades away from AI mastery. The “Move 37” becomes legendary.
Combines Monte Carlo Tree Search with deep neural networks trained via RL
Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484-489.
AlphaZero learns chess, shogi, and Go from scratch through self-play alone—no human games, no handcrafted features. Defeats Stockfish in chess after just 4 hours of training.
Silver, D. et al. (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv:1712.01815
OpenAI introduces Proximal Policy Optimization, a simple yet effective policy gradient method that becomes the default algorithm for many applications.
Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347
RL conquers complex multiplayer games. OpenAI Five defeats world champions in Dota 2. DeepMind’s AlphaStar reaches Grandmaster level in StarCraft II.
OpenAI (2019). OpenAI Five Defeats Dota 2 World Champions
MuZero learns to play Atari, Go, chess, and shogi without even knowing the rules. It learns a model of the environment and plans with it.
Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature, 588, 604-609.
ChatGPT and other LLMs use RLHF (Reinforcement Learning from Human Feedback) to align language models with human preferences. RL becomes central to making AI assistants helpful and safe.
InstructGPT, ChatGPT, Claude, Gemini—all trained with RL from human feedback
Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.
RL is everywhere: robotics (Figure, Boston Dynamics), autonomous vehicles, chip design, scientific discovery, and AI reasoning. New frontiers include multi-agent RL, world models, and RL for code generation.
The journey continues…
Key Themes Across History
From backgammon to Go to Dota 2, games have driven RL progress. They offer clear rewards, fast simulation, and objective benchmarks.
Many ideas from the 1980s-90s only became practical with modern hardware. DQN, AlphaGo, and OpenAI Five all required massive compute.
The 2013 DQN paper sparked the “deep RL” revolution by combining neural networks with classic RL algorithms.
RLHF for LLMs marked a shift: RL now powers products used by billions, not just game-playing demos.
Further Reading
For a deeper dive, these papers are worth reading:
- Sutton (1988) — TD Learning paper (foundational)
- Mnih et al. (2015) — DQN Nature paper (deep RL breakthrough)
- Silver et al. (2016) — AlphaGo (superhuman game-playing)
- Schulman et al. (2017) — PPO (workhorse algorithm)
- Ouyang et al. (2022) — InstructGPT/RLHF (LLM alignment)
All are available on arXiv or the authors’ websites.
The history of RL is one of incremental progress punctuated by breakthroughs. Many of today’s techniques—experience replay, target networks, policy gradients—have roots going back decades. Understanding this history helps you appreciate why algorithms are designed the way they are.