Foundations • Part 3 of 3
Editor Reviewed

A Brief History of RL

From Bellman to ChatGPT: the milestones that shaped reinforcement learning

The Road to Modern RL

Reinforcement learning has a rich history spanning decades of research. Understanding where we came from helps appreciate where we’re going.

1950s — The Foundations

Richard Bellman develops dynamic programming and the Bellman equation—the mathematical backbone of RL. The term “curse of dimensionality” is coined.

Bellman, R. (1957). Dynamic Programming. Princeton University Press.

1988 — Temporal Difference Learning

Richard Sutton publishes the foundational paper on TD learning, showing how to learn from incomplete episodes by bootstrapping from estimates.

Sutton, R. (1988). Learning to predict by the methods of temporal differences. Machine Learning, 3(1), 9-44.

1989 — Q-Learning

Chris Watkins introduces Q-learning in his PhD thesis, proving that a simple update rule can learn optimal behavior without a model of the environment.

Watkins, C. (1989). Learning from Delayed Rewards. PhD thesis, Cambridge University.

1992 — TD-Gammon

Gerald Tesauro at IBM creates TD-Gammon, a backgammon program that learns through self-play. It reaches world-champion level and discovers novel strategies that human experts adopt.

Tesauro, G. (1995). Temporal difference learning and TD-Gammon. Communications of the ACM, 38(3), 58-68.

1998 — The RL Bible

Sutton & Barto publish Reinforcement Learning: An Introduction, unifying decades of research into a coherent framework. It becomes the definitive textbook.

Sutton, R. & Barto, A. (1998, 2nd ed. 2018). Reinforcement Learning: An Introduction. MIT Press. Free online

2013 — Deep Q-Networks (DQN)

DeepMind combines deep learning with Q-learning. DQN learns to play Atari games from raw pixels, achieving superhuman performance on many games with a single architecture.

Key innovations: experience replay, target networks, end-to-end learning from pixels

Mnih, V. et al. (2013). Playing Atari with Deep Reinforcement Learning. arXiv:1312.5602. Nature paper: 2015.

2016 — AlphaGo Defeats Lee Sedol

DeepMind’s AlphaGo defeats world champion Lee Sedol at Go, a game thought to be decades away from AI mastery. The “Move 37” becomes legendary.

Combines Monte Carlo Tree Search with deep neural networks trained via RL

Silver, D. et al. (2016). Mastering the game of Go with deep neural networks and tree search. Nature, 529, 484-489.

2017 — AlphaZero

AlphaZero learns chess, shogi, and Go from scratch through self-play alone—no human games, no handcrafted features. Defeats Stockfish in chess after just 4 hours of training.

Silver, D. et al. (2017). Mastering Chess and Shogi by Self-Play with a General Reinforcement Learning Algorithm. arXiv:1712.01815

2017 — PPO

OpenAI introduces Proximal Policy Optimization, a simple yet effective policy gradient method that becomes the default algorithm for many applications.

Schulman, J. et al. (2017). Proximal Policy Optimization Algorithms. arXiv:1707.06347

2019 — OpenAI Five & AlphaStar

RL conquers complex multiplayer games. OpenAI Five defeats world champions in Dota 2. DeepMind’s AlphaStar reaches Grandmaster level in StarCraft II.

2020 — MuZero

MuZero learns to play Atari, Go, chess, and shogi without even knowing the rules. It learns a model of the environment and plans with it.

Schrittwieser, J. et al. (2020). Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model. Nature, 588, 604-609.

2022–2024 — The RLHF Revolution

ChatGPT and other LLMs use RLHF (Reinforcement Learning from Human Feedback) to align language models with human preferences. RL becomes central to making AI assistants helpful and safe.

InstructGPT, ChatGPT, Claude, Gemini—all trained with RL from human feedback

Ouyang, L. et al. (2022). Training language models to follow instructions with human feedback. NeurIPS.

2025 — Today

RL is everywhere: robotics (Figure, Boston Dynamics), autonomous vehicles, chip design, scientific discovery, and AI reasoning. New frontiers include multi-agent RL, world models, and RL for code generation.

The journey continues…

Key Themes Across History

Games as Proving Grounds

From backgammon to Go to Dota 2, games have driven RL progress. They offer clear rewards, fast simulation, and objective benchmarks.

Compute Unlocks Potential

Many ideas from the 1980s-90s only became practical with modern hardware. DQN, AlphaGo, and OpenAI Five all required massive compute.

Deep Learning + RL

The 2013 DQN paper sparked the “deep RL” revolution by combining neural networks with classic RL algorithms.

From Games to Real World

RLHF for LLMs marked a shift: RL now powers products used by billions, not just game-playing demos.

Further Reading

💡Essential Papers

For a deeper dive, these papers are worth reading:

  • Sutton (1988) — TD Learning paper (foundational)
  • Mnih et al. (2015) — DQN Nature paper (deep RL breakthrough)
  • Silver et al. (2016) — AlphaGo (superhuman game-playing)
  • Schulman et al. (2017) — PPO (workhorse algorithm)
  • Ouyang et al. (2022) — InstructGPT/RLHF (LLM alignment)

All are available on arXiv or the authors’ websites.

ℹ️Standing on Giants

The history of RL is one of incremental progress punctuated by breakthroughs. Many of today’s techniques—experience replay, target networks, policy gradients—have roots going back decades. Understanding this history helps you appreciate why algorithms are designed the way they are.