The Building Blocks of RL
Now that we’ve seen RL in action, let’s define our terms precisely. Every RL system has the same core components.
This is what we’re building. The agent observes, decides, and learns from experience.
The world the agent interacts with. Responds to actions with new states and rewards.
A Simple Example: GridWorld
Throughout this book, we’ll use GridWorld as our primary example environment. It’s simple enough to understand completely, yet rich enough to illustrate key concepts.
Imagine a 4×4 grid. Your agent starts in one corner. The goal is in the opposite corner. Each step, the agent can move up, down, left, or right (if not blocked by a wall).
- State: Agent’s position (row, column)
- Actions: Up, Down, Left, Right
- Reward: -1 per step, +10 at goal
- Episode: Ends when agent reaches goal
This encourages the agent to find the shortest path. Without step penalties, the agent wouldn’t care how long it takes to reach the goal.
This simple setup illustrates the core RL loop: observe position → choose direction → receive reward → repeat. We’ll return to GridWorld throughout the book to test new algorithms.