See RL in Action
You’ve learned the concepts—now let’s see them in action. This interactive demo lets you step through an RL agent navigating a simple GridWorld.
The agent (🤖) needs to reach the goal (🎯). Watch what happens at each step:
- The agent observes its current position (state)
- The agent chooses an action based on its learned policy
- The agent receives a reward (-1 for each step, +10 for reaching the goal)
- The agent moves to a new state
- Repeat until the goal is reached
Interactive GridWorld
Watch an agent learn to reach the goal. Click "Step" to see each action.
What to Notice
Click “Show Policy” to see the arrows. The agent has learned which direction to go from each cell. This is its policy—a mapping from states to actions.
The -1 step penalty encourages the shortest path. Without it, the agent wouldn’t care how long it takes. Reward design is crucial in RL.
Watch the total reward. The agent maximizes this over the whole episode, not just the next step. That’s why it takes the shortest path.
This agent uses a pre-learned policy. In the chapters ahead, you’ll learn how agents learn these policies from scratch through trial and error.
Try these:
- Reset and step through manually to see each action
- Use “Play” and watch the agent navigate automatically
- Toggle “Show Policy” to see the learned strategy
- Notice how the total reward is always maximized (6 steps × -1 + 10 = +4)
The RL Loop, Visualized
What you just saw is the core of all reinforcement learning:
In the next chapter, we’ll start with the simplest RL problem: Multi-Armed Bandits. There’s only one state—just a choice between options with uncertain rewards. It’s where you’ll learn the fundamentals of exploration and exploitation that power all of RL.