Foundations • Part 2 of 2
Editor Reviewed

Try It Yourself

Experience the RL loop with an interactive GridWorld demo

See RL in Action

You’ve learned the concepts—now let’s see them in action. This interactive demo lets you step through an RL agent navigating a simple GridWorld.

The agent (🤖) needs to reach the goal (🎯). Watch what happens at each step:

  1. The agent observes its current position (state)
  2. The agent chooses an action based on its learned policy
  3. The agent receives a reward (-1 for each step, +10 for reaching the goal)
  4. The agent moves to a new state
  5. Repeat until the goal is reached

Interactive GridWorld

Watch an agent learn to reach the goal. Click "Step" to see each action.

🤖
🎯
Click "Step" to begin, or "Play" for automatic stepping
0
Steps
0
Total Reward
🤖Agent
🎯Goal (+10)
Empty (-1 per step)
What's happening? The agent has already learned an optimal policy (shown with arrows when you click "Show Policy"). Each step costs -1 reward, encouraging the shortest path. Reaching the goal gives +10 reward. A well-trained agent maximizes total reward.

What to Notice

The Policy Matters

Click “Show Policy” to see the arrows. The agent has learned which direction to go from each cell. This is its policy—a mapping from states to actions.

Rewards Shape Behavior

The -1 step penalty encourages the shortest path. Without it, the agent wouldn’t care how long it takes. Reward design is crucial in RL.

Cumulative Reward

Watch the total reward. The agent maximizes this over the whole episode, not just the next step. That’s why it takes the shortest path.

This Is Just the Beginning

This agent uses a pre-learned policy. In the chapters ahead, you’ll learn how agents learn these policies from scratch through trial and error.

💡Experiment!

Try these:

  • Reset and step through manually to see each action
  • Use “Play” and watch the agent navigate automatically
  • Toggle “Show Policy” to see the learned strategy
  • Notice how the total reward is always maximized (6 steps × -1 + 10 = +4)

The RL Loop, Visualized

What you just saw is the core of all reinforcement learning:

1
Agent sees current state (position)
2
Agent picks action from policy
3
Environment gives reward
4
Agent moves to new state
Repeat until episode ends
ℹ️Ready for More?

In the next chapter, we’ll start with the simplest RL problem: Multi-Armed Bandits. There’s only one state—just a choice between options with uncertain rewards. It’s where you’ll learn the fundamentals of exploration and exploitation that power all of RL.