Try It Yourself | Getting Started

See RL in Action

You’ve learned the concepts—now let’s see them in action. This interactive demo lets you step through an RL agent navigating a simple GridWorld.

The agent (🤖) needs to reach the goal (🎯). Watch what happens at each step:

The agent observes its current position (state)
The agent chooses an action based on its learned policy
The agent receives a reward (-1 for each step, +10 for reaching the goal)
The agent moves to a new state
Repeat until the goal is reached

Interactive GridWorld

Watch an agent learn to reach the goal. Click "Step" to see each action.

🤖

🎯

Click "Step" to begin, or "Play" for automatic stepping

Steps

Total Reward

🤖Agent

🎯Goal (+10)

Empty (-1 per step)

What's happening? The agent has already learned an optimal policy (shown with arrows when you click "Show Policy"). Each step costs -1 reward, encouraging the shortest path. Reaching the goal gives +10 reward. A well-trained agent maximizes total reward.

What to Notice

The Policy Matters

Click “Show Policy” to see the arrows. The agent has learned which direction to go from each cell. This is its policy—a mapping from states to actions.

Rewards Shape Behavior

The -1 step penalty encourages the shortest path. Without it, the agent wouldn’t care how long it takes. Reward design is crucial in RL.

Cumulative Reward

Watch the total reward. The agent maximizes this over the whole episode, not just the next step. That’s why it takes the shortest path.

This Is Just the Beginning

This agent uses a pre-learned policy. In the chapters ahead, you’ll learn how agents learn these policies from scratch through trial and error.

💡Experiment!

Try these:

Reset and step through manually to see each action
Use “Play” and watch the agent navigate automatically
Toggle “Show Policy” to see the learned strategy
Notice how the total reward is always maximized (6 steps × -1 + 10 = +4)

The RL Loop, Visualized

What you just saw is the core of all reinforcement learning:

Agent sees current state (position)

Agent picks action from policy

Environment gives reward

Agent moves to new state

↻

Repeat until episode ends

ℹ️Ready for More?

In the next chapter, we’ll start with the simplest RL problem: Multi-Armed Bandits. There’s only one state—just a choice between options with uncertain rewards. It’s where you’ll learn the fundamentals of exploration and exploitation that power all of RL.