GridWorld Playground
Why GridWorld?
GridWorld is the βHello Worldβ of reinforcement learning. Itβs simple enough to understand completely, yet rich enough to demonstrate every core concept:
- States β grid positions
- Actions β up, down, left, right
- Rewards β goal (+10), step penalty (-0.1), walls (-0.5)
- Transitions β deterministic or stochastic
When you truly understand GridWorld, you understand RL.
Specification
Mathematical Details
State Space:
- Type: Discrete
- Size: (e.g., 4Γ4 = 16 states)
- Representation: tuple or flattened index
Action Space:
- Type: Discrete
- Size: 4
- Actions: UP (0), DOWN (1), LEFT (2), RIGHT (3)
Transition Dynamics:
- Deterministic: Action moves agent in specified direction
- Wall collision: Agent stays in place
- Optional: Wind/stochasticity (agent may slip)
Reward Function:
- Goal state:
- Each step: (encourages efficiency)
- Wall collision: (optional)
- Obstacle: and episode ends
Episode Termination:
- Agent reaches goal
- Agent hits obstacle
- Maximum steps reached (e.g., 100)
Variants
| Variant | Grid | Obstacles | Dynamics | Difficulty |
|---|---|---|---|---|
| Basic | 4Γ4 | None | Deterministic | Easy |
| Walls | 6Γ6 | Walls | Deterministic | Medium |
| Windy | 4Γ4 | None | Stochastic | Medium |
| Cliff | 4Γ12 | Cliff edge | Deterministic | Hard |
| Maze | 8Γ8 | Complex walls | Deterministic | Hard |
Baseline Results
Performance on 4Γ4 GridWorld (average return over 100 episodes):
| Algorithm | Mean Return | Episodes to Converge |
|---|---|---|
| Random | -45.2 | β |
| Q-Learning | -2.8 | ~50 |
| SARSA | -3.1 | ~60 |
| Monte Carlo | -3.5 | ~100 |
| Optimal | -2.6 | β |
The optimal path takes 6 steps from corner to corner: return.
Suggested Experiments
1. Value Function Visualization
Watch the value function evolve during training. You should see values propagate backward from the goal.
2. Exploration vs. Exploitation
Try different values (0.1, 0.3, 0.5). How does it affect learning speed and final performance?
3. Discount Factor Impact
Compare vs . Which finds shorter paths?
4. SARSA vs. Q-Learning
Run both on CliffWalking. SARSA learns the safe path; Q-Learning learns the risky optimal path.
5. Stochastic Transitions
Add 10% random action probability. How do the learned policies differ?
Implementation
Implementation
import numpy as np
import gymnasium as gym
from gymnasium import spaces
class GridWorld(gym.Env):
def __init__(self, size=4, obstacles=None):
super().__init__()
self.size = size
self.obstacles = obstacles or []
self.observation_space = spaces.Discrete(size * size)
self.action_space = spaces.Discrete(4) # UP, DOWN, LEFT, RIGHT
self.goal = (size - 1, size - 1)
self.start = (0, 0)
self.agent_pos = self.start
# Action effects
self.actions = {
0: (-1, 0), # UP
1: (1, 0), # DOWN
2: (0, -1), # LEFT
3: (0, 1), # RIGHT
}
def reset(self, seed=None):
super().reset(seed=seed)
self.agent_pos = self.start
return self._get_obs(), {}
def step(self, action):
# Calculate new position
dr, dc = self.actions[action]
new_row = max(0, min(self.size - 1, self.agent_pos[0] + dr))
new_col = max(0, min(self.size - 1, self.agent_pos[1] + dc))
new_pos = (new_row, new_col)
# Check for walls
if new_pos not in self.obstacles:
self.agent_pos = new_pos
# Compute reward
if self.agent_pos == self.goal:
reward = 10.0
done = True
else:
reward = -0.1
done = False
return self._get_obs(), reward, done, False, {}
def _get_obs(self):
return self.agent_pos[0] * self.size + self.agent_pos[1]Further Reading
- Introduction to TD Learning β Learn TD methods on GridWorld
- Q-Learning Basics β Master Q-Learning here
- Sutton & Barto Ch. 4 β Dynamic Programming on GridWorld