πŸ“ AI Generated

GridWorld Playground

The canonical RL teaching environment. Navigate a grid, avoid obstacles, reach the goal. Perfect for understanding value functions and policies.

GridWorld Playground

Why GridWorld?

GridWorld is the β€œHello World” of reinforcement learning. It’s simple enough to understand completely, yet rich enough to demonstrate every core concept:

  • States β€” grid positions
  • Actions β€” up, down, left, right
  • Rewards β€” goal (+10), step penalty (-0.1), walls (-0.5)
  • Transitions β€” deterministic or stochastic

When you truly understand GridWorld, you understand RL.

Specification

βˆ‘Mathematical Details

State Space:

  • Type: Discrete
  • Size: nΓ—mn \times m (e.g., 4Γ—4 = 16 states)
  • Representation: (row,col)(row, col) tuple or flattened index

Action Space:

  • Type: Discrete
  • Size: 4
  • Actions: UP (0), DOWN (1), LEFT (2), RIGHT (3)

Transition Dynamics:

  • Deterministic: Action moves agent in specified direction
  • Wall collision: Agent stays in place
  • Optional: Wind/stochasticity (agent may slip)

Reward Function:

  • Goal state: +10+10
  • Each step: βˆ’0.1-0.1 (encourages efficiency)
  • Wall collision: βˆ’0.5-0.5 (optional)
  • Obstacle: βˆ’5-5 and episode ends

Episode Termination:

  • Agent reaches goal
  • Agent hits obstacle
  • Maximum steps reached (e.g., 100)

Variants

VariantGridObstaclesDynamicsDifficulty
Basic4Γ—4NoneDeterministicEasy
Walls6Γ—6WallsDeterministicMedium
Windy4Γ—4NoneStochasticMedium
Cliff4Γ—12Cliff edgeDeterministicHard
Maze8Γ—8Complex wallsDeterministicHard

Baseline Results

Performance on 4Γ—4 GridWorld (average return over 100 episodes):

AlgorithmMean ReturnEpisodes to Converge
Random-45.2β€”
Q-Learning-2.8~50
SARSA-3.1~60
Monte Carlo-3.5~100
Optimal-2.6β€”

The optimal path takes 6 steps from corner to corner: βˆ’0.1Γ—6+10=9.4-0.1 \times 6 + 10 = 9.4 return.

Suggested Experiments

1. Value Function Visualization

Watch the value function evolve during training. You should see values propagate backward from the goal.

2. Exploration vs. Exploitation

Try different Ο΅\epsilon values (0.1, 0.3, 0.5). How does it affect learning speed and final performance?

3. Discount Factor Impact

Compare Ξ³=0.9\gamma = 0.9 vs Ξ³=0.99\gamma = 0.99. Which finds shorter paths?

4. SARSA vs. Q-Learning

Run both on CliffWalking. SARSA learns the safe path; Q-Learning learns the risky optimal path.

5. Stochastic Transitions

Add 10% random action probability. How do the learned policies differ?

Implementation

</>Implementation
import numpy as np
import gymnasium as gym
from gymnasium import spaces

class GridWorld(gym.Env):
    def __init__(self, size=4, obstacles=None):
        super().__init__()
        self.size = size
        self.obstacles = obstacles or []

        self.observation_space = spaces.Discrete(size * size)
        self.action_space = spaces.Discrete(4)  # UP, DOWN, LEFT, RIGHT

        self.goal = (size - 1, size - 1)
        self.start = (0, 0)
        self.agent_pos = self.start

        # Action effects
        self.actions = {
            0: (-1, 0),  # UP
            1: (1, 0),   # DOWN
            2: (0, -1),  # LEFT
            3: (0, 1),   # RIGHT
        }

    def reset(self, seed=None):
        super().reset(seed=seed)
        self.agent_pos = self.start
        return self._get_obs(), {}

    def step(self, action):
        # Calculate new position
        dr, dc = self.actions[action]
        new_row = max(0, min(self.size - 1, self.agent_pos[0] + dr))
        new_col = max(0, min(self.size - 1, self.agent_pos[1] + dc))
        new_pos = (new_row, new_col)

        # Check for walls
        if new_pos not in self.obstacles:
            self.agent_pos = new_pos

        # Compute reward
        if self.agent_pos == self.goal:
            reward = 10.0
            done = True
        else:
            reward = -0.1
            done = False

        return self._get_obs(), reward, done, False, {}

    def _get_obs(self):
        return self.agent_pos[0] * self.size + self.agent_pos[1]

Further Reading