GridWorld Playground

Why GridWorld?

GridWorld is the “Hello World” of reinforcement learning. It’s simple enough to understand completely, yet rich enough to demonstrate every core concept:

States — grid positions
Actions — up, down, left, right
Rewards — goal (+10), step penalty (-0.1), walls (-0.5)
Transitions — deterministic or stochastic

When you truly understand GridWorld, you understand RL.

Specification

∑Mathematical Details

State Space:

Type: Discrete
Size: $n \times m$ (e.g., 4×4 = 16 states)
Representation: $(row, col)$ tuple or flattened index

Action Space:

Type: Discrete
Size: 4
Actions: UP (0), DOWN (1), LEFT (2), RIGHT (3)

Transition Dynamics:

Deterministic: Action moves agent in specified direction
Wall collision: Agent stays in place
Optional: Wind/stochasticity (agent may slip)

Reward Function:

Goal state: $+10$
Each step: $-0.1$ (encourages efficiency)
Wall collision: $-0.5$ (optional)
Obstacle: $-5$ and episode ends

Episode Termination:

Agent reaches goal
Agent hits obstacle
Maximum steps reached (e.g., 100)

Variants

Variant	Grid	Obstacles	Dynamics	Difficulty
Basic	4×4	None	Deterministic	Easy
Walls	6×6	Walls	Deterministic	Medium
Windy	4×4	None	Stochastic	Medium
Cliff	4×12	Cliff edge	Deterministic	Hard
Maze	8×8	Complex walls	Deterministic	Hard

Baseline Results

Performance on 4×4 GridWorld (average return over 100 episodes):

Algorithm	Mean Return	Episodes to Converge
Random	-45.2	—
Q-Learning	-2.8	~50
SARSA	-3.1	~60
Monte Carlo	-3.5	~100
Optimal	-2.6	—

The optimal path takes 6 steps from corner to corner: $-0.1 \times 6 + 10 = 9.4$ return.

Suggested Experiments

1. Value Function Visualization

Watch the value function evolve during training. You should see values propagate backward from the goal.

2. Exploration vs. Exploitation

Try different $\epsilon$ values (0.1, 0.3, 0.5). How does it affect learning speed and final performance?

3. Discount Factor Impact

Compare $\gamma = 0.9$ vs $\gamma = 0.99$ . Which finds shorter paths?

4. SARSA vs. Q-Learning

Run both on CliffWalking. SARSA learns the safe path; Q-Learning learns the risky optimal path.

5. Stochastic Transitions

Add 10% random action probability. How do the learned policies differ?

Implementation

</>Implementation

import numpy as np
import gymnasium as gym
from gymnasium import spaces

class GridWorld(gym.Env):
    def __init__(self, size=4, obstacles=None):
        super().__init__()
        self.size = size
        self.obstacles = obstacles or []

        self.observation_space = spaces.Discrete(size * size)
        self.action_space = spaces.Discrete(4)  # UP, DOWN, LEFT, RIGHT

        self.goal = (size - 1, size - 1)
        self.start = (0, 0)
        self.agent_pos = self.start

        # Action effects
        self.actions = {
            0: (-1, 0),  # UP
            1: (1, 0),   # DOWN
            2: (0, -1),  # LEFT
            3: (0, 1),   # RIGHT
        }

    def reset(self, seed=None):
        super().reset(seed=seed)
        self.agent_pos = self.start
        return self._get_obs(), {}

    def step(self, action):
        # Calculate new position
        dr, dc = self.actions[action]
        new_row = max(0, min(self.size - 1, self.agent_pos[0] + dr))
        new_col = max(0, min(self.size - 1, self.agent_pos[1] + dc))
        new_pos = (new_row, new_col)

        # Check for walls
        if new_pos not in self.obstacles:
            self.agent_pos = new_pos

        # Compute reward
        if self.agent_pos == self.goal:
            reward = 10.0
            done = True
        else:
            reward = -0.1
            done = False

        return self._get_obs(), reward, done, False, {}

    def _get_obs(self):
        return self.agent_pos[0] * self.size + self.agent_pos[1]