Elevator Dispatch with Multi-Agent RL

You’re on the 8th floor, running late. You press the elevator button and wait. And wait.

An empty elevator passes by—heading somewhere else. When one finally arrives, it’s already packed.

Why does this happen?

Elevators solve a coordination problem: multiple agents working together to minimize wait times across an entire building, adapting to changing traffic patterns throughout the day.

This isn’t simple scheduling—it’s sequential decision-making under uncertainty. Passenger arrivals are random, demands shift (morning rush vs quiet periods), and elevators decide in real-time without knowing future requests.

Perfect for reinforcement learning.

See It In Action

Before diving into the details, try controlling elevators yourself! Compare different algorithms:

Elevator Dispatch Simulation

Building View

0/8

Metrics

Timestep

Waiting

Avg Wait

0.0s

Max Wait

Served

0/0

Utilization

Algorithm

Traffic Pattern

Speed: 500ms

Legend: Blue = Low utilization, Yellow = Medium, Red = High | Orange dots = Waiting passengers

Quick experiment:

Set algorithm to “Nearest Car”, traffic to “Morning Rush”
Press Play and watch wait times
Switch to “Random” — notice the chaos!
Try “RL” — see the learned coordination

Now let’s understand why this problem is challenging…

Why Traditional Rules Fail

Traditional elevator systems use simple heuristics:

First-Come-First-Served

Serve requests in arrival order

❌ Can starve upper floors

SCAN Algorithm

Continue in direction until no more requests

❌ Ignores individual wait times

Nearest Car

Send closest available elevator

❌ No lookahead

These work in simple scenarios but struggle when faced with:

Time-varying traffic — Rush hours vs quiet periods
Multi-objective tradeoffs — Wait time vs energy vs fairness
Coordination — Multiple elevators working together
Long-term planning — Positioning for future demand

RL agents learn policies that handle all of this.

The Problem Domain

Our Test Building

Floors

Elevators

Capacity

10m

Episode

Each episode simulates 300 timesteps (10 minutes at 2 seconds per step). Elevators move at 0.5 floors/timestep.

Traffic Patterns

Real buildings have predictable patterns throughout the day:

🌅

Morning Rush (7-9am)

70% lobby → upper floors, high volume

🍽️

Lunch Time (12-1pm)

Bidirectional, moderate volume

🌆

Evening Rush (5-7pm)

70% upper floors → lobby, high volume

🌙

Quiet Period

Low volume, random destinations

ℹ️Note

Passengers arrive following a Poisson process with time-varying rates (λ changes by pattern).

Success Metrics

We measure performance across multiple dimensions:

⏱️

Average wait time

Primary user experience metric

⚠️

Max wait time

Fairness — prevent starvation

📊

Throughput

How many passengers served

⚡

Energy cost

Total floors traveled (efficiency)

📈

Utilization

How full elevators are when moving

The RL agent must balance all of these simultaneously.

The MDP Formulation

State Space: What Each Elevator Sees

Each elevator has a ~46-dimensional observation vector containing:

Own State (14 dims)

• Current floor
• Direction (UP/DOWN/IDLE)
• Passenger count
• Destination floors (which buttons pressed)

Pending Requests (20 dims)

• Waiting passengers per floor
• Request directions (up/down buttons)
• How long they’ve been waiting

Other Elevators (8 dims)

• Positions of other 2 elevators
• Their directions
• Their passenger loads

Time Context (4 dims)

• Traffic pattern (one-hot)
• Morning/Lunch/Evening/Quiet

⚠️Warning

Partial observability: Elevators don’t know exact passenger destinations until they board—only which floors have waiting passengers and their desired direction.

∑Mathematical Details

The observation for elevator $i$ at time $t$ is:

$\mathbf{o}_i^t = [\mathbf{s}_i^t, \mathbf{r}^t, \mathbf{e}_{-i}^t, \mathbf{c}^t]$

where:

$\mathbf{s}_i^t \in \mathbb{R}^{14}$ : Own state (floor, direction one-hot, passenger count, destination floors binary)
$\mathbf{r}^t \in \mathbb{R}^{20}$ : Requests (waiting passengers per floor, up/down buttons)
$\mathbf{e}_{-i}^t \in \mathbb{R}^{8}$ : Other elevators (2 elevators × 4 features each)
$\mathbf{c}^t \in \mathbb{R}^{4}$ : Traffic context (one-hot)

Total observation dimension: $d = 46$

Action Space: Where To Go Next

Each elevator chooses a target floor (0-9) every timestep.

💡Tip

We use high-level actions (target floor) rather than low-level control (MOVE_UP/MOVE_DOWN/OPEN_DOOR). The environment handles pathfinding—if target is floor 7, the elevator moves toward 7, stopping to pick up/drop off passengers en route.

Per Elevator

Discrete(10)

Joint (3 elevators)

10³ = 1000

∑Mathematical Details

At each timestep $t$ , each elevator $i$ selects an action:

$a_i^t \in \mathcal{A}_i = \{0, 1, \ldots, 9\}$

The joint action is $\mathbf{a}^t = (a_1^t, a_2^t, a_3^t)$ .

Reward Design: Balancing Multiple Goals

The reward function must balance competing objectives:

Wait Time Penalty

-1.0 per waiting passenger per timestep

Primary goal: minimize wait

Starvation Penalty

-5.0 per passenger waiting > 60 seconds

Fairness: don’t ignore anyone

Delivery Bonus

+10.0 per passenger delivered

Reward completion, not just attempts

Energy Cost

-0.1 per floor moved

Efficiency: discourage wasted movement

💡Tip

Why is delivery bonus (+10) so much larger than wait penalty (-1)? Because it takes multiple timesteps to pick up and deliver a passenger. This ratio prevents short-term thinking.

Total reward each timestep:

$r^t = r_{\text{wait}} + r_{\text{starvation}} + r_{\text{delivery}} + r_{\text{energy}}$

∑Mathematical Details

Formally, the reward at timestep $t$ is:

$r^t = -\sum_{p \in W^t} w(p) + 10 \cdot |D^t| - 0.1 \cdot \sum_{i=1}^{3} m_i^t$

where:

$W^t$ : set of waiting passengers at time $t$
$w(p) = \begin{cases} 6 & \text{if wait time} > 60 \\ 1 & \text{otherwise} \end{cases}$
$D^t$ : set of passengers delivered at time $t$
$m_i^t$ : floors moved by elevator $i$ at time $t$

The discount factor is $\gamma = 0.99$ (episodes are short, so we weight near-future heavily).

Episode Structure

Duration

300 steps

(10 min)

Start Position

Floor 0

(lobby)

Arrivals

Poisson

(λ varies)

Termination

Fixed

(no early end)

This creates episodes with rush hours, coordination challenges, and consequences for positioning decisions.

The Multi-Agent Challenge

With a single elevator, this is a standard MDP. With three elevators? Much harder.

1. Credit Assignment

When wait times improve, which elevator deserves credit? All three contribute, but actions are coupled.

2. Non-Stationarity

From elevator 1’s view, the world is constantly changing because elevators 2 and 3 are learning too. What worked yesterday fails today.

3. Coordination Without Communication

Elevators must learn to partition floors or specialize—without talking to each other. Coordination emerges from shared experience.

4. Safe Exploration

Trying “what if I skip this request?” can cause terrible wait times. Can’t explore recklessly in production.

∑Mathematical Details

This is a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).

Formally:

Agents: $\mathcal{N} = \{1, 2, 3\}$ (three elevators)
Joint observation: $\mathbf{o}^t = (o_1^t, o_2^t, o_3^t)$ where each $o_i^t$ depends on true state $s^t$
Joint action: $\mathbf{a}^t = (a_1^t, a_2^t, a_3^t)$
Shared reward: $r^t$ (all agents receive the same reward signal)
Transition: $s^{t+1} \sim P(\cdot | s^t, \mathbf{a}^t)$

The goal is to find a joint policy $\pi = (\pi_1, \pi_2, \pi_3)$ that maximizes expected return:

$J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{T} \gamma^t r^t\right]$

Baseline Approaches

Before applying RL, let’s see what simple rules achieve:

🎲 Random

Each elevator picks random floors.

✓ Dead simple

✗ Ignores all info

✗ Terrible

🚗 Nearest Car

Closest idle elevator takes each request.

✓ Intuitive

✗ Elevators cluster

✗ No lookahead

↕️ SCAN

Continue in direction until no more requests, then reverse.

✓ Good for uni-directional

✗ Poor for bidirectional

✗ No coordination

ℹ️Note

These baselines set performance bounds our RL agent must beat.

The RL Solution: Independent Q-Learning

Our approach: Independent Q-Learning with a shared replay buffer.

💡 Key Idea

Each elevator has its own Q-network learning

Q_i(o_i, a_i)

, but all three share a replay buffer.

✓ Decentralized execution — each elevator decides independently

✓ Implicit coordination — learn from each other’s experiences

✓ Simple — easier than QMIX, MADDPG, etc.

Network Architecture

Q-Network (one per elevator)

Input:46 dims

Hidden:[128, 128] ReLU

Output:10 Q-values

Training Hyperparameters

Replay buffer:50,000

Batch size:64

ε-greedy:1.0 → 0.01

Target update:every 100 steps

Optimizer:Adam lr=0.001

</>Implementation

Here’s the core training loop:

from rlbook.envs import ElevatorDispatch
from rlbook.agents import ElevatorDQN

# Create environment
env = ElevatorDispatch(
    n_floors=10,
    n_elevators=3,
    traffic_pattern="morning_rush",
    max_timesteps=300
)

# Create multi-agent DQN
agent = ElevatorDQN(
    n_floors=10,
    n_elevators=3,
    observation_dim=46,
    hidden_dims=(128, 128),
    gamma=0.99,
    epsilon=1.0,
    epsilon_decay=0.995
)

# Training loop
for episode in range(1000):
    obs, info = env.reset()
    episode_reward = 0

    for step in range(env.max_timesteps):
        # Each elevator selects action from its Q-network
        actions = agent.select_actions(obs, training=True)

        # Environment step
        next_obs, reward, done, truncated, info = env.step(actions)

        # Store all three elevators' transitions
        agent.store_transitions(obs, actions, reward, next_obs, done)

        # Train all networks
        loss = agent.train_step()

        episode_reward += reward
        obs = next_obs

        if done or truncated:
            break

    # Decay exploration
    agent.decay_epsilon()

Full implementation: code/rlbook/examples/train_elevator.py

∑Mathematical Details

Each elevator $i$ learns a Q-function $Q_i: \mathcal{O}_i \times \mathcal{A}_i \to \mathbb{R}$ via:

$Q_i(o_i^t, a_i^t) \leftarrow Q_i(o_i^t, a_i^t) + \alpha \left[r^t + \gamma \max_{a_i'} Q_i(o_i^{t+1}, a_i') - Q_i(o_i^t, a_i^t)\right]$

Note that the reward $r^t$ is shared (global), but each elevator updates based on its own observation-action pairs.

The policy for elevator $i$ is:

$\pi_i(o_i) = \arg\max_{a_i} Q_i(o_i, a_i)$

During training, we use ε-greedy exploration:

$a_i^t = \text{argmax}_{a_i} Q_i(o_i^t, a_i) \text{ with prob. } 1-\epsilon, \text{ else random}$

Or more precisely:

With probability $\epsilon$ : select random floor
With probability $1-\epsilon$ : select $\pi_i(o_i^t)$

Results

Training time: 1000 episodes (~30 minutes on laptop CPU)

🎲

Random

Baseline - terrible

45.2s

87 served

🚗

Nearest Car

Decent, no coordination

18.3s

142 served

↕️

SCAN

Good for uni-directional

16.7s

148 served

🧠

DQN (Trained RL)

26% better!

12.4s

156 served

Emergent Behaviors

The trained agent discovered coordination strategies we never programmed:

🗺️ Implicit Zoning

Elevators naturally partition floors (e.g., one serves 0-3, another 4-6, another 7-9)

🎯 Anticipatory Positioning

During quiet periods, position at likely request floors

↗️ Direction Awareness

Prefer picking up passengers going the same direction

⚖️ Load Balancing

If one elevator is full, others compensate nearby

ℹ️Note

This coordination emerged from independent learning—we never explicitly programmed these strategies!

Challenges & Solutions

❌ Challenge: Slow Initial Learning

With ε=1.0, early episodes are random walks → very negative rewards → slow buffer fill

✓ Solution:

• Pre-fill buffer with nearest-car policy (100 episodes)
• Or use shaped rewards (bonus for approaching requests)

❌ Challenge: Non-Stationarity

All three elevators’ policies change during training → non-stationary environment

✓ Solution:

• Shared replay buffer stabilizes learning
• Target networks reduce moving-target problem
• Slower epsilon decay allows adaptation

❌ Challenge: Exploration in Production

Can’t let elevators explore wildly—real passengers would complain!

✓ Solution:

• Train fully in simulation first
• Deploy with low ε (0.05) for minimal exploration
• Constrain actions to “reasonable” floors only
• Safety fallback: if wait > threshold → nearest-car override

Deployment Considerations

Moving from simulation to real buildings requires careful engineering:

🔄 Sim-to-Real Gap

Gaps:

• Real passengers ≠ Poisson
• Mechanical delays & failures
• Special events (fire, maintenance)

Fixes:

• Use real building data
• Domain randomization
• Continuous fine-tuning

🛡️ Safety Constraints

• Hard timeout: 2 min max wait

• Minimum service frequency/floor

• Emergency mode overrides

⚠️Warning

Implemented as wrappers, not learned!

📊 Monitoring

Track:

• Wait time (mean, p95, max)
• Starvation events
• Utilization & energy

Red flags:

• Wait spike → revert to baseline
• Clustering → check coordination

Extensions

Ways to push this further:

🏢 Larger Buildings

50 floors × 10 elevators. Use graph neural networks (GNNs) to encode relationships. Consider CTDE methods like QMIX.

⚡ Express Elevators

Some elevators skip floors (1, 10, 20…). RL learns when to use express vs local—better than fixed rules.

⚖️ Multi-Objective Optimization

Trade-off wait time vs energy vs fairness. Use weighted reward. Building managers tune weights. Pareto front finds non-dominated policies.

🎯 Destination Dispatch

Passengers enter destination before boarding → full observability → easier credit assignment and better routing.

🔄 Lifelong Learning

Building patterns shift (new tenants, seasons). Keep replay buffer in production, fine-tune nightly, detect distribution drift and retrain automatically.

Try It Yourself

</>Implementation

Hands-On Training

Train your own elevator dispatch agent:

# Clone the repository
git clone https://github.com/ebilgin/rlbook
cd rlbook/code

# Install dependencies
pip install -e .

# Train for 1000 episodes (takes ~30 minutes on CPU)
python -m rlbook.examples.train_elevator --episodes 1000

# Try different traffic patterns
python -m rlbook.examples.train_elevator --traffic evening_rush

# Larger building
python -m rlbook.examples.train_elevator --n-floors 20 --n-elevators 5

Colab Notebook: Open in Colab (coming soon)

Exercises

Reward Shaping: Modify the reward function to prioritize fairness over average wait time. How does this change behavior?
Architecture Experiments: Try different network sizes ([64,64] vs [256,256]). How does this affect sample efficiency?
Baseline Improvement: Implement a smarter nearest-car policy that considers elevator direction. Can you beat RL?
Traffic Generalization: Train on morning_rush, then evaluate on evening_rush. How well does it transfer?
Communication: Allow elevators to share their target floors with each other. Does this improve coordination?

Key Takeaways

RL shines for coordination problems

When multiple agents work together without explicit communication, RL discovers emergent coordination.

Reward engineering is critical

Small changes (delivery bonus, starvation penalty) drastically affect learned behavior.

Baselines matter

Simple heuristics (SCAN) can be surprisingly good. Always compare to strong baselines, not just random.

Sim-to-real gap is real

Training in simulation is easy; deployment requires careful safety engineering and monitoring.

Multi-agent is hard

Non-stationarity and credit assignment make MARL harder than single-agent. Independent Q-learning with shared replay is a good starting point.

Elevator Dispatch with Multi-Agent RL

See It In Action

Elevator Dispatch Simulation

Building View

Metrics

Why Traditional Rules Fail

First-Come-First-Served

SCAN Algorithm

Nearest Car

The Problem Domain

Our Test Building

Traffic Patterns

Success Metrics

The MDP Formulation

State Space: What Each Elevator Sees

Own State (14 dims)

Pending Requests (20 dims)

Other Elevators (8 dims)

Time Context (4 dims)

Action Space: Where To Go Next

Reward Design: Balancing Multiple Goals

Episode Structure

The Multi-Agent Challenge

Baseline Approaches

The RL Solution: Independent Q-Learning

Network Architecture

Results

Emergent Behaviors

Challenges & Solutions

Deployment Considerations

Extensions

Try It Yourself

Hands-On Training

Exercises

Key Takeaways

Further Reading