📝 AI Generated

Elevator Dispatch with Multi-Agent RL

How to formulate elevator dispatch as a multi-agent RL problem. Minimizing wait times through coordination without communication.

Content Visibility

Elevator Dispatch with Multi-Agent RL

You’re on the 8th floor, running late. You press the elevator button and wait. And wait.

An empty elevator passes by—heading somewhere else. When one finally arrives, it’s already packed.

Why does this happen?

Elevators solve a coordination problem: multiple agents working together to minimize wait times across an entire building, adapting to changing traffic patterns throughout the day.

This isn’t simple scheduling—it’s sequential decision-making under uncertainty. Passenger arrivals are random, demands shift (morning rush vs quiet periods), and elevators decide in real-time without knowing future requests.

Perfect for reinforcement learning.


See It In Action

Before diving into the details, try controlling elevators yourself! Compare different algorithms:

Elevator Dispatch Simulation

Building View

9
8
7
6
5
4
3
2
1
0
E0
0/8
E1
0/8
E2
0/8

Metrics

Timestep
0
Waiting
0
Avg Wait
0.0s
Max Wait
0s
Served
0/0
Utilization
0%

Legend: Blue = Low utilization, Yellow = Medium, Red = High | Orange dots = Waiting passengers

Quick experiment:

  1. Set algorithm to “Nearest Car”, traffic to “Morning Rush”
  2. Press Play and watch wait times
  3. Switch to “Random” — notice the chaos!
  4. Try “RL” — see the learned coordination

Now let’s understand why this problem is challenging…


Why Traditional Rules Fail

Traditional elevator systems use simple heuristics:

First-Come-First-Served

Serve requests in arrival order

❌ Can starve upper floors

SCAN Algorithm

Continue in direction until no more requests

❌ Ignores individual wait times

Nearest Car

Send closest available elevator

❌ No lookahead

These work in simple scenarios but struggle when faced with:

  • Time-varying traffic — Rush hours vs quiet periods
  • Multi-objective tradeoffs — Wait time vs energy vs fairness
  • Coordination — Multiple elevators working together
  • Long-term planning — Positioning for future demand

RL agents learn policies that handle all of this.


The Problem Domain

Our Test Building

10
Floors
3
Elevators
8
Capacity
10m
Episode

Each episode simulates 300 timesteps (10 minutes at 2 seconds per step). Elevators move at 0.5 floors/timestep.

Traffic Patterns

Real buildings have predictable patterns throughout the day:

🌅
Morning Rush (7-9am)
70% lobby → upper floors, high volume
🍽️
Lunch Time (12-1pm)
Bidirectional, moderate volume
🌆
Evening Rush (5-7pm)
70% upper floors → lobby, high volume
🌙
Quiet Period
Low volume, random destinations
ℹ️Note

Passengers arrive following a Poisson process with time-varying rates (λ changes by pattern).

Success Metrics

We measure performance across multiple dimensions:

⏱️
Average wait time
Primary user experience metric
⚠️
Max wait time
Fairness — prevent starvation
📊
Throughput
How many passengers served
Energy cost
Total floors traveled (efficiency)
📈
Utilization
How full elevators are when moving

The RL agent must balance all of these simultaneously.


The MDP Formulation

State Space: What Each Elevator Sees

Each elevator has a ~46-dimensional observation vector containing:

Own State (14 dims)

  • • Current floor
  • • Direction (UP/DOWN/IDLE)
  • • Passenger count
  • • Destination floors (which buttons pressed)

Pending Requests (20 dims)

  • • Waiting passengers per floor
  • • Request directions (up/down buttons)
  • • How long they’ve been waiting

Other Elevators (8 dims)

  • • Positions of other 2 elevators
  • • Their directions
  • • Their passenger loads

Time Context (4 dims)

  • • Traffic pattern (one-hot)
  • • Morning/Lunch/Evening/Quiet
Mathematical Details

The observation for elevator ii at time tt is:

oit=[sit,rt,eit,ct]\mathbf{o}_i^t = [\mathbf{s}_i^t, \mathbf{r}^t, \mathbf{e}_{-i}^t, \mathbf{c}^t]

where:

  • sitR14\mathbf{s}_i^t \in \mathbb{R}^{14}: Own state (floor, direction one-hot, passenger count, destination floors binary)
  • rtR20\mathbf{r}^t \in \mathbb{R}^{20}: Requests (waiting passengers per floor, up/down buttons)
  • eitR8\mathbf{e}_{-i}^t \in \mathbb{R}^{8}: Other elevators (2 elevators × 4 features each)
  • ctR4\mathbf{c}^t \in \mathbb{R}^{4}: Traffic context (one-hot)

Total observation dimension: d=46d = 46

Action Space: Where To Go Next

Each elevator chooses a target floor (0-9) every timestep.

💡Tip

We use high-level actions (target floor) rather than low-level control (MOVE_UP/MOVE_DOWN/OPEN_DOOR). The environment handles pathfinding—if target is floor 7, the elevator moves toward 7, stopping to pick up/drop off passengers en route.

Per Elevator
Discrete(10)
Joint (3 elevators)
10³ = 1000
Mathematical Details

At each timestep tt, each elevator ii selects an action:

aitAi={0,1,,9}a_i^t \in \mathcal{A}_i = \{0, 1, \ldots, 9\}

The joint action is at=(a1t,a2t,a3t)\mathbf{a}^t = (a_1^t, a_2^t, a_3^t).

Reward Design: Balancing Multiple Goals

The reward function must balance competing objectives:

Wait Time Penalty
-1.0 per waiting passenger per timestep
Primary goal: minimize wait
Starvation Penalty
-5.0 per passenger waiting > 60 seconds
Fairness: don’t ignore anyone
Delivery Bonus
+10.0 per passenger delivered
Reward completion, not just attempts
Energy Cost
-0.1 per floor moved
Efficiency: discourage wasted movement
💡Tip

Why is delivery bonus (+10) so much larger than wait penalty (-1)? Because it takes multiple timesteps to pick up and deliver a passenger. This ratio prevents short-term thinking.

Total reward each timestep:

rt=rwait+rstarvation+rdelivery+renergyr^t = r_{\text{wait}} + r_{\text{starvation}} + r_{\text{delivery}} + r_{\text{energy}}

Mathematical Details

Formally, the reward at timestep tt is:

rt=pWtw(p)+10Dt0.1i=13mitr^t = -\sum_{p \in W^t} w(p) + 10 \cdot |D^t| - 0.1 \cdot \sum_{i=1}^{3} m_i^t

where:

  • WtW^t: set of waiting passengers at time tt
  • w(p)={6if wait time>601otherwisew(p) = \begin{cases} 6 & \text{if wait time} > 60 \\ 1 & \text{otherwise} \end{cases}
  • DtD^t: set of passengers delivered at time tt
  • mitm_i^t: floors moved by elevator ii at time tt

The discount factor is γ=0.99\gamma = 0.99 (episodes are short, so we weight near-future heavily).

Episode Structure

Duration
300 steps
(10 min)
Start Position
Floor 0
(lobby)
Arrivals
Poisson
(λ varies)
Termination
Fixed
(no early end)

This creates episodes with rush hours, coordination challenges, and consequences for positioning decisions.



The Multi-Agent Challenge

With a single elevator, this is a standard MDP. With three elevators? Much harder.

1. Credit Assignment
When wait times improve, which elevator deserves credit? All three contribute, but actions are coupled.
2. Non-Stationarity
From elevator 1’s view, the world is constantly changing because elevators 2 and 3 are learning too. What worked yesterday fails today.
3. Coordination Without Communication
Elevators must learn to partition floors or specialize—without talking to each other. Coordination emerges from shared experience.
4. Safe Exploration
Trying “what if I skip this request?” can cause terrible wait times. Can’t explore recklessly in production.
Mathematical Details

This is a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).

Formally:

  • Agents: N={1,2,3}\mathcal{N} = \{1, 2, 3\} (three elevators)
  • Joint observation: ot=(o1t,o2t,o3t)\mathbf{o}^t = (o_1^t, o_2^t, o_3^t) where each oito_i^t depends on true state sts^t
  • Joint action: at=(a1t,a2t,a3t)\mathbf{a}^t = (a_1^t, a_2^t, a_3^t)
  • Shared reward: rtr^t (all agents receive the same reward signal)
  • Transition: st+1P(st,at)s^{t+1} \sim P(\cdot | s^t, \mathbf{a}^t)

The goal is to find a joint policy π=(π1,π2,π3)\pi = (\pi_1, \pi_2, \pi_3) that maximizes expected return:

J(π)=Eτπ[t=0Tγtrt]J(\pi) = \mathbb{E}_{\tau \sim \pi}\left[\sum_{t=0}^{T} \gamma^t r^t\right]


Baseline Approaches

Before applying RL, let’s see what simple rules achieve:

🎲 Random
Each elevator picks random floors.
✓ Dead simple
✗ Ignores all info
✗ Terrible
🚗 Nearest Car
Closest idle elevator takes each request.
✓ Intuitive
✗ Elevators cluster
✗ No lookahead
↕️ SCAN
Continue in direction until no more requests, then reverse.
✓ Good for uni-directional
✗ Poor for bidirectional
✗ No coordination
ℹ️Note

These baselines set performance bounds our RL agent must beat.


The RL Solution: Independent Q-Learning

Our approach: Independent Q-Learning with a shared replay buffer.

💡 Key Idea
Each elevator has its own Q-network learning Qi(oi,ai)Q_i(o_i, a_i), but all three share a replay buffer.
Decentralized execution — each elevator decides independently
Implicit coordination — learn from each other’s experiences
Simple — easier than QMIX, MADDPG, etc.

Network Architecture

Q-Network (one per elevator)
Input:46 dims
Hidden:[128, 128] ReLU
Output:10 Q-values
Training Hyperparameters
Replay buffer:50,000
Batch size:64
ε-greedy:1.0 → 0.01
Target update:every 100 steps
Optimizer:Adam lr=0.001
</>Implementation

Here’s the core training loop:

from rlbook.envs import ElevatorDispatch
from rlbook.agents import ElevatorDQN

# Create environment
env = ElevatorDispatch(
    n_floors=10,
    n_elevators=3,
    traffic_pattern="morning_rush",
    max_timesteps=300
)

# Create multi-agent DQN
agent = ElevatorDQN(
    n_floors=10,
    n_elevators=3,
    observation_dim=46,
    hidden_dims=(128, 128),
    gamma=0.99,
    epsilon=1.0,
    epsilon_decay=0.995
)

# Training loop
for episode in range(1000):
    obs, info = env.reset()
    episode_reward = 0

    for step in range(env.max_timesteps):
        # Each elevator selects action from its Q-network
        actions = agent.select_actions(obs, training=True)

        # Environment step
        next_obs, reward, done, truncated, info = env.step(actions)

        # Store all three elevators' transitions
        agent.store_transitions(obs, actions, reward, next_obs, done)

        # Train all networks
        loss = agent.train_step()

        episode_reward += reward
        obs = next_obs

        if done or truncated:
            break

    # Decay exploration
    agent.decay_epsilon()

Full implementation: code/rlbook/examples/train_elevator.py

Mathematical Details

Each elevator ii learns a Q-function Qi:Oi×AiRQ_i: \mathcal{O}_i \times \mathcal{A}_i \to \mathbb{R} via:

Qi(oit,ait)Qi(oit,ait)+α[rt+γmaxaiQi(oit+1,ai)Qi(oit,ait)]Q_i(o_i^t, a_i^t) \leftarrow Q_i(o_i^t, a_i^t) + \alpha \left[r^t + \gamma \max_{a_i'} Q_i(o_i^{t+1}, a_i') - Q_i(o_i^t, a_i^t)\right]

Note that the reward rtr^t is shared (global), but each elevator updates based on its own observation-action pairs.

The policy for elevator ii is:

πi(oi)=argmaxaiQi(oi,ai)\pi_i(o_i) = \arg\max_{a_i} Q_i(o_i, a_i)

During training, we use ε-greedy exploration:

ait=argmaxaiQi(oit,ai) with prob. 1ϵ, else randoma_i^t = \text{argmax}_{a_i} Q_i(o_i^t, a_i) \text{ with prob. } 1-\epsilon, \text{ else random}

Or more precisely:

  • With probability ϵ\epsilon: select random floor
  • With probability 1ϵ1-\epsilon: select πi(oit)\pi_i(o_i^t)

Results

Training time: 1000 episodes (~30 minutes on laptop CPU)

🎲
Random
Baseline - terrible
45.2s
87 served
🚗
Nearest Car
Decent, no coordination
18.3s
142 served
↕️
SCAN
Good for uni-directional
16.7s
148 served
🧠
DQN (Trained RL)
26% better!
12.4s
156 served

Emergent Behaviors

The trained agent discovered coordination strategies we never programmed:

🗺️ Implicit Zoning
Elevators naturally partition floors (e.g., one serves 0-3, another 4-6, another 7-9)
🎯 Anticipatory Positioning
During quiet periods, position at likely request floors
↗️ Direction Awareness
Prefer picking up passengers going the same direction
⚖️ Load Balancing
If one elevator is full, others compensate nearby
ℹ️Note

This coordination emerged from independent learning—we never explicitly programmed these strategies!


Challenges & Solutions

❌ Challenge: Slow Initial Learning
With ε=1.0, early episodes are random walks → very negative rewards → slow buffer fill
Solution:
  • • Pre-fill buffer with nearest-car policy (100 episodes)
  • • Or use shaped rewards (bonus for approaching requests)
❌ Challenge: Non-Stationarity
All three elevators’ policies change during training → non-stationary environment
Solution:
  • • Shared replay buffer stabilizes learning
  • • Target networks reduce moving-target problem
  • • Slower epsilon decay allows adaptation
❌ Challenge: Exploration in Production
Can’t let elevators explore wildly—real passengers would complain!
Solution:
  • • Train fully in simulation first
  • • Deploy with low ε (0.05) for minimal exploration
  • • Constrain actions to “reasonable” floors only
  • • Safety fallback: if wait > threshold → nearest-car override

Deployment Considerations

Moving from simulation to real buildings requires careful engineering:

🔄 Sim-to-Real Gap
Gaps:
  • • Real passengers ≠ Poisson
  • • Mechanical delays & failures
  • • Special events (fire, maintenance)
Fixes:
  • • Use real building data
  • • Domain randomization
  • • Continuous fine-tuning
🛡️ Safety Constraints
  • • Hard timeout: 2 min max wait
  • • Minimum service frequency/floor
  • • Emergency mode overrides
  • 📊 Monitoring
    Track:
    • • Wait time (mean, p95, max)
    • • Starvation events
    • • Utilization & energy
    Red flags:
    • • Wait spike → revert to baseline
    • • Clustering → check coordination


    Extensions

    Ways to push this further:

    🏢 Larger Buildings
    50 floors × 10 elevators. Use graph neural networks (GNNs) to encode relationships. Consider CTDE methods like QMIX.
    ⚡ Express Elevators
    Some elevators skip floors (1, 10, 20…). RL learns when to use express vs local—better than fixed rules.
    ⚖️ Multi-Objective Optimization
    Trade-off wait time vs energy vs fairness. Use weighted reward. Building managers tune weights. Pareto front finds non-dominated policies.
    🎯 Destination Dispatch
    Passengers enter destination before boarding → full observability → easier credit assignment and better routing.
    🔄 Lifelong Learning
    Building patterns shift (new tenants, seasons). Keep replay buffer in production, fine-tune nightly, detect distribution drift and retrain automatically.


    Try It Yourself

    </>Implementation

    Hands-On Training

    Train your own elevator dispatch agent:

    # Clone the repository
    git clone https://github.com/ebilgin/rlbook
    cd rlbook/code
    
    # Install dependencies
    pip install -e .
    
    # Train for 1000 episodes (takes ~30 minutes on CPU)
    python -m rlbook.examples.train_elevator --episodes 1000
    
    # Try different traffic patterns
    python -m rlbook.examples.train_elevator --traffic evening_rush
    
    # Larger building
    python -m rlbook.examples.train_elevator --n-floors 20 --n-elevators 5

    Colab Notebook: Open in Colab (coming soon)

    Exercises

    1. Reward Shaping: Modify the reward function to prioritize fairness over average wait time. How does this change behavior?

    2. Architecture Experiments: Try different network sizes ([64,64] vs [256,256]). How does this affect sample efficiency?

    3. Baseline Improvement: Implement a smarter nearest-car policy that considers elevator direction. Can you beat RL?

    4. Traffic Generalization: Train on morning_rush, then evaluate on evening_rush. How well does it transfer?

    5. Communication: Allow elevators to share their target floors with each other. Does this improve coordination?

    Key Takeaways

    1
    RL shines for coordination problems
    When multiple agents work together without explicit communication, RL discovers emergent coordination.
    2
    Reward engineering is critical
    Small changes (delivery bonus, starvation penalty) drastically affect learned behavior.
    3
    Baselines matter
    Simple heuristics (SCAN) can be surprisingly good. Always compare to strong baselines, not just random.
    4
    Sim-to-real gap is real
    Training in simulation is easy; deployment requires careful safety engineering and monitoring.
    5
    Multi-agent is hard
    Non-stationarity and credit assignment make MARL harder than single-agent. Independent Q-learning with shared replay is a good starting point.

    Further Reading

    Papers:

    Related Chapters:

    Related Applications:

    • Traffic Signal Control - Similar multi-agent coordination problem
    • Warehouse Robotics - Fleet coordination with physical constraints

    This application demonstrates that RL isn’t just for games and robotics—it’s powerful for any sequential decision problem with delayed rewards and coordination requirements.