Elevator Dispatch with Multi-Agent RL
You’re on the 8th floor, running late. You press the elevator button and wait. And wait.
An empty elevator passes by—heading somewhere else. When one finally arrives, it’s already packed.
Why does this happen?
Elevators solve a coordination problem: multiple agents working together to minimize wait times across an entire building, adapting to changing traffic patterns throughout the day.
This isn’t simple scheduling—it’s sequential decision-making under uncertainty. Passenger arrivals are random, demands shift (morning rush vs quiet periods), and elevators decide in real-time without knowing future requests.
Perfect for reinforcement learning.
See It In Action
Before diving into the details, try controlling elevators yourself! Compare different algorithms:
Elevator Dispatch Simulation
Building View
Metrics
Legend: Blue = Low utilization, Yellow = Medium, Red = High | Orange dots = Waiting passengers
Quick experiment:
- Set algorithm to “Nearest Car”, traffic to “Morning Rush”
- Press Play and watch wait times
- Switch to “Random” — notice the chaos!
- Try “RL” — see the learned coordination
Now let’s understand why this problem is challenging…
Why Traditional Rules Fail
Traditional elevator systems use simple heuristics:
First-Come-First-Served
Serve requests in arrival order
❌ Can starve upper floors
SCAN Algorithm
Continue in direction until no more requests
❌ Ignores individual wait times
Nearest Car
Send closest available elevator
❌ No lookahead
These work in simple scenarios but struggle when faced with:
- Time-varying traffic — Rush hours vs quiet periods
- Multi-objective tradeoffs — Wait time vs energy vs fairness
- Coordination — Multiple elevators working together
- Long-term planning — Positioning for future demand
RL agents learn policies that handle all of this.
The Problem Domain
Our Test Building
Each episode simulates 300 timesteps (10 minutes at 2 seconds per step). Elevators move at 0.5 floors/timestep.
Traffic Patterns
Real buildings have predictable patterns throughout the day:
Passengers arrive following a Poisson process with time-varying rates (λ changes by pattern).
Success Metrics
We measure performance across multiple dimensions:
The RL agent must balance all of these simultaneously.
The MDP Formulation
State Space: What Each Elevator Sees
Each elevator has a ~46-dimensional observation vector containing:
Own State (14 dims)
- • Current floor
- • Direction (UP/DOWN/IDLE)
- • Passenger count
- • Destination floors (which buttons pressed)
Pending Requests (20 dims)
- • Waiting passengers per floor
- • Request directions (up/down buttons)
- • How long they’ve been waiting
Other Elevators (8 dims)
- • Positions of other 2 elevators
- • Their directions
- • Their passenger loads
Time Context (4 dims)
- • Traffic pattern (one-hot)
- • Morning/Lunch/Evening/Quiet
Partial observability: Elevators don’t know exact passenger destinations until they board—only which floors have waiting passengers and their desired direction.
The observation for elevator at time is:
where:
- : Own state (floor, direction one-hot, passenger count, destination floors binary)
- : Requests (waiting passengers per floor, up/down buttons)
- : Other elevators (2 elevators × 4 features each)
- : Traffic context (one-hot)
Total observation dimension:
Action Space: Where To Go Next
Each elevator chooses a target floor (0-9) every timestep.
We use high-level actions (target floor) rather than low-level control (MOVE_UP/MOVE_DOWN/OPEN_DOOR). The environment handles pathfinding—if target is floor 7, the elevator moves toward 7, stopping to pick up/drop off passengers en route.
At each timestep , each elevator selects an action:
The joint action is .
Reward Design: Balancing Multiple Goals
The reward function must balance competing objectives:
Why is delivery bonus (+10) so much larger than wait penalty (-1)? Because it takes multiple timesteps to pick up and deliver a passenger. This ratio prevents short-term thinking.
Total reward each timestep:
Formally, the reward at timestep is:
where:
- : set of waiting passengers at time
- : set of passengers delivered at time
- : floors moved by elevator at time
The discount factor is (episodes are short, so we weight near-future heavily).
Episode Structure
This creates episodes with rush hours, coordination challenges, and consequences for positioning decisions.
The Multi-Agent Challenge
With a single elevator, this is a standard MDP. With three elevators? Much harder.
This is a Decentralized Partially Observable Markov Decision Process (Dec-POMDP).
Formally:
- Agents: (three elevators)
- Joint observation: where each depends on true state
- Joint action:
- Shared reward: (all agents receive the same reward signal)
- Transition:
The goal is to find a joint policy that maximizes expected return:
Baseline Approaches
Before applying RL, let’s see what simple rules achieve:
These baselines set performance bounds our RL agent must beat.
The RL Solution: Independent Q-Learning
Our approach: Independent Q-Learning with a shared replay buffer.
Network Architecture
Here’s the core training loop:
from rlbook.envs import ElevatorDispatch
from rlbook.agents import ElevatorDQN
# Create environment
env = ElevatorDispatch(
n_floors=10,
n_elevators=3,
traffic_pattern="morning_rush",
max_timesteps=300
)
# Create multi-agent DQN
agent = ElevatorDQN(
n_floors=10,
n_elevators=3,
observation_dim=46,
hidden_dims=(128, 128),
gamma=0.99,
epsilon=1.0,
epsilon_decay=0.995
)
# Training loop
for episode in range(1000):
obs, info = env.reset()
episode_reward = 0
for step in range(env.max_timesteps):
# Each elevator selects action from its Q-network
actions = agent.select_actions(obs, training=True)
# Environment step
next_obs, reward, done, truncated, info = env.step(actions)
# Store all three elevators' transitions
agent.store_transitions(obs, actions, reward, next_obs, done)
# Train all networks
loss = agent.train_step()
episode_reward += reward
obs = next_obs
if done or truncated:
break
# Decay exploration
agent.decay_epsilon()Full implementation: code/rlbook/examples/train_elevator.py
Each elevator learns a Q-function via:
Note that the reward is shared (global), but each elevator updates based on its own observation-action pairs.
The policy for elevator is:
During training, we use ε-greedy exploration:
Or more precisely:
- With probability : select random floor
- With probability : select
Results
Training time: 1000 episodes (~30 minutes on laptop CPU)
Emergent Behaviors
The trained agent discovered coordination strategies we never programmed:
This coordination emerged from independent learning—we never explicitly programmed these strategies!
Challenges & Solutions
- • Pre-fill buffer with nearest-car policy (100 episodes)
- • Or use shaped rewards (bonus for approaching requests)
- • Shared replay buffer stabilizes learning
- • Target networks reduce moving-target problem
- • Slower epsilon decay allows adaptation
- • Train fully in simulation first
- • Deploy with low ε (0.05) for minimal exploration
- • Constrain actions to “reasonable” floors only
- • Safety fallback: if wait > threshold → nearest-car override
Deployment Considerations
Moving from simulation to real buildings requires careful engineering:
- • Real passengers ≠ Poisson
- • Mechanical delays & failures
- • Special events (fire, maintenance)
- • Use real building data
- • Domain randomization
- • Continuous fine-tuning
Implemented as wrappers, not learned!
- • Wait time (mean, p95, max)
- • Starvation events
- • Utilization & energy
- • Wait spike → revert to baseline
- • Clustering → check coordination
Extensions
Ways to push this further:
Try It Yourself
Hands-On Training
Train your own elevator dispatch agent:
# Clone the repository
git clone https://github.com/ebilgin/rlbook
cd rlbook/code
# Install dependencies
pip install -e .
# Train for 1000 episodes (takes ~30 minutes on CPU)
python -m rlbook.examples.train_elevator --episodes 1000
# Try different traffic patterns
python -m rlbook.examples.train_elevator --traffic evening_rush
# Larger building
python -m rlbook.examples.train_elevator --n-floors 20 --n-elevators 5Colab Notebook: Open in Colab (coming soon)
Exercises
-
Reward Shaping: Modify the reward function to prioritize fairness over average wait time. How does this change behavior?
-
Architecture Experiments: Try different network sizes ([64,64] vs [256,256]). How does this affect sample efficiency?
-
Baseline Improvement: Implement a smarter nearest-car policy that considers elevator direction. Can you beat RL?
-
Traffic Generalization: Train on morning_rush, then evaluate on evening_rush. How well does it transfer?
-
Communication: Allow elevators to share their target floors with each other. Does this improve coordination?
Key Takeaways
Further Reading
Papers:
- Multi-Agent Actor-Critic for Mixed Cooperative-Competitive Environments (Lowe et al., 2017) - MADDPG
- QMIX: Monotonic Value Function Factorisation for Decentralised Multi-Agent RL (Rashid et al., 2018)
- Real-World Elevator Group Control with Deep RL (Hakonen et al., 2018)
Related Chapters:
- Deep Q-Networks - Foundation for DQN algorithm
- Multi-Agent RL - Advanced coordination methods (coming soon)
Related Applications:
- Traffic Signal Control - Similar multi-agent coordination problem
- Warehouse Robotics - Fleet coordination with physical constraints
This application demonstrates that RL isn’t just for games and robotics—it’s powerful for any sequential decision problem with delayed rewards and coordination requirements.