Experiment Tracking for RL

Why Tracking Matters More in RL

Reinforcement learning experiments are notoriously hard to reproduce. Small changes in random seeds, hyperparameters, or even library versions can produce dramatically different results.

Good experiment tracking isn’t optional—it’s the difference between “I got 500 reward once” and “here’s exactly how to reproduce 500 reward.”

What to Track

Metrics

Training metrics:

Episode reward (mean, min, max, std)
Episode length
Policy loss, value loss
Entropy (exploration indicator)
Learning rate schedule

Evaluation metrics:

Evaluation reward (deterministic policy)
Success rate (if applicable)
Custom domain metrics

Artifacts

Model checkpoints
Training videos/renders
Configuration files
Environment specifications

Metadata

Git commit hash
Library versions (gym, torch, etc.)
Hardware specs
Random seeds

Tool Comparison

Feature	Weights & Biases	MLflow	TensorBoard
Hosting	Cloud (free tier)	Self-hosted or cloud	Local
Collaboration	Built-in	Manual setup	Limited
Hyperparameter sweeps	Yes	Yes (Optuna)	No
Model registry	Yes	Yes	No
Cost	Free/paid	Open source	Free

Recommendation: Start with W&B for ease of use, consider MLflow if you need self-hosting or already use it.

Setup with Weights & Biases

</>Implementation

import wandb
import gymnasium as gym

# Initialize
wandb.init(
    project="rl-experiments",
    config={
        "algorithm": "PPO",
        "env": "CartPole-v1",
        "learning_rate": 3e-4,
        "gamma": 0.99,
        "clip_eps": 0.2,
        "seed": 42,
    }
)

# Training loop
for episode in range(1000):
    state, _ = env.reset()
    episode_reward = 0

    while True:
        action = policy.select_action(state)
        state, reward, done, truncated, _ = env.step(action)
        episode_reward += reward

        if done or truncated:
            break

    # Log metrics
    wandb.log({
        "episode": episode,
        "reward": episode_reward,
        "epsilon": policy.epsilon,
    })

# Save model artifact
artifact = wandb.Artifact("policy", type="model")
artifact.add_file("policy.pt")
wandb.log_artifact(artifact)

wandb.finish()

Video logging (critical for RL debugging):

# Record evaluation episodes
frames = []
state, _ = env.reset()
while True:
    frames.append(env.render())
    action = policy.select_action(state, deterministic=True)
    state, reward, done, truncated, _ = env.step(action)
    if done or truncated:
        break

# Log as video
wandb.log({"eval_video": wandb.Video(np.array(frames), fps=30)})

Hyperparameter Sweeps