Policy Gradient Methods in Practice
What You'll Learn
- Apply policy gradient methods to robotics and continuous control tasks
- Understand how RLHF (Reinforcement Learning from Human Feedback) uses PPO
- Recognize practical challenges: reward shaping, sim-to-real, sample efficiency
- Know when to choose policy gradient vs. value-based methods
- Navigate the landscape of modern RL algorithms
From Theory to Practice
We’ve built up the theory of policy gradients: from the basic motivation through REINFORCE, actor-critic, and PPO.
Now let’s see how these methods are actually used.
Policy gradient methods aren’t just academic exercises—they power some of the most impressive AI systems today:
- Robotics: Robots learning to walk, manipulate objects, navigate
- Games: OpenAI Five (Dota 2), AlphaStar (StarCraft II)
- Language Models: ChatGPT, Claude, and other assistants trained with RLHF
- Autonomous Systems: Self-driving cars, drone control
The common thread: situations where we need to learn complex behaviors from interaction, often with continuous actions.
Continuous Control: Robotics
Continuous control is where policy gradients shine. Let’s see how to extend our discrete-action PPO to handle continuous actions.
Gaussian Policies for Continuous Actions
Mathematical Details
For continuous actions, we parameterize a Gaussian (normal) distribution:
where:
- is a neural network mapping state to action mean
- is the standard deviation (can be learned or fixed)
To sample an action:
Log probability (needed for policy gradients):
Implementation
import torch
import torch.nn as nn
from torch.distributions import Normal
class ContinuousPolicyNetwork(nn.Module):
"""Actor network for continuous actions."""
def __init__(self, state_dim, action_dim, hidden_dim=64):
super().__init__()
self.shared = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.Tanh(),
nn.Linear(hidden_dim, hidden_dim),
nn.Tanh()
)
# Mean of action distribution
self.mean = nn.Linear(hidden_dim, action_dim)
# Log standard deviation (learned, but shared across states)
self.log_std = nn.Parameter(torch.zeros(action_dim))
def forward(self, state):
features = self.shared(state)
mean = self.mean(features)
std = torch.exp(self.log_std)
return mean, std
def get_action(self, state, deterministic=False):
"""Sample action from policy."""
mean, std = self.forward(state)
if deterministic:
return mean, None, None
dist = Normal(mean, std)
action = dist.sample()
# For bounded action spaces, often use tanh squashing
# action = torch.tanh(action)
log_prob = dist.log_prob(action).sum(dim=-1)
entropy = dist.entropy().sum(dim=-1)
return action, log_prob, entropy
class ContinuousPPO:
"""PPO for continuous action spaces."""
def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99,
gae_lambda=0.95, clip_epsilon=0.2, epochs=10):
self.gamma = gamma
self.gae_lambda = gae_lambda
self.clip_epsilon = clip_epsilon
self.epochs = epochs
self.actor = ContinuousPolicyNetwork(state_dim, action_dim)
self.critic = nn.Sequential(
nn.Linear(state_dim, 64),
nn.Tanh(),
nn.Linear(64, 64),
nn.Tanh(),
nn.Linear(64, 1)
)
self.optimizer = torch.optim.Adam(
list(self.actor.parameters()) + list(self.critic.parameters()),
lr=lr
)
def select_action(self, state):
"""Select action for environment interaction."""
with torch.no_grad():
state_tensor = torch.FloatTensor(state).unsqueeze(0)
action, log_prob, _ = self.actor.get_action(state_tensor)
return action.squeeze(0).numpy(), log_prob.item()
def update(self, states, actions, old_log_probs, advantages, returns):
"""PPO update for continuous actions."""
states = torch.FloatTensor(states)
actions = torch.FloatTensor(actions)
old_log_probs = torch.FloatTensor(old_log_probs)
advantages = torch.FloatTensor(advantages)
returns = torch.FloatTensor(returns)
# Normalize advantages
advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
for _ in range(self.epochs):
# Get current log probs
mean, std = self.actor(states)
dist = Normal(mean, std)
new_log_probs = dist.log_prob(actions).sum(dim=-1)
entropy = dist.entropy().sum(dim=-1).mean()
# PPO clipped loss
ratio = torch.exp(new_log_probs - old_log_probs)
clipped_ratio = torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon)
policy_loss = -torch.min(ratio * advantages, clipped_ratio * advantages).mean()
# Value loss
values = self.critic(states).squeeze()
value_loss = nn.functional.mse_loss(values, returns)
# Combined loss
loss = policy_loss + 0.5 * value_loss - 0.01 * entropy
self.optimizer.zero_grad()
loss.backward()
nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
self.optimizer.step()Training on Pendulum
Implementation
import gymnasium as gym
import numpy as np
def train_continuous_ppo(env_name='Pendulum-v1', total_steps=100000):
"""Train continuous PPO on Pendulum."""
env = gym.make(env_name)
state_dim = env.observation_space.shape[0]
action_dim = env.action_space.shape[0]
agent = ContinuousPPO(state_dim, action_dim)
# Collect and train
n_steps = 2048
episode_rewards = []
current_reward = 0
state, _ = env.reset()
for step in range(0, total_steps, n_steps):
states, actions, rewards, dones, log_probs, values = [], [], [], [], [], []
for _ in range(n_steps):
action, log_prob = agent.select_action(state)
# Clip action to valid range
action_clipped = np.clip(action, env.action_space.low, env.action_space.high)
next_state, reward, terminated, truncated, _ = env.step(action_clipped)
done = terminated or truncated
states.append(state)
actions.append(action)
rewards.append(reward)
dones.append(done)
log_probs.append(log_prob)
with torch.no_grad():
value = agent.critic(torch.FloatTensor(state).unsqueeze(0))
values.append(value.item())
current_reward += reward
if done:
episode_rewards.append(current_reward)
current_reward = 0
state, _ = env.reset()
else:
state = next_state
# Compute advantages and update
advantages, returns = compute_gae(
torch.FloatTensor(rewards),
torch.FloatTensor(values + [values[-1]]), # Bootstrap
torch.BoolTensor(dones),
agent.gamma,
agent.gae_lambda
)
agent.update(states, actions, log_probs, advantages.numpy(), returns.numpy())
if len(episode_rewards) > 0 and step % 10000 == 0:
print(f"Step {step}, Avg Reward: {np.mean(episode_rewards[-20:]):.1f}")
return agent, episode_rewards
# Note: compute_gae is the same function from the PPO chapterFor more complex continuous control tasks (HalfCheetah, Ant, Humanoid), you’d typically use larger networks, longer training, and possibly techniques like reward normalization and observation normalization. Libraries like Stable-Baselines3 provide well-tuned implementations.
RLHF: Training Language Models
One of the most impactful applications of policy gradients is RLHF (Reinforcement Learning from Human Feedback), used to train models like ChatGPT.
The RLHF pipeline:
- Supervised Fine-Tuning (SFT): Train a language model on high-quality demonstrations
- Reward Modeling: Collect human preferences, train a model to predict which outputs humans prefer
- PPO Optimization: Use the reward model as a reward signal, optimize the policy (language model) with PPO
The key insight: humans can’t easily write reward functions for “be helpful, harmless, and honest,” but they can compare two outputs and say which is better. RLHF converts these comparisons into a reward signal.
How Reward Models Work
Mathematical Details
Given two responses and to the same prompt , humans label which is better.
The Bradley-Terry model converts preferences to probabilities:
where is a learned reward model and is the sigmoid function.
Training objective (cross-entropy on preferences):
where is the preferred (winning) response and is the less preferred one.
PPO for Language Models
Mathematical Details
Once we have a reward model , we optimize the language model policy using PPO:
The KL penalty prevents the policy from deviating too far from the reference policy (usually the SFT model).
Why the KL penalty?
- Prevents reward hacking: finding outputs that fool the reward model
- Maintains capabilities: the policy shouldn’t forget how to write coherently
- Provides stability: limits how much the policy can change
Without the KL penalty, the model might find adversarial examples that score high on the reward model but are actually bad. For example, it might learn to always say “I’m happy to help!” because that pattern often correlates with positive preferences, even when it doesn’t actually help.
The KL penalty says: “Don’t stray too far from the original model that we know works.”
Implementation
# Conceptual RLHF training loop (simplified)
# In practice, this requires significant infrastructure
def rlhf_training_step(policy, reward_model, reference_policy, prompts, kl_coef=0.1):
"""
Simplified RLHF training step.
Args:
policy: Current language model policy
reward_model: Trained reward model
reference_policy: Reference policy (frozen SFT model)
prompts: Batch of prompts
kl_coef: KL penalty coefficient
"""
# Generate responses from current policy
responses = policy.generate(prompts)
# Compute rewards from reward model
rewards = reward_model(prompts, responses)
# Compute KL penalty
policy_log_probs = policy.log_prob(prompts, responses)
ref_log_probs = reference_policy.log_prob(prompts, responses)
kl_penalty = policy_log_probs - ref_log_probs
# Adjusted rewards
adjusted_rewards = rewards - kl_coef * kl_penalty
# PPO update using adjusted rewards
# (Standard PPO, treating each token as a "step")
policy.ppo_update(prompts, responses, adjusted_rewards)RLHF is complex in practice. Real implementations deal with:
- Token-level rewards and advantages
- Very long sequences (thousands of tokens)
- Distributed training across many GPUs
- Careful hyperparameter tuning
Libraries like TRL (Transformer Reinforcement Learning) and TRLX provide implementations.
Sim-to-Real Transfer
A major challenge in robotics: training in simulation is cheap and safe, but robots need to work in the real world.
The reality gap: Simulations are never perfect. Physics engines have approximations, sensors have noise patterns that differ from reality, and real-world environments are messier than simulated ones.
A policy trained purely in simulation may fail spectacularly on a real robot.
Solutions:
- Domain Randomization: Randomize simulation parameters (friction, mass, delays) so the policy learns to be robust
- System Identification: Measure real-world parameters and match the simulation
- Fine-tuning: Train in sim, then fine-tune with limited real-world data
- Sim-to-Real Adaptation: Learn to transfer, not just train
Implementation
# Domain randomization example
class RandomizedEnv:
"""Environment wrapper with domain randomization."""
def __init__(self, base_env, randomize_every=1):
self.base_env = base_env
self.randomize_every = randomize_every
self.episode_count = 0
def reset(self):
self.episode_count += 1
if self.episode_count % self.randomize_every == 0:
self.randomize_dynamics()
return self.base_env.reset()
def randomize_dynamics(self):
"""Randomize physics parameters."""
# Example randomizations (environment-specific)
self.base_env.model.body_mass[:] *= np.random.uniform(0.8, 1.2)
self.base_env.model.dof_damping[:] *= np.random.uniform(0.5, 1.5)
self.base_env.model.geom_friction[:] *= np.random.uniform(0.7, 1.3)
def step(self, action):
# Add action noise to simulate actuator imprecision
noisy_action = action + np.random.normal(0, 0.02, size=action.shape)
return self.base_env.step(noisy_action)The key insight of domain randomization: if the policy works across many simulated variations, it’s more likely to work in reality (which is just another variation it hasn’t seen).
Method Selection Guide
When should you use policy gradients vs. other approaches?
| Scenario | Recommended | Why |
|---|---|---|
| Discrete actions, moderate state space | DQN | Sample efficient, off-policy |
| Continuous actions | PPO or SAC | Natural handling of continuous actions |
| High-dimensional observations | PPO or SAC | Both scale well with deep networks |
| Sample efficiency critical | SAC or model-based | Off-policy reuses data; models can plan |
| Stability critical | PPO | Robust to hyperparameters |
| Human feedback available | PPO (RLHF) | Stable, handles preference learning well |
| Real-time constraints | Simpler methods | PPO can be slow for real-time control |
| Offline data only | CQL, IQL, DT | Can’t interact, need offline methods |
PPO vs. SAC
Two popular choices for continuous control:
PPO (on-policy):
- Collects new data with current policy
- Throws away data after a few epochs
- More stable, easier to tune
- Less sample efficient
SAC (Soft Actor-Critic, off-policy):
- Stores all experience in replay buffer
- Reuses old data many times
- More sample efficient
- Can be less stable
Rule of thumb:
- Simulated environments (cheap samples): PPO is simpler
- Real robots (expensive samples): SAC’s efficiency matters more
Practical Tips
Hard-won lessons from applying policy gradients:
Reward Design
- Start simple: sparse rewards are often better than complex shaped rewards
- Avoid reward hacking: test that the policy actually solves the task, not just maximizes reward
- Normalize rewards: helps with stability
Training Stability
- Use gradient clipping (max norm ~0.5)
- Normalize observations (running mean/std)
- Normalize advantages (per-batch)
- Start with lower learning rates (3e-4 for Adam)
Debugging
- Check that the value function is actually learning (plot value loss)
- Monitor entropy (should decrease but not collapse to zero)
- Track KL divergence between updates (should stay reasonable)
- Visualize the policy regularly (not just reward curves)
Scaling Up
- Parallelism helps: collect experience from multiple environments
- Larger batch sizes are more stable
- Learning rate scheduling can help fine-tune
RL is notoriously hard to debug. A bug in your advantage computation or loss function might not cause an error—the algorithm just won’t learn. Unit test your components carefully!
Summary
Key Takeaways
- Continuous control uses Gaussian policies: output mean and std, sample actions
- RLHF combines preference learning with PPO to train helpful language models
- The KL penalty in RLHF prevents reward hacking and maintains capabilities
- Sim-to-real transfer uses domain randomization to bridge the reality gap
- Method selection depends on action space, sample efficiency needs, and stability requirements
- PPO is a good default: stable, simple, and effective across many domains
- SAC is more sample efficient but can be less stable
- Practical success requires careful reward design, normalization, and debugging
Section Summary: The Policy Gradient Journey
We’ve traveled from the basic idea of learning policies directly to state-of-the-art algorithms used in production systems:
- Introduction: Why learn policies instead of values?
- Policy Gradient Theorem: The mathematical foundation—how to differentiate expected return
- REINFORCE: The simplest implementation—elegant but high variance
- Actor-Critic: Adding a critic for lower variance, trading off some bias
- PPO: Clipped objectives for stability—the workhorse of modern RL
- Applications: From robots to language models
You now have the conceptual understanding and practical tools to apply policy gradient methods to real problems.
What’s Next
The RL landscape continues to evolve:
- Off-policy actor-critic (SAC, TD3): Combining the best of policy gradients and Q-learning
- Model-based RL: Learning world models to plan and improve sample efficiency
- Offline RL: Learning from fixed datasets without environment interaction
- Multi-agent RL: Multiple policies learning together
- Foundation models for RL: Using large pretrained models for better generalization
The tools you’ve learned here—policy parameterization, advantage estimation, trust regions—are the building blocks for understanding these advanced topics.
Exercises
Conceptual Questions
-
Why is PPO particularly well-suited for RLHF? What properties make it appropriate for fine-tuning language models?
-
What is the sim-to-real gap? Name three techniques to address it.
-
When would you choose PPO over SAC? When SAC over PPO?
-
Why does RLHF need a KL penalty? What could go wrong without it?
Coding Challenges
-
Implement continuous-action PPO and train on the Pendulum environment. Plot the learning curve and the final policy’s behavior.
-
Implement a simple reward model that learns from pairwise preferences. Generate synthetic preferences (prefer responses with certain properties), train the reward model, and verify it ranks correctly.
Exploration
-
Domain randomization experiment. If you have access to a physics simulator (PyBullet, MuJoCo), try training with different amounts of randomization. How does randomization affect training speed? Final performance? Robustness to parameter changes?
-
Research dive. Pick a recent application of RL in a domain you’re interested in (robotics, games, NLP). What algorithm did they use? What were the key challenges? How did they address them?