Policy Gradient Methods in Practice

What You'll Learn

Apply policy gradient methods to robotics and continuous control tasks
Understand how RLHF (Reinforcement Learning from Human Feedback) uses PPO
Recognize practical challenges: reward shaping, sim-to-real, sample efficiency
Know when to choose policy gradient vs. value-based methods
Navigate the landscape of modern RL algorithms

From Theory to Practice

We’ve built up the theory of policy gradients: from the basic motivation through REINFORCE, actor-critic, and PPO.

Now let’s see how these methods are actually used.

Policy gradient methods aren’t just academic exercises—they power some of the most impressive AI systems today:

Robotics: Robots learning to walk, manipulate objects, navigate
Games: OpenAI Five (Dota 2), AlphaStar (StarCraft II)
Language Models: ChatGPT, Claude, and other assistants trained with RLHF
Autonomous Systems: Self-driving cars, drone control

The common thread: situations where we need to learn complex behaviors from interaction, often with continuous actions.

Continuous Control: Robotics

Continuous control is where policy gradients shine. Let’s see how to extend our discrete-action PPO to handle continuous actions.

Gaussian Policies for Continuous Actions

∑Mathematical Details

For continuous actions, we parameterize a Gaussian (normal) distribution:

$\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta^2)$

where:

$\mu_\theta(s)$ is a neural network mapping state to action mean
$\sigma_\theta$ is the standard deviation (can be learned or fixed)

To sample an action: $a = \mu_\theta(s) + \sigma_\theta \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$

Log probability (needed for policy gradients): $\log \pi_\theta(a|s) = -\frac{1}{2}\left(\frac{a - \mu_\theta(s)}{\sigma_\theta}\right)^2 - \log \sigma_\theta - \frac{1}{2}\log(2\pi)$

</>Implementation

import torch
import torch.nn as nn
from torch.distributions import Normal


class ContinuousPolicyNetwork(nn.Module):
    """Actor network for continuous actions."""

    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()

        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh()
        )

        # Mean of action distribution
        self.mean = nn.Linear(hidden_dim, action_dim)

        # Log standard deviation (learned, but shared across states)
        self.log_std = nn.Parameter(torch.zeros(action_dim))

    def forward(self, state):
        features = self.shared(state)
        mean = self.mean(features)
        std = torch.exp(self.log_std)
        return mean, std

    def get_action(self, state, deterministic=False):
        """Sample action from policy."""
        mean, std = self.forward(state)

        if deterministic:
            return mean, None, None

        dist = Normal(mean, std)
        action = dist.sample()

        # For bounded action spaces, often use tanh squashing
        # action = torch.tanh(action)

        log_prob = dist.log_prob(action).sum(dim=-1)
        entropy = dist.entropy().sum(dim=-1)

        return action, log_prob, entropy


class ContinuousPPO:
    """PPO for continuous action spaces."""

    def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99,
                 gae_lambda=0.95, clip_epsilon=0.2, epochs=10):
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.clip_epsilon = clip_epsilon
        self.epochs = epochs

        self.actor = ContinuousPolicyNetwork(state_dim, action_dim)
        self.critic = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, 1)
        )

        self.optimizer = torch.optim.Adam(
            list(self.actor.parameters()) + list(self.critic.parameters()),
            lr=lr
        )

    def select_action(self, state):
        """Select action for environment interaction."""
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            action, log_prob, _ = self.actor.get_action(state_tensor)
        return action.squeeze(0).numpy(), log_prob.item()

    def update(self, states, actions, old_log_probs, advantages, returns):
        """PPO update for continuous actions."""
        states = torch.FloatTensor(states)
        actions = torch.FloatTensor(actions)
        old_log_probs = torch.FloatTensor(old_log_probs)
        advantages = torch.FloatTensor(advantages)
        returns = torch.FloatTensor(returns)

        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        for _ in range(self.epochs):
            # Get current log probs
            mean, std = self.actor(states)
            dist = Normal(mean, std)
            new_log_probs = dist.log_prob(actions).sum(dim=-1)
            entropy = dist.entropy().sum(dim=-1).mean()

            # PPO clipped loss
            ratio = torch.exp(new_log_probs - old_log_probs)
            clipped_ratio = torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon)
            policy_loss = -torch.min(ratio * advantages, clipped_ratio * advantages).mean()

            # Value loss
            values = self.critic(states).squeeze()
            value_loss = nn.functional.mse_loss(values, returns)

            # Combined loss
            loss = policy_loss + 0.5 * value_loss - 0.01 * entropy

            self.optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
            self.optimizer.step()

Training on Pendulum

</>Implementation

import gymnasium as gym
import numpy as np


def train_continuous_ppo(env_name='Pendulum-v1', total_steps=100000):
    """Train continuous PPO on Pendulum."""
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.shape[0]

    agent = ContinuousPPO(state_dim, action_dim)

    # Collect and train
    n_steps = 2048
    episode_rewards = []
    current_reward = 0
    state, _ = env.reset()

    for step in range(0, total_steps, n_steps):
        states, actions, rewards, dones, log_probs, values = [], [], [], [], [], []

        for _ in range(n_steps):
            action, log_prob = agent.select_action(state)

            # Clip action to valid range
            action_clipped = np.clip(action, env.action_space.low, env.action_space.high)

            next_state, reward, terminated, truncated, _ = env.step(action_clipped)
            done = terminated or truncated

            states.append(state)
            actions.append(action)
            rewards.append(reward)
            dones.append(done)
            log_probs.append(log_prob)

            with torch.no_grad():
                value = agent.critic(torch.FloatTensor(state).unsqueeze(0))
            values.append(value.item())

            current_reward += reward
            if done:
                episode_rewards.append(current_reward)
                current_reward = 0
                state, _ = env.reset()
            else:
                state = next_state

        # Compute advantages and update
        advantages, returns = compute_gae(
            torch.FloatTensor(rewards),
            torch.FloatTensor(values + [values[-1]]),  # Bootstrap
            torch.BoolTensor(dones),
            agent.gamma,
            agent.gae_lambda
        )

        agent.update(states, actions, log_probs, advantages.numpy(), returns.numpy())

        if len(episode_rewards) > 0 and step % 10000 == 0:
            print(f"Step {step}, Avg Reward: {np.mean(episode_rewards[-20:]):.1f}")

    return agent, episode_rewards


# Note: compute_gae is the same function from the PPO chapter

ℹ️Note

For more complex continuous control tasks (HalfCheetah, Ant, Humanoid), you’d typically use larger networks, longer training, and possibly techniques like reward normalization and observation normalization. Libraries like Stable-Baselines3 provide well-tuned implementations.

RLHF: Training Language Models

One of the most impactful applications of policy gradients is RLHF (Reinforcement Learning from Human Feedback), used to train models like ChatGPT.

The RLHF pipeline:

Supervised Fine-Tuning (SFT): Train a language model on high-quality demonstrations
Reward Modeling: Collect human preferences, train a model to predict which outputs humans prefer
PPO Optimization: Use the reward model as a reward signal, optimize the policy (language model) with PPO

The key insight: humans can’t easily write reward functions for “be helpful, harmless, and honest,” but they can compare two outputs and say which is better. RLHF converts these comparisons into a reward signal.

How Reward Models Work

∑Mathematical Details

Given two responses $y_1$ and $y_2$ to the same prompt $x$ , humans label which is better.

The Bradley-Terry model converts preferences to probabilities:

$P(y_1 \succ y_2 | x) = \sigma(r_\phi(x, y_1) - r_\phi(x, y_2))$

where $r_\phi$ is a learned reward model and $\sigma$ is the sigmoid function.

Training objective (cross-entropy on preferences):

$L(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))\right]$

where $y_w$ is the preferred (winning) response and $y_l$ is the less preferred one.

PPO for Language Models

∑Mathematical Details

Once we have a reward model $r_\phi$ , we optimize the language model policy $\pi_\theta$ using PPO:

$J(\theta) = \mathbb{E}_{x \sim D, y \sim \pi_\theta(\cdot|x)}\left[r_\phi(x, y) - \beta D_\text{KL}(\pi_\theta \| \pi_\text{ref})\right]$

The KL penalty prevents the policy from deviating too far from the reference policy $\pi_\text{ref}$ (usually the SFT model).

Why the KL penalty?

Prevents reward hacking: finding outputs that fool the reward model
Maintains capabilities: the policy shouldn’t forget how to write coherently
Provides stability: limits how much the policy can change

Without the KL penalty, the model might find adversarial examples that score high on the reward model but are actually bad. For example, it might learn to always say “I’m happy to help!” because that pattern often correlates with positive preferences, even when it doesn’t actually help.

The KL penalty says: “Don’t stray too far from the original model that we know works.”

</>Implementation

# Conceptual RLHF training loop (simplified)
# In practice, this requires significant infrastructure

def rlhf_training_step(policy, reward_model, reference_policy, prompts, kl_coef=0.1):
    """
    Simplified RLHF training step.

    Args:
        policy: Current language model policy
        reward_model: Trained reward model
        reference_policy: Reference policy (frozen SFT model)
        prompts: Batch of prompts
        kl_coef: KL penalty coefficient
    """
    # Generate responses from current policy
    responses = policy.generate(prompts)

    # Compute rewards from reward model
    rewards = reward_model(prompts, responses)

    # Compute KL penalty
    policy_log_probs = policy.log_prob(prompts, responses)
    ref_log_probs = reference_policy.log_prob(prompts, responses)
    kl_penalty = policy_log_probs - ref_log_probs

    # Adjusted rewards
    adjusted_rewards = rewards - kl_coef * kl_penalty

    # PPO update using adjusted rewards
    # (Standard PPO, treating each token as a "step")
    policy.ppo_update(prompts, responses, adjusted_rewards)

⚠️Warning

RLHF is complex in practice. Real implementations deal with:

Token-level rewards and advantages
Very long sequences (thousands of tokens)
Distributed training across many GPUs
Careful hyperparameter tuning

Libraries like TRL (Transformer Reinforcement Learning) and TRLX provide implementations.

Sim-to-Real Transfer

A major challenge in robotics: training in simulation is cheap and safe, but robots need to work in the real world.

The reality gap: Simulations are never perfect. Physics engines have approximations, sensors have noise patterns that differ from reality, and real-world environments are messier than simulated ones.

A policy trained purely in simulation may fail spectacularly on a real robot.

Solutions:

Domain Randomization: Randomize simulation parameters (friction, mass, delays) so the policy learns to be robust
System Identification: Measure real-world parameters and match the simulation
Fine-tuning: Train in sim, then fine-tune with limited real-world data
Sim-to-Real Adaptation: Learn to transfer, not just train

</>Implementation

# Domain randomization example
class RandomizedEnv:
    """Environment wrapper with domain randomization."""

    def __init__(self, base_env, randomize_every=1):
        self.base_env = base_env
        self.randomize_every = randomize_every
        self.episode_count = 0

    def reset(self):
        self.episode_count += 1

        if self.episode_count % self.randomize_every == 0:
            self.randomize_dynamics()

        return self.base_env.reset()

    def randomize_dynamics(self):
        """Randomize physics parameters."""
        # Example randomizations (environment-specific)
        self.base_env.model.body_mass[:] *= np.random.uniform(0.8, 1.2)
        self.base_env.model.dof_damping[:] *= np.random.uniform(0.5, 1.5)
        self.base_env.model.geom_friction[:] *= np.random.uniform(0.7, 1.3)

    def step(self, action):
        # Add action noise to simulate actuator imprecision
        noisy_action = action + np.random.normal(0, 0.02, size=action.shape)
        return self.base_env.step(noisy_action)

💡Tip

The key insight of domain randomization: if the policy works across many simulated variations, it’s more likely to work in reality (which is just another variation it hasn’t seen).

Method Selection Guide

When should you use policy gradients vs. other approaches?

Scenario	Recommended	Why
Discrete actions, moderate state space	DQN	Sample efficient, off-policy
Continuous actions	PPO or SAC	Natural handling of continuous actions
High-dimensional observations	PPO or SAC	Both scale well with deep networks
Sample efficiency critical	SAC or model-based	Off-policy reuses data; models can plan
Stability critical	PPO	Robust to hyperparameters
Human feedback available	PPO (RLHF)	Stable, handles preference learning well
Real-time constraints	Simpler methods	PPO can be slow for real-time control
Offline data only	CQL, IQL, DT	Can’t interact, need offline methods

PPO vs. SAC

Two popular choices for continuous control:

PPO (on-policy):

Collects new data with current policy
Throws away data after a few epochs
More stable, easier to tune
Less sample efficient

SAC (Soft Actor-Critic, off-policy):

Stores all experience in replay buffer
Reuses old data many times
More sample efficient
Can be less stable

Rule of thumb:

Simulated environments (cheap samples): PPO is simpler
Real robots (expensive samples): SAC’s efficiency matters more

Practical Tips

Hard-won lessons from applying policy gradients:

Reward Design

Start simple: sparse rewards are often better than complex shaped rewards
Avoid reward hacking: test that the policy actually solves the task, not just maximizes reward
Normalize rewards: helps with stability

Training Stability

Use gradient clipping (max norm ~0.5)
Normalize observations (running mean/std)
Normalize advantages (per-batch)
Start with lower learning rates (3e-4 for Adam)

Debugging

Check that the value function is actually learning (plot value loss)
Monitor entropy (should decrease but not collapse to zero)
Track KL divergence between updates (should stay reasonable)
Visualize the policy regularly (not just reward curves)

Scaling Up

Parallelism helps: collect experience from multiple environments
Larger batch sizes are more stable
Learning rate scheduling can help fine-tune

⚠️Warning

RL is notoriously hard to debug. A bug in your advantage computation or loss function might not cause an error—the algorithm just won’t learn. Unit test your components carefully!

Summary

Key Takeaways

Continuous control uses Gaussian policies: output mean and std, sample actions
RLHF combines preference learning with PPO to train helpful language models
The KL penalty in RLHF prevents reward hacking and maintains capabilities
Sim-to-real transfer uses domain randomization to bridge the reality gap
Method selection depends on action space, sample efficiency needs, and stability requirements
PPO is a good default: stable, simple, and effective across many domains
SAC is more sample efficient but can be less stable
Practical success requires careful reward design, normalization, and debugging

Section Summary: The Policy Gradient Journey

We’ve traveled from the basic idea of learning policies directly to state-of-the-art algorithms used in production systems:

Introduction: Why learn policies instead of values?
Policy Gradient Theorem: The mathematical foundation—how to differentiate expected return
REINFORCE: The simplest implementation—elegant but high variance
Actor-Critic: Adding a critic for lower variance, trading off some bias
PPO: Clipped objectives for stability—the workhorse of modern RL
Applications: From robots to language models

You now have the conceptual understanding and practical tools to apply policy gradient methods to real problems.

What’s Next

The RL landscape continues to evolve:

Off-policy actor-critic (SAC, TD3): Combining the best of policy gradients and Q-learning
Model-based RL: Learning world models to plan and improve sample efficiency
Offline RL: Learning from fixed datasets without environment interaction
Multi-agent RL: Multiple policies learning together
Foundation models for RL: Using large pretrained models for better generalization

The tools you’ve learned here—policy parameterization, advantage estimation, trust regions—are the building blocks for understanding these advanced topics.

Exercises

Conceptual Questions

Why is PPO particularly well-suited for RLHF? What properties make it appropriate for fine-tuning language models?
What is the sim-to-real gap? Name three techniques to address it.
When would you choose PPO over SAC? When SAC over PPO?
Why does RLHF need a KL penalty? What could go wrong without it?

Coding Challenges

Implement continuous-action PPO and train on the Pendulum environment. Plot the learning curve and the final policy’s behavior.
Implement a simple reward model that learns from pairwise preferences. Generate synthetic preferences (prefer responses with certain properties), train the reward model, and verify it ranks correctly.

Exploration

Domain randomization experiment. If you have access to a physics simulator (PyBullet, MuJoCo), try training with different amounts of randomization. How does randomization affect training speed? Final performance? Robustness to parameter changes?
Research dive. Pick a recent application of RL in a domain you’re interested in (robotics, games, NLP). What algorithm did they use? What were the key challenges? How did they address them?