Chapter 205
📝Draft

Policy Gradient Methods in Practice

Apply policy gradient methods to real-world challenges in robotics, RLHF, and beyond

Policy Gradient Methods in Practice

What You'll Learn

  • Apply policy gradient methods to robotics and continuous control tasks
  • Understand how RLHF (Reinforcement Learning from Human Feedback) uses PPO
  • Recognize practical challenges: reward shaping, sim-to-real, sample efficiency
  • Know when to choose policy gradient vs. value-based methods
  • Navigate the landscape of modern RL algorithms

From Theory to Practice

We’ve built up the theory of policy gradients: from the basic motivation through REINFORCE, actor-critic, and PPO.

Now let’s see how these methods are actually used.

Policy gradient methods aren’t just academic exercises—they power some of the most impressive AI systems today:

  • Robotics: Robots learning to walk, manipulate objects, navigate
  • Games: OpenAI Five (Dota 2), AlphaStar (StarCraft II)
  • Language Models: ChatGPT, Claude, and other assistants trained with RLHF
  • Autonomous Systems: Self-driving cars, drone control

The common thread: situations where we need to learn complex behaviors from interaction, often with continuous actions.

Continuous Control: Robotics

Continuous control is where policy gradients shine. Let’s see how to extend our discrete-action PPO to handle continuous actions.

Gaussian Policies for Continuous Actions

Mathematical Details

For continuous actions, we parameterize a Gaussian (normal) distribution:

πθ(as)=N(μθ(s),σθ2)\pi_\theta(a|s) = \mathcal{N}(\mu_\theta(s), \sigma_\theta^2)

where:

  • μθ(s)\mu_\theta(s) is a neural network mapping state to action mean
  • σθ\sigma_\theta is the standard deviation (can be learned or fixed)

To sample an action: a=μθ(s)+σθϵ,ϵN(0,I)a = \mu_\theta(s) + \sigma_\theta \cdot \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)

Log probability (needed for policy gradients): logπθ(as)=12(aμθ(s)σθ)2logσθ12log(2π)\log \pi_\theta(a|s) = -\frac{1}{2}\left(\frac{a - \mu_\theta(s)}{\sigma_\theta}\right)^2 - \log \sigma_\theta - \frac{1}{2}\log(2\pi)

</>Implementation
import torch
import torch.nn as nn
from torch.distributions import Normal


class ContinuousPolicyNetwork(nn.Module):
    """Actor network for continuous actions."""

    def __init__(self, state_dim, action_dim, hidden_dim=64):
        super().__init__()

        self.shared = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.Tanh(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.Tanh()
        )

        # Mean of action distribution
        self.mean = nn.Linear(hidden_dim, action_dim)

        # Log standard deviation (learned, but shared across states)
        self.log_std = nn.Parameter(torch.zeros(action_dim))

    def forward(self, state):
        features = self.shared(state)
        mean = self.mean(features)
        std = torch.exp(self.log_std)
        return mean, std

    def get_action(self, state, deterministic=False):
        """Sample action from policy."""
        mean, std = self.forward(state)

        if deterministic:
            return mean, None, None

        dist = Normal(mean, std)
        action = dist.sample()

        # For bounded action spaces, often use tanh squashing
        # action = torch.tanh(action)

        log_prob = dist.log_prob(action).sum(dim=-1)
        entropy = dist.entropy().sum(dim=-1)

        return action, log_prob, entropy


class ContinuousPPO:
    """PPO for continuous action spaces."""

    def __init__(self, state_dim, action_dim, lr=3e-4, gamma=0.99,
                 gae_lambda=0.95, clip_epsilon=0.2, epochs=10):
        self.gamma = gamma
        self.gae_lambda = gae_lambda
        self.clip_epsilon = clip_epsilon
        self.epochs = epochs

        self.actor = ContinuousPolicyNetwork(state_dim, action_dim)
        self.critic = nn.Sequential(
            nn.Linear(state_dim, 64),
            nn.Tanh(),
            nn.Linear(64, 64),
            nn.Tanh(),
            nn.Linear(64, 1)
        )

        self.optimizer = torch.optim.Adam(
            list(self.actor.parameters()) + list(self.critic.parameters()),
            lr=lr
        )

    def select_action(self, state):
        """Select action for environment interaction."""
        with torch.no_grad():
            state_tensor = torch.FloatTensor(state).unsqueeze(0)
            action, log_prob, _ = self.actor.get_action(state_tensor)
        return action.squeeze(0).numpy(), log_prob.item()

    def update(self, states, actions, old_log_probs, advantages, returns):
        """PPO update for continuous actions."""
        states = torch.FloatTensor(states)
        actions = torch.FloatTensor(actions)
        old_log_probs = torch.FloatTensor(old_log_probs)
        advantages = torch.FloatTensor(advantages)
        returns = torch.FloatTensor(returns)

        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)

        for _ in range(self.epochs):
            # Get current log probs
            mean, std = self.actor(states)
            dist = Normal(mean, std)
            new_log_probs = dist.log_prob(actions).sum(dim=-1)
            entropy = dist.entropy().sum(dim=-1).mean()

            # PPO clipped loss
            ratio = torch.exp(new_log_probs - old_log_probs)
            clipped_ratio = torch.clamp(ratio, 1 - self.clip_epsilon, 1 + self.clip_epsilon)
            policy_loss = -torch.min(ratio * advantages, clipped_ratio * advantages).mean()

            # Value loss
            values = self.critic(states).squeeze()
            value_loss = nn.functional.mse_loss(values, returns)

            # Combined loss
            loss = policy_loss + 0.5 * value_loss - 0.01 * entropy

            self.optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(self.actor.parameters(), 0.5)
            self.optimizer.step()

Training on Pendulum

</>Implementation
import gymnasium as gym
import numpy as np


def train_continuous_ppo(env_name='Pendulum-v1', total_steps=100000):
    """Train continuous PPO on Pendulum."""
    env = gym.make(env_name)
    state_dim = env.observation_space.shape[0]
    action_dim = env.action_space.shape[0]

    agent = ContinuousPPO(state_dim, action_dim)

    # Collect and train
    n_steps = 2048
    episode_rewards = []
    current_reward = 0
    state, _ = env.reset()

    for step in range(0, total_steps, n_steps):
        states, actions, rewards, dones, log_probs, values = [], [], [], [], [], []

        for _ in range(n_steps):
            action, log_prob = agent.select_action(state)

            # Clip action to valid range
            action_clipped = np.clip(action, env.action_space.low, env.action_space.high)

            next_state, reward, terminated, truncated, _ = env.step(action_clipped)
            done = terminated or truncated

            states.append(state)
            actions.append(action)
            rewards.append(reward)
            dones.append(done)
            log_probs.append(log_prob)

            with torch.no_grad():
                value = agent.critic(torch.FloatTensor(state).unsqueeze(0))
            values.append(value.item())

            current_reward += reward
            if done:
                episode_rewards.append(current_reward)
                current_reward = 0
                state, _ = env.reset()
            else:
                state = next_state

        # Compute advantages and update
        advantages, returns = compute_gae(
            torch.FloatTensor(rewards),
            torch.FloatTensor(values + [values[-1]]),  # Bootstrap
            torch.BoolTensor(dones),
            agent.gamma,
            agent.gae_lambda
        )

        agent.update(states, actions, log_probs, advantages.numpy(), returns.numpy())

        if len(episode_rewards) > 0 and step % 10000 == 0:
            print(f"Step {step}, Avg Reward: {np.mean(episode_rewards[-20:]):.1f}")

    return agent, episode_rewards


# Note: compute_gae is the same function from the PPO chapter
ℹ️Note

For more complex continuous control tasks (HalfCheetah, Ant, Humanoid), you’d typically use larger networks, longer training, and possibly techniques like reward normalization and observation normalization. Libraries like Stable-Baselines3 provide well-tuned implementations.

RLHF: Training Language Models

One of the most impactful applications of policy gradients is RLHF (Reinforcement Learning from Human Feedback), used to train models like ChatGPT.

The RLHF pipeline:

  1. Supervised Fine-Tuning (SFT): Train a language model on high-quality demonstrations
  2. Reward Modeling: Collect human preferences, train a model to predict which outputs humans prefer
  3. PPO Optimization: Use the reward model as a reward signal, optimize the policy (language model) with PPO

The key insight: humans can’t easily write reward functions for “be helpful, harmless, and honest,” but they can compare two outputs and say which is better. RLHF converts these comparisons into a reward signal.

How Reward Models Work

Mathematical Details

Given two responses y1y_1 and y2y_2 to the same prompt xx, humans label which is better.

The Bradley-Terry model converts preferences to probabilities:

P(y1y2x)=σ(rϕ(x,y1)rϕ(x,y2))P(y_1 \succ y_2 | x) = \sigma(r_\phi(x, y_1) - r_\phi(x, y_2))

where rϕr_\phi is a learned reward model and σ\sigma is the sigmoid function.

Training objective (cross-entropy on preferences):

L(ϕ)=E(x,yw,yl)D[logσ(rϕ(x,yw)rϕ(x,yl))]L(\phi) = -\mathbb{E}_{(x, y_w, y_l) \sim D}\left[\log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l))\right]

where ywy_w is the preferred (winning) response and yly_l is the less preferred one.

PPO for Language Models

Mathematical Details

Once we have a reward model rϕr_\phi, we optimize the language model policy πθ\pi_\theta using PPO:

J(θ)=ExD,yπθ(x)[rϕ(x,y)βDKL(πθπref)]J(\theta) = \mathbb{E}_{x \sim D, y \sim \pi_\theta(\cdot|x)}\left[r_\phi(x, y) - \beta D_\text{KL}(\pi_\theta \| \pi_\text{ref})\right]

The KL penalty prevents the policy from deviating too far from the reference policy πref\pi_\text{ref} (usually the SFT model).

Why the KL penalty?

  • Prevents reward hacking: finding outputs that fool the reward model
  • Maintains capabilities: the policy shouldn’t forget how to write coherently
  • Provides stability: limits how much the policy can change

Without the KL penalty, the model might find adversarial examples that score high on the reward model but are actually bad. For example, it might learn to always say “I’m happy to help!” because that pattern often correlates with positive preferences, even when it doesn’t actually help.

The KL penalty says: “Don’t stray too far from the original model that we know works.”

</>Implementation
# Conceptual RLHF training loop (simplified)
# In practice, this requires significant infrastructure

def rlhf_training_step(policy, reward_model, reference_policy, prompts, kl_coef=0.1):
    """
    Simplified RLHF training step.

    Args:
        policy: Current language model policy
        reward_model: Trained reward model
        reference_policy: Reference policy (frozen SFT model)
        prompts: Batch of prompts
        kl_coef: KL penalty coefficient
    """
    # Generate responses from current policy
    responses = policy.generate(prompts)

    # Compute rewards from reward model
    rewards = reward_model(prompts, responses)

    # Compute KL penalty
    policy_log_probs = policy.log_prob(prompts, responses)
    ref_log_probs = reference_policy.log_prob(prompts, responses)
    kl_penalty = policy_log_probs - ref_log_probs

    # Adjusted rewards
    adjusted_rewards = rewards - kl_coef * kl_penalty

    # PPO update using adjusted rewards
    # (Standard PPO, treating each token as a "step")
    policy.ppo_update(prompts, responses, adjusted_rewards)

Sim-to-Real Transfer

A major challenge in robotics: training in simulation is cheap and safe, but robots need to work in the real world.

The reality gap: Simulations are never perfect. Physics engines have approximations, sensors have noise patterns that differ from reality, and real-world environments are messier than simulated ones.

A policy trained purely in simulation may fail spectacularly on a real robot.

Solutions:

  1. Domain Randomization: Randomize simulation parameters (friction, mass, delays) so the policy learns to be robust
  2. System Identification: Measure real-world parameters and match the simulation
  3. Fine-tuning: Train in sim, then fine-tune with limited real-world data
  4. Sim-to-Real Adaptation: Learn to transfer, not just train
</>Implementation
# Domain randomization example
class RandomizedEnv:
    """Environment wrapper with domain randomization."""

    def __init__(self, base_env, randomize_every=1):
        self.base_env = base_env
        self.randomize_every = randomize_every
        self.episode_count = 0

    def reset(self):
        self.episode_count += 1

        if self.episode_count % self.randomize_every == 0:
            self.randomize_dynamics()

        return self.base_env.reset()

    def randomize_dynamics(self):
        """Randomize physics parameters."""
        # Example randomizations (environment-specific)
        self.base_env.model.body_mass[:] *= np.random.uniform(0.8, 1.2)
        self.base_env.model.dof_damping[:] *= np.random.uniform(0.5, 1.5)
        self.base_env.model.geom_friction[:] *= np.random.uniform(0.7, 1.3)

    def step(self, action):
        # Add action noise to simulate actuator imprecision
        noisy_action = action + np.random.normal(0, 0.02, size=action.shape)
        return self.base_env.step(noisy_action)
💡Tip

The key insight of domain randomization: if the policy works across many simulated variations, it’s more likely to work in reality (which is just another variation it hasn’t seen).

Method Selection Guide

When should you use policy gradients vs. other approaches?

ScenarioRecommendedWhy
Discrete actions, moderate state spaceDQNSample efficient, off-policy
Continuous actionsPPO or SACNatural handling of continuous actions
High-dimensional observationsPPO or SACBoth scale well with deep networks
Sample efficiency criticalSAC or model-basedOff-policy reuses data; models can plan
Stability criticalPPORobust to hyperparameters
Human feedback availablePPO (RLHF)Stable, handles preference learning well
Real-time constraintsSimpler methodsPPO can be slow for real-time control
Offline data onlyCQL, IQL, DTCan’t interact, need offline methods

PPO vs. SAC

Two popular choices for continuous control:

PPO (on-policy):

  • Collects new data with current policy
  • Throws away data after a few epochs
  • More stable, easier to tune
  • Less sample efficient

SAC (Soft Actor-Critic, off-policy):

  • Stores all experience in replay buffer
  • Reuses old data many times
  • More sample efficient
  • Can be less stable

Rule of thumb:

  • Simulated environments (cheap samples): PPO is simpler
  • Real robots (expensive samples): SAC’s efficiency matters more

Practical Tips

Hard-won lessons from applying policy gradients:

Reward Design

  • Start simple: sparse rewards are often better than complex shaped rewards
  • Avoid reward hacking: test that the policy actually solves the task, not just maximizes reward
  • Normalize rewards: helps with stability

Training Stability

  • Use gradient clipping (max norm ~0.5)
  • Normalize observations (running mean/std)
  • Normalize advantages (per-batch)
  • Start with lower learning rates (3e-4 for Adam)

Debugging

  • Check that the value function is actually learning (plot value loss)
  • Monitor entropy (should decrease but not collapse to zero)
  • Track KL divergence between updates (should stay reasonable)
  • Visualize the policy regularly (not just reward curves)

Scaling Up

  • Parallelism helps: collect experience from multiple environments
  • Larger batch sizes are more stable
  • Learning rate scheduling can help fine-tune

Summary

Key Takeaways

  • Continuous control uses Gaussian policies: output mean and std, sample actions
  • RLHF combines preference learning with PPO to train helpful language models
  • The KL penalty in RLHF prevents reward hacking and maintains capabilities
  • Sim-to-real transfer uses domain randomization to bridge the reality gap
  • Method selection depends on action space, sample efficiency needs, and stability requirements
  • PPO is a good default: stable, simple, and effective across many domains
  • SAC is more sample efficient but can be less stable
  • Practical success requires careful reward design, normalization, and debugging

Section Summary: The Policy Gradient Journey

We’ve traveled from the basic idea of learning policies directly to state-of-the-art algorithms used in production systems:

  1. Introduction: Why learn policies instead of values?
  2. Policy Gradient Theorem: The mathematical foundation—how to differentiate expected return
  3. REINFORCE: The simplest implementation—elegant but high variance
  4. Actor-Critic: Adding a critic for lower variance, trading off some bias
  5. PPO: Clipped objectives for stability—the workhorse of modern RL
  6. Applications: From robots to language models

You now have the conceptual understanding and practical tools to apply policy gradient methods to real problems.

What’s Next

The RL landscape continues to evolve:

  • Off-policy actor-critic (SAC, TD3): Combining the best of policy gradients and Q-learning
  • Model-based RL: Learning world models to plan and improve sample efficiency
  • Offline RL: Learning from fixed datasets without environment interaction
  • Multi-agent RL: Multiple policies learning together
  • Foundation models for RL: Using large pretrained models for better generalization

The tools you’ve learned here—policy parameterization, advantage estimation, trust regions—are the building blocks for understanding these advanced topics.

Exercises

Conceptual Questions

  1. Why is PPO particularly well-suited for RLHF? What properties make it appropriate for fine-tuning language models?

  2. What is the sim-to-real gap? Name three techniques to address it.

  3. When would you choose PPO over SAC? When SAC over PPO?

  4. Why does RLHF need a KL penalty? What could go wrong without it?

Coding Challenges

  1. Implement continuous-action PPO and train on the Pendulum environment. Plot the learning curve and the final policy’s behavior.

  2. Implement a simple reward model that learns from pairwise preferences. Generate synthetic preferences (prefer responses with certain properties), train the reward model, and verify it ranks correctly.

Exploration

  1. Domain randomization experiment. If you have access to a physics simulator (PyBullet, MuJoCo), try training with different amounts of randomization. How does randomization affect training speed? Final performance? Robustness to parameter changes?

  2. Research dive. Pick a recent application of RL in a domain you’re interested in (robotics, games, NLP). What algorithm did they use? What were the key challenges? How did they address them?