Chapter 102
📝Draft

Contextual Bandits

Personalized decisions based on context features

Prerequisites:

Contextual Bandits

What You'll Learn

  • Understand how contextual bandits extend multi-armed bandits with features
  • Formalize the contextual bandit problem mathematically
  • Implement LinUCB for linear reward models
  • Recognize real-world applications in recommendations, advertising, and personalization
  • Understand the bridge from bandits to full reinforcement learning

A news website needs to decide which headline to show each visitor. But here’s the catch: different people like different things. A sports fan wants game scores; a tech enthusiast wants startup news.

This isn’t just about finding the best arm—it’s about finding the best arm for each person. Welcome to contextual bandits.

From One-Size-Fits-All to Personalization

In multi-armed bandits, we sought a single best action. But many real problems have context—features that should inform our decision:

  • User profile: Age, location, browsing history
  • Time of day: Morning news vs evening entertainment
  • Device type: Mobile users want shorter content

Contextual bandits learn to map these features to actions, personalizing decisions while still exploring efficiently.

ℹ️Note

Contextual bandits are the workhorse behind modern recommendation systems, online advertising, and personalized medicine. They’re simpler than full RL but more powerful than basic bandits.

Chapter Overview

The Key Insight

📖Contextual Bandit

A sequential decision problem where the agent observes context (features) before choosing an action, and the expected reward depends on both the context and the chosen action.

Contextual bandits bridge the gap between:

  • Simple bandits: One best arm for everyone
  • Full RL: Sequential decisions that change the environment

By conditioning on context, we personalize without needing the full complexity of states and transitions.

Prerequisites

This chapter builds on:


Key Takeaways

  • Context allows personalization: different users get different recommendations
  • LinUCB extends UCB to linear reward models with confidence ellipsoids
  • Context is not state: it doesn’t change due to your actions
  • Contextual bandits power real systems: news, ads, medical trials
  • When actions affect future context, you need full RL (MDPs)
Next ChapterIntroduction to MDPs