Contextual Bandits

What You'll Learn

Understand how contextual bandits extend multi-armed bandits with features
Formalize the contextual bandit problem mathematically
Implement LinUCB for linear reward models
Recognize real-world applications in recommendations, advertising, and personalization
Understand the bridge from bandits to full reinforcement learning

A news website needs to decide which headline to show each visitor. But here’s the catch: different people like different things. A sports fan wants game scores; a tech enthusiast wants startup news.

This isn’t just about finding the best arm—it’s about finding the best arm for each person. Welcome to contextual bandits.

From One-Size-Fits-All to Personalization

In multi-armed bandits, we sought a single best action. But many real problems have context—features that should inform our decision:

User profile: Age, location, browsing history
Time of day: Morning news vs evening entertainment
Device type: Mobile users want shorter content

Contextual bandits learn to map these features to actions, personalizing decisions while still exploring efficiently.

ℹ️Note

Contextual bandits are the workhorse behind modern recommendation systems, online advertising, and personalized medicine. They’re simpler than full RL but more powerful than basic bandits.

Chapter Overview

Why Context Matters

From bandits to personalized decisions

Linear UCB

Contextual exploration with linear models

Real-World Applications

Recommendations, ads, and clinical trials

The Key Insight

📖Contextual Bandit

A sequential decision problem where the agent observes context (features) before choosing an action, and the expected reward depends on both the context and the chosen action.

Contextual bandits bridge the gap between:

Simple bandits: One best arm for everyone
Full RL: Sequential decisions that change the environment

By conditioning on context, we personalize without needing the full complexity of states and transitions.

Prerequisites

This chapter builds on:

Exploration strategies from Multi-Armed Bandits
Especially UCB which LinUCB extends

Key Takeaways

Context allows personalization: different users get different recommendations
LinUCB extends UCB to linear reward models with confidence ellipsoids
Context is not state: it doesn’t change due to your actions
Contextual bandits power real systems: news, ads, medical trials
When actions affect future context, you need full RL (MDPs)

Next ChapterIntroduction to MDPs→