Contextual Bandits
What You'll Learn
- Understand how contextual bandits extend multi-armed bandits with features
- Formalize the contextual bandit problem mathematically
- Implement LinUCB for linear reward models
- Recognize real-world applications in recommendations, advertising, and personalization
- Understand the bridge from bandits to full reinforcement learning
A news website needs to decide which headline to show each visitor. But here’s the catch: different people like different things. A sports fan wants game scores; a tech enthusiast wants startup news.
This isn’t just about finding the best arm—it’s about finding the best arm for each person. Welcome to contextual bandits.
From One-Size-Fits-All to Personalization
In multi-armed bandits, we sought a single best action. But many real problems have context—features that should inform our decision:
- User profile: Age, location, browsing history
- Time of day: Morning news vs evening entertainment
- Device type: Mobile users want shorter content
Contextual bandits learn to map these features to actions, personalizing decisions while still exploring efficiently.
Contextual bandits are the workhorse behind modern recommendation systems, online advertising, and personalized medicine. They’re simpler than full RL but more powerful than basic bandits.
Chapter Overview
Why Context Matters
From bandits to personalized decisions
Linear UCB
Contextual exploration with linear models
Real-World Applications
Recommendations, ads, and clinical trials
The Key Insight
A sequential decision problem where the agent observes context (features) before choosing an action, and the expected reward depends on both the context and the chosen action.
Contextual bandits bridge the gap between:
- Simple bandits: One best arm for everyone
- Full RL: Sequential decisions that change the environment
By conditioning on context, we personalize without needing the full complexity of states and transitions.
Prerequisites
This chapter builds on:
- Exploration strategies from Multi-Armed Bandits
- Especially UCB which LinUCB extends
Key Takeaways
- Context allows personalization: different users get different recommendations
- LinUCB extends UCB to linear reward models with confidence ellipsoids
- Context is not state: it doesn’t change due to your actions
- Contextual bandits power real systems: news, ads, medical trials
- When actions affect future context, you need full RL (MDPs)