Bandit Problems • Part 1 of 3
📝Draft

Why Context Matters

From bandits to personalized decisions

Why Context Matters

In multi-armed bandits, we searched for a single best arm—one action that’s optimal for everyone. But many real problems have a crucial missing ingredient: context. Different situations call for different decisions.

A news website shouldn’t show the same headline to everyone. A sports fan wants game scores; a tech enthusiast wants startup news. The best arm depends on who’s asking.

Welcome to contextual bandits—where we learn to personalize decisions.

The Limitation of One-Size-Fits-All

Consider running a news website. You have 5 headline slots and need to choose which article to feature. In the standard bandit framework:

  • Run the site for a week
  • Track which headlines get the most clicks
  • Converge on the “best” headline
  • Show it to everyone

But this ignores a massive opportunity. Different users have different interests:

User TypeBest Article CategoryClick Rate on SportsClick Rate on Tech
Sports fanSports80%10%
Tech enthusiastTech20%70%
General readerEither40%40%

A standard bandit might conclude “Sports headlines get 50% clicks overall, Tech gets 40%—show Sports to everyone.”

But if you could identify the user type first, you could show Sports to sports fans (80% clicks) and Tech to tech enthusiasts (70% clicks). That’s much better than 50% for everyone!

📌The Cost of Ignoring Context

Let’s quantify the opportunity cost:

Standard Bandit (one arm for all):

  • Population: 50% sports fans, 50% tech enthusiasts
  • Best overall: Sports (50% sports fans x 80% + 50% tech enthusiasts x 20% = 50% average CTR)
  • Show Sports to everyone: 50% CTR

Contextual Bandit (personalized):

  • Sports fans see Sports: 80% CTR
  • Tech enthusiasts see Tech: 70% CTR
  • Overall: 50% x 80% + 50% x 70% = 75% CTR

By using context, we improved CTR from 50% to 75%—a 50% relative improvement! This is the power of personalization.

Introducing Context

📖Context (Feature Vector)

A vector of features xRdx \in \mathbb{R}^d that describes the current situation. Context is observed before making a decision and helps predict which action will be best.

Context is any information available before you choose an action:

User context (who is this?):

  • Demographics (age, location)
  • Past behavior (browsing history, purchase history)
  • Device type (mobile, desktop)
  • Time of visit

Situational context (what’s happening?):

  • Time of day, day of week
  • Current weather
  • Trending topics
  • Recent site activity

Item context (what are the options?):

  • Article category, length, recency
  • Product features, price, ratings
  • Ad creative details

The key insight: context helps us predict how good each action will be for this specific situation.

The Contextual Bandit Framework

Mathematical Details

A contextual bandit problem consists of:

  1. Context space XRd\mathcal{X} \subseteq \mathbb{R}^d: The set of possible contexts
  2. Action space A={1,,k}\mathcal{A} = \{1, \ldots, k\}: The set of arms/actions
  3. Reward function: For each arm aa, there’s a function fa:XRf_a: \mathcal{X} \to \mathbb{R} that determines expected reward given context

At each round tt:

  1. Environment generates context xtXx_t \in \mathcal{X}
  2. Agent observes xtx_t and selects action atAa_t \in \mathcal{A}
  3. Agent receives reward rt=fat(xt)+εtr_t = f_{a_t}(x_t) + \varepsilon_t where εt\varepsilon_t is noise

The goal is to maximize cumulative reward: t=1Trt=t=1Tfat(xt)+εt\sum_{t=1}^T r_t = \sum_{t=1}^T f_{a_t}(x_t) + \varepsilon_t

Or equivalently, minimize regret relative to always picking the best arm for each context: Regret(T)=t=1T[maxafa(xt)fat(xt)]\text{Regret}(T) = \sum_{t=1}^T \left[ \max_a f_a(x_t) - f_{a_t}(x_t) \right]

The key difference from standard bandits:

  • Standard bandit: Find the single best arm
  • Contextual bandit: Find the best arm for each context

We’re learning a policy π:XA\pi: \mathcal{X} \to \mathcal{A} that maps contexts to actions, not just picking a single action.

Linear Reward Models

How do we model the relationship between context and reward? The simplest approach: assume rewards are linear in the features.

Expected reward for arm a with context x=xθa\text{Expected reward for arm } a \text{ with context } x = x^\top \theta_a

where θaRd\theta_a \in \mathbb{R}^d is a weight vector for arm aa.

For example, if context is x=[age,is_mobile,time_of_day]x = [\text{age}, \text{is\_mobile}, \text{time\_of\_day}]:

  • Sports arm might have θsports=[0.02,0.1,0.05]\theta_{\text{sports}} = [0.02, -0.1, 0.05]
  • Tech arm might have θtech=[0.01,0.15,0.02]\theta_{\text{tech}} = [-0.01, 0.15, 0.02]

Then a 25-year-old on mobile at noon would have:

  • Sports expected reward: 25×0.02+1×(0.1)+12×0.05=1.025 \times 0.02 + 1 \times (-0.1) + 12 \times 0.05 = 1.0
  • Tech expected reward: 25×(0.01)+1×0.15+12×0.02=0.1425 \times (-0.01) + 1 \times 0.15 + 12 \times 0.02 = 0.14

Sports is predicted to be better for this user.

Mathematical Details

Formally, we assume:

E[rtxt,at]=xtθat\mathbb{E}[r_t | x_t, a_t] = x_t^\top \theta_{a_t}

This is the linear reward model. Each arm has its own parameter vector θa\theta_a, and we learn these from data.

Given observations (x1,a1,r1),,(xn,an,rn)(x_1, a_1, r_1), \ldots, (x_n, a_n, r_n), we can estimate θa\theta_a using ridge regression on the data where arm aa was selected:

θ^a=(t:at=axtxt+λI)1t:at=artxt\hat{\theta}_a = \left(\sum_{t: a_t = a} x_t x_t^\top + \lambda I\right)^{-1} \sum_{t: a_t = a} r_t x_t

The regularization term λI\lambda I prevents overfitting when we have few samples.

</>Implementation
import numpy as np

class ContextualBandit:
    """
    A contextual bandit environment with linear rewards.

    Each arm has a true weight vector. Rewards are linear
    in features plus Gaussian noise.
    """

    def __init__(self, n_arms: int, n_features: int, noise_std: float = 0.1):
        self.n_arms = n_arms
        self.n_features = n_features
        self.noise_std = noise_std

        # True weight vectors (unknown to the agent)
        self.theta = np.random.randn(n_arms, n_features)
        # Normalize so rewards are roughly in [0, 1]
        self.theta = self.theta / (n_features ** 0.5)

    def get_context(self) -> np.ndarray:
        """Generate a random context vector."""
        return np.random.randn(self.n_features)

    def get_reward(self, context: np.ndarray, arm: int) -> float:
        """Return noisy reward for selecting arm given context."""
        expected = context @ self.theta[arm]
        return expected + np.random.randn() * self.noise_std

    def get_optimal_arm(self, context: np.ndarray) -> int:
        """Return the best arm for this context (for evaluation)."""
        expected_rewards = context @ self.theta.T
        return np.argmax(expected_rewards)

    def get_optimal_reward(self, context: np.ndarray) -> float:
        """Return the expected reward of the optimal arm."""
        expected_rewards = context @ self.theta.T
        return np.max(expected_rewards)


# Example usage
bandit = ContextualBandit(n_arms=5, n_features=10)

# Simulate some interactions
for _ in range(10):
    context = bandit.get_context()
    optimal = bandit.get_optimal_arm(context)
    rewards = [bandit.get_reward(context, a) for a in range(5)]
    print(f"Optimal arm: {optimal}, Rewards: {[f'{r:.2f}' for r in rewards]}")

Context is NOT State

Here’s the key test: Does your action affect the next context you see?

Contextual bandit (action doesn’t affect future):

  • Showing an ad to a user
  • Recommending a movie to a random visitor
  • Selecting a news headline
  • Choosing a treatment for independent patients

Full RL (action affects future):

  • Playing a game (moves change the board)
  • Robot navigation (position changes)
  • Dialogue systems (conversation evolves)
  • Long-term recommendations (preferences shift)

Why Not Separate Bandits?

You might wonder: “Why not just run a separate bandit for each possible context?”

The problem is context is usually continuous or high-dimensional. If your context is a 100-dimensional user embedding, you’ll never see the exact same context twice!

We need to generalize across contexts. If two users have similar features, their preferences are probably similar. This is where function approximation (like linear models) comes in—we learn smooth relationships between context and reward.

📌The Generalization Advantage

Suppose you have:

  • 1 million possible users (contexts)
  • 10 arms
  • 10,000 total impressions

Separate bandits approach:

  • Each user-arm combination needs its own estimate
  • 10 million parameters to learn
  • Most users seen only once—no learning possible!

Contextual bandit with 50-dimensional features:

  • 10 arms x 50 features = 500 parameters total
  • Learning generalizes across similar users
  • Even users seen once benefit from information about similar users

The linear model lets us transfer knowledge: “Users who like sports also tend to like fitness articles” is captured in the learned weights.

Exploration in Context Space

Exploration becomes more nuanced with context. In standard bandits, we asked: “Should I try this arm more?”

In contextual bandits, we ask: “Should I try this arm for contexts like this one?”

An arm might be well-understood for some contexts but not others. If we’ve never seen a teenager on mobile asking for news, we’re uncertain about which headline they’d prefer—even if we’ve learned a lot about other demographics.

This leads to algorithms like LinUCB that explore based on uncertainty in predictions for the current context, not just global arm uncertainty.

Mathematical Details

In the linear model, our uncertainty about the expected reward xθax^\top \theta_a depends on:

  1. How well we’ve estimated θa\theta_a: More samples = lower uncertainty
  2. How similar xx is to past contexts: If xx is unlike anything we’ve seen, we’re uncertain

The covariance matrix Aa1A_a^{-1} captures both:

Var(xθ^a)xAa1x\text{Var}(x^\top \hat{\theta}_a) \propto x^\top A_a^{-1} x

where Aa=t:at=axtxt+λIA_a = \sum_{t: a_t = a} x_t x_t^\top + \lambda I.

High variance means high uncertainty—and that’s where we should explore.

Summary: From Bandits to Contextual Bandits

AspectMulti-Armed BanditsContextual Bandits
GoalFind one best armFind best arm per context
InputNothing (same each round)Context vector xx
OutputAction aaPolicy π(x)a\pi(x) \to a
Reward modelμa\mu_a (constant per arm)fa(x)f_a(x) (function of context)
ExplorationPer-arm uncertaintyPer-context uncertainty
GeneralizationNone neededAcross similar contexts

The key insight: contextual bandits learn to personalize decisions by using features that predict which action will be best. This is the foundation of modern recommendation systems, ad selection, and many other applications.

In the next section, we’ll see LinUCB—the workhorse algorithm for contextual bandits that elegantly extends UCB to handle context.

ℹ️Note

Contextual bandits sit between simple bandits and full RL. They’re more powerful than bandits (they personalize) but simpler than RL (no sequential dependencies). This sweet spot makes them extremely practical for real-world applications.