Linear UCB (LinUCB)
LinUCB is the workhorse algorithm for contextual bandits. It extends the UCB principle—“optimism in the face of uncertainty”—to handle context-dependent rewards. The key insight: not only do we want to explore arms we’re uncertain about, we want to explore arms we’re uncertain about for the current context.
From UCB to LinUCB
Recall UCB for standard bandits:
We picked the arm with the highest “optimistic estimate”—the estimated value plus an exploration bonus based on uncertainty.
LinUCB does the same thing, but now:
- The estimated value depends on context:
- The uncertainty depends on context too: how well do we know the reward for this specific context?
Arms that we’re uncertain about for the current context get an exploration bonus, even if we’ve tried them many times for other contexts.
The LinUCB Algorithm
For each arm , maintain a linear model that predicts reward from context. Select the arm with the highest upper confidence bound: estimated reward plus an exploration bonus based on prediction uncertainty.
For each arm , we maintain:
- : Regularized design matrix (starts as )
- : Reward-weighted context sum (starts as )
- : Current parameter estimate
At each round with context :
-
For each arm , compute:
- Estimated reward:
- Prediction variance:
- UCB value:
-
Select:
-
Observe reward and update:
The exploration parameter controls the width of the confidence bound.
Let’s unpack the key quantities:
: This matrix accumulates information about which directions in feature space we’ve explored for arm . It starts as identity (regularization) and grows as we see more data.
: This is the ridge regression solution—the weight vector that best predicts rewards from context for arm .
: This is the prediction variance—how uncertain we are about the reward for this specific context. It’s small when is similar to contexts we’ve seen before, large when is in an unexplored direction.
The UCB value is optimistic: we assume the arm might perform as well as the upper end of our confidence interval.
Visualizing Confidence Ellipsoids
In 2D feature space, the uncertainty about forms an ellipsoid. The shape tells us which directions we’re confident about:
Imagine you’re estimating a weight vector :
- If you’ve seen many contexts with high and low , you’re confident about but not
- The ellipsoid is stretched in the direction (high uncertainty) and compressed in (low uncertainty)
For a new context :
- If points in a well-explored direction, is small
- If points in an unexplored direction, is large
This is how LinUCB knows where to explore: contexts that project onto unexplored feature directions get higher exploration bonuses.
Suppose we have 2D contexts and we’ve observed:
- Context [1, 0] many times
- Context [0, 1] only once
Our matrix might look like:
For a new context:
- : variance = (low uncertainty)
- : variance = (high uncertainty)
- : variance = (mixed)
LinUCB will give a larger exploration bonus to arms when the context points in the less-explored direction.
Implementation
import numpy as np
from typing import Optional
class LinUCBArm:
"""
Linear UCB model for a single arm.
Maintains sufficient statistics for ridge regression
and computes UCB values for contexts.
"""
def __init__(self, d: int, alpha: float = 1.0, lambda_reg: float = 1.0):
"""
Initialize LinUCB arm.
Args:
d: Dimension of context vectors
alpha: Exploration parameter
lambda_reg: Regularization parameter
"""
self.d = d
self.alpha = alpha
# A = lambda * I + sum of outer products
self.A = lambda_reg * np.eye(d)
# b = sum of reward-weighted contexts
self.b = np.zeros(d)
# Cached inverse and theta (recomputed on update)
self._A_inv = np.linalg.inv(self.A)
self._theta = np.zeros(d)
@property
def theta(self) -> np.ndarray:
"""Current parameter estimate."""
return self._theta
def get_ucb(self, context: np.ndarray) -> float:
"""
Compute UCB value for this arm given context.
Returns:
Estimated reward plus exploration bonus
"""
# Estimated reward
mean = context @ self._theta
# Prediction standard deviation
variance = context @ self._A_inv @ context
std = np.sqrt(max(variance, 0)) # Numerical stability
return mean + self.alpha * std
def get_expected_reward(self, context: np.ndarray) -> float:
"""Return estimated reward without exploration bonus."""
return context @ self._theta
def update(self, context: np.ndarray, reward: float):
"""
Update model with new observation.
Args:
context: Context vector for this observation
reward: Observed reward
"""
# Update sufficient statistics
self.A += np.outer(context, context)
self.b += reward * context
# Recompute inverse and theta
self._A_inv = np.linalg.inv(self.A)
self._theta = self._A_inv @ self.b
class LinUCB:
"""
LinUCB algorithm for contextual bandits.
Each arm has its own linear model (disjoint LinUCB).
"""
def __init__(self, n_arms: int, d: int, alpha: float = 1.0):
"""
Initialize LinUCB.
Args:
n_arms: Number of arms
d: Dimension of context vectors
alpha: Exploration parameter (higher = more exploration)
"""
self.n_arms = n_arms
self.d = d
self.alpha = alpha
self.arms = [LinUCBArm(d, alpha) for _ in range(n_arms)]
self.t = 0
def select_action(self, context: np.ndarray) -> int:
"""
Select arm with highest UCB for given context.
Args:
context: Current context vector
Returns:
Index of selected arm
"""
self.t += 1
ucb_values = [arm.get_ucb(context) for arm in self.arms]
return int(np.argmax(ucb_values))
def update(self, context: np.ndarray, action: int, reward: float):
"""
Update the selected arm's model.
Args:
context: Context for this round
action: Selected arm index
reward: Observed reward
"""
self.arms[action].update(context, reward)
def get_all_ucbs(self, context: np.ndarray) -> np.ndarray:
"""Return UCB values for all arms (for visualization)."""
return np.array([arm.get_ucb(context) for arm in self.arms])
# Example usage
def run_linucb_demo():
np.random.seed(42)
n_arms = 5
d = 10
n_steps = 1000
# Create a contextual bandit environment
true_theta = np.random.randn(n_arms, d) * 0.5
agent = LinUCB(n_arms, d, alpha=1.0)
total_reward = 0
total_regret = 0
for step in range(n_steps):
# Generate context
context = np.random.randn(d)
# Optimal arm and its expected reward
expected_rewards = context @ true_theta.T
optimal_arm = np.argmax(expected_rewards)
optimal_reward = expected_rewards[optimal_arm]
# Agent's choice
action = agent.select_action(context)
# Noisy reward
reward = expected_rewards[action] + np.random.randn() * 0.1
# Update
agent.update(context, action, reward)
total_reward += reward
total_regret += optimal_reward - expected_rewards[action]
if (step + 1) % 200 == 0:
print(f"Step {step+1}: Avg reward = {total_reward/(step+1):.3f}, "
f"Total regret = {total_regret:.1f}")
run_linucb_demo()Choosing Alpha
Guidelines for the Exploration Parameter
The parameter controls how much we explore:
- : Pure exploitation (no exploration bonus)
- : Moderate exploration (common default)
- : More exploration
Theoretical analysis suggests:
where is the time horizon, is the number of arms, and is the failure probability.
In practice, start with and tune based on performance.
Scale Sensitivity
LinUCB’s exploration bonus is sensitive to the scale of rewards and features:
- If rewards are in instead of , the exploration bonus might be too small
- If some features have much larger magnitude than others, they’ll dominate the uncertainty
Best practices:
- Normalize rewards to roughly or
- Standardize features (zero mean, unit variance)
- Adjust based on the scale of your problem
Computational Efficiency
LinUCB requires inverting a matrix for each arm at each update. For high-dimensional features, this can be slow: per arm per update.
Optimization: Sherman-Morrison formula
Instead of inverting from scratch, update the inverse incrementally:
This is per update—much faster for large .
import numpy as np
class EfficientLinUCBArm:
"""
LinUCB arm with efficient incremental updates.
Uses Sherman-Morrison formula to update the inverse
in O(d^2) instead of O(d^3).
"""
def __init__(self, d: int, alpha: float = 1.0, lambda_reg: float = 1.0):
self.d = d
self.alpha = alpha
self.lambda_reg = lambda_reg
# Maintain the inverse directly
self.A_inv = np.eye(d) / lambda_reg
self.b = np.zeros(d)
self.theta = np.zeros(d)
def get_ucb(self, context: np.ndarray) -> float:
mean = context @ self.theta
variance = context @ self.A_inv @ context
return mean + self.alpha * np.sqrt(max(variance, 0))
def update(self, context: np.ndarray, reward: float):
# Sherman-Morrison update for A_inv
# A_inv_new = A_inv - (A_inv @ x @ x.T @ A_inv) / (1 + x.T @ A_inv @ x)
A_inv_x = self.A_inv @ context
denom = 1.0 + context @ A_inv_x
self.A_inv -= np.outer(A_inv_x, A_inv_x) / denom
self.b += reward * context
self.theta = self.A_inv @ self.bHybrid LinUCB
Sometimes, features can be split into:
- Arm-specific features : Different for each arm (e.g., article features)
- Shared features : Same across arms (e.g., user features)
Hybrid LinUCB learns:
- A shared parameter for user features
- Per-arm parameters for arm-specific features
This allows knowledge to transfer across arms through the shared .
The hybrid model can be written as:
where:
- is the shared context
- is the arm-specific context
- is shared across arms
- is specific to arm
The sufficient statistics now include cross-terms between shared and arm features, making the update more complex but allowing richer models.
Regret Analysis
LinUCB achieves regret bounded by:
This bound tells us:
- Regret grows as (sublinear—we improve over time)
- Dependence on dimension is linear (higher dimensions need more exploration)
- Logarithmic factor from concentration inequalities
The rate is near-optimal for contextual bandits. Comparing to standard bandits:
- Standard UCB: where is number of arms
- LinUCB: where is feature dimension
When (few features, many arms), LinUCB can be much more efficient.
LinUCB vs Epsilon-Greedy with Linear Models
You could also use epsilon-greedy with linear regression for each arm. How does it compare?
Epsilon-greedy + Linear:
- Explores randomly with probability
- Simple to implement
- Doesn’t adapt exploration to the context
LinUCB:
- Explores where uncertainty is high for the current context
- More sample-efficient
- Slightly more complex
The key advantage of LinUCB: it knows where to explore. If we’ve seen many contexts in one region of feature space, epsilon-greedy still explores uniformly. LinUCB focuses exploration on unexplored regions.
Imagine a 2D feature space where:
- Arm 1 is best in the top-left quadrant
- Arm 2 is best everywhere else
- You’ve seen many contexts from the top-left
Epsilon-greedy: Keeps exploring both arms uniformly everywhere, wasting pulls in the well-understood top-left region.
LinUCB: The top-left region has low uncertainty—it exploits there. The other regions have high uncertainty—it explores there.
LinUCB focuses exploration budget where it matters most.
Summary
LinUCB extends UCB to contextual bandits:
- Core idea: Optimism in context-dependent uncertainty
- Model: Linear relationship between context and reward
- Exploration: Based on prediction variance
- Update: Ridge regression with incremental updates
- Regret: —near-optimal
LinUCB was introduced in the famous Li et al. (2010) paper on news recommendation at Yahoo!. It demonstrated that contextual bandits with linear models could significantly improve click-through rates in production. Today, variants of LinUCB power many recommendation and ad selection systems.
In the next section, we’ll see LinUCB and contextual bandits in action across real-world applications.