contextual-bandits

#contextual-bandits

Cross-Domain Off-Policy Evaluation and Learning for Contextual Bandits

arXiv cs.LG ↗ · 3h ago Cached

This paper introduces cross-domain off-policy evaluation and learning (OPE/L) for contextual bandits, allowing the use of logged data from multiple source domains to improve policy evaluation and learning in target domains with challenging conditions like few-shot data, deterministic logging policies, and new actions.

0 favorites 0 likes

#contextual-bandits

Information-Directed Sampling for Causal Bandits

arXiv cs.LG ↗ · 2026-07-20 Cached

This paper studies contextual causal bandits with non-manipulable variables, proposing causal variants of Thompson Sampling and Information-Directed Sampling (IDS) that exploit shared causal mechanisms to accelerate decision-making. Theoretical regret bounds and experiments on synthetic tasks show that the proposed methods outperform causal and non-causal baselines.

0 favorites 0 likes

#contextual-bandits

Correlation-Aware Contextual Bandits with Surrogate Rewards for LLM Routing

arXiv cs.LG ↗ · 2026-07-13 Cached

This paper proposes correlation-aware contextual bandit algorithms that leverage surrogate reward signals from machine learning models for LLM routing, achieving improved accuracy-cost trade-offs and sample efficiency compared to standard baselines.

0 favorites 0 likes

#contextual-bandits

A Contextual-Bandit Oversight Game with Two-Sided Informational Asymmetry

arXiv cs.AI ↗ · 2026-07-02 Cached

This paper introduces a contextual-bandit team game with two-sided informational asymmetry for runtime human oversight of AI agents, characterizing gaps between team-optimal and myopic human oversight strategies.

0 favorites 0 likes

#contextual-bandits

Contextual Slate GLM Bandits with Limited Adaptivity

arXiv cs.LG ↗ · 2026-07-01 Cached

Proposes algorithms for contextual slate bandits with generalized linear rewards under limited adaptivity, achieving regret bounds independent of the non-linearity parameter. The batched and rarely-switching algorithms are computationally efficient and empirically outperform baselines, including in a language model example selection task.

0 favorites 0 likes

#contextual-bandits

Diagnosing and Repairing Factual Errors in RAG under Budget Constraints

arXiv cs.AI ↗ · 2026-06-30 Cached

This paper proposes D2R-RAG, a model-agnostic and resource-aware framework that diagnoses and repairs factual errors in RAG systems under latency and VRAM constraints, achieving better accuracy-efficiency trade-offs on FEVER and HotpotQA.

0 favorites 0 likes

#contextual-bandits

Graph Dimensionality Reduction for Contextual Bandits: Structure-Specific Regret Bounds under Approximate Smoothness and Noisy Eigenspaces

arXiv cs.LG ↗ · 2026-06-29 Cached

Proposes GraphDR-LinUCB, a method for contextual bandits with graph-structured arms that projects features onto the graph's low-frequency spectral subspace. Achieves the first regret bound for spectral-projection-based contextual bandits and demonstrates 15x regret reduction on real datasets over full-dimensional LinUCB.

0 favorites 0 likes

#contextual-bandits

Contextual Bandits for Maximizing Stimulated Word-of-Mouth Rewards

arXiv cs.LG ↗ · 2026-06-16 Cached

This paper presents a contextual multi-armed bandit framework that learns individual spillover probabilities in social networks to optimize stimulated word-of-mouth marketing, achieving higher rewards by targeting connected users.

0 favorites 0 likes

#contextual-bandits

Policy Regret for Embedding Model Routing: Contextual Bandits with Low-Rank Experts

arXiv cs.LG ↗ · 2026-06-16 Cached

This paper formalizes embedding model routing as an adversarial contextual linear bandit with low-rank experts, proposing the Hypentropy Policy Gradient (HPG) algorithm that achieves O~(s√(MT)) policy regret, avoiding the curse of dimensionality.

0 favorites 0 likes

#contextual-bandits

Online Pandora's Box for Contextual LLM Cascading

arXiv cs.AI ↗ · 2026-06-08 Cached

This paper introduces an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs, proposing a learning approach that combines GMM estimation with UCB-style confidence bounds and proving dimension-dependent regret bounds.

0 favorites 0 likes

#contextual-bandits

Human-in-the-Loop Contextual Bandits for Short-Term Rental Dynamic Pricing: Structural Equivalence of Historical Warm-Up and Approval-Gated Live Learning

arXiv cs.LG ↗ · 2026-06-03 Cached

The paper introduces Human-in-the-Loop Gated Bandit (HITL-GB) for short-term rental dynamic pricing, showing that historical pricing data under a prior policy is structurally equivalent to on-policy warm-up data, reducing cold-start from ~150 to ~30 episodes.

0 favorites 0 likes

#contextual-bandits

Catching a Moving Subspace: Low-Rank Bandits Beyond Stationarity

arXiv cs.LG ↗ · 2026-05-21 Cached

This paper studies piecewise-stationary low-rank linear contextual bandits, proposes the SPSC algorithm that achieves dynamic regret scaling with the intrinsic rank instead of the ambient dimension, and characterizes the identification boundary for subspace recovery under scalar feedback.

0 favorites 0 likes

contextual-bandits

Submit Feedback