Online LLM Selection via Constrained Bandits with Time-Varying Demand
Summary
This paper proposes a constrained stochastic bandit algorithm for online selection of large language models under time-varying task demand and heterogeneous accuracy, latency, and cost profiles, with theoretical guarantees on regret and constraint violations.
View Cached Full Text
Cached at: 06/17/26, 05:39 AM
# Online LLM Selection via Constrained Bandits with Time-Varying Demand
Source: [https://arxiv.org/html/2606.17489](https://arxiv.org/html/2606.17489)
Yin Huang Qingsong Liu Jie XuY\. Huang and J\. Xu are with the Department of Electrical and Computer Engineering, University of Florida\. Email: \{yin\.huang, jie\.xu\}@ufl\.edu\. Q\. Liu is with the Manning College of Information and Computer Sciences, University of Massachusetts Amherst\. Email: qingsongliu@umass\.edu\.
###### Abstract
Large Language Models \(LLMs\) are increasingly deployed in edge–cloud inference systems to handle diverse user tasks with heterogeneous accuracy, latency, and cost profiles\. Selecting the appropriate LLM for each incoming task is critical for ensuring service quality and efficient resource utilization\. However, model heterogeneity, stochastic and unknown performance characteristics, and time\-varying task demands make static selection strategies inadequate\. Real\-world deployments often impose hard resource budgets such as monetary expenditure limits, along with soft service\-level requirements such as latency guarantees\. These constraints introduce additional challenges for online decision\-making\. We formulate this problem as a constrained stochastic bandit learning task, where the learner sequentially selects models under both packing\-type \(hard\) and covering\-type \(soft\) constraints, while adapting to time\-varying task demand\. The learner operates without access to the underlying reward, cost, or latency distributions and must rely on partial feedback\. We develop a novel online learning algorithm that leverages confidence\-bound estimates and demand predictions to balance reward maximization with long\-term constraint satisfaction\. We provide theoretical guarantees showing sublinear regret and sublinear covering constraint violations compared to an offline benchmark with full information\. Experimental results on synthetic workloads demonstrate the effectiveness and robustness of our approach in dynamic, resource\-constrained environments\.
###### Index Terms:
Collaborative Edge Inference, Multi\-armed Bandits, LLM Selection, Bandits Constraints\.
## 1Introduction
Large language models \(LLMs\) are increasingly deployed in real\-world systems to support tasks such as generation, summarization, and question answering\. In collaborative edge\-cloud inference systems, a mobile device offloads incoming requests to a cloud server, where it can select from a pool of pre\-deployed LLMs\. A central challenge is choosing the most appropriate model to serve incoming tasks, given the heterogeneity in model capabilities\. Most existing approaches rely on static selection, fixing one LLM throughout deployment\[[35](https://arxiv.org/html/2606.17489#bib.bib6),[36](https://arxiv.org/html/2606.17489#bib.bib7)\]\. However, in new or evolving scenarios, such one\-size\-fits\-all approach often underperform\. Due to generation diversity\[[11](https://arxiv.org/html/2606.17489#bib.bib39)\], smaller models may be better suited for certain prompts, while data drift\[[10](https://arxiv.org/html/2606.17489#bib.bib40)\]may alter task characteristics over time\. These issues call for an online model selection\[[45](https://arxiv.org/html/2606.17489#bib.bib10)\]approach that adapts to the current workload\. Importantly, different LLMs often exhibit varying and uncertain accuracy, latency, and monetary cost\. This makes the online model selection problem a natural fit for a bandit learning framework, where the system must balance exploration and exploitation while adapting to stochastic reward and cost feedback\.
An additional challenge in online LLM selection lies in satisfying multiple resource constraints, which often vary in both form and severity\. These constraints can be broadly classified along two dimensions\. First, a constraint can be hard, meaning it must be satisfied at all times, or else the system halts or soft, meaning occasional violations are allowed as long as performance remains acceptable on average\. Second, constraints can be of packing or covering type: a packing constraint enforces that cumulative resource usage does not exceed a fixed budget \(i\.e\.,≤\\leq\), while a covering constraint requires that cumulative service provision meets a minimum threshold \(i\.e\.,≥\\geq\)\[[27](https://arxiv.org/html/2606.17489#bib.bib20),[42](https://arxiv.org/html/2606.17489#bib.bib26)\]\. Our problem involves both types\. First, LLMs typically incur monetary cost based on the number of tokens they generate this cost is stochastic, as token counts vary across tasks and models\[[44](https://arxiv.org/html/2606.17489#bib.bib38),[21](https://arxiv.org/html/2606.17489#bib.bib43)\]\. In many deployments, users or providers face a fixed budget, and exceeding it may terminate service\. This defines a hard packing constraint\. Second, many applications impose a latency Service\-Level Agreement \(SLA\), which demands that a certain fraction of tasks be completed within a target deadline\[[48](https://arxiv.org/html/2606.17489#bib.bib41)\]\. Occasional violations are tolerable, but too many missed deadlines degrades user experience\. This defines a soft covering constraint\. These bidirectional constraints reflect fundamentally different goals, cost containment vs\. service quality, and demand tailored algorithmic handling\.
A third challenge arises from the dynamic and uncertain nature of task arrival rates across time\. In edge\-cloud inference systems, the number of tasks arriving in each time slot may vary significantly due to user activity, temporal patterns, or external conditions\[[8](https://arxiv.org/html/2606.17489#bib.bib42)\]\. To ensure low overhead and reduce cold\-start delays, a single model is typically selected at the start of each slot to serve all tasks that arrive during that interval\. While this preselection avoids per\-task switching costs, it makes model decisions particularly sensitive to variations in task volume\. Ignoring such fluctuations can lead to degraded performance: lightweight load may tolerate slower or cheaper models, whereas heavy load may require faster, more costly ones to meet latency constraints\. Fortunately, forecast information is often available in practice for example, from historical workload traces or predictive demand models and can be leveraged to anticipate demand sizes before decision\-making\. Integrating such predictions into the learning algorithm allows more informed and robust model selection as conditions change, particularly under budget constraints or time\-sensitive SLAs\.
In this paper, we formulate the online LLM model selection problem for a task offloading scenario in a heterogeneous edge\-cloud inference system, explicitly addressing the aforementioned challenges\. We consider a setting with multiple available LLMs \(e\.g\., different model sizes or providers at edge and cloud\), where tasks arrive continuously over time and must be assigned to appropriate models for inference\. Our formulation captures three key aspects: \(i\) the need to dynamically select among candidate models at each decision point rather than committing to a static allocation, \(ii\) the existence of a hard budget constraint on token\-based inference costs alongside a soft latency SLA that ensures quality of service, and \(iii\) the presence of time\-varying task volumes that influence both cost and responsiveness\. By integrating these elements, the proposed framework generalizes classical offloading and scheduling formulations to the setting of LLM inference, where monetary efficiency and latency guarantees are primary design considerations\.
To solve this online model selection problem, we cast it as a constrained stochastic multi\-armed bandit instance, akin to the bandits with knapsacks framework, where the system faces both a hard budget constraint and a soft latency SLA, corresponding to packing\- and covering\-type constraints, respectively\.We develop a new learning algorithm, COPAC\-UCB, which integrates confidence\-guided estimation, being optimistic \(via upper confidence bounds or UCB\) for reward and covering\-type constraints, and pessimistic \(via lower confidence bounds, or LCB\) for packing\-type constraints with Lagrangian regularization to adaptively balance performance and feasibility under uncertainty\. To improve responsiveness under demand variability, COPAC\-UCB incorporates a forecasting component that estimates the cumulative task load using a black\-box predictor\. These demand estimates help augment decisions to more effectively satisfy constraints\. Meanwhile, the algorithm updates dual variables \(virtual prices\) for each constraint via online gradient descent, effectively discouraging models that risk violating the budget or SLA\. By combining predictive guidance with optimistic exploration and adaptive cost signals, COPAC\-UCB balances reward maximization with safe resource usage over time\. The key contributions of this paper are summarized as follows:
- •We formulate a new online LLM model selection problem in edge\-cloud collaborative inference systems, explicitly incorporating time\-varying task demand, monetary and latency constraints, and model heterogeneity\.
- •We cast this problem as a stochastic bandit with both packing and covering constraints, and propose a novel algorithm, COPAC\-UCB, which combines confidence\-bound estimates with Lagrangian\-based resource balancing\. The algorithm also leverages predictions of future task load to guide cost\-aware model selection under dynamic demand\.
- •We theoretically show that COPAC\-UCB achieves sublinear regret while satisfying budget and latency constraints with high probability, ensuring safe and efficient decision\-making over time\.
- •We evaluate our algorithm through simulations on realistic workload traces, demonstrating that COPAC\-UCB achieves higher utility and better latency compliance under budget than several competitive baselines\.
## 2Related Work
Inference Selection and Offloading OptimizationRecent years have seen growing interest in improving LLM inference efficiency via adaptive selection and offloading\. Traditional methods\[[35](https://arxiv.org/html/2606.17489#bib.bib6),[36](https://arxiv.org/html/2606.17489#bib.bib7),[40](https://arxiv.org/html/2606.17489#bib.bib8),[16](https://arxiv.org/html/2606.17489#bib.bib9)\]often adopt a static model, chosen offline by benchmark scores or perplexity, but ignore input\-dependent variability across LLMs\. More recent work proposes online learning\-based model selection to balance accuracy, latency, and monetary cost under uncertainty\. For example,\[[45](https://arxiv.org/html/2606.17489#bib.bib10)\]models dynamic performance drift via a time\-varying bandit framework\. In parallel, offloading methods decide whether to run tasks locally, at the edge, or in the cloud, optimizing for latency, energy, or budget\[[30](https://arxiv.org/html/2606.17489#bib.bib12),[3](https://arxiv.org/html/2606.17489#bib.bib13),[49](https://arxiv.org/html/2606.17489#bib.bib14),[13](https://arxiv.org/html/2606.17489#bib.bib15),[29](https://arxiv.org/html/2606.17489#bib.bib45),[47](https://arxiv.org/html/2606.17489#bib.bib46),[14](https://arxiv.org/html/2606.17489#bib.bib47),[19](https://arxiv.org/html/2606.17489#bib.bib3),[20](https://arxiv.org/html/2606.17489#bib.bib4)\]\. Among them,\[[13](https://arxiv.org/html/2606.17489#bib.bib15)\]introduces a combinatorial bandit approach for LLM placement under cost constraints\. These lines of work underscore the value of principled online decision\-making for LLM inference in dynamic, resource\-constrained settings\. However, none of these works jointly address both packing and covering constraints or explicitly incorporate time\-varying demand, which are central to our formulation\.
Bandits with Arm Selection ConstraintsBandits with arm selection constraints model online decision\-making where each action yields a reward while consuming resources or satisfying additional constraints\. A canonical example is the Bandits with budget constraint framework\[[7](https://arxiv.org/html/2606.17489#bib.bib16)\], where each arm incurs stochastic rewards and resource consumption, and the goal is to maximize cumulative reward before exhausting a fixed budget\. Subsequent work\[[2](https://arxiv.org/html/2606.17489#bib.bib17)\]developed primal\-dual and UCB\-based algorithms with sublinear regret guarantees under stationary settings\. More recent extensions introduce covering\-type constraints to capture fairness or service\-level guarantees\. For instance,\[[25](https://arxiv.org/html/2606.17489#bib.bib23),[15](https://arxiv.org/html/2606.17489#bib.bib24),[46](https://arxiv.org/html/2606.17489#bib.bib25)\]enforce long\-term proportionality across arms or coverage of service across groups, modeled as soft constraints that require cumulative satisfaction over time\. These settings typically balance reward maximization with constraint violation minimization\. Other lines of work explore adversarial\[[23](https://arxiv.org/html/2606.17489#bib.bib18),[17](https://arxiv.org/html/2606.17489#bib.bib22)\], non\-stationary\[[12](https://arxiv.org/html/2606.17489#bib.bib19)\], and time\-varying settings with oracle predictions\[[33](https://arxiv.org/html/2606.17489#bib.bib5),[18](https://arxiv.org/html/2606.17489#bib.bib21)\], showing that side information can improve performance\. However, existing models rarely address both packing and covering constraints simultaneously, and do not consider time\-varying, stochastic demand\. Our work fills this gap by formulating and analyzing a setting that integrates all three dimensions\.
Learning\-augmented online algorithm designOur work connects to the literature on learning\-augmented online algorithms, which leverage machine\-learned predictions to enhance online decision\-making\[[32](https://arxiv.org/html/2606.17489#bib.bib28),[34](https://arxiv.org/html/2606.17489#bib.bib27)\]\. These methods improve performance by incorporating predictions on latent parameters, while maintaining robustness to adversarial inputs\. Applications span caching\[[32](https://arxiv.org/html/2606.17489#bib.bib28)\], scheduling\[[34](https://arxiv.org/html/2606.17489#bib.bib27),[26](https://arxiv.org/html/2606.17489#bib.bib29)\], rent\-or\-buy\[[37](https://arxiv.org/html/2606.17489#bib.bib30)\], set cover\[[9](https://arxiv.org/html/2606.17489#bib.bib31),[4](https://arxiv.org/html/2606.17489#bib.bib32)\], and matching\[[5](https://arxiv.org/html/2606.17489#bib.bib33)\], often assuming a static prediction known at the start of the horizon\. Recent work explores sequential predictions updated at each round\[[38](https://arxiv.org/html/2606.17489#bib.bib34),[39](https://arxiv.org/html/2606.17489#bib.bib35),[43](https://arxiv.org/html/2606.17489#bib.bib36),[24](https://arxiv.org/html/2606.17489#bib.bib37)\], primarily in full\-feedback or contextual bandit settings\. However, these models do not incorporate resource constraints\. We extend this paradigm by combining sequential predictions with mixed hard and soft constraints, introducing new algorithmic challenges\.
## 3System Model
We consider a mobile edge–cloud inference system where a user device offloads LLM inference tasks to a cloud\-based service\. Due to limited on\-device compute and memory, the device cannot run large\-scale LLMs locally and relies on the cloud for inference\. The system proceeds in discrete rounds indexed byt=1,2,…,Tt=1,2,\\dots,T, each representing a fixed\-duration time slot\. At the start of each roundtt, the system selects a single LLMat∈𝒜a\_\{t\}\\in\\mathcal\{A\}from a pool of pre\-deployed models to handle tasks arriving during the slot\. This preselection avoids frequent model switches and reduces cold\-start overhead\. The task loadqtq\_\{t\}in roundttis unknown at selection time and may fluctuate with user activity or application demand\. Although the learner must adapt without knowingqtq\_\{t\}in advance, such variations often follow structured patterns \(e\.g\., time\-of\-day or seasonal trends\) and can be predicted from historical data or external signals\[[38](https://arxiv.org/html/2606.17489#bib.bib34),[39](https://arxiv.org/html/2606.17489#bib.bib35)\]\. These predictions help guide model selection and resource planning\.
Model Performance and Cost\.Each modela∈𝒜a\\in\\mathcal\{A\}in the cloud has a stochastic cost and performance profile that is only revealed after it is selected\. In each roundtt, the system selects a modelata\_\{t\}to serve tasks arriving during the time slot\. The selected model then yields a random per\-task rewardrt\(at\)∈\[0,1\]r\_\{t\}\(a\_\{t\}\)\\in\[0,1\], capturing task accuracy or success rate, and a latencyℓt\(at\)\\ell\_\{t\}\(a\_\{t\}\), which includes model inference time and system delay\. Recall thatqtq\_\{t\}is the received number of tasks during roundtt, the total reward and latency incurred in this round areqt⋅rt\(at\)q\_\{t\}\\cdot r\_\{t\}\(a\_\{t\}\)andqt⋅ℓt\(at\)q\_\{t\}\\cdot\\ell\_\{t\}\(a\_\{t\}\), respectively\.
Each modelaaincurs a monetary cost based on a fixed per\-token priceρa\>0\\rho\_\{a\}\>0, specified by the service provider\. The number of tokens generated per task varies with prompt complexity and decoding behavior\. Letτti\(a\)\\tau\_\{t\}^\{i\}\(a\)denote the token usage of modelaaon taskiiin roundtt, drawn from a bounded distribution with unknown mean\[[21](https://arxiv.org/html/2606.17489#bib.bib43)\]\. The total token usage in roundttisτt\(a\)=∑i=1qtτti\(a\)\\tau\_\{t\}\(a\)=\\sum\_\{i=1\}^\{q\_\{t\}\}\\tau\_\{t\}^\{i\}\(a\), and the corresponding cost isρat⋅τt\(at\)\\rho\_\{a\_\{t\}\}\\cdot\\tau\_\{t\}\(a\_\{t\}\), revealed only after execution\. The learner must choose models without prior knowledge of realized reward, latency, or token usage, relying on historical feedback\(rt\(at\),ℓt\(at\),τt\(at\)\)\(r\_\{t\}\(a\_\{t\}\),\\ell\_\{t\}\(a\_\{t\}\),\\tau\_\{t\}\(a\_\{t\}\)\)to balance accuracy, latency, and cost\.
Constraints\.The system operates under two global constraints that govern model selection throughout the decision horizon\. First, a hard monetary budget constraint limits the cumulative token\-based cost\. Formally, the total incurred cost must satisfy∑t=1Tρat⋅τt\(at\)≤B\\sum\_\{t=1\}^\{T\}\\rho\_\{a\_\{t\}\}\\cdot\\tau\_\{t\}\(a\_\{t\}\)\\leq B\. This reflects operational or financial limits, such as API usage quotas or cloud usage budgets\. Once the budget is exhausted, meaning the cumulative cost exceedsBB, the system is prohibited from using any LLMs\. It continues to process incoming workloads until roundTT, but can only select a default no\-operation action that produces zero reward and incurs zero cost\. This represents a fallback mode where the system remains functional but consumes no additional resources\.
Second, a soft latency service\-level agreement imposes a covering constraint on user\-perceived responsiveness\[[48](https://arxiv.org/html/2606.17489#bib.bib41)\]\. At least a fractionα∈\(0,1\]\\alpha\\in\(0,1\]of all tasks must complete within a latency deadlined0d\_\{0\}\. Since allqtq\_\{t\}tasks in roundttshare the same model latencyℓt\(at\)\\ell\_\{t\}\(a\_\{t\}\), the SLA constraint is:∑t=1Tqt⋅𝟏\{ℓt\(at\)≤d0\}≥α⋅∑t=1Tqt\.\\sum\_\{t=1\}^\{T\}q\_\{t\}\\cdot\\mathbf\{1\}\\\{\\ell\_\{t\}\(a\_\{t\}\)\\leq d\_\{0\}\\\}\\geq\\alpha\\cdot\\sum\_\{t=1\}^\{T\}q\_\{t\}\.The indicator is treated as zero when a no\-operation action is selected, ensuring only effective decisions contribute to SLA compliance\. While occasional violations are permitted, persistent delays significantly degrade user experience\. As latency is stochastic and only observed after selection, the learner must balance responsiveness against reward and cost\. See Fig\. 1 for a system overview\.
Objective\.The goal of the system is to maximize the expected cumulative user\-perceived reward over a fixed decision horizon ofTTrounds, while satisfying both a hard monetary budget constraint and a soft latency SLA\. Formally, the learner seeks to solve the following constrained online optimization problem:
maxa1,…,aT\\displaystyle\\max\_\{a\_\{1\},\\dots,a\_\{T\}\}𝔼\[∑t=1Tqt⋅rt\(at\)\]\\displaystyle\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{T\}q\_\{t\}\\cdot r\_\{t\}\(a\_\{t\}\)\\right\]subject to∑t=1Tρat⋅τt\(at\)≤B,\(monetary budget\)\\displaystyle\\sum\_\{t=1\}^\{T\}\\rho\_\{a\_\{t\}\}\\cdot\\tau\_\{t\}\(a\_\{t\}\)\\leq B,\\quad\\text\{\(monetary budget\)\}∑t=1Tqt⋅\\displaystyle\\sum\_\{t=1\}^\{T\}q\_\{t\}\\cdot𝟏\{ℓt\(at\)≤d0\}≥α⋅∑t=1Tqt\.\(latency SLA\)\\displaystyle\\mathbf\{1\}\\\{\\ell\_\{t\}\(a\_\{t\}\)\\leq d\_\{0\}\\\}\\geq\\alpha\\cdot\\sum\_\{t=1\}^\{T\}q\_\{t\}\.\\text\{ \(latency SLA\)\}As described earlier, any no\-operation decision contributes zero reward and is excluded from SLA satisfaction\.
The expectation is taken over both the learner’s internal randomness and the stochasticity of reward, latency, and token generation\. This formulation captures the core trade\-off between reward maximization and constraint satisfaction under uncertainty\. Since the performance and cost profiles of each model are unknown and revealed only through interaction, the problem aligns naturally with a constrained bandit framework\. The learner must explore model behavior, adapt to time\-varying task loads, and make decisions under partial feedback, all while respecting the budget and service constraints\.
Figure 1:Online LLM selection under Budget Constraint
## 4Time Varying Demands Bandit Formulation with General Constraints\.
We formulate the model selection problem as a Time Varying Demands Bandit Formulation with General Constraints\. While this framework is motivated by the above model selection problem, it is more general and can be applied to a broader class of constrained online decision\-making problems with time\-varying demand, making it of independent interest\. In each roundt∈\{1,…,T\}t\\in\\\{1,\\dots,T\\\}, the learner selects an actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}from a finite set of LLM models to process the incoming tasks\. The learner then observes a stochastic rewardrt\(at\)∈\[0,1\]r\_\{t\}\(a\_\{t\}\)\\in\[0,1\], capturing average task utility such as accuracy or success rate\.
While our system specifically involves inference latency and monetary cost, many practical deployments also face other resource pressures, such as energy usage, memory access, or bandwidth limitations\. To capture this broader class of constraints in a unified manner, we introduce add\-dimensional cost vector𝐳t\(at\)∈\[0,1\]d\\mathbf\{z\}\_\{t\}\(a\_\{t\}\)\\in\[0,1\]^\{d\}, where each coordinate represents the per\-task consumption of a specific resource type\. The total cost in roundttis scaled by the realized task load asqt⋅𝐳t\(at\)q\_\{t\}\\cdot\\mathbf\{z\}\_\{t\}\(a\_\{t\}\)\. This abstraction allows us to handle heterogeneous system costs within a single learning framework and sets the stage for constraint\-aware decision\-making\.
Constraint Structure\.We categorize thedd\-dimensional cost vector𝐳t\(at\)∈\[0,1\]d\\mathbf\{z\}\_\{t\}\(a\_\{t\}\)\\in\[0,1\]^\{d\}into two types of constraints: packing constraintsℐpack=\{1,…,d1\}\\mathcal\{I\}\_\{\\text\{pack\}\}=\\\{1,\\dots,d\_\{1\}\\\}with hard budgetsBi\>0B\_\{i\}\>0; and covering constraintsℐcov=\{d1\+1,…,d\}\\mathcal\{I\}\_\{\\text\{cov\}\}=\\\{d\_\{1\}\+1,\\dots,d\\\}with soft lower thresholds, whered=d1\+d2d=d\_\{1\}\+d\_\{2\}\. For each packing constrainti∈ℐpacki\\in\\mathcal\{I\}\_\{\\text\{pack\}\}, the system halts if the cumulative usage exceeds budgetBiB\_\{i\}at any roundτ\\tau, i\.e\.,∑t=1τqt⋅zt\(i\)\(at\)\>Bi\\sum\_\{t=1\}^\{\\tau\}q\_\{t\}\\cdot z\_\{t\}^\{\(i\)\}\(a\_\{t\}\)\>B\_\{i\}\. For each covering constrainti∈ℐcovi\\in\\mathcal\{I\}\_\{\\text\{cov\}\}, the learner must ensure the cumulative usage satisfies∑t=1Tqt⋅zt\(i\)\(at\)≥Bi\\sum\_\{t=1\}^\{T\}q\_\{t\}\\cdot z\_\{t\}^\{\(i\)\}\(a\_\{t\}\)\\geq B\_\{i\}\. For instance, the latency SLA introduced earlier can be represented as a special case wherezt\(i\)\(at\):=𝟏\{ℓt\(at\)≤d0\}z\_\{t\}^\{\(i\)\}\(a\_\{t\}\):=\\mathbf\{1\}\\\{\\ell\_\{t\}\(a\_\{t\}\)\\leq d\_\{0\}\\\}, andBi=α⋅∑t=1TqtB\_\{i\}=\\alpha\\cdot\\sum\_\{t=1\}^\{T\}q\_\{t\}\.
Objective\.The learner does not observe the underlying reward or cost distributions, nor the future demand sequence\{qt\}t=1T\\\{q\_\{t\}\\\}\_\{t=1\}^\{T\}\. It must adaptively select actions based only on past observations\. The goal is to maximize the total cumulative reward over rounds\{1,…,τ−1\}\\\{1,\\dots,\\tau\-1\\\}, whereτ\\taudenotes the first round in which any packing constraint is violated, while ensuring all covering constraints are satisfied over the horizon\.
Benchmark\.To evaluate algorithm performance, we benchmark against a clairvoyant offline policy that knows all latent distributions\{rt\(a\),zt\(i\)\(a\)\}\\\{r\_\{t\}\(a\),z\_\{t\}^\{\(i\)\}\(a\)\\\}and the entire demand sequence\{qt\}t=1T\\\{q\_\{t\}\\\}\_\{t=1\}^\{T\}in advance\. LetOPTdenote the expected total reward achieved by this dynamic oracle policy, which can adaptively choose the best action in each round using full information, including future outcomes\. Given an online learner that selects actionsata\_\{t\}without access to latent reward or cost distributions, we define the regret at horizonTTas the gap between the offline dynamic oracle and the learner’s cumulative expected reward:
Regret\(T\):=OPT−𝔼\[∑t=1τ−1qt⋅rt\(at\)\],\\text\{Regret\}\(T\):=\\text\{OPT\}\-\\mathbb\{E\}\\left\[\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\cdot r\_\{t\}\(a\_\{t\}\)\\right\],\(1\)whereτ\\tauis the first round that violates any packing constraint\.
To facilitate analysis, we upper boundOPTby introducing a linear programming \(LP\) relaxation based on a static randomized policy\. Let𝐮∈Δ\|𝒜\|\\mathbf\{u\}\\in\\Delta\_\{\|\\mathcal\{A\}\|\}be a fixed probability distribution over actions,𝐫∈ℝ\|𝒜\|\\mathbf\{r\}\\in\\mathbb\{R\}^\{\|\\mathcal\{A\}\|\}the vector of expected per\-task rewards \(i\.e\.𝐫=𝔼\[rt\(a\)\]\\mathbf\{r\}=\\mathbb\{E\}\[r\_\{t\}\(a\)\]\), and𝐳\(i\)∈ℝ\|𝒜\|\\mathbf\{z\}^\{\(i\)\}\\in\\mathbb\{R\}^\{\|\\mathcal\{A\}\|\}the expected per\-task consumption along dimensioni∈\[d\]i\\in\[d\]\(i\.e\.,𝐳\(i\)\(a\)=𝔼\[zt\(i\)\(a\)\]\\mathbf\{z\}^\{\(i\)\}\(a\)=\\mathbb\{E\}\[z\_\{t\}^\{\(i\)\}\(a\)\]\)\. LetQ:=∑t=1TqtQ:=\\sum\_\{t=1\}^\{T\}q\_\{t\}be the total task volume\. Then the LP relaxation benchmark is defined as:
OPTLP=\\displaystyle\\text\{OPT\}\_\{\\text\{LP\}\}=max𝐮∈Δ\|𝒜\|Q⋅𝐫⊤𝐮\\displaystyle\\max\_\{\\mathbf\{u\}\\in\\Delta\_\{\|\\mathcal\{A\}\|\}\}\\quad Q\\cdot\\mathbf\{r\}^\{\\top\}\\mathbf\{u\}s\.t\.Q⋅𝐳\(i\)⊤𝐮≤B,∀i∈ℐpack\\displaystyle\\text\{s\.t\.\}\\quad Q\\cdot\\mathbf\{z\}^\{\(i\)\\top\}\\mathbf\{u\}\\leq B,\\quad\\forall i\\in\\mathcal\{I\}\_\{\\text\{pack\}\}Q⋅𝐳\(i\)⊤𝐮≥B,∀i∈ℐcov,\\displaystyle\\qquad\\,Q\\cdot\\mathbf\{z\}^\{\(i\)\\top\}\\mathbf\{u\}\\geq B,\\quad\\forall i\\in\\mathcal\{I\}\_\{\\text\{cov\}\},\(2\)
For simplicity, here we assume all per\-dimension budgets are equal toBB\. Our formulation and results can be extended to heterogeneous budgets with minimal changes\. This LP solution provides an upper bound on the performance of any dynamic policy, as formalized by the following lemma:
###### Lemma 1\.
OPTLP≥OPT\.\\text\{OPT\}\_\{\\text\{LP\}\}\\geq\\text\{OPT\}\.
###### Proof Sketch\.
We begin by considering the relaxed LP form:
OPTLP′:=max𝒙t∈Δ\|𝒜\|,∀t∈\[T\]∑t=1Tqt⋅𝒓⊤𝒙t\\displaystyle\\text\{OPT\}^\{\\prime\}\_\{\\text\{LP\}\}:=\\max\_\{\\boldsymbol\{x\}\_\{t\}\\in\\Delta\_\{\|\\mathcal\{A\}\|\},\\ \\forall t\\in\[T\]\}\\sum\_\{t=1\}^\{T\}q\_\{t\}\\cdot\\boldsymbol\{r\}^\{\\top\}\\boldsymbol\{x\}\_\{t\}s\.t\.∑t=1Tqt⋅\(𝐳\(i\)\)⊤𝒙t≤B,∀i∈ℐpack,\\displaystyle\\text\{s\.t\.\}\\quad\\sum\_\{t=1\}^\{T\}q\_\{t\}\\cdot\(\\mathbf\{z\}^\{\(i\)\}\)^\{\\top\}\\boldsymbol\{x\}\_\{t\}\\leq B,\\quad\\forall i\\in\\mathcal\{I\}\_\{\\text\{pack\}\},∑t=1Tqt⋅\(𝐳\(i\)\)⊤𝒙t≥B,∀i∈ℐcov,\\displaystyle\\sum\_\{t=1\}^\{T\}q\_\{t\}\\cdot\(\\mathbf\{z\}^\{\(i\)\}\)^\{\\top\}\\boldsymbol\{x\}\_\{t\}\\geq B,\\quad\\forall i\\in\\mathcal\{I\}\_\{\\text\{cov\}\},where𝒙t∈Δ\|𝒜\|\\boldsymbol\{x\}\_\{t\}\\in\\Delta\_\{\|\\mathcal\{A\}\|\}is a probability distribution over actions at timett\. Any policyπ\\piachievingOPTinduces a feasible solution toOPTLP′\\text\{OPT\}^\{\\prime\}\_\{\\text\{LP\}\}by settingxt,a:=𝔼at∼π\[𝟏\{at=a\}\]x\_\{t,a\}:=\\mathbb\{E\}\_\{a\_\{t\}\\sim\\pi\}\[\\mathbf\{1\}\\\{a\_\{t\}=a\\\}\], which impliesOPTLP′≥OPT\\text\{OPT\}^\{\\prime\}\_\{\\text\{LP\}\}\\geq\\text\{OPT\}\. Conversely, any feasible solution\{𝒙t\}t=1T\\\{\\boldsymbol\{x\}\_\{t\}\\\}\_\{t=1\}^\{T\}toOPTLP′\\text\{OPT\}^\{\\prime\}\_\{\\text\{LP\}\}yields a static distribution𝒖:=∑t=1Tqt𝒙t∑t=1Tqt\\boldsymbol\{u\}:=\\frac\{\\sum\_\{t=1\}^\{T\}q\_\{t\}\\boldsymbol\{x\}\_\{t\}\}\{\\sum\_\{t=1\}^\{T\}q\_\{t\}\}, which is feasible forOPTLP\\text\{OPT\}\_\{\\text\{LP\}\}and achieves the same value\. Hence,OPTLP′=OPTLP≥OPT\\text\{OPT\}^\{\\prime\}\_\{\\text\{LP\}\}=\\text\{OPT\}\_\{\\text\{LP\}\}\\geq\\text\{OPT\}\. ∎
This benchmark enables tractable regret analysis, while preserving fidelity to the original problem\. Any regret bound relative toOPTLP\\text\{OPT\}\_\{\\text\{LP\}\}directly implies a bound relative toOPT\.
Constraint Violation Metric\.To unify the treatment of covering\-type constraints, we define the cumulative violation for each covering resourcei∈ℐcovi\\in\\mathcal\{I\}\_\{\\text\{cov\}\}as
Vi\(T\):=\[B−∑t=1Tqt⋅zt\(i\)\(at\)\]\+,V\_\{i\}\(T\):=\\left\[B\-\\sum\_\{t=1\}^\{T\}q\_\{t\}\\cdot z\_\{t\}^\{\(i\)\}\(a\_\{t\}\)\\right\]\_\{\+\},The learner aims to simultaneously achieve sublinear regretRegret\(T\)\\text\{Regret\}\(T\)and ensure vanishing average violation:Vi\(T\)T=o\(1\),∀i∈ℐcov\.\\frac\{V\_\{i\}\(T\)\}\{T\}=o\(1\),\\forall i\\in\\mathcal\{I\}\_\{\\text\{cov\}\}\.This formulation captures both reward maximization and long\-term covering constraint satisfaction under model uncertainty\.
## 5Algorithm Design
We propose COPAC\-UCB \(Constrained Prediction\-Aware Covering and Packing UCB\), a learning algorithm for our framework\. It selects actions being pessimistic about packing\-type costs while optimistic about reward and covering\-type costs\. In each round, COPAC\-UCB computes a UCB for expected reward, an LCB for packing\-type costs, and a UCB for covering\-type costs\. The action is chosen to maximize a regularized utility objective with the reward of each candidate action scaled by constraint amplification multipliers that boost perceived reward when the corresponding constraints have previously been violated\. To handle task load variability, COPAC\-UCB uses a predicted cumulative demandQ^t\\hat\{Q\}\_\{t\}to approximate the total number of tasks\. This estimate normalizes resource usage by divided by total budgetBB, aiding cost\-aware decision\-making\. Its construction is detailed in subsection[5\.2](https://arxiv.org/html/2606.17489#S5.SS2)\.
### 5\.1Confidence Bound Estimation
LetNt\(a\)N\_\{t\}\(a\)denote the number of times modela∈𝒜a\\in\\mathcal\{A\}has been selected before roundtt\. We define the empirical mean reward and consumption for each actionaaas follows:r^t\(a\):=1Nt\(a\)∑s=1t−1𝟏\{as=a\}⋅rs\(a\),z^t\(i\)\(a\):=1Nt\(a\)∑s=1t−1𝟏\{as=a\}⋅zs\(i\)\(a\),∀i∈\[d\]\.\\hat\{r\}\_\{t\}\(a\):=\\frac\{1\}\{N\_\{t\}\(a\)\}\\sum\_\{s=1\}^\{t\-1\}\\mathbf\{1\}\\\{a\_\{s\}=a\\\}\\cdot r\_\{s\}\(a\),\\quad\\hat\{z\}\_\{t\}^\{\(i\)\}\(a\):=\\frac\{1\}\{N\_\{t\}\(a\)\}\\sum\_\{s=1\}^\{t\-1\}\\mathbf\{1\}\\\{a\_\{s\}=a\\\}\\cdot z\_\{s\}^\{\(i\)\}\(a\),\\quad\\forall i\\in\[d\]\.To facilitate exploration, we adopt the confidence radius function from\[[33](https://arxiv.org/html/2606.17489#bib.bib5)\]:rad\(v,N,δ\):=2vlog\(1/δ\)N\+4log\(1/δ\)N\.\\text\{rad\}\(v,N,\\delta\):=\\sqrt\{\\frac\{2v\\log\(1/\\delta\)\}\{N\}\}\+\\frac\{4\\log\(1/\\delta\)\}\{N\}\.
We then construct the following confidence bounds:
UCBr,t\(a\)\\displaystyle\\text\{UCB\}\_\{r,t\}\(a\):=min\{r^t\(a\)\+rad\(r^t\(a\),Nt\(a\),δ\),1\},\\displaystyle:=\\min\\left\\\{\\hat\{r\}\_\{t\}\(a\)\+\\text\{rad\}\(\\hat\{r\}\_\{t\}\(a\),N\_\{t\}\(a\),\\delta\),\\ 1\\right\\\},\(3\)LCBc,t\(i\)\(a\)\\displaystyle\\text\{LCB\}\_\{c,t\}^\{\(i\)\}\(a\):=max\{z^t\(i\)\(a\)−rad\(z^t\(i\)\(a\),Nt\(a\),δ\),0\},\\displaystyle:=\\max\\left\\\{\\hat\{z\}\_\{t\}^\{\(i\)\}\(a\)\-\\text\{rad\}\(\\hat\{z\}\_\{t\}^\{\(i\)\}\(a\),N\_\{t\}\(a\),\\delta\),\\ 0\\right\\\},fori∈ℐpack,\\displaystyle\\qquad\\qquad\\qquad\\qquad\\text\{ for \}i\\in\\mathcal\{I\}\_\{\\text\{pack\}\},\(4\)UCBc,t\(i\)\(a\)\\displaystyle\\text\{UCB\}\_\{c,t\}^\{\(i\)\}\(a\):=min\{z^t\(i\)\(a\)\+rad\(z^t\(i\)\(a\),Nt\(a\),δ\),1\},\\displaystyle:=\\min\\left\\\{\\hat\{z\}\_\{t\}^\{\(i\)\}\(a\)\+\\text\{rad\}\(\\hat\{z\}\_\{t\}^\{\(i\)\}\(a\),N\_\{t\}\(a\),\\delta\),\\ 1\\right\\\},fori∈ℐcov\.\\displaystyle\\qquad\\qquad\\qquad\\qquad\\text\{ for \}i\\in\\mathcal\{I\}\_\{\\text\{cov\}\}\.\(5\)
These confidence intervals ensure, with high probability, that the true reward and costs lie within the bounds\. The following guarantee formalizes this:
###### Lemma 2\.
With probability at least1−3KTdδ1\-3KTd\\delta, for alla∈𝒜a\\in\\mathcal\{A\}, we haveUCBr,t\(a\)≥r\(a\),\\text\{UCB\}\_\{r,t\}\(a\)\\geq r\(a\),andi∈ℐpack,LCBc,t\(i\)\(a\)≤z\(i\)\(a\),i\\in\\mathcal\{I\}\_\{\\text\{pack\}\},\\text\{LCB\}\_\{c,t\}^\{\(i\)\}\(a\)\\leq z^\{\(i\)\}\(a\),andi∈ℐcov,UCBc,t\(i\)\(a\)≤z\(i\)\(a\)i\\in\\mathcal\{I\}\_\{\\text\{cov\}\},\\text\{UCB\}\_\{c,t\}^\{\(i\)\}\(a\)\\leq z^\{\(i\)\}\(a\)\.
###### Proof Sketch\.
The result follows from the argument of\[[6](https://arxiv.org/html/2606.17489#bib.bib49),[1](https://arxiv.org/html/2606.17489#bib.bib44)\]by applying concentration bounds to the empirical means of rewards and costs\. With high probability, the constructed UCB and LCB terms respectively upper and lower bound the true values, completing the proof\. ∎
### 5\.2Prediction of Task Demands
To accommodate time\-varying task demand, the learner employs a time\-series predictor𝒫\\mathcal\{P\}to estimate the total task volumeQ:=∑t=1TqtQ:=\\sum\_\{t=1\}^\{T\}q\_\{t\}from historical data\. At each roundtt, the learner observes\{qs\}s=1t−1\\\{q\_\{s\}\\\}\_\{s=1\}^\{t\-1\}and produces a forecastQ^t=𝒫t\(q1,…,qt−1\)\\hat\{Q\}\_\{t\}=\\mathcal\{P\}\_\{t\}\(q\_\{1\},\\dots,q\_\{t\-1\}\)\. In practice,𝒫\\mathcal\{P\}can be any black\-box model that supports one\-step or multi\-step prediction\. For example, it may recursively generate forecastsq^t,q^t\+1,…,q^T\\hat\{q\}\_\{t\},\\hat\{q\}\_\{t\+1\},\\dots,\\hat\{q\}\_\{T\}using past and previously predicted values, and defineQ^t:=∑s=1t−1qs\+∑s=tTq^s\\hat\{Q\}\_\{t\}:=\\sum\_\{s=1\}^\{t\-1\}q\_\{s\}\+\\sum\_\{s=t\}^\{T\}\\hat\{q\}\_\{s\}as the estimated total demand\. These predictors can incorporate statistical techniques or deep models widely used in time\-series forecasting\[[41](https://arxiv.org/html/2606.17489#bib.bib50),[22](https://arxiv.org/html/2606.17489#bib.bib51),[28](https://arxiv.org/html/2606.17489#bib.bib53)\]\.
We define the prediction error at roundttasεt:=\|Q−Q^t\|\\varepsilon\_\{t\}:=\|Q\-\\hat\{Q\}\_\{t\}\|, which captures the deviation between true and estimated cumulative demand\. This estimate is used by the learner to adapt cost\-aware decisions to upcoming load variability, enabling dynamic adjustment under uncertainty\. Without loss of generality, we also assume thatqt≤q¯q\_\{t\}\\leq\\bar\{q\}for alltt, whereq¯\\bar\{q\}is a known constant upper bound on task load\. This upper bound enables the learner to gett worst\-case cost scaling and construct confidence intervals that account for demand uncertainty\.
### 5\.3Algorithm Description
Algorithm 1 presents the COPAC\-UCB algorithm\. At each roundt∈\[T\]t\\in\[T\], the learner uses a predicted cumulative task volumeQ^t\\hat\{Q\}\_\{t\}from a forecasting oracle𝒫t\\mathcal\{P\}\_\{t\}, based on historical demands\{qs\}s=1t−1\\\{q\_\{s\}\\\}\_\{s=1\}^\{t\-1\}\. Using this forecast, COPAC\-UCB constructs high\-probability confidence intervals for reward and per\-task resource consumption\.
Specifically, for each actiona∈𝒜a\\in\\mathcal\{A\}, COPAC\-UCB computes a UCBUCBr,t\(a\)\\text\{UCB\}\_\{r,t\}\(a\)on the expected per\-task reward\. For each resource dimensioni∈\[d\]i\\in\[d\], it constructs a one\-sided confidence bound on the unit resource consumptionzt\(i\)\(a\)z\_\{t\}^\{\(i\)\}\(a\): a lower boundLCBc,t\(i\)\(a\)\\text\{LCB\}\_\{c,t\}^\{\(i\)\}\(a\)ifi∈ℐpack:=\[1,d1\]i\\in\\mathcal\{I\}\_\{\\text\{pack\}\}:=\[1,d\_\{1\}\], and an upper boundUCBc,t\(i\)\(a\)\\text\{UCB\}\_\{c,t\}^\{\(i\)\}\(a\)ifi∈ℐcov:=\[d1\+1,d\]i\\in\\mathcal\{I\}\_\{\\text\{cov\}\}:=\[d\_\{1\}\+1,d\]\.
We define a unified confidence vector for each actiona∈𝒜a\\in\\mathcal\{A\}:
\[𝒛tconf\(a\)\]i:=\{LCBc,t\(i\)\(a\),ifi∈ℐpack−UCBc,t\(i\)\(a\),ifi∈ℐcov∈ℝd\.\\left\[\\boldsymbol\{z\}\_\{t\}^\{\\text\{conf\}\}\(a\)\\right\]\_\{i\}:=\\begin\{cases\}\\text\{LCB\}\_\{c,t\}^\{\(i\)\}\(a\),&\\text\{if \}i\\in\\mathcal\{I\}\_\{\\text\{pack\}\}\\\\ \-\\text\{UCB\}\_\{c,t\}^\{\(i\)\}\(a\),&\\text\{if \}i\\in\\mathcal\{I\}\_\{\\text\{cov\}\}\\end\{cases\}\\quad\\in\\mathbb\{R\}^\{d\}\.\(6\)
Based on these estimates, the algorithm selects an actionat∈𝒜a\_\{t\}\\in\\mathcal\{A\}that maximizes:
at=argmaxa∈𝒜\{UCBr,t\(a\)−𝝀t⊤\(Q^tB⋅𝒛tconf\(a\)\)\},a\_\{t\}=\\arg\\max\_\{a\\in\\mathcal\{A\}\}\\left\\\{\\text\{UCB\}\_\{r,t\}\(a\)\-\\boldsymbol\{\\lambda\}\_\{t\}^\{\\top\}\\left\(\\frac\{\\hat\{Q\}\_\{t\}\}\{B\}\\cdot\\boldsymbol\{z\}\_\{t\}^\{\\text\{conf\}\}\(a\)\\right\)\\right\\\},\(7\)where𝝀t∈ℝd\\boldsymbol\{\\lambda\}\_\{t\}\\in\\mathbb\{R\}^\{d\}is the dual price vector \(i\.e\., Lagrange multipliers\) maintained by the algorithm and defined later\.
This Lagrangian\-regularized rule balances reward maximization and constraint satisfaction, penalizing consumption of scarce packing resources via lower bounds, while encouraging usage of underutilized covering resources via upper bounds, approximating the dual form of the static LP in Eq\. \([2](https://arxiv.org/html/2606.17489#S4.E2)\), where dual weights𝝀t∈ℝd\\boldsymbol\{\\lambda\}\_\{t\}\\in\\mathbb\{R\}^\{d\}encode the opportunity cost of constrained resources\. Packing resource usage is penalized via LCBs to avoid overestimation, while covering resource usage is incentivized via UCBs to promote conservative compliance\. The normalization factorQ^t/B\\hat\{Q\}\_\{t\}/Breflects the marginal value of a unit resource under predicted load\.
After choosingAtA\_\{t\}, the system observes the realized demand sizeqtq\_\{t\}, receives rewardqt⋅rt\(at\)q\_\{t\}\\cdot r\_\{t\}\(a\_\{t\}\), and consumesqt⋅zt\(i\)\(at\)q\_\{t\}\\cdot z\_\{t\}^\{\(i\)\}\(a\_\{t\}\)units of each resourcei∈\[d\]i\\in\[d\]\. If any packing resource exceeds its budget, i\.e\., if∑s=1tqs⋅zs\(i\)\(as\)\>Bfor somei∈ℐpack,\\sum\_\{s=1\}^\{t\}q\_\{s\}\\cdot z\_\{s\}^\{\(i\)\}\(a\_\{s\}\)\>B\\text\{ for some \}i\\in\\mathcal\{I\}\_\{\\text\{pack\}\},then the algorithm halts and pulls a null arma0a\_\{0\}for all subsequent rounds\.
To adaptively track the cost of each resource, COPAC\-UCB maintains a dual weight vector𝝀t∈ℝd\\boldsymbol\{\\lambda\}\_\{t\}\\in\\mathbb\{R\}^\{d\}that is updated via Online Gradient Descent\. At each roundtt, it solves the convex minimization problem:
𝝀t\+1=Π𝒮\(𝝀t−ηt⋅∇ft\(𝝀t\)\),\\boldsymbol\{\\lambda\}\_\{t\+1\}=\\Pi\_\{\\mathcal\{S\}\}\\left\(\\boldsymbol\{\\lambda\}\_\{t\}\-\\eta\_\{t\}\\cdot\\nabla f\_\{t\}\(\\boldsymbol\{\\lambda\}\_\{t\}\)\\right\),\(8\)where the projectionΠ𝒮\\Pi\_\{\\mathcal\{S\}\}is onto the feasible set𝒮:=\{𝝀∈ℝd:‖𝝀‖1≤T1/4,𝝀≥0d\}\\mathcal\{S\}:=\\\{\\boldsymbol\{\\lambda\}\\in\\mathbb\{R\}^\{d\}:\\\|\\boldsymbol\{\\lambda\}\\\|\_\{1\}\\leq T^\{1/4\},\\ \\boldsymbol\{\\lambda\}\\geq\\textbf\{0\}\_\{d\}\\\}, and the objective function is defined using the signed confidence vector𝒛tconf\(at\)\\boldsymbol\{z\}^\{\\text\{conf\}\}\_\{t\}\(a\_\{t\}\)as:
ft\(𝝀\)=qtQ^tB⋅\(\(BQ^t⋅𝒔−𝒛tconf\(at\)\)⊤𝝀\),f\_\{t\}\(\\boldsymbol\{\\lambda\}\)=\\frac\{q\_\{t\}\\hat\{Q\}\_\{t\}\}\{B\}\\cdot\\left\(\\left\(\\frac\{B\}\{\\hat\{Q\}\_\{t\}\}\\cdot\\boldsymbol\{s\}\-\\boldsymbol\{z\}^\{\\text\{conf\}\}\_\{t\}\(a\_\{t\}\)\\right\)^\{\\top\}\\boldsymbol\{\\lambda\}\\right\),where𝒔∈ℝd\\boldsymbol\{s\}\\in\\mathbb\{R\}^\{d\}is a direction vector with components
\[𝒔\]i:=\{1,ifi∈ℐpack−1,ifi∈ℐcov\.\[\\boldsymbol\{s\}\]\_\{i\}:=\\begin\{cases\}1,&\\text\{if \}i\\in\\mathcal\{I\}\_\{\\text\{pack\}\}\\\\ \-1,&\\text\{if \}i\\in\\mathcal\{I\}\_\{\\text\{cov\}\}\\end\{cases\}\.
This dual update encourages the algorithm to prioritize constraint satisfaction by dynamically adjusting weights based on observed feedback and demand forecasts\.
This update increases the dual weight𝝀t\(i\)\\boldsymbol\{\\lambda\}\_\{t\}\(i\)for packing resourcesi∈ℐpacki\\in\\mathcal\{I\}\_\{\\text\{pack\}\}when consumption is high \(i\.e\.,LCBc,t\(i\)\(a\)\\text\{LCB\}\_\{c,t\}^\{\(i\)\}\(a\)is large\), discouraging future overuse\. Conversely, it increases𝝀t\(i\)\\boldsymbol\{\\lambda\}\_\{t\}\(i\)for covering resourcesi∈ℐcovi\\in\\mathcal\{I\}\_\{\\text\{cov\}\}when usage is low \(i\.e\.,UCBc,t\(i\)\(a\)\\text\{UCB\}\_\{c,t\}^\{\(i\)\}\(a\)is small\), thereby encouraging future allocation\. This dual feedback mechanism enables the learner to balance reward maximization and long\-term constraint satisfaction\.
Algorithm 1COPAC\-UCB1:Initialize
𝝀1=1d𝟏∈ℝd\\boldsymbol\{\\lambda\}\_\{1\}=\\frac\{1\}\{d\}\\boldsymbol\{1\}\\in\\mathbb\{R\}^\{d\},
ηt=2Mt\\eta\_\{t\}=\\frac\{2\}\{M\\sqrt\{t\}\},
M=\(q¯\+q¯2b\)M=\(\\bar\{q\}\+\\frac\{\\bar\{q\}^\{2\}\}\{b\}\),
𝑩=\(B\)i∈\[d\]\\boldsymbol\{B\}=\(B\)\_\{i\\in\[d\]\}\.
2:for
t=1,2,…,Tt=1,2,\\dots,Tdo
3:Use demand predictior
Q^t=𝒫t\(q1,…,qt−1\)\\hat\{Q\}\_\{t\}=\\mathcal\{P\}\_\{t\}\(q\_\{1\},\\ldots,q\_\{t\-1\}\)\.
4:For all
a∈𝒜a\\in\\mathcal\{A\}, compute
UCBr,t\(a\)\\text\{UCB\}\_\{r,t\}\(a\)using \([3](https://arxiv.org/html/2606.17489#S5.E3)\);
5:for
i∈ℐpacki\\in\\mathcal\{I\}\_\{\\text\{pack\}\}, compute
LCBc,t\(i\)\(a\)\\text\{LCB\}\_\{c,t\}^\{\(i\)\}\(a\)using \([4](https://arxiv.org/html/2606.17489#S5.E4)\);
6:for
i∈ℐcoveri\\in\\mathcal\{I\}\_\{\\text\{cover\}\}, compute
UCBc,t\(i\)\(a\)\\text\{UCB\}\_\{c,t\}^\{\(i\)\}\(a\)using \([5](https://arxiv.org/html/2606.17489#S5.E5)\)\.
7:Select action
ata\_\{t\}according to \([7](https://arxiv.org/html/2606.17489#S5.E7)\)\.
8:if
∃i∈ℐpack\\exists i\\in\\mathcal\{I\}\_\{\\text\{pack\}\}such that
∑s=1tqs⋅zs\(i\)\(as\)\>B\\sum\_\{s=1\}^\{t\}q\_\{s\}\\cdot z\_\{s\}^\{\(i\)\}\(a\_\{s\}\)\>Bthen
9:breakand pull null arm
a0a\_\{0\}all the way\.
10:endif
11:Observe
qtq\_\{t\}, realized reward
qtrt\(at\)q\_\{t\}\{r\}\_\{t\}\(a\_\{t\}\), and cost
qtzt\(i\)\(at\)q\_\{t\}\{z\}\_\{t\}^\{\(i\)\}\(a\_\{t\}\)for all
i∈\[d\]i\\in\[d\]\.
12:Update budgets:
B←B−qt⋅zt\(i\)\(at\)B\\leftarrow B\-q\_\{t\}\\cdot\{z\}\_\{t\}^\{\(i\)\}\(a\_\{t\}\), for
i∈ℐpacki\\in\\mathcal\{I\}\_\{\\text\{pack\}\}\.
13:Update dual vector:
𝝀t\+1\\boldsymbol\{\\lambda\}\_\{t\+1\}using Eq\. \([8](https://arxiv.org/html/2606.17489#S5.E8)\)\.
14:endfor
## 6Main Results
In this section, we give the theoretical analysis of COPAC\-UCB\. Let𝒖∗∈Δ\|𝒜\|\\boldsymbol\{u\}^\{\*\}\\in\\Delta^\{\|\\mathcal\{A\}\|\}be an optimal solution to the LP problem \([2](https://arxiv.org/html/2606.17489#S4.E2)\)\. Before giving an upper bound on the cumulative regret and constraint violation of COPAC\-UCB, we first begin with several auxiliary lemmas\.
###### Lemma 3\(Cumulative Arm Selection Gap with OGD Bound\)\.
Define the residual term:
Rt:=qtQ^tB⋅𝝀t⊤\(BQ^t⋅𝒔−𝒛tconf\(𝒖∗\)\),R\_\{t\}:=\\frac\{q\_\{t\}\\widehat\{Q\}\_\{t\}\}\{B\}\\cdot\\boldsymbol\{\\lambda\}\_\{t\}^\{\\top\}\\left\(\\frac\{B\}\{\\widehat\{Q\}\_\{t\}\}\\cdot\\boldsymbol\{s\}\-\\boldsymbol\{z\}\_\{t\}^\{\\text\{conf\}\}\(\\boldsymbol\{u\}^\{\*\}\)\\right\),
Then, by applying Online Gradient Descent with step sizeηt=2Mt\\eta\_\{t\}=\\frac\{2\}\{M\\sqrt\{t\}\}, we have with probability at least1−3KTδ1\-3KT\\delta:
∑t=1τ−1qt⋅𝐔𝐂𝐁𝒓,𝒕⊤𝒖∗−∑t=1τ−1qt⋅UCBr,t\(at\)\+∑t=1τ−1Rt\\displaystyle\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\cdot\\boldsymbol\{\\mathrm\{UCB\}\_\{r,t\}\}^\{\\top\}\\boldsymbol\{u\}^\{\*\}\-\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\cdot\\mathrm\{UCB\}\_\{r,t\}\(a\_\{t\}\)\+\\sum\_\{t=1\}^\{\\tau\-1\}R\_\{t\}≤∑t=1τ−1ft\(𝝀\)\+O\(MT3/4\),∀𝝀∈S\.\\displaystyle\\leq\\sum\_\{t=1\}^\{\\tau\-1\}f\_\{t\}\(\\boldsymbol\{\\lambda\}\)\+O\\left\(MT^\{3/4\}\\right\),\\quad\\forall\\boldsymbol\{\\lambda\}\\in S\.
###### Proof Sketch\.
We compare the algorithm’s selected actionata\_\{t\}against the fixed benchmark𝒖∗\\boldsymbol\{u\}^\{\*\}, which yields a per\-round reward difference and a cost\-weighted deviation\. This motivates a regret decomposition that includes an arm selection gap and a dual\-weighted cost gap\.
UCBr,t⊤𝒖∗−UCBr,t\(at\)\+Q^tB𝝀t⊤\(BQ^t𝒔−𝒛tconf\(𝒖∗\)\)\\displaystyle\\text\{UCB\}\_\{r,t\}^\{\\top\}\\boldsymbol\{u\}^\{\*\}\-\\text\{UCB\}\_\{r,t\}\(a\_\{t\}\)\+\\frac\{\\hat\{Q\}\_\{t\}\}\{B\}\\boldsymbol\{\\lambda\}\_\{t\}^\{\\top\}\\left\(\\frac\{B\}\{\\hat\{Q\}\_\{t\}\}\\boldsymbol\{s\}\-\\boldsymbol\{z\}\_\{t\}^\{\\text\{conf\}\}\(\\boldsymbol\{u\}^\{\*\}\)\\right\)≤Q^tB𝝀t⊤\(BQ^t⋅𝒔−𝒛tconf\(at\)\)\.\\displaystyle\\leq\\frac\{\\hat\{Q\}\_\{t\}\}\{B\}\\boldsymbol\{\\lambda\}\_\{t\}^\{\\top\}\\left\(\\frac\{B\}\{\\hat\{Q\}\_\{t\}\}\\cdot\\boldsymbol\{s\}\-\\boldsymbol\{z\}\_\{t\}^\{\\text\{conf\}\}\(a\_\{t\}\)\\right\)\.
Rewriting and summing overt=1t=1toτ−1\\tau\-1, and multiplying byqtq\_\{t\}\. By defining a sequence of convex functions, i\.e\.ft\(𝝀\)f\_\{t\}\(\\boldsymbol\{\\lambda\}\)encoding the dual\-adjusted budget residual, we frame the second term as a regret bound in online convex optimization\. Applying standard OGD guarantees over the simplex𝒮:=\{𝝀∈ℝd:‖𝝀‖1≤T1/4,𝝀≥0d\}\\mathcal\{S\}:=\\\{\\boldsymbol\{\\lambda\}\\in\\mathbb\{R\}^\{d\}:\\\|\\boldsymbol\{\\lambda\}\\\|\_\{1\}\\leq T^\{1/4\},\\ \\boldsymbol\{\\lambda\}\\geq\\textbf\{0\}\_\{d\}\\\}, we bound the cumulative difference between the algorithm’s decisions and the best fixed dual variable in hindsight\. This yields the desired upper bound\. ∎
###### Lemma 4\(Budget Residual Lower Bound\)\.
At any stopping timeτ≤T\\tau\\leq T, with probability at least1−3KTδ1\-3KT\\delta, it holds that:
∑t=1τ−1Rt≥∑t=1τ−1qt⋅Q^tB\(BQ^t−BQ\)‖𝝀t‖1\.\\sum\_\{t=1\}^\{\\tau\-1\}R\_\{t\}\\geq\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\cdot\\frac\{\\widehat\{Q\}\_\{t\}\}\{B\}\\left\(\\frac\{B\}\{\\widehat\{Q\}\_\{t\}\}\-\\frac\{B\}\{Q\}\\right\)\\\|\\boldsymbol\{\\lambda\}\_\{t\}\\\|\_\{1\}\.\(9\)
###### Proof Sketch\.
Applying Lemma 2 ensures that with high probability, the lower confidence bound of packing cost vector inRtR\_\{t\}is upper bounded by the true cost vector\. We first decompose the residualRtR\_\{t\}by separatingBQ^t𝟏d1−\(𝐋𝐂𝐁c,t\[1:d1\]\)⊤𝝁∗\\frac\{B\}\{\\widehat\{Q\}\_\{t\}\}\\mathbf\{1\}\_\{d\_\{1\}\}\-\\left\(\\boldsymbol\{\\mathrm\{LCB\}\}\_\{c,t\}^\{\[1:d\_\{1\}\]\}\\right\)^\{\\top\}\\boldsymbol\{\\mu\}^\{\*\}into\(BQ^t−BQ\)𝟏d1\+\(BQ𝟏d−\(𝐋𝐂𝐁c,t\[1:d1\]\)⊤𝝁∗\)\\left\(\\frac\{B\}\{\\widehat\{Q\}\_\{t\}\}\-\\frac\{B\}\{Q\}\\right\)\\mathbf\{1\}\_\{d\_\{1\}\}\+\\left\(\\frac\{B\}\{Q\}\\mathbf\{1\}\_\{d\}\-\\left\(\\boldsymbol\{\\mathrm\{LCB\}\}\_\{c,t\}^\{\[1:d\_\{1\}\]\}\\right\)^\{\\top\}\\boldsymbol\{\\mu\}^\{\*\}\\right\), and bound the latter using the definition of𝒖∗\\boldsymbol\{u\}^\{\*\}and non\-negativity of all terms\. The covering constraint analysis follows similarly by symmetry\. Then by dropping non\-negative slack terms and summing acrosst=1t=1toτ−1\\tau\-1, we obtain the desired lower bound involving the norm‖𝝀t‖1\\\|\\boldsymbol\{\\lambda\}\_\{t\}\\\|\_\{1\}\. ∎
###### Lemma 5\(Budget Violation Decomposition and Upper Bound\)\.
Assume that at some stopping timeτ≤T\\tau\\leq T, there exists a budget dimensionj0∈ℐpackj\_\{0\}\\in\\mathcal\{I\}\_\{\\text\{pack\}\}such that:∑t=1τ−1qtzt\(j0\)\(at\)\>B\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}z\_\{t\}^\{\(j\_\{0\}\)\}\(a\_\{t\}\)\>B\. Define the dual variable as𝝀=α𝒆j0\\boldsymbol\{\\lambda\}=\\alpha\\boldsymbol\{e\}\_\{j\_\{0\}\}, withα∈\[0,T1/4\]\\alpha\\in\[0,T^\{1/4\}\]\. Then, with probability at least1−3KTδ1\-3KT\\delta, it holds that:
∑t=1τ−1ft\(𝝀\)=∑t=1τ−1ft\(α𝒆j0\)≤α\(Qτ−1−Q\+QBq¯\\displaystyle\\sum\_\{t=1\}^\{\\tau\-1\}f\_\{t\}\(\\boldsymbol\{\\lambda\}\)=\\sum\_\{t=1\}^\{\\tau\-1\}f\_\{t\}\(\\alpha\\boldsymbol\{e\}\_\{j\_\{0\}\}\)\\leq\\alpha\\bigg\(Q\_\{\\tau\-1\}\-Q\+\\frac\{Q\}\{B\}\\bar\{q\}\+QB∑t=1τ−1qt\(zt\(j0\)\(at\)−\[𝐋𝐂𝐁c,t\(at\)\]\(j0\)\)\)\\displaystyle\+\\frac\{Q\}\{B\}\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\left\(z\_\{t\}^\{\(j\_\{0\}\)\}\(a\_\{t\}\)\-\\left\[\\boldsymbol\{\\mathrm\{LCB\}\}\_\{c,t\}\(a\_\{t\}\)\\right\]^\{\(j\_\{0\}\)\}\\right\)\\bigg\)\+∑t=1τ−1qtα⋅Q^tB\(BQ^t−BQ\)\+O\(α\(1Q\+1B\)∑t=1τ−1qtεt\),\\displaystyle\\quad\+\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\alpha\\cdot\\frac\{\\widehat\{Q\}\_\{t\}\}\{B\}\\left\(\\frac\{B\}\{\\widehat\{Q\}\_\{t\}\}\-\\frac\{B\}\{Q\}\\right\)\+O\\left\(\\alpha\\left\(\\frac\{1\}\{Q\}\+\\frac\{1\}\{B\}\\right\)\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\varepsilon\_\{t\}\\right\),
###### Proof Sketch\.
We consider a stopping timeτ≤T\\tau\\leq Tsuch that a budget constraint is violated in coordinatej0∈ℐpackj\_\{0\}\\in\\mathcal\{I\}\_\{\\text\{pack\}\}, i\.e\.,∑t=1τ−1qtzt\(j0\)\(at\)\>B\.\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}z\_\{t\}^\{\(j\_\{0\}\)\}\(a\_\{t\}\)\>B\.Letting𝝀=α𝒆j0\\boldsymbol\{\\lambda\}=\\alpha\\boldsymbol\{e\}\_\{j\_\{0\}\}for someα∈\[0,T1/4\]\\alpha\\in\[0,T^\{1/4\}\], we analyze the cumulative dual\-adjusted cost residual\. We decompose the expression into: \(i\) the normalization error fromBQ^t−BQ\\frac\{B\}\{\\widehat\{Q\}\_\{t\}\}\-\\frac\{B\}\{Q\}, \(ii\) the empirical budget violation∑tqt\(zt\(j0\)\(at\)−\[𝐋𝐂𝐁c,t\(at\)\]\(j0\)\)\\sum\_\{t\}q\_\{t\}\\left\(z\_\{t\}^\{\(j\_\{0\}\)\}\(a\_\{t\}\)\-\\left\[\\boldsymbol\{\\mathrm\{LCB\}\}\_\{c,t\}\(a\_\{t\}\)\\right\]^\{\(j\_\{0\}\)\}\\right\), and \(iii\) prediction errorsεt\\varepsilon\_\{t\}\. Using standard concentration bounds and boundedness assumptions, we derive the stated upper bound\. ∎
Then the following theorem provides a high\-probability regret upper bound for Algorithm 1:
###### Theorem 1\(High\-Probability Regret Bound under Time\-Varying Demand\)\.
With probability at least1−3KTδ1\-3KT\\delta, the algorithm satisfies the following regret bound:
OPTLP−∑t=1τ−1qtRt≤O\(log\(1δ\)\(OPTLPq¯KB\\displaystyle\\mathrm\{OPT\}\_\{\\mathrm\{LP\}\}\-\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}R\_\{t\}\\leq O\\bigg\(\\log\\left\(\\frac\{1\}\{\\delta\}\\right\)\\Big\(\\mathrm\{OPT\}\_\{\\mathrm\{LP\}\}\\sqrt\{\\frac\{\\bar\{q\}K\}\{B\}\}\+q¯K⋅OPTLP\)\+\(1Q\+1B\)∑t=1τ−1qtϵt\+MT3/4\),\\displaystyle\\quad\+\\sqrt\{\\bar\{q\}K\\cdot\\mathrm\{OPT\}\_\{\\mathrm\{LP\}\}\}\\Big\)\+\\left\(\\frac\{1\}\{Q\}\+\\frac\{1\}\{B\}\\right\)\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\epsilon\_\{t\}\+MT^\{3/4\}\\bigg\),whereQ=∑t=1TqtQ=\\sum\_\{t=1\}^\{T\}q\_\{t\},q¯=maxtqt\\bar\{q\}=\\max\_\{t\}q\_\{t\}, andϵt=\|Q^t−Q\|\\epsilon\_\{t\}=\|\\hat\{Q\}\_\{t\}\-Q\|\.
###### Proof Sketch\.
We consider two cases depending on whether any budget constraint is violated before horizonTT\.
Case 1:τ≤T\\tau\\leq T\(some budget is violated before the horizon ends\)\.In this case,
∑t=1τ−1qt⋅𝐔𝐂𝐁r,t⊤𝒖∗≥∑t=1τ−1qt𝒓t⊤𝒖∗=OPTLP⋅Qτ−1Q,\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\cdot\\boldsymbol\{\\mathrm\{UCB\}\}\_\{r,t\}^\{\\top\}\\boldsymbol\{u\}^\{\*\}\\geq\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\boldsymbol\{r\}\_\{t\}^\{\\top\}\\boldsymbol\{u\}^\{\*\}=\\mathrm\{OPT\}\_\{\\mathrm\{LP\}\}\\cdot\\frac\{Q\_\{\\tau\-1\}\}\{Q\},
We begin by decomposing the regret between the LP benchmark and the cumulative reward of the algorithm:
OPTLP⋅Qτ−1Q−∑t=1τ−1qt𝐔𝐂𝐁r,t\(at\)\.\\mathrm\{OPT\}\_\{\\mathrm\{LP\}\}\\cdot\\frac\{Q\_\{\\tau\-1\}\}\{Q\}\-\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\boldsymbol\{\\mathrm\{UCB\}\}\_\{r,t\}\(a\_\{t\}\)\.Choosing𝝀=OPTLP/Q⋅𝟏≤1\\boldsymbol\{\\lambda\}=\\mathrm\{OPT\}\_\{\\mathrm\{LP\}\}/Q\\cdot\\boldsymbol\{1\}\\leq 1, and combining Lemmas 3–5, and rearranging terms, we can derive the results of Theorem 1\.
Case 2:τ=T\+1\\tau=T\+1\(no packing budget is violated and all rounds are used\)\.With probability≥1−3KTδ\\geq 1\-3KT\\delta, we have
∑t=1τ−1qt⋅𝐔𝐂𝐁r,t⊤𝒖∗≥∑t=1τ−1qt𝒓t⊤𝒖∗=OPTLP\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\cdot\\boldsymbol\{\\mathrm\{UCB\}\}\_\{r,t\}^\{\\top\}\\boldsymbol\{u\}^\{\*\}\\geq\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\boldsymbol\{r\}\_\{t\}^\{\\top\}\\boldsymbol\{u\}^\{\*\}=\\mathrm\{OPT\}\_\{\\mathrm\{LP\}\}
By primal\-dual optimality, we know that
∑t=1τ−1qt𝒓t⊤𝒖∗=OPTand∑t=1τ−1qt𝒛t⊤𝒖∗≤B\.\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\boldsymbol\{r\}\_\{t\}^\{\\top\}\\boldsymbol\{u\}^\{\*\}=\\mathrm\{OPT\}\\quad\\text\{and\}\\quad\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\boldsymbol\{z\}\_\{t\}^\{\\top\}\\boldsymbol\{u\}^\{\*\}\\leq B\.Using Lemma 3 to control the gap between𝐔𝐂𝐁r,t\(at\)\\boldsymbol\{\\mathrm\{UCB\}\}\_\{r,t\}\(a\_\{t\}\)and𝐔𝐂𝐁r,t⊤𝒖∗\\boldsymbol\{\\mathrm\{UCB\}\}\_\{r,t\}^\{\\top\}\\boldsymbol\{u\}^\{\*\}, and Lemma 4 with𝝁=𝟎\\boldsymbol\{\\mu\}=\\boldsymbol\{0\}, we obtain:
OPT−∑t=1τ−1qt𝐔𝐂𝐁r,t\(at\)≤O\(1Q∑t=1τ−1qtεt\+MT3/4\)\.\\mathrm\{OPT\}\-\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\boldsymbol\{\\mathrm\{UCB\}\}\_\{r,t\}\(a\_\{t\}\)\\leq O\\left\(\\frac\{1\}\{Q\}\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\varepsilon\_\{t\}\+MT^\{3/4\}\\right\)\.Further combining two cases, we arrive at the final high\-probability regret bound\. ∎
The following theorem provides a high\-probability constraint violation upper bound for Algorithm 1:
###### Theorem 2\.
Consider the COPAC\-UCB algorithm, that is provided with predictions that satisfy\|Q^t−Q\|≤εt\|\\widehat\{Q\}\_\{t\}\-Q\|\\leq\\varepsilon\_\{t\}for allt∈\[τ−1\]t\\in\[\\tau\-1\]\. With probability≥1−3KTdδ\\geq 1\-3KTd\\delta, for alli∈ℐcoveri\\in\\mathcal\{I\}\_\{\\text\{cover\}\},
max\(0,B−∑t=1τ−1qtzt\(i\)\(at\)\)\\displaystyle\\max\\left\(0,\\ B\-\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}z\_\{t\}^\{\(i\)\}\(a\_\{t\}\)\\right\)≤O\(T−1/4Q\+MT3/4\+1B∑t=1τ−1qtεt\+QBT\)\.\\displaystyle\\leq O\\left\(T^\{\-1/4\}Q\+MT^\{3/4\}\+\\frac\{1\}\{B\}\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\varepsilon\_\{t\}\+\\frac\{Q\}\{B\}\\sqrt\{T\}\\right\)\.
###### Proof Sketch\.
Fixi∈ℐcoveri\\in\\mathcal\{I\}\_\{\\text\{cover\}\}such that\[𝝀\]i=T1/4\[\\boldsymbol\{\\lambda\}\]\_\{i\}=T^\{1/4\}and all other coordinates are zero\. Lemma 3 can then be rewritten as:
∑t=1τ−1\[qtQ^tB𝐔𝐂𝐁c,t\(i\)\(at\)−qtQ^tB⋅BQ^t\]\+O\(MT3/4\)\\displaystyle\\sum\_\{t=1\}^\{\\tau\-1\}\\left\[\\frac\{q\_\{t\}\\widehat\{Q\}\_\{t\}\}\{B\}\\boldsymbol\{\\mathrm\{UCB\}\}\_\{c,t\}^\{\(i\)\}\(a\_\{t\}\)\-\\frac\{q\_\{t\}\\widehat\{Q\}\_\{t\}\}\{B\}\\cdot\\frac\{B\}\{\\widehat\{Q\}\_\{t\}\}\\right\]\+O\(MT^\{3/4\}\)≥∑t=1τ−1\(qt⟨𝝁∗,𝒓t⟩−qt⋅𝐔𝐂𝐁r,t\(at\)\+qtQ^tB⋅λt\\displaystyle\\quad\\geq\\sum\_\{t=1\}^\{\\tau\-1\}\\bigg\(q\_\{t\}\\langle\\boldsymbol\{\\mu\}^\{\*\},\\boldsymbol\{r\}\_\{t\}\\rangle\-q\_\{t\}\\cdot\\boldsymbol\{\\mathrm\{UCB\}\}\_\{r,t\}\(a\_\{t\}\)\+\\frac\{q\_\{t\}\\widehat\{Q\}\_\{t\}\}\{B\}\\cdot\\lambda\_\{t\}−λtqt\+qtQ^tBλt⟨𝝁∗,𝒛t\[ℐcover\]⟩\)\.\\displaystyle\\qquad\\qquad\-\\lambda\_\{t\}q\_\{t\}\+\\frac\{q\_\{t\}\\widehat\{Q\}\_\{t\}\}\{B\}\\lambda\_\{t\}\\langle\\boldsymbol\{\\mu\}^\{\*\},\\boldsymbol\{z\}\_\{t\}^\{\[\\mathcal\{I\}\_\{\\text\{cover\}\}\]\}\\rangle\\bigg\)\.\(10\)
For the left\-hand side of \([10](https://arxiv.org/html/2606.17489#S6.E10)\), we can upper bound:
∑t=1τ−1qtQ^tB\[𝐔𝐂𝐁c,t\(i\)\(at\)−BQ^t\]\\displaystyle\\sum\_\{t=1\}^\{\\tau\-1\}\\frac\{q\_\{t\}\\widehat\{Q\}\_\{t\}\}\{B\}\\left\[\\boldsymbol\{\\mathrm\{UCB\}\}\_\{c,t\}^\{\(i\)\}\(a\_\{t\}\)\-\\frac\{B\}\{\\widehat\{Q\}\_\{t\}\}\\right\]≥∑t=1τ−1\(qt−qtQBzt\(i\)\(at\)\+QBO\(∥𝐋𝐂𝐁c,t\(i\)−zt\(i\)∥\)\\displaystyle\\geq\\sum\_\{t=1\}^\{\\tau\-1\}\\Big\(q\_\{t\}\-\\frac\{q\_\{t\}Q\}\{B\}z\_\{t\}^\{\(i\)\}\(a\_\{t\}\)\+\\frac\{Q\}\{B\}O\(\\\|\\boldsymbol\{\\mathrm\{LCB\}\}\_\{c,t\}^\{\(i\)\}\-z\_\{t\}^\{\(i\)\}\\\|\)\(11\)\+O\(q¯εtB\)\)\\displaystyle\+O\\left\(\\frac\{\\bar\{q\}\\varepsilon\_\{t\}\}\{B\}\\right\)\\Big\)=∑t=1τ−1\(QBO\(‖𝐋𝐂𝐁c,t\(i\)−zt\(i\)‖\)\+O\(q¯εtB\)\)\\displaystyle=\\sum\_\{t=1\}^\{\\tau\-1\}\\left\(\\frac\{Q\}\{B\}O\(\\\|\\boldsymbol\{\\mathrm\{LCB\}\}\_\{c,t\}^\{\(i\)\}\-z\_\{t\}^\{\(i\)\}\\\|\)\+O\\left\(\\frac\{\\bar\{q\}\\varepsilon\_\{t\}\}\{B\}\\right\)\\right\)\+QB\[B−∑t=1τ−1qtzt\(i\)\(at\)\]\.\\displaystyle\+\\frac\{Q\}\{B\}\\left\[B\-\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}z\_\{t\}^\{\(i\)\}\(a\_\{t\}\)\\right\]\.\(12\)
For the right\-hand side, we have:
−T−1/4\(qt⟨𝝁∗,𝒓t⟩−qt⋅𝐔𝐂𝐁r,t\(at\)\)\\displaystyle\-T^\{\-1/4\}\\left\(q\_\{t\}\\langle\\boldsymbol\{\\mu\}^\{\*\},\\boldsymbol\{r\}\_\{t\}\\rangle\-q\_\{t\}\\cdot\\boldsymbol\{\\mathrm\{UCB\}\}\_\{r,t\}\(a\_\{t\}\)\\right\)≥−T−1/4qt𝐔𝐂𝐁c,t\(i\)\(at\)≥−T−1/4qt\.\\displaystyle\\geq\-\{T^\{\-1/4\}\}q\_\{t\}\\boldsymbol\{\\mathrm\{UCB\}\}\_\{c,t\}^\{\(i\)\}\(a\_\{t\}\)\\geq\-T^\{\-1/4\}\{q\_\{t\}\}\.\(13\)
The termλtqt−qtQ^tBλt⟨𝝁∗,𝒛t\[ℐcover\]⟩\\lambda\_\{t\}q\_\{t\}\-\\frac\{q\_\{t\}\\widehat\{Q\}\_\{t\}\}\{B\}\\lambda\_\{t\}\\langle\\boldsymbol\{\\mu\}^\{\*\},\\boldsymbol\{z\}\_\{t\}^\{\[\\mathcal\{I\}\_\{\\text\{cover\}\}\]\}\\ranglecan be bounded as:
λtqt−qtQ^tBλt⟨𝝁∗,𝒛t\[ℐcover\]⟩\\displaystyle\\lambda\_\{t\}q\_\{t\}\-\\frac\{q\_\{t\}\\widehat\{Q\}\_\{t\}\}\{B\}\\lambda\_\{t\}\\langle\\boldsymbol\{\\mu\}^\{\*\},\\boldsymbol\{z\}\_\{t\}^\{\[\\mathcal\{I\}\_\{\\text\{cover\}\}\]\}\\rangle≥λtqt⋅εtB⟨𝝁∗,𝒛t\[ℐcover\]⟩=O\(1B∑t=1τ−1qtεt\)\.\\displaystyle\\geq\\lambda\_\{t\}q\_\{t\}\\cdot\\frac\{\\varepsilon\_\{t\}\}\{B\}\\langle\\boldsymbol\{\\mu\}^\{\*\},\\boldsymbol\{z\}\_\{t\}^\{\[\\mathcal\{I\}\_\{\\text\{cover\}\}\]\}\\rangle=O\\left\(\\frac\{1\}\{B\}\\sum\_\{t=1\}^\{\\tau\-1\}q\_\{t\}\\varepsilon\_\{t\}\\right\)\.\(14\)
Combining Lemma 2, \([12](https://arxiv.org/html/2606.17489#S6.E12)\), \([13](https://arxiv.org/html/2606.17489#S6.E13)\), and \([14](https://arxiv.org/html/2606.17489#S6.E14)\), we derive the violation bound for alli∈ℐcoveri\\in\\mathcal\{I\}\_\{\\text\{cover\}\}\. ∎
Remark:Theorem 1 \(regret bound\) and Theorem 2 \(constraint violation bound\) both reveal the key factors contributing to learning performance\. In particular, the estimation errorϵt\\epsilon\_\{t\}from demand forecasting directly affects the regret and constraint violation through the predicted task volumeQ^t\\hat\{Q\}\_\{t\}\. The𝒪~\(M3/4\)\\widetilde\{\\mathcal\{O\}\}\(M^\{3/4\}\)term arises from the regret of the OGD used to update the dual variables\. The remaining terms are governed by the confidence bounds for𝒓\\boldsymbol\{r\}and𝒛\\boldsymbol\{z\}\. These components jointly reflect how uncertainty in system dynamics and predictive advice influences online decision\-making\.
## 7Experiment
### 7\.1Experimental Setup
Collaborative inference setting:We evaluate the proposed online constrained offloading algorithm in a collaborative inference system with four candidate LLM families:Gemma2\_2b,Llama3\.2\_1b,Qwen2\.5\_0\.5b, andQwen2\.5\_1\.5b\. At each decision roundtt, the system receives a batch ofqtq\_\{t\}tasks and selects an inference option \(i\.e\., one of the candidate LLMs, or its quantized variant\) to serve the tasks\. Each option exhibits heterogeneous and uncertain trade\-offs in \(i\) task accuracy, \(ii\) inference latency \(in seconds\), \(iii\) cost per 1k tokens \(USD\), and \(iv\) the average response length \(in tokens\), which together induce time\-varying constrained decision making\.
Trace\-driven model profiling from edge measurements:Rather than relying on hand\-crafted or synthetic model profiles, we build a trace\-driven simulator grounded in real measurements reported in\[[21](https://arxiv.org/html/2606.17489#bib.bib43)\]\. The measurement traces provide dataset\-specific accuracy, end\-to\-end inference latency, and output\-length statistics for a diverse set of quantized variants drawn from commonly deployed lightweight LLM families\. We use these traces to instantiate the stochastic reward\-and\-cost model in our evaluation: for each inference option, task correctness is modeled according to the measured accuracy, the per\-request delay is sampled from the corresponding empirical latency distribution, and the token\-level output length follows the measured response\-length statistics\. The monetary cost of each request is then computed under a unified cost\-per\-1k\-tokens accounting rule based on the sampled output length\. This trace\-driven construction preserves realistic accuracy–latency–cost trade\-offs observed in edge deployments, enabling a faithful evaluation of online decisions under uncertainty\.
Constraints:We consider a budget\-constrained and delay\-sensitive deployment scenario reflecting practical service\-level objectives\. The system operates under a packing constraint of a maximum total monetary budget of $8,000, which limits the total cumulative cost of serving all tasks\. Concurrently, we impose a covering constraint on inference latency: at least 80% of all tasks must be served within 180s, enforcing that more than half of the responses meet low\-latency requirements typical for user\-facing applications\.
Dynamic Workloads:We construct a dynamic task workload based on the HumanEval benchmark, which contains 164 code\-generation tasks and is used without subsampling\. At each roundtt, a batch ofqtq\_\{t\}tasks arrives and must be served by the selected inference option\. To capture both stationary and temporally correlated demand variations, we evaluate our algorithm under two arrival patterns: \(i\) i\.i\.d\. arrivals, where\{qt\}\\\{q\_\{t\}\\\}are independently generated across rounds, and \(ii\) autoregressive arrivals following an AR\(1\) process, where the current demand depends on the previous round to emulate bursty or persistent traffic\.
Across all settings, we adopt a unified demand estimation strategy to approximate the total demandQ^t\\hat\{Q\}\_\{t\}\. The estimator combines historical observations\{qs\}s=1t−1\\\{q\_\{s\}\\\}\_\{s=1\}^\{t\-1\}with forecasts of future demands\{q^s\}\\\{\\hat\{q\}\_\{s\}\\\}produced by standard time\-series prediction methods\. To reduce computational overhead, we employ a “power\-of\-two” update rule:Q^t\\hat\{Q\}\_\{t\}is updated only whenttis a power of two \(i\.e\.,t=2kt=2^\{k\}for somek∈ℕ\+k\\in\\mathbb\{N\}^\{\+\}\); otherwise, the previous estimate is reused\. When an update is triggered,Q^t\\hat\{Q\}\_\{t\}is computed by aggregating the realized past demand and the predicted future demand, providing an efficient yet stable approximation that supports online decision making under time\-varying workloads\.
### 7\.2Benchmark Algorithms
We compare our COPAC\-UCB algorithm with three representative baselines, all extended to handle both packing and covering constraints but without demand prediction:
- •AD\-UCB:A UCB\-based algorithm for BwCR that formulates the problem as an online convex program and uses optimistic estimates to balance exploration and exploitation under convex constraints\[[1](https://arxiv.org/html/2606.17489#bib.bib44)\]\. In our setting, AD\-UCB handles both packing and covering constraints as linear constraints but lacks the ability to predict future demand fluctuations\.
- •PD\-BwK:A primal\-dual algorithm for BwK that maintains resource prices via multiplicative updates and selects arms greedily based on estimated reward\-to\-cost ratios\[[7](https://arxiv.org/html/2606.17489#bib.bib16)\]\. Although efficient and competitive in stationary settings, it is conservative and adapts poorly to non\-stationary or covering\-constrained environments\.
- •SW\-UCB:SW\-UCB extends the classical UCB framework to Non\-Stationary Bandits with Knapsacks by employing a sliding\-window estimator that focuses only on recent observations\[[31](https://arxiv.org/html/2606.17489#bib.bib48)\]\. This enables SW\-UCB to adapt to abruptly changing environments, but its short memory can lead to overly cautious resource utilization and suboptimal SLA compliance in predictable, gradually changing scenarios like ours\.
These baselines collectively illustrate the trade\-offs between optimism, conservatism, and short\-term adaptability in constrained bandit learning\.
TABLE I:Collaborative Inference Symstem ParametersLLMGemma2\_2b0\.77281\.920\.005168\.29Llama3\.2\_1b0\.8484\.430\.015124\.67Qwen2\.5\_0\.5b0\.5441\.050\.001209\.98Qwen2\.5\_1\.5b0\.27111\.530\.003130\.84
\(a\)Estimation Error

\(b\)Cumulative Regret

\(c\)Constraint Violation
Figure 2:Performance with i\.i\.d model\.
\(a\)Estimation Error

\(b\)Cumulative Regret

\(c\)Constraint Violation
Figure 3:Performance with AR \(1\) model\.
### 7\.3Performance with i\.i\.d\. Demand
We first evaluate the algorithms under an i\.i\.d\. demand setting, where eachqtq\_\{t\}is independently drawn from a Gaussian distribution with mean 2 and variance 0\.5\. The experimental results are presented in Fig 2, which reports the estimation error ofQQover time \(Fig\. 2a\), cumulative regret \(Fig\. 2b\), and constraint violation \(Fig\. 2c\)\.
In Fig\. 2a, we observe that the estimation error ofQQdecreases rapidly as time progresses, benefiting from the stability of the i\.i\.d\. demand sequence\. The power\-of\-two update policy allows the estimation to converge quickly and maintain low error with negligible computational overhead\.
Fig\. 2b compares the cumulative regret of the four algorithms\. COPAC\-UCB achieves the lowest regret throughout the horizon, highlighting the advantage of incorporating predictive estimation and explicitly optimizing for covering constraints\. AD\-UCB performs second best, leveraging its optimistic convex optimization design, though it lacks foresight and reacts more slowly to constraint trade\-offs\. PD\-BwK exhibits higher regret due to its conservative, primal\-dual updates, which underexploit available resources in a stationary setting\. SW\-UCB performs worst, accumulating the highest regret as its short\-memory sliding\-window mechanism discards useful long\-term information and underutilizes resources unnecessarily\.
Fig\. 2c shows the average constraint violation over time\. COPAC\-UCB consistently achieves the smallest \(and slightly negative\) violation, indicating it not only satisfies the covering constraint but often slightly over\-achieves it\. AD\-UCB maintains moderate violations, while PD\-BwK remains more conservative, with higher violations than COPAC\-UCB\. SW\-UCB again performs the worst, with persistently large violations due to its inability to exploit the stationary nature of the environment and its overly cautious allocation strategy\.
Overall, these results confirm that in an i\.i\.d\. setting, COPAC\-UCB successfully balances exploration, exploitation, and constraint satisfaction through predictive demand estimation and constraint\-aware decision\-making, outperforming the baselines on both regret minimization and SLA compliance\.
#### Performance with AR \(1\) Linear Demand
We evaluate algorithm performance under a time\-correlated AR\(1\) demand model,qt=α\+βqt−1\+εtq\_\{t\}=\\alpha\+\\beta q\_\{t\-1\}\+\\varepsilon\_\{t\}, withα=2\\alpha=2,β=0\.5\\beta=0\.5\. Fig 3 summarizes the results: estimation error \(Fig\. 3a\), cumulative regret \(Fig\. 3b\), and constraint violation \(Fig\. 3c\)\.
As shown in Fig\. 3a, the estimation error of cumulative demandQQremains higher and decays more slowly under AR\(1\) demand compared to the i\.i\.d\. setting \(Fig\. 2a\)\. This is because temporal correlation causes forecasting errors to propagate across rounds, one inaccurate prediction can distort subsequent estimates\. As a result, all algorithms suffer from higher cumulative regret \(Fig\. 3b\) and increased constraint violations \(Fig\. 3c\)\. While COPAC\-UCB continues to outperform the baselines in both cumulative reward and SLA satisfaction, its performance advantage narrows in the AR setting, especially during the early rounds\. This is expected: time\-correlated demand requires longer adaptation before accurate prediction stabilizes, and the dual update mechanism becomes more conservative\. Interestingly, the performance gap between AD\-UCB, PD\-BwK, and SW\-UCB also shrinks, suggesting that none of these baselines effectively exploit temporal structure\. Their respective update rules, whether multiplicative \(PD\-BwK\) or sliding\-window\-based \(SW\-UCB\), lack mechanisms to anticipate future demand shifts, often leading to mismatches between allocation and actual task volume\.
In summary, the i\.i\.d\. scenario allows all algorithms \(except SW\-UCB\) to achieve lower regret, likely with sublinear growth due to stationarity, and smaller constraint violations\. In contrast, the AR scenario highlights the importance of prediction: COPAC\-UCB retains an advantage by leveraging estimation, while the baselines degrade more uniformly\.
## 8Conclusion
We studied the problem of online LLM selection in an offloading system with time\-varying task demand\. Each LLM has unknown, stochastic accuracy, latency, and cost, and decisions must satisfy hard budget and soft service\-level constraints\. We modeled this as a stochastic bandit problem with mixed packing and covering constraints under non\-stationary demand\. Based on this, we proposed a learning\-augmented online algorithm that combines confidence\-based exploration with sequential prediction to balance reward and constraint satisfaction\. We provided theoretical guarantees on regret and constraint violations, and validated our approach through experiments on both synthetic and real\-world workloads\. Future directions include handling richer prediction feedback, more complex constraints, and adversarial demand\.
## References
- \[1\]\(2014\)Bandits with concave rewards and convex knapsacks\.InProceedings of the fifteenth ACM conference on Economics and computation,pp\. 989–1006\.Cited by:[§5\.1](https://arxiv.org/html/2606.17489#S5.SS1.1.p1.1),[1st item](https://arxiv.org/html/2606.17489#S7.I1.i1.p1.1)\.
- \[2\]S\. Agrawal and N\. R\. Devanur\(2019\)Bandits with global convex constraints and objective\.Operations Research67\(5\),pp\. 1486–1502\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p2.1)\.
- \[3\]G\. Al\-Atat, P\. Datta, S\. Moharir, and J\. P\. Champati\(2024\)Regret bounds for online learning for hierarchical inference\.InProceedings of the Twenty\-fifth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing,pp\. 281–290\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[4\]M\. Almanza, F\. Chierichetti, S\. Lattanzi, A\. Panconesi, and G\. Re\(2021\)Online facility location with multiple advice\.Advances in neural information processing systems34,pp\. 4661–4673\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p3.1)\.
- \[5\]A\. Antoniadis, T\. Gouleakis, P\. Kleer, and P\. Kolev\(2020\)Secretary and online matching problems with machine learned advice\.Advances in Neural Information Processing Systems33,pp\. 7933–7944\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p3.1)\.
- \[6\]M\. Babaioff, S\. Dughmi, R\. Kleinberg, and A\. Slivkins\(2015\)Dynamic pricing with limited supply\.ACM New York, NY, USA\.Cited by:[§5\.1](https://arxiv.org/html/2606.17489#S5.SS1.1.p1.1)\.
- \[7\]A\. Badanidiyuru, R\. Kleinberg, and A\. Slivkins\(2018\)Bandits with knapsacks\.Journal of the ACM \(JACM\)65\(3\),pp\. 1–55\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p2.1),[2nd item](https://arxiv.org/html/2606.17489#S7.I1.i2.p1.1)\.
- \[8\]S\. Balseiro, C\. Kroer, and R\. Kumar\(2023\)Online resource allocation under horizon uncertainty\.InAbstract Proceedings of the 2023 ACM SIGMETRICS International Conference on Measurement and Modeling of Computer Systems,pp\. 63–64\.Cited by:[§1](https://arxiv.org/html/2606.17489#S1.p3.1)\.
- \[9\]E\. Bamas, A\. Maggiori, and O\. Svensson\(2020\)The primal\-dual method for learning augmented algorithms\.Advances in Neural Information Processing Systems33,pp\. 20083–20094\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p3.1)\.
- \[10\]R\. Bhardwaj, Z\. Xia, G\. Ananthanarayanan, J\. Jiang, Y\. Shu, N\. Karianakis, K\. Hsieh, P\. Bahl, and I\. Stoica\(2022\)Ekya: continuous learning of video analytics models on edge compute servers\.In19th USENIX Symposium on Networked Systems Design and Implementation \(NSDI 22\),pp\. 119–135\.Cited by:[§1](https://arxiv.org/html/2606.17489#S1.p1.1)\.
- \[11\]L\. Chen, M\. Zaharia, and J\. Zou\(2023\)Frugalgpt: how to use large language models while reducing cost and improving performance\.arXiv preprint arXiv:2305\.05176\.Cited by:[§1](https://arxiv.org/html/2606.17489#S1.p1.1)\.
- \[12\]W\. C\. Cheung, D\. Simchi\-Levi, and R\. Zhu\(2019\)Learning to optimize under non\-stationarity\.InThe 22nd International Conference on Artificial Intelligence and Statistics,pp\. 1079–1087\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p2.1)\.
- \[13\]X\. Dai, J\. Li, X\. Liu, A\. Yu, and J\. Lui\(2024\)Cost\-effective online multi\-llm selection with versatile reward models\.arXiv preprint arXiv:2405\.16587\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[14\]X\. Dai, Z\. Xiao, H\. Jiang, M\. Lei, G\. Min, J\. Liu, and S\. Dustdar\(2023\)Offloading dependent tasks in edge computing with unknown system\-side information\.IEEE Transactions on Services Computing16\(6\),pp\. 4345–4359\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[15\]S\. Gillen, C\. Jung, M\. Kearns, and A\. Roth\(2018\)Online learning with an unknown fairness metric\.Advances in neural information processing systems31\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p2.1)\.
- \[16\]K\. Huang, Y\. Shi, D\. Ding, Y\. Li, Y\. Fei, L\. Lakshmanan, and X\. Xiao\(2025\)Thriftllm: on cost\-effective selection of large language models for classification queries\.arXiv preprint arXiv:2501\.04901\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[17\]Y\. Huang, Q\. Liu, and J\. Xu\(2024\)Adversarial combinatorial bandits with switching cost and arm selection constraints\.InIEEE INFOCOM 2024\-IEEE Conference on Computer Communications,pp\. 371–380\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p2.1)\.
- \[18\]Y\. Huang, L\. Wang, and J\. Xu\(2024\)Quantum entanglement path selection and qubit allocation via adversarial group neural bandits\.IEEE Transactions on Networking33\(2\),pp\. 583–594\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p2.1)\.
- \[19\]Y\. Huang, L\. Zhang, and J\. Xu\(2023\)Adversarial group linear bandits and its application to collaborative edge inference\.InIEEE INFOCOM 2023\-IEEE Conference on Computer Communications,pp\. 1–10\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[20\]Y\. Huang, L\. Zhang, and J\. Xu\(2025\)Learning the optimal path and dnn partition for collaborative edge inference\.IEEE Transactions on Mobile Computing\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[21\]E\. J\. Husom, A\. Goknil, M\. Astekin, L\. K\. Shar, A\. Kåsen, S\. Sen, B\. A\. Mithassel, and A\. Soylu\(2025\)Sustainable llm inference for edge ai: evaluating quantized llms for energy efficiency, output accuracy, and inference latency\.arXiv preprint arXiv:2504\.03360\.Cited by:[§1](https://arxiv.org/html/2606.17489#S1.p2.2),[§3](https://arxiv.org/html/2606.17489#S3.p3.10),[§7\.1](https://arxiv.org/html/2606.17489#S7.SS1.p2.1)\.
- \[22\]R\. J\. Hyndman and G\. Athanasopoulos\(2021\)Forecasting: principles and practice\.OTexts\.Cited by:[§5\.2](https://arxiv.org/html/2606.17489#S5.SS2.p1.8)\.
- \[23\]N\. Immorlica, K\. Sankararaman, R\. Schapire, and A\. Slivkins\(2022\)Adversarial bandits with knapsacks\.Journal of the ACM69\(6\),pp\. 1–47\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p2.1)\.
- \[24\]A\. Jadbabaie, A\. Rakhlin, S\. Shahrampour, and K\. Sridharan\(2015\)Online optimization: competing with dynamic comparators\.InArtificial Intelligence and Statistics,pp\. 398–406\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p3.1)\.
- \[25\]M\. Joseph, M\. Kearns, J\. H\. Morgenstern, and A\. Roth\(2016\)Fairness in learning: classic and contextual bandits\.Advances in neural information processing systems29\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p2.1)\.
- \[26\]S\. Lattanzi, T\. Lavastida, B\. Moseley, and S\. Vassilvitskii\(2020\)Online scheduling via learned weights\.InProceedings of the Fourteenth Annual ACM\-SIAM Symposium on Discrete Algorithms,pp\. 1859–1877\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p3.1)\.
- \[27\]F\. Li, J\. Liu, and B\. Ji\(2019\)Combinatorial sleeping bandits with fairness constraints\.IEEE Transactions on Network Science and Engineering7\(3\),pp\. 1799–1813\.Cited by:[§1](https://arxiv.org/html/2606.17489#S1.p2.2)\.
- \[28\]B\. Lim and S\. Zohren\(2021\)Time series forecasting with deep learning: a survey\.InPhilosophical Transactions of the Royal Society A,Cited by:[§5\.2](https://arxiv.org/html/2606.17489#S5.SS2.p1.8)\.
- \[29\]C\. Ling, K\. Peng, S\. Wang, X\. Xu, and V\. C\. Leung\(2024\)A multi\-agent drl\-based computation offloading and resource allocation method with attention mechanism in mec\-enabled iiot\.IEEE Transactions on Services Computing\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[30\]C\. Liu and J\. Zhao\(2024\)Resource allocation for stable llm training in mobile edge computing\.InProceedings of the Twenty\-fifth International Symposium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing,pp\. 81–90\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[31\]S\. Liu, J\. Jiang, and X\. Li\(2022\)Non\-stationary bandits with knapsacks\.Advances in Neural Information Processing Systems35,pp\. 16522–16532\.Cited by:[3rd item](https://arxiv.org/html/2606.17489#S7.I1.i3.p1.1)\.
- \[32\]T\. Lykouris and S\. Vassilvitskii\(2021\)Competitive caching with machine learned advice\.Journal of the ACM \(JACM\)68\(4\),pp\. 1–25\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p3.1)\.
- \[33\]L\. Lyu and W\. C\. Cheung\(2023\)Bandits with knapsacks: advice on time\-varying demands\.InInternational Conference on Machine Learning,pp\. 23212–23238\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p2.1),[§5\.1](https://arxiv.org/html/2606.17489#S5.SS1.p1.6)\.
- \[34\]M\. Mitzenmacher and S\. Vassilvitskii\(2022\)Algorithms with predictions\.Communications of the ACM65\(7\),pp\. 33–35\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p3.1)\.
- \[35\]A\. T\. Owodunni and C\. C\. Emezue\(2023\)Koya: a recommender system for large language model selection\.In4th Workshop on African Natural Language Processing,Cited by:[§1](https://arxiv.org/html/2606.17489#S1.p1.1),[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[36\]B\. Peng, M\. Galley, P\. He, H\. Cheng, Y\. Xie, Y\. Hu, Q\. Huang, L\. Liden, Z\. Yu, W\. Chen,et al\.\(2023\)Check your facts and try again: improving large language models with external knowledge and automated feedback\.arXiv preprint arXiv:2302\.12813\.Cited by:[§1](https://arxiv.org/html/2606.17489#S1.p1.1),[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[37\]M\. Purohit, Z\. Svitkina, and R\. Kumar\(2018\)Improving online algorithms via ml predictions\.Advances in Neural Information Processing Systems31\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p3.1)\.
- \[38\]A\. Rakhlin and K\. Sridharan\(2013\)Online learning with predictable sequences\.InConference on Learning Theory,pp\. 993–1019\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p3.1),[§3](https://arxiv.org/html/2606.17489#S3.p1.6)\.
- \[39\]S\. Rakhlin and K\. Sridharan\(2013\)Optimization, learning, and games with predictable sequences\.Advances in Neural Information Processing Systems26\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p3.1),[§3](https://arxiv.org/html/2606.17489#S3.p1.6)\.
- \[40\]J\. Salazar, D\. Liang, T\. Q\. Nguyen, and K\. Kirchhoff\(2019\)Masked language model scoring\.arXiv preprint arXiv:1910\.14659\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[41\]R\. H\. Shumway and D\. S\. Stoffer\(2017\)Time series analysis and its applications\.Springer\.Cited by:[§5\.2](https://arxiv.org/html/2606.17489#S5.SS2.p1.8)\.
- \[42\]A\. Slivkins, K\. A\. Sankararaman, and D\. J\. Foster\(2023\)Contextual bandits with packing and covering constraints: a modular lagrangian approach via regression\.InThe Thirty Sixth Annual Conference on Learning Theory,pp\. 4633–4656\.Cited by:[§1](https://arxiv.org/html/2606.17489#S1.p2.2)\.
- \[43\]J\. Steinhardt and P\. Liang\(2014\)Adaptivity and optimism: an improved exponentiated gradient algorithm\.InInternational conference on machine learning,pp\. 1593–1601\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p3.1)\.
- \[44\]G\. Sun, Z\. Wang, X\. Zhao, B\. Tian, Z\. Shen, Y\. He, J\. Xing, and A\. Li\(2025\)Invisible tokens, visible bills: the urgent need to audit hidden operations in opaque llm services\.arXiv preprint arXiv:2505\.18471\.Cited by:[§1](https://arxiv.org/html/2606.17489#S1.p2.2)\.
- \[45\]Y\. Xia, F\. Kong, T\. Yu, L\. Guo, R\. A\. Rossi, S\. Kim, and S\. Li\(2024\)Which llm to play? convergence\-aware online model selection with time\-increasing bandits\.InProceedings of the ACM Web Conference 2024,pp\. 4059–4070\.Cited by:[§1](https://arxiv.org/html/2606.17489#S1.p1.1),[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[46\]H\. Xu, Y\. Liu, W\. C\. Lau, and R\. Li\(2020\)Combinatorial multi\-armed bandits with concave rewards and fairness constraints\.\.InIJCAI,pp\. 2554–2560\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p2.1)\.
- \[47\]N\. Yang, J\. Wen, M\. Zhang, and M\. Tang\(2025\)Generalizable pareto\-optimal offloading with reinforcement learning in mobile edge computing\.IEEE Transactions on Services Computing\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.
- \[48\]H\. Yu, T\. Taleb, and J\. Zhang\(2022\)Deterministic latency/jitter\-aware service function chaining over beyond 5g edge fabric\.IEEE Transactions on Network and Service Management19\(3\),pp\. 2148–2162\.Cited by:[§1](https://arxiv.org/html/2606.17489#S1.p2.2),[§3](https://arxiv.org/html/2606.17489#S3.p5.6)\.
- \[49\]L\. Yuan, D\. Han, S\. Wang, and C\. G\. Brinton\(2025\)Local\-cloud inference offloading for llms in multi\-modal, multi\-task, multi\-dialogue settings\.arXiv preprint arXiv:2502\.11007\.Cited by:[§2](https://arxiv.org/html/2606.17489#S2.p1.1)\.Similar Articles
Online Pandora's Box for Contextual LLM Cascading
This paper introduces an online contextual Pandora's Box model for adaptively querying and selecting LLM APIs, proposing a learning approach that combines GMM estimation with UCB-style confidence bounds and proving dimension-dependent regret bounds.
Inference-Time Budget Control for LLM Search Agents
This paper introduces a two-stage inference-time budget control method for LLM search agents, using Value-of-Information scores to optimize tool-call and token allocation during multi-hop question answering.
A Theory of Training Profit-Optimal LLMs
This paper develops an economic model combining scaling laws with microeconomic theory to analyze profit-optimal training of large language models, considering trade-offs between model quality, training costs, and hardware efficiency.
Truthful Online Preference Aggregation for LLM Fine-Tuning in Mobile Crowdsourcing
Proposes a truthful online preference aggregation mechanism for LLM fine-tuning in mobile crowdsourcing, addressing strategic worker misreporting and achieving sublinear regret.
Optimal Gap-Dependent Regret for Private Stochastic Decision-Theoretic Online Learning
This paper solves a COLT open problem by providing an optimal gap-dependent regret algorithm for private stochastic decision-theoretic online learning, achieving the lower bound of order (log K)/Δ_min + (log K)/ε.