Cross-Epoch Adaptive Rollout Optimization for RL Post-Training

arXiv cs.LG Papers

Summary

This paper presents CERO, a cross-epoch adaptive rollout optimization method for RL post-training of LLMs, which allocates a fixed rollout budget across prompts and epochs using Bayesian posterior variance to maximize sample efficiency, achieving theoretical regret bounds and outperforming GRPO on mathematical reasoning tasks.

arXiv:2606.05606v1 Announce Type: new Abstract: LLM post-training often relies on reinforcement learning methods that sample multiple rollouts per prompt, yet most existing approaches use a fixed rollout budget for every prompt, despite large differences in the training signal different prompts provide. In this paper, we study adaptive rollout allocation under a fixed global budget and formulate the problem as online resource allocation with prompt-level diminishing returns. Our method, CERO, maintains a Beta posterior over each prompt's success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts. We use this estimate to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget. Since the resulting objective is temporally nonseparable, we derive a Fenchel-dual reformulation and update both prompt-level and budget-level dual variables via projected online gradient descent. Under fixed prompt utilities, we prove an $O(\sqrt{K})$ regret bound against the offline allocation benchmark. Experiments on mathematical-reasoning problems show that CERO consistently outperforms GRPO across multiple open-weight LLMs and benchmarks, demonstrating that adaptive rollout budgeting can improve sample efficiency.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:12 AM

# 1 Introduction
Source: [https://arxiv.org/html/2606.05606](https://arxiv.org/html/2606.05606)
\\OneAndAHalfSpacedXI\\TheoremsNumberedThrough\\ECRepeatTheorems\\EquationsNumberedThrough

\\RUNAUTHOR

Zong, Wang, Jiang

\\RUNTITLE

Adaptive Rollout Optimization

\\TITLE

Cross\-Epoch Adaptive Rollout Optimization for RL Post\-Training

\\ARTICLEAUTHORS\\AUTHOR

Yiming Zong, Yige Wang, Jiashuo Jiang

\\AFF

Department of Industrial Engineering & Decision Analytics, Hong Kong University of Science and Technology

\\ABSTRACT

LLM post\-training often relies on reinforcement learning methods that sample multiple rollouts per prompt, yet most existing approaches use a fixed rollout budget for every prompt, despite large differences in the training signal different prompts provide\. In this paper, we study adaptive rollout allocation under a fixed global budget and formulate the problem as online resource allocation with prompt\-level diminishing returns\. Our method,CERO, maintains a Beta posterior over each prompt’s success probability and uses the posterior expected Bernoulli variance as a Bayesian estimate of the value of additional rollouts\. We use this estimate to construct a concave, saturating utility over cumulative allocations, yielding an objective in which decisions across prompts and epochs are coupled by the global budget\. Since the resulting objective is temporally nonseparable, we derive a Fenchel\-dual reformulation and update both prompt\-level and budget\-level dual variables via projected online gradient descent\. Under fixed prompt utilities, we prove anO​\(K\)O\(\\sqrt\{K\}\)regret bound against the offline allocation benchmark\. Experiments on mathematical\-reasoning problems show that CERO consistently outperforms GRPO across multiple open\-weight LLMs and benchmarks, demonstrating that adaptive rollout budgeting can improve sample efficiency\.

\\KEYWORDS

Reinforcement Learning, LLM Post\-training, Online Resource Allocation, Rollout Optimization

Reinforcement learning \(RL\) post\-training has become a central ingredient for improving the reasoning ability of large language models \(LLMs\)\. In widely used algorithms such as GRPO\(Shaoet al\.[2024](https://arxiv.org/html/2606.05606#bib.bib227)\), each prompt is typically assigned a fixed number of rollouts, which are grouped to construct relative reward signals for policy optimization\. While simple and scalable, this uniform allocation strategy ignores substantial heterogeneity across prompts\. Some prompts are already solved and yield little reward variation; others are too difficult and provide weak or noisy signals; only a subset of prompts produce useful success–failure variation that can drive meaningful policy updates\. As a result, a large fraction of the rollout budget may be spent on prompts with limited marginal value\.

Recent rollout\-optimization methods address this inefficiency through prompt selection\(Yuet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib230)\), within\-batch reallocation\(Liet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib219)\), or post\-generation filtering and pruning\. These methods improve training efficiency, but they primarily make local decisions within a batch or after rollouts have already been generated\. By contrast, rollout budgeting in RL post\-training naturally spans multiple epochs: decisions made early in training affect the remaining budget, the evidence collected for each prompt, and the opportunities for future updates\. This motivates a global question: under a fixed rollout budget, how should we allocate rollouts across both prompts and epochs to maximize the utility of collected samples?

We formulate this problem as budgeted online optimization with prompt\-level diminishing returns\. An effective allocation rule should favor prompts with high expected learning value, while avoiding excessive sampling of the same prompt as evidence accumulates\. To this end, we proposeCERO, a cross\-epoch online adaptive rollout optimization framework\. To the best of our knowledge,CEROis the first rollout\-allocation framework for LLM post\-training that explicitly optimizes a global rollout budget across epochs rather than only reallocating or filtering rollouts within a batch\.

CEROmaintains a Beta posterior over each prompt’s latent success probability and uses the posterior expected Bernoulli reward variance to estimate the value of additional rollouts\. This estimate induces a concave, saturating utility over cumulative prompt\-level allocations, capturing diminishing returns as more rollouts are assigned to the same prompt\. Since this utility depends on cumulative allocations, the resulting objective is temporally non\-separable under the global budget\. We then derive a Fenchel\-dual reformulation of the budgeted allocation problem and obtain an online primal–dual algorithm with prompt\-level dual variables and a global budget multiplier\. The prompt\-level variables estimate the marginal value of additional rollouts, while the global multiplier acts as a dynamic budget price coordinating rollout consumption across epochs\.

The resulting algorithm is simple to integrate with existing RL post\-training pipelines\. At each epoch,CEROallocates rollouts to a prompt when its estimated marginal value exceeds the current global budget price, updates the prompt posterior using observed rewards, and refreshes the plug\-in value estimate for future epochs\. It acts as a drop\-in adaptive data\-collection layer on top of GRPO\-style policy optimization: it changes which prompts receive rollouts and how many rollouts they receive, but leaves the underlying policy\-gradient objective unchanged\.

Our contributions are threefold\. First, we formulate adaptive rollout allocation for LLM RL post\-training as a cross\-epoch fixed\-budget online optimization problem with prompt\-specific diminishing\-return utilities\. This formulation explicitly couples rollout decisions across the training horizon through a shared global budget, distinguishing it from prior within\-batch selection or post\-generation filtering methods\. Second, we proposeCERO, a Bayesian primal–dual allocation algorithm that uses posterior estimates of prompt\-level rollout value to guide budgeting across epochs\. Third, we provide both theoretical and empirical evidence for the effectiveness ofCERO: under fixed utilities induced by prompt\-level value estimates, we establish anO​\(K\)O\(\\sqrt\{K\}\)regret guarantee against the offline allocation benchmark\. Experiments on mathematical\-reasoning problems demonstrate thatCEROoutperforms vanilla GRPO across multiple open\-weight LLMs and benchmarks\.

### 1\.1Related Works

#### RL Rollout Optimization

A surge of research has recently emerged focusing on RL rollout optimization\. These works predominantly involve modifications to GRPO\(Shaoet al\.[2024](https://arxiv.org/html/2606.05606#bib.bib227)\)and can be broadly categorized into three perspectives\. The first line focuses on online curriculum learning and sample selection, where training examples are prioritized according to their estimated learning value\(Mahrooghiet al\.[2026](https://arxiv.org/html/2606.05606#bib.bib217), Zhanget al\.[2025](https://arxiv.org/html/2606.05606#bib.bib225), Chenet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib226), Huet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib218)\)\. The second line investigates adaptive rollout budget allocation\(Liet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib219), Yaoet al\.[2026](https://arxiv.org/html/2606.05606#bib.bib220)\), which replace uniform rollout assignment with value\-aware resource allocation under a fixed total budget\. The third line emphasizes post\-rollout filtering or pruning, such as AERO\(Zhanget al\.[2026](https://arxiv.org/html/2606.05606#bib.bib221)\), CPPO\(Linet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib224)\), PODS\(Xuet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib223)\), and GFPO\(Shrivastavaet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib222)\), which reduce update cost by rescuing, smoothing, or sub\-selecting rollout groups after generation\. Overall, prior work suggests that rollout efficiency depends critically on identifying high\-value queries and avoiding degenerate all\-correct or all\-incorrect groups\.

#### Online Resource Allocation

Online resource allocation is a central topic in the online optimization literature and has been applied in many operational settings, including ad assignment\(Mehtaet al\.[2007](https://arxiv.org/html/2606.05606#bib.bib138)\), revenue management\(Talluri and Van Ryzin[2006](https://arxiv.org/html/2606.05606#bib.bib199), Jasin and Kumar[2012](https://arxiv.org/html/2606.05606#bib.bib100), Jianget al\.[2025](https://arxiv.org/html/2606.05606#bib.bib13)\), online knapsack problems\(Arlotto and Gurvich[2019](https://arxiv.org/html/2606.05606#bib.bib105), Jiang and Zhang[2020](https://arxiv.org/html/2606.05606#bib.bib135), Liuet al\.[2022](https://arxiv.org/html/2606.05606#bib.bib3)\), and online packing/covering problems\(Buchbinder and Naor[2005](https://arxiv.org/html/2606.05606#bib.bib202),[2006](https://arxiv.org/html/2606.05606#bib.bib201), Feldmanet al\.[2010](https://arxiv.org/html/2606.05606#bib.bib200)\)\. The existing algorithmic literature can be broadly organized according to how it exploits dual information\. A dominant class of methods uses dual prices as signals for making online allocation decisions\. Within this class, some papers adopt a learn\-once strategy: they reserve an initial set of arrivals for estimation, compute a price vector from that sample, and then use the same prices for all future decisions\(Devanur and Hayes[2009](https://arxiv.org/html/2606.05606#bib.bib183), Feldmanet al\.[2010](https://arxiv.org/html/2606.05606#bib.bib200), Molinaro and Ravi[2014](https://arxiv.org/html/2606.05606#bib.bib130), Devanur and Jain[2012](https://arxiv.org/html/2606.05606#bib.bib203), Zong and Jiang[2026](https://arxiv.org/html/2606.05606#bib.bib1)\)\. Other papers advocate adaptive updating, where the pricing rule is revised at multiple points over time through repeated re\-solving, leading to stronger robustness and improved horizon dependence\(Agrawalet al\.[2014](https://arxiv.org/html/2606.05606#bib.bib131), Li and Ye[2022](https://arxiv.org/html/2606.05606#bib.bib76), Chen and Wang[2015](https://arxiv.org/html/2606.05606#bib.bib204)\)\. In parallel, other works depart from this paradigm by developing primal\-oriented algorithms\(Kesselheimet al\.[2014](https://arxiv.org/html/2606.05606#bib.bib139)\)or first\-order schemes that do not require repeated computation of dual prices\(Agrawal and Devanur[2014](https://arxiv.org/html/2606.05606#bib.bib206), Balseiroet al\.[2020](https://arxiv.org/html/2606.05606#bib.bib207)\)\.

## 2Problem Formulation

We focus on the LLM post\-training stage and study the optimization of rollout allocation\. Consider a training procedure withKKepochs and a training dataset ofMMprompts,𝒳=\{x1,…,xM\}\\mathcal\{X\}=\\\{x\_\{1\},\\ldots,x\_\{M\}\\\}\. At each epochkk, each promptxix\_\{i\}becomes available once, and the current policy generates multiple rollouts conditioned onxix\_\{i\}for policy optimization\. A standard approach, such as GRPO, allocates a fixed numberNNof rollouts to each prompt in each epoch, yielding a total rollout budgetB=K​M​NB=KMN\.

In this work, we consider an adaptive rollout allocation strategy\. Let

𝐍𝒳=\(𝑵1,…,𝑵M\)∈ℤ≥0M×K,𝑵i=\(Ni,1,…,Ni,K\)∈ℤ≥0K,\\mathbf\{N\}\_\{\\mathcal\{X\}\}=\(\\bm\{N\}\_\{1\},\\ldots,\\bm\{N\}\_\{M\}\)\\in\\mathbb\{Z\}\_\{\\geq 0\}^\{M\\times K\},\\quad\\bm\{N\}\_\{i\}=\(N\_\{i,1\},\\ldots,N\_\{i,K\}\)\\in\\mathbb\{Z\}\_\{\\geq 0\}^\{K\},denote the allocation over the entire training procedure, whereNi,kN\_\{i,k\}is the number of rollouts assigned to promptxix\_\{i\}at epochkk\. The allocation is subject to the global rollout budget

∑k=1K∑i=1MNi,k≤B\.\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\\leq B\.
For the theoretical formulation, we associate each promptxix\_\{i\}with a fixed scoreqiq\_\{i\}that abstracts the expected learning signal obtained from additional rollouts\. We refer toqiq\_\{i\}as the prompt informativeness score\. The scoreqiq\_\{i\}parameterizes the prompt\-level utility function and determines how valuable it is to allocate additional rollouts toxix\_\{i\}\. In our implementation,qiq\_\{i\}is updated across epochs using a Beta\-posterior plug\-in rule\. Let

Ui​\(n\):=U​\(qi,n\)U\_\{i\}\(n\):=U\(q\_\{i\},n\)denote the utility of assigningnncumulative rollouts to promptxix\_\{i\}\. Sinceqiq\_\{i\}is fixed in the theoretical formulation, we omit it from the notation and viewUiU\_\{i\}as a prompt\-specific utility function\. We will specify informativeness scoreqiq\_\{i\}and utilityUi​\(n\)U\_\{i\}\(n\)in[Section3](https://arxiv.org/html/2606.05606#S3)\. The rollout allocation problem is then formulated as

𝖮𝖯𝖳​\(𝒳\):=max\{Ni,k\}\\displaystyle\\mathsf\{OPT\}\(\\mathcal\{X\}\)=\\max\_\{\\\{N\_\{i,k\}\\\}\}∑i=1MUi​\(∑k=1KNi,k\)\\displaystyle\\sum\_\{i=1\}^\{M\}U\_\{i\}\\\!\\left\(\\sum\_\{k=1\}^\{K\}N\_\{i,k\}\\right\)\(1\)s\.t\.∑i=1M∑k=1KNi,k≤B,\\displaystyle\\sum\_\{i=1\}^\{M\}\\sum\_\{k=1\}^\{K\}N\_\{i,k\}\\leq B,Ni,k∈\{0,1,…,Nmax\},i∈\[M\],k∈\[K\]\.\\displaystyle N\_\{i,k\}\\in\\\{0,1,\\ldots,N\_\{\\max\}\\\},\\quad i\\in\[M\],\\ k\\in\[K\]\.Here∑k=1KNi,k\\sum\_\{k=1\}^\{K\}N\_\{i,k\}is the cumulative number of rollouts assigned to promptxix\_\{i\}andNmaxN\_\{\\max\}denotes the maximum number of rollouts that can be assigned to a single prompt in one epoch\. Compared with uniform allocation, this formulation allows rollout numbers to vary across prompts and epochs, so that more budget can be assigned to prompts with higher utility while respecting the same total budget\.

### 2\.1Problem Reformulation

The allocation problem in[Equation1](https://arxiv.org/html/2606.05606#S2.E1)is coupled across epochs because the utility of promptxix\_\{i\}depends on its cumulative allocation∑k=1KNi,k\\sum\_\{k=1\}^\{K\}N\_\{i,k\}\. This temporal coupling makes it difficult to choose the epoch\-wise allocationsNi,kN\_\{i,k\}in an online manner\. We address this difficulty by using a Fenchel\-dual representation of each utility function, which linearizes the dependence on cumulative allocations and yields an allocation subproblem that decomposes across epochs\.

For each promptxix\_\{i\}, define the concave conjugate ofUiU\_\{i\}as

Ui∗​\(θ\)=infs≥0\{s​θ−Ui​\(s\)\}\.U\_\{i\}^\{\*\}\(\\theta\)=\\inf\_\{s\\geq 0\}\\bigl\\\{s\\theta\-U\_\{i\}\(s\)\\bigr\\\}\.Assuming thatUiU\_\{i\}is closed, concave and nondecreasing onℝ\+\\mathbb\{R\}\_\{\+\}, we have the biconjugate representation

Ui​\(s\)=infθ≥0\{s​θ−Ui∗​\(θ\)\}\.U\_\{i\}\(s\)=\\inf\_\{\\theta\\geq 0\}\\bigl\\\{s\\theta\-U\_\{i\}^\{\*\}\(\\theta\)\\bigr\\\}\.Introducing a Fenchel dual variableθi\\theta\_\{i\}for each promptxix\_\{i\}, together with a Lagrange multiplierμ≥0\\mu\\geq 0for the global rollout budget constraint, we obtain the following dual objective:

L𝒳​\(𝜽,μ\)=B​μ\+max\{Ni,k∈\{0,1,…,Nmax\}\}⁡\{∑i=1M\(θi−μ\)​∑k=1KNi,k\}−∑i=1MUi∗​\(θi\)\.L\_\{\\mathcal\{X\}\}\(\\bm\{\\theta\},\\mu\)=B\\mu\+\\max\_\{\\\{N\_\{i,k\}\\in\\\{0,1,\\ldots,N\_\{\\max\}\\\}\\\}\}\\left\\\{\\sum\_\{i=1\}^\{M\}\(\\theta\_\{i\}\-\\mu\)\\sum\_\{k=1\}^\{K\}N\_\{i,k\}\\right\\\}\-\\sum\_\{i=1\}^\{M\}U\_\{i\}^\{\*\}\(\\theta\_\{i\}\)\.By weak duality, we know that

𝖮𝖯𝖳​\(𝒳\)≤inf𝜽≥0,μ≥0L𝒳​\(𝜽,μ\)\.\\mathsf\{OPT\}\(\\mathcal\{X\}\)\\leq\\inf\_\{\\bm\{\\theta\}\\geq 0,\\mu\\geq 0\}L\_\{\\mathcal\{X\}\}\(\\bm\{\\theta\},\\mu\)\.
The key advantage of this reformulation is that the maximization over allocations now decomposes across epochs\. For fixed\(𝜽,μ\)\(\\bm\{\\theta\},\\mu\), the allocation term can be written asKKidentical per\-epoch subproblems\. Hence,

L𝒳​\(𝜽,μ\)=K​L​\(𝜽,μ\),L\_\{\\mathcal\{X\}\}\(\\bm\{\\theta\},\\mu\)=K\\,L\(\\bm\{\\theta\},\\mu\),where the per\-epoch dual objective is

L​\(𝜽,μ\)=BK​μ\+max\{ni∈\{0,1,…,Nmax\}\}⁡\{∑i=1M\(θi−μ\)​ni\}−1K​∑i=1MUi∗​\(θi\)\.L\(\\bm\{\\theta\},\\mu\)=\\frac\{B\}\{K\}\\mu\+\\max\_\{\\\{n\_\{i\}\\in\\\{0,1,\\ldots,N\_\{\\max\}\\\}\\\}\}\\left\\\{\\sum\_\{i=1\}^\{M\}\(\\theta\_\{i\}\-\\mu\)n\_\{i\}\\right\\\}\-\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{M\}U\_\{i\}^\{\*\}\(\\theta\_\{i\}\)\.Here,nin\_\{i\}can be interpreted as the allocation decision for promptxix\_\{i\}in a representative epoch\. Equivalently, if

N~i:=1K​∑k=1KNi,k,\\tilde\{N\}\_\{i\}:=\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}N\_\{i,k\},thenN~i\\tilde\{N\}\_\{i\}denotes the average number of rollouts assigned to promptxix\_\{i\}across epochs\. Thus, the Fenchel\-dual reformulation turns the temporally coupled cumulative\-utility objective into per\-epoch allocation subproblems coordinated by the dual variables\(𝜽,μ\)\(\\bm\{\\theta\},\\mu\)\.

## 3Informativeness and Utility Function Design

In this section, we specify the prompt informativeness score and the resulting utility function\. We writeqiq\_\{i\}for the fixed score used in the theoretical formulation andqk,iq\_\{k,i\}for the plug\-in posterior score used by the implementation at epochkk\.

Consider each promptxix\_\{i\}, and letℋik\\mathcal\{H\}\_\{i\}^\{k\}denote the rollout history collected for this prompt up to the beginning of epochkk\. We model its latent success probabilitypip\_\{i\}using a Beta posterior:

pi∣ℋik∼Beta​\(ak,i,bk,i\)\.p\_\{i\}\\mid\\mathcal\{H\}\_\{i\}^\{k\}\\sim\\mathrm\{Beta\}\(a\_\{k,i\},b\_\{k,i\}\)\.This posterior serves as a compact Bayesian summary of the current evidence for promptxix\_\{i\}, and provides a principled basis for measuring both the current success probability and the remaining uncertainty\. The posterior mean is

p~k,i:=𝔼​\[pi∣ℋik\]=ak,iak,i\+bk,i,\\tilde\{p\}\_\{k,i\}:=\\mathbb\{E\}\[p\_\{i\}\\mid\\mathcal\{H\}\_\{i\}^\{k\}\]=\\frac\{a\_\{k,i\}\}\{a\_\{k,i\}\+b\_\{k,i\}\},which estimates the current success probability of promptxix\_\{i\}\. The posterior variance is

σk,i2:=Var​\(pi∣ℋik\)=ak,i​bk,i\(ak,i\+bk,i\)2​\(ak,i\+bk,i\+1\)\.\\sigma\_\{k,i\}^\{2\}:=\\mathrm\{Var\}\(p\_\{i\}\\mid\\mathcal\{H\}\_\{i\}^\{k\}\)=\\frac\{a\_\{k,i\}b\_\{k,i\}\}\{\(a\_\{k,i\}\+b\_\{k,i\}\)^\{2\}\(a\_\{k,i\}\+b\_\{k,i\}\+1\)\}\.The posterior mean captures the current performance estimate, while the posterior variance characterizes the remaining uncertainty in that estimate\. Prompts with larger posterior uncertainty may be more informative, but uncertainty alone can overemphasize prompts that are either too easy or too difficult\. To capture learnability more directly, we use the posterior expectation of the Bernoulli variance:

qk,i:=𝔼​\[pi​\(1−pi\)∣ℋik\]\.q\_\{k,i\}:=\\mathbb\{E\}\\left\[p\_\{i\}\(1\-p\_\{i\}\)\\mid\\mathcal\{H\}\_\{i\}^\{k\}\\right\]\.Under the Beta posterior, this informativeness score has the closed\-form expression

qk,i=ak,i​bk,i\(ak,i\+bk,i\)​\(ak,i\+bk,i\+1\)\.q\_\{k,i\}=\\frac\{a\_\{k,i\}b\_\{k,i\}\}\{\(a\_\{k,i\}\+b\_\{k,i\}\)\(a\_\{k,i\}\+b\_\{k,i\}\+1\)\}\.The termpi​\(1−pi\)p\_\{i\}\(1\-p\_\{i\}\)is maximized nearpi=1/2p\_\{i\}=1/2and vanishes in the degenerate regimespi∈\{0,1\}p\_\{i\}\\in\\\{0,1\\\}\. Thus,qk,iq\_\{k,i\}measures the extent to which the outcome of promptiiremains sensitive to additional rollouts\. A larger value ofqk,iq\_\{k,i\}indicates that promptiiis still learnable and that further budget spent on this prompt is likely to produce useful training signals\.

In the regret analysis, we fixqiq\_\{i\}and analyze the corresponding fixed utility\. In the implementation, this fixed score is replaced by the epoch\-wise plug\-in posterior estimateqk,iq\_\{k,i\}\. For the fixed\-utility formulation, define

Ui​\(n\):=U​\(qi,n\)=1−exp⁡\(−η​qi​n\)=1−exp⁡\(−ci​n\),ci:=η​qi,U\_\{i\}\(n\):=U\(q\_\{i\},n\)=1\-\\exp\(\-\\eta q\_\{i\}n\)=1\-\\exp\(\-c\_\{i\}n\),\\quad c\_\{i\}:=\\eta q\_\{i\},wherennis the cumulative number of rollouts assigned to promptii, andη\>0\\eta\>0is a temperature parameter\. Larger values ofη\\etamake the utility saturate more quickly, placing greater emphasis on early allocations, while smaller values lead to a more gradual increase and weaker diminishing returns\.

This utility has two useful properties\. First, it is increasing in both the cumulative allocationnnand the informativeness scoreqiq\_\{i\}\. Second, it is concave innn, which captures diminishing returns: early rollouts are typically more informative, while the marginal value of additional rollouts decreases as more samples are collected\. Consequently, this utility favors prompts that are both informative and under\-explored, while preventing excessive allocation to a single prompt\.

Substituting this utility into the fixed\-budget allocation problem gives

𝖮𝖯𝖳​\(𝒳\):=max\{Ni,k\}\\displaystyle\\mathsf\{OPT\}\(\\mathcal\{X\}\)=\\max\_\{\\\{N\_\{i,k\}\\\}\}∑i=1M\[1−exp⁡\(−η​qi​∑k=1KNi,k\)\]\\displaystyle\\sum\_\{i=1\}^\{M\}\\left\[1\-\\exp\\left\(\-\\eta q\_\{i\}\\sum\_\{k=1\}^\{K\}N\_\{i,k\}\\right\)\\right\]\(2\)s\.t\.∑i=1M∑k=1KNi,k≤B,\\displaystyle\\sum\_\{i=1\}^\{M\}\\sum\_\{k=1\}^\{K\}N\_\{i,k\}\\leq B,0≤Ni,k≤Nmax,i∈\[M\],k∈\[K\]\.\\displaystyle 0\\leq N\_\{i,k\}\\leq N\_\{\\max\},\\quad i\\in\[M\],\\ k\\in\[K\]\.This formulation yields a simple Bayesian mechanism for adaptive rollout allocation: the implementation uses historical rollouts to update posterior statisticsqk,iq\_\{k,i\}, and then allocates future rollout budget toward prompts whose outcomes are expected to remain informative\.

### 3\.1Theoretical Motivation

Our notion of informativeness is motivated by recent analyses of prompt\-level learnability in reinforcement learning with verifiable rewards\. Under binary rewards, the policy gap to the entropy\-regularized optimum is governed by the variance of the Bernoulli reward distribution\(Baeet al\.[2026](https://arxiv.org/html/2606.05606#bib.bib231)\)\. Letpip\_\{i\}denote the pass probability for promptxix\_\{i\}\. We define

Ii:=pi​\(1−pi\),I\_\{i\}:=p\_\{i\}\(1\-p\_\{i\}\),\(3\)which attains its maximum atpi=1/2p\_\{i\}=1/2and vanishes whenpi∈\{0,1\}p\_\{i\}\\in\\\{0,1\\\}\. This design is also consistent with recent findings in difficulty\-aware RL for LLMs\(Shrivastavaet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib222), Huet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib218)\): prompts that are either too easy or too difficult tend to provide limited learning signal, since the model almost always succeeds or fails on them, whereas medium\-difficulty prompts induce higher outcome variance and thus yield more informative updates\. In our setting,IiI\_\{i\}captures this principle directly: easy and hard prompts, corresponding topip\_\{i\}close to11or0, receive fewer rollouts, while prompts of intermediate difficulty, wherepi≈1/2p\_\{i\}\\approx 1/2, are assigned more rollouts\.

This observation provides a natural justification for using reward variation as a measure of informativeness\. However, in our rollout allocation setting, the latent pass probabilitypip\_\{i\}is unknown and must be estimated online from historical rollouts\. Rather than using a point estimate ofpip\_\{i\}, we maintain the posterior distribution

pi∣ℋik∼Beta​\(ak,i,bk,i\),p\_\{i\}\\mid\\mathcal\{H\}\_\{i\}^\{k\}\\sim\\mathrm\{Beta\}\(a\_\{k,i\},b\_\{k,i\}\),which captures both the current estimate of the prompt difficulty and the remaining uncertainty in that estimate\. We then define the posterior informativeness as the posterior expectation of the ideal learnability:

qk,i:=𝔼​\[Ii∣ℋik\]=𝔼​\[pi​\(1−pi\)∣ℋik\]\.q\_\{k,i\}:=\\mathbb\{E\}\[I\_\{i\}\\mid\\mathcal\{H\}\_\{i\}^\{k\}\]=\\mathbb\{E\}\[p\_\{i\}\(1\-p\_\{i\}\)\\mid\\mathcal\{H\}\_\{i\}^\{k\}\]\.Under the Beta posterior, this yields

qk,i=ak,i​bk,i\(ak,i\+bk,i\)​\(ak,i\+bk,i\+1\)\.q\_\{k,i\}=\\frac\{a\_\{k,i\}b\_\{k,i\}\}\{\(a\_\{k,i\}\+b\_\{k,i\}\)\(a\_\{k,i\}\+b\_\{k,i\}\+1\)\}\.
While prior work uses this variance\-based learnability mainly to filter prompts into retained or discarded subsets, we use it as a continuous Bayesian score for optimizing rollout allocation under a fixed budget\. Thus,qk,iq\_\{k,i\}can be interpreted as a Bayesian proxy for the ideal prompt\-level learnability\. Compared with directly filtering prompts based on empirical pass rates, this posterior formulation is better suited for online rollout allocation: it accounts for uncertainty induced by limited historical samples and provides a smooth, continuously updated estimate of how informative future rollouts from each prompt are expected to be\.

This motivates the plug\-in utility used by the implementation,

U​\(qk,i,n\)=1−exp⁡\(−η​qk,i​n\),U\(q\_\{k,i\},n\)=1\-\\exp\(\-\\eta q\_\{k,i\}n\),where largerqk,iq\_\{k,i\}increases the marginal value of additional rollouts, while concavity innnimposes diminishing returns and prevents excessive allocation to a single prompt\. The theorem below analyzes the fixed\-utility version obtained by holdingqiq\_\{i\}fixed; the posterior scoreqk,iq\_\{k,i\}is used as a practical plug\-in update in CERO\.

## 4Online Rollout Allocation Algorithm

We now presentCERO, an online primal\-dual rollout allocation algorithm for LLM reinforcement learning under a fixed generation budget\. The formal pseudo\-code is provided in[Algorithm1](https://arxiv.org/html/2606.05606#alg1)\.

#### Online rollout allocation\.

Given the current dual variables\(𝜽k,μk\)\(\\bm\{\\theta\}^\{k\},\\mu^\{k\}\),CEROdetermines the number of rollouts to collect for each prompt by solving the prompt\-wise allocation subproblem

Ni,k=arg⁡max0≤n≤Nmax⁡\(θik−μk\)​n,N\_\{i,k\}=\\arg\\max\_\{0\\leq n\\leq N\_\{\\max\}\}\(\\theta\_\{i\}^\{k\}\-\\mu^\{k\}\)n,whereNmaxN\_\{\\max\}is the per\-prompt rollout cap\. Since the objective is linear innn, the solution admits the threshold form

Ni,k=\{Nmax,θik\>μk,0,θik≤μk\.N\_\{i,k\}=\\begin\{cases\}N\_\{\\max\},&\\theta\_\{i\}^\{k\}\>\\mu^\{k\},\\\\ 0,&\\theta\_\{i\}^\{k\}\\leq\\mu^\{k\}\.\\end\{cases\}In implementation, we additionally clip the allocation by the remaining rollout budget to avoid overspending\. The regret analysis below is stated for the primal–dual allocation rule with the box constraint0≤n≤Nmax0\\leq n\\leq N\_\{\\max\}, and controls budget violation through the global dual variable\. Thus, a prompt receives rollouts only when its estimated marginal value exceeds the current global budget price\. This rule provides an online scheduler for LLM rollout generation: prompts that are currently more informative receive more samples, while low\-value prompts are skipped to preserve budget for future epochs\.

#### Prompt\-level feedback and dual update\.

After selectingNi,kN\_\{i,k\}, the current policy samples responses from promptxix\_\{i\},

yi,k\(1\),…,yi,k\(Ni,k\)∼πϕ\(⋅∣xi\),y\_\{i,k\}^\{\(1\)\},\\ldots,y\_\{i,k\}^\{\(N\_\{i,k\}\)\}\\sim\\pi\_\{\\phi\}\(\\cdot\\mid x\_\{i\}\),and the generated responses are evaluated by the reward model or task\-specific reward function\. The resulting prompt–response–reward tuples are added to the rollout buffer𝒟\\mathcal\{D\}and are used to update the informativeness estimate of promptxix\_\{i\}\. Specifically,CEROupdates the Beta posterior associated with promptxix\_\{i\}, computes the next informativeness scoreqk\+1,iq\_\{k\+1,i\}, and recomputes the corresponding utility term\.

Using this updated prompt\-level feedback,CEROperforms a projected online gradient step on the prompt\-level dual variable:

gθik=Ni,k−1K​∇Ui∗​\(θik\),θik\+1=\[θik−ηθ​gθik\]\+\.g\_\{\\theta\_\{i\}\}^\{k\}=N\_\{i,k\}\-\\frac\{1\}\{K\}\\nabla U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\),\\qquad\\theta\_\{i\}^\{k\+1\}=\\left\[\\theta\_\{i\}^\{k\}\-\\eta\_\{\\theta\}g\_\{\\theta\_\{i\}\}^\{k\}\\right\]\_\{\+\}\.This update follows from theθi\\theta\_\{i\}\-subgradient of the per\-epoch dual objective\. Intuitively, it adjusts the future sampling priority of promptxix\_\{i\}by balancing the number of rollouts already allocated to the prompt against the utility implied by its current informativeness estimate\.

#### Cross\-epoch budget control\.

To regulate rollout usage over the entire training horizon,CEROupdates the global multiplier according to the discrepancy between the current epoch’s rollout consumption and the remaining budgetBkB\_\{k\}:

gμk=BkK−∑i=1MNi,k,μk\+1=\[μk−ημ​gμk\]\+\.g\_\{\\mu\}^\{k\}=\\frac\{B\_\{k\}\}\{K\}\-\\sum\_\{i=1\}^\{M\}N\_\{i,k\},\\qquad\\mu^\{k\+1\}=\\left\[\\mu^\{k\}\-\\eta\_\{\\mu\}g\_\{\\mu\}^\{k\}\\right\]\_\{\+\}\.When the algorithm consumes rollouts faster than the target budget rate,μk\\mu^\{k\}increases, making future allocation decisions more selective\. Conversely, when rollout consumption is below the target rate,μk\\mu^\{k\}decreases, allowing more prompts to become eligible for sampling in later epochs\. This global update couples all prompt\-level allocation decisions through the shared generation budget, while avoiding a rigid per\-epoch rollout quota\.

#### Integration with existing LLM RL pipelines\.

CEROcan be used as a drop\-in online rollout scheduler for existing LLM reinforcement\-learning pipelines\. In our implementation, we instantiateCEROon top of the widely used GRPO framework\. Whenever the rollout buffer satisfies\|𝒟\|≥Bbatch\|\\mathcal\{D\}\|\\geq B\_\{\\mathrm\{batch\}\}, the policy is updated using the GRPO objective and the buffer is cleared\. Importantly,CEROleaves the underlying policy\-gradient objective unchanged; it only modifies the data\-collection process\. By adapting rollout allocation to prompt\-level feedback and global budget pressure,CEROimproves rollout efficiency under a fixed sampling budget\.

Algorithm 1Cross\-Epoch Online Adaptive Rollout Optimization \(CERO\)1:Input:Number of prompts

MM, training epochs

KK, average rollout

NN, maximum rollouts

NmaxN\_\{\\max\}, utility functions

Ui​\(⋅\)U\_\{i\}\(\\cdot\), policy

πθ\\pi\_\{\\theta\}, batch size

BbatchB\_\{\\text\{batch\}\}
2:Initialize dual variables

𝜽1\\bm\{\\theta\}^\{1\},

μ1\\mu^\{1\}, total rollout budget

B0=K​M​NB\_\{0\}=KMN, collected batch

𝒟=∅\\mathcal\{D\}=\\emptyset
3:for

k=1k=1to

KKdo

4:foreach prompt

xix\_\{i\}do

5:Compute the step allocation:

Ni,k=arg⁡max0≤Ni,k≤min⁡\(Nmax,B−Bused\)⁡\(θik−μk\)⋅Ni,kN\_\{i,k\}=\\arg\\max\_\{0\\leq N\_\{i,k\}\\leq\\min\(N\_\{\\max\},B\-B\_\{\\text\{used\}\}\)\}\(\\theta\_\{i\}^\{k\}\-\\mu^\{k\}\)\\cdot N\_\{i,k\}\.

6:Allocate

Ni,kN\_\{i,k\}rollouts:

yi\(1\),…,yi\(Ni,k\)∼πθ\(⋅∣xi\)y\_\{i\}^\{\(1\)\},\\dots,y\_\{i\}^\{\(N\_\{i,k\}\)\}\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x\_\{i\}\)\.

7:Add collected rollouts to batch:

𝒟←𝒟∪\{yi\(1\),…,yi\(Ni,k\)\}\\mathcal\{D\}\\leftarrow\\mathcal\{D\}\\cup\\\{y\_\{i\}^\{\(1\)\},\\dots,y\_\{i\}^\{\(N\_\{i,k\}\)\}\\\}
8:Update remaining budget:

Bk←Bk−Ni,kB\_\{k\}\\leftarrow B\_\{k\}\-N\_\{i,k\}
9:Update dual variables via online gradient descent:

gθik=Ni,k−1K​∇Ui∗​\(θik\),θik\+1=\[θik−ηθ​gθik\]\+g\_\{\\theta\_\{i\}\}^\{k\}=\{N\}\_\{i,k\}\-\\frac\{1\}\{K\}\\nabla U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\),\\quad\\theta\_\{i\}^\{k\+1\}=\\left\[\\theta\_\{i\}^\{k\}\-\\eta\_\{\\theta\}g\_\{\\theta\_\{i\}\}^\{k\}\\right\]\_\{\+\}
10:if

\|𝒟\|≥Bbatch\|\\mathcal\{D\}\|\\geq B\_\{\\text\{batch\}\}then

11:Policy update

12:endif

13:endfor

14:if

B<0B<0then

15:Terminate

16:endif

17:Update global dual variable:

gμk=BkK−∑i=1MNi,k,μk\+1=\[μk−ημ​gμk\]\+,Bk\+1=Bkg\_\{\\mu\}^\{k\}=\\frac\{B\_\{k\}\}\{K\}\-\\sum\_\{i=1\}^\{M\}\{N\}\_\{i,k\},\\quad\\mu^\{k\+1\}=\\left\[\\mu^\{k\}\-\\eta\_\{\\mu\}g\_\{\\mu\}^\{k\}\\right\]\_\{\+\},\\quad B\_\{k\+1\}=B\_\{k\}
18:endfor

### 4\.1Theoretical Guarantee

Before stating the regret guarantee, we first define the offline benchmark against which the online rollout allocation algorithm is compared\. For the theoretical analysis, we consider the setting where each promptxix\_\{i\}has a fixed informativeness scoreqiq\_\{i\}\. To simplify notation, we suppress the dependence onqiq\_\{i\}and the utility hyperparameterη\\eta\.

Given a total rollout budgetBB, let𝖮𝖯𝖳\\mathsf\{OPT\}denote the optimal value of the offline allocation problem in Eq\. \([2](https://arxiv.org/html/2606.05606#S3.E2)\)\. ForCEROalgorithm, we define its realized utility as

𝖠𝖫𝖦:=∑i=1MUi​\(∑k=1KNi,k\),\\mathsf\{ALG\}:=\\sum\_\{i=1\}^\{M\}U\_\{i\}\\left\(\\sum\_\{k=1\}^\{K\}N\_\{i,k\}\\right\),whereNi,kN\_\{i,k\}is the number of rollouts allocated to promptxix\_\{i\}at epochkk\. The regret ofCEROis defined as following:

Reg:=𝖮𝖯𝖳−𝖠𝖫𝖦\\operatorname\{Reg\}:=\\mathsf\{OPT\}\-\\mathsf\{ALG\}
###### Theorem 4\.1\(main result\)

In the LLM RL post\-training setting, suppose that each training prompt appears only once in each training epoch, and consider the CERO algorithm in[Algorithm1](https://arxiv.org/html/2606.05606#alg1)\. Suppose the prompt\-level Fenchel dual variables and the global rollout\-budget dual variable are projected onto compact domains,

𝜽k∈Θϵ:=∏i=1M\[ϵ,ci\],μk∈ℳ:=\[0,μ¯\],\\bm\{\\theta\}^\{k\}\\in\\Theta\_\{\\epsilon\}:=\\prod\_\{i=1\}^\{M\}\[\\epsilon,c\_\{i\}\],\\quad\\mu^\{k\}\\in\\mathcal\{M\}:=\[0,\\bar\{\\mu\}\],whereci=η​qic\_\{i\}=\\eta q\_\{i\},ϵ\>0\\epsilon\>0, andμ¯<∞\\bar\{\\mu\}<\\infty\. If the prompt\-level and budget\-level OCO updates incur regretsRegθ⁡\(K\)\\operatorname\{Reg\}\_\{\\theta\}\(K\)andRegμ⁡\(K\)\\operatorname\{Reg\}\_\{\\mu\}\(K\), then

Reg≤Regθ⁡\(K\)\+Regμ⁡\(K\)\.\\operatorname\{Reg\}\\leq\\operatorname\{Reg\}\_\{\\theta\}\(K\)\+\\operatorname\{Reg\}\_\{\\mu\}\(K\)\.Consequently, if projected online gradient ascent is used onΘϵ\\Theta\_\{\\epsilon\}andℳ\\mathcal\{M\}with uniformly bounded supergradients, then

Reg=O​\(K\)\.\\operatorname\{Reg\}=O\(\\sqrt\{K\}\)\.

## 5Numerical Experiments

### 5\.1Experimental Setup

We integrate our proposedCEROwith GRPO and compare it with the vanilla GRPO algorithm\. We evaluate their performance on four open\-weight language models:DeepSeek\-R1\-Distill\-Qwen\-1\.5B\(Guoet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib180)\)\(abbreviated as R1\-Distill\-1\.5B\),Qwen3\-4B\-Base\(Yanget al\.[2025](https://arxiv.org/html/2606.05606#bib.bib228)\),Qwen3\-4B\-Instruct\(Yanget al\.[2025](https://arxiv.org/html/2606.05606#bib.bib228)\), andQwen2\.5\-Math\-7B\(Yanget al\.[2024](https://arxiv.org/html/2606.05606#bib.bib229)\)\. We post\-training these LLMs via verl\(Shenget al\.[2025](https://arxiv.org/html/2606.05606#bib.bib235)\)framework on DAPO\-Math\-17K dataset\(Yuet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib230)\), which consists of1791717917training prompts with ground truth answers for verification\. We initiate the Beta posterior for training prompts asBeta​\(1,1\)\\mathrm\{Beta\}\(1,1\)\. More training details are deferred in Appendix[9](https://arxiv.org/html/2606.05606#S9)\.

We evaluate mathematical reasoning on four competition\-style benchmarks:AIME24,AIME25,AIME26, andAMC23\. The AIME sets test challenging multi\-step high\-school reasoning with exact\-match integer answers, while AMC23 provides broader coverage with more varied difficulty and problem formats\. Together, these benchmarks assess whether improvements generalize across contest years and beyond AIME\-style problems\.

### 5\.2Main results and analysis

Table[1](https://arxiv.org/html/2606.05606#S5.T1)compares CERO with vanilla GRPO across four representative models and four mathematical reasoning benchmarks\. CERO consistently outperforms GRPO for every evaluated model–benchmark pair, indicating that its gains are not tied to a particular architecture, model scale, or post\-training setting\. Averaged across benchmarks, CERO improves over GRPO by\+4\.84\+4\.84points on R1\-Distill\-1\.5B,\+6\.04\+6\.04points on Qwen3\-4B\-Base,\+3\.28\+3\.28points on Qwen3\-4B\-Instruct, and\+1\.53\+1\.53points on Qwen2\.5\-Math\-7B\. Such results demonstrate the effectiveness of our proposed method\. In addition, CERO introduces negligible computational overhead: in our experiments, each allocation update takes less than one second\.

Table 1:Evaluation performance comparison across different models and benchmarks\.We next analyze the source of CERO’s gains from the perspective of training signal availability\. Following Knapsack\-RL\(Liet al\.[2025](https://arxiv.org/html/2606.05606#bib.bib219)\), we define aneffective promptas a prompt whose rollout group contains both successful and failed responses\. Such prompts have non\-zero within\-group reward variance and therefore induce non\-zero GRPO gradients\. We use the*effective prompt ratio*to measure the fraction of prompts in a batch that provide useful learning signals, and report this metric in[Figure1](https://arxiv.org/html/2606.05606#S5.F1)\.

Across all LLM post\-training settings, CERO consistently achieves a higher effective prompt ratio than GRPO, which helps explain its stronger downstream performance\. In particular,[Figure1\(a\)](https://arxiv.org/html/2606.05606#S5.F1.sf1)shows that the effective prompt ratio of GRPO drops sharply after a certain number of training steps\. This trend suggests that, by that stage, DeepSeek\-R1\-Distill\-Qwen\-1\.5B has already learned many of the sampled prompts; as a result, later rollout groups often contain only successful or failure responses, yielding ineffective prompts with zero reward variance\. In contrast, CERO adaptively allocates more rollout budget to informative prompts and stops before the effective prompt ratio collapses, thereby maintaining a higher proportion of useful training signals throughout training\.

![Refer to caption](https://arxiv.org/html/2606.05606v1/x1.png)\(a\)DeepSeek\-R1\-Distill\-Qwen\-1\.5B
![Refer to caption](https://arxiv.org/html/2606.05606v1/x2.png)\(b\)Qwen3\-4B\-base
![Refer to caption](https://arxiv.org/html/2606.05606v1/x3.png)\(c\)Qwen3\-4B\-Instruct
![Refer to caption](https://arxiv.org/html/2606.05606v1/x4.png)\(d\)Qwen2\.5\-Math

Figure 1:Effective prompt ratio comparison between CERO and GRPO\.

## 6Conclusion

We introducedCERO, a cross\-epoch adaptive rollout allocation framework for LLM RL post\-training under a fixed global budget\. By modeling prompt informativeness with a Beta posterior, using a concave saturating utility to capture diminishing returns, and deriving a Fenchel\-dual online optimization formulation,CEROadaptively prioritizes prompts that provide higher expected learning value\. We established anO​\(K\)O\(\\sqrt\{K\}\)regret guarantee and empirically showed consistent improvements over GRPO across multiple models and mathematical reasoning benchmarks\. These results suggest that adaptive rollout budgeting is an effective and principled direction for improving the sample efficiency of RL post\-training\.

#### Limitations\.

Our theoretical regret guarantee is established for fixed prompt\-level utilities, whereas the implemented version uses epoch\-wise plug\-in posterior informativeness scores\. Thus, the current analysis does not fully capture the nonstationarity induced by policy updates and posterior refreshes\. Empirically, we evaluate CERO on mathematical\-reasoning post\-training with four open\-weight backbones and four benchmarks; broader validation on other domains, reward models, and larger\-scale training regimes remains important\.

## References

- Fast algorithms for online stochastic convex programming\.InProceedings of the twenty\-sixth annual ACM\-SIAM symposium on Discrete algorithms,pp\. 1405–1424\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- S\. Agrawal, Z\. Wang, and Y\. Ye \(2014\)A dynamic near\-optimal algorithm for online linear programming\.Operations Research62\(4\),pp\. 876–890\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- A\. Arlotto and I\. Gurvich \(2019\)Uniformly bounded regret in the multisecretary problem\.Stochastic Systems9\(3\),pp\. 231–260\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- S\. Bae, J\. Hong, M\. Y\. Lee, H\. Kim, J\. Nam, and D\. Kwak \(2026\)Online difficulty filtering for reasoning oriented reinforcement learning\.InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 700–719\.Cited by:[§3\.1](https://arxiv.org/html/2606.05606#S3.SS1.p1.2)\.
- S\. Balseiro, H\. Lu, and V\. Mirrokni \(2020\)Dual mirror descent for online allocation problems\.InInternational Conference on Machine Learning,pp\. 613–628\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- N\. Buchbinder and J\. Naor \(2005\)Online primal\-dual algorithms for covering and packing problems\.InEuropean Symposium on Algorithms,pp\. 689–701\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- N\. Buchbinder and J\. Naor \(2006\)Improved bounds for online routing and packing via a primal\-dual approach\.In2006 47th Annual IEEE Symposium on Foundations of Computer Science \(FOCS’06\),pp\. 293–304\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- X\. A\. Chen and Z\. Wang \(2015\)A dynamic learning algorithm for online matching problems with concave returns\.European Journal of Operational Research247\(2\),pp\. 379–388\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- X\. Chen, J\. Lu, M\. Kim, D\. Zhang, J\. Tang, A\. Piché, N\. Gontier, Y\. Bengio, and E\. Kamalloo \(2025\)Self\-evolving curriculum for llm reasoning\.arXiv preprint arXiv:2505\.14970\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px1.p1.1)\.
- N\. R\. Devanur and T\. P\. Hayes \(2009\)The adwords problem: online keyword matching with budgeted bidders under random permutations\.InProceedings of the 10th ACM conference on Electronic commerce,pp\. 71–78\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- N\. R\. Devanur and K\. Jain \(2012\)Online matching with concave returns\.InProceedings of the forty\-fourth annual ACM symposium on Theory of computing,pp\. 137–144\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- J\. Feldman, M\. Henzinger, N\. Korula, V\. S\. Mirrokni, and C\. Stein \(2010\)Online stochastic packing applied to display ad allocation\.InEuropean Symposium on Algorithms,pp\. 182–194\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§5\.1](https://arxiv.org/html/2606.05606#S5.SS1.p1.2)\.
- Z\. Hu, J\. Qiu, T\. Bai, H\. Yang, B\. Yuan, Q\. Jing, C\. He, and W\. Zhang \(2025\)VADE: variance\-aware dynamic sampling via online sample\-level difficulty estimation for multimodal rl\.arXiv preprint arXiv:2511\.18902\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.05606#S3.SS1.p1.9)\.
- S\. Jasin and S\. Kumar \(2012\)A re\-solving heuristic with bounded revenue loss for network revenue management with customer choice\.Mathematics of Operations Research37\(2\),pp\. 313–345\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- J\. Jiang, W\. Ma, and J\. Zhang \(2025\)Degeneracy is ok: logarithmic regret for network revenue management with indiscrete distributions\.Operations Research73\(6\),pp\. 3405–3420\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- J\. Jiang and J\. Zhang \(2020\)Online resource allocation with stochastic resource consumption\.arXiv preprint arXiv:2012\.07933\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- T\. Kesselheim, A\. Tönnis, K\. Radke, and B\. Vöcking \(2014\)Primal beats dual on online packing lps in the random\-order model\.InProceedings of the forty\-sixth annual ACM symposium on Theory of computing,pp\. 303–312\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- X\. Li and Y\. Ye \(2022\)Online linear programming: dual convergence, new algorithms, and regret bounds\.Operations Research70\(5\),pp\. 2948–2966\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- Z\. Li, C\. Chen, T\. Yang, T\. Ding, R\. Sun, G\. Zhang, W\. Huang, and Z\. Luo \(2025\)Knapsack rl: unlocking exploration of llms via optimizing budget allocation\.arXiv preprint arXiv:2509\.25849\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05606#S1.p2.1),[§5\.2](https://arxiv.org/html/2606.05606#S5.SS2.p2.1)\.
- Z\. Lin, M\. Lin, Y\. Xie, and R\. Ji \(2025\)Cppo: accelerating the training of group relative policy optimization\-based reasoning models\.arXiv preprint arXiv:2503\.22342\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px1.p1.1)\.
- S\. Liu, J\. Jiang, and X\. Li \(2022\)Non\-stationary bandits with knapsacks\.Advances in Neural Information Processing Systems35,pp\. 16522–16532\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- I\. Mahrooghi, A\. Lotfi, and E\. Abbe \(2026\)Goldilocks rl: tuning task difficulty to escape sparse rewards for reasoning\.arXiv preprint arXiv:2602\.14868\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px1.p1.1)\.
- A\. Mehta, A\. Saberi, U\. Vazirani, and V\. Vazirani \(2007\)Adwords and generalized online matching\.Journal of the ACM \(JACM\)54\(5\),pp\. 22–es\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- M\. Molinaro and R\. Ravi \(2014\)The geometry of online packing linear programs\.Mathematics of Operations Research39\(1\),pp\. 46–59\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.05606#S1.p1.1)\.
- G\. Sheng, C\. Zhang, Z\. Ye, X\. Wu, W\. Zhang, R\. Zhang, Y\. Peng, H\. Lin, and C\. Wu \(2025\)Hybridflow: a flexible and efficient rlhf framework\.InProceedings of the Twentieth European Conference on Computer Systems,pp\. 1279–1297\.Cited by:[§5\.1](https://arxiv.org/html/2606.05606#S5.SS1.p1.2)\.
- V\. Shrivastava, A\. Awadallah, V\. Balachandran, S\. Garg, H\. Behl, and D\. Papailiopoulos \(2025\)Sample more to think less: group filtered policy optimization for concise reasoning\.arXiv preprint arXiv:2508\.09726\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px1.p1.1),[§3\.1](https://arxiv.org/html/2606.05606#S3.SS1.p1.9)\.
- K\. T\. Talluri and G\. J\. Van Ryzin \(2006\)The theory and practice of revenue management\.Vol\.68,Springer Science & Business Media\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.
- Y\. E\. Xu, Y\. Savani, F\. Fang, and J\. Z\. Kolter \(2025\)Not all rollouts are useful: down\-sampling rollouts in llm reinforcement learning\.arXiv preprint arXiv:2504\.13818\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§5\.1](https://arxiv.org/html/2606.05606#S5.SS1.p1.2)\.
- A\. Yang, B\. Zhang, B\. Hui, B\. Gao, B\. Yu, C\. Li, D\. Liu, J\. Tu, J\. Zhou, J\. Lin,et al\.\(2024\)Qwen2\. 5\-math technical report: toward mathematical expert model via self\-improvement\.arXiv preprint arXiv:2409\.12122\.Cited by:[§5\.1](https://arxiv.org/html/2606.05606#S5.SS1.p1.2)\.
- Z\. Yao, Y\. Zhang, Y\. Chen, Y\. Sun, Z\. Xu, Y\. Yang, T\. Hu, Q\. Gu, H\. Su, and X\. Cai \(2026\)CoBA\-rl: capability\-oriented budget allocation for reinforcement learning in llms\.arXiv preprint arXiv:2602\.03048\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu,et al\.\(2025\)Dapo: an open\-source llm reinforcement learning system at scale\.arXiv preprint arXiv:2503\.14476\.Cited by:[§1](https://arxiv.org/html/2606.05606#S1.p2.1),[§5\.1](https://arxiv.org/html/2606.05606#S5.SS1.p1.2)\.
- R\. Zhang, D\. Arora, S\. Mei, and A\. Zanette \(2025\)Speed\-rl: faster training of reasoning models via online curriculum learning\.arXiv preprint arXiv:2506\.09016\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px1.p1.1)\.
- Z\. Zhang, Z\. Han, C\. Mavromatis, Q\. Zhu, Y\. Zhang, S\. Guan, D\. Wang, X\. Zhou, S\. Wang, S\. Adeshina,et al\.\(2026\)Train less, learn more: adaptive efficient rollout optimization for group\-based reinforcement learning\.arXiv preprint arXiv:2602\.14338\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px1.p1.1)\.
- Y\. Zong and J\. Jiang \(2026\)Online semi\-infinite linear programming: efficient algorithms via function approximation\.arXiv preprint arXiv:2603\.16200\.Cited by:[§1\.1](https://arxiv.org/html/2606.05606#S1.SS1.SSS0.Px2.p1.1)\.

\{APPENDICES\}

## 7Proof of Theorem[4\.1](https://arxiv.org/html/2606.05606#S4.Thmtheorem1)

#### Step 1: Fenchel representation ofUiU\_\{i\}\.

Recall that the prompt\-level utility is

Ui​\(n\)=1−exp⁡\(−ci​n\),ci:=η​qi,U\_\{i\}\(n\)=1\-\\exp\(\-c\_\{i\}n\),\\quad c\_\{i\}:=\\eta q\_\{i\},whereqiq\_\{i\}is the informativeness score of promptiiandη\>0\\eta\>0is the temperature parameter\. Throughout the proof, we treatqiq\_\{i\}and hencecic\_\{i\}as fixed\. The posterior update used in the implementation can be viewed as a plug\-in procedure for updating these utilities over time\.

Forn≥0n\\geq 0, we have

Ui′​\(n\)=ci​exp⁡\(−ci​n\)\.U\_\{i\}^\{\\prime\}\(n\)=c\_\{i\}\\exp\(\-c\_\{i\}n\)\.Hence

0<Ui′​\(n\)≤ci\.0<U\_\{i\}^\{\\prime\}\(n\)\\leq c\_\{i\}\.Therefore, the natural domain of the Fenchel dual variable is

θi∈\[0,ci\]\.\\theta\_\{i\}\\in\[0,c\_\{i\}\]\.
For a concave utility function, define its concave Fenchel conjugate by

Ui∗​\(θi\):=infn≥0\{θi​n−Ui​\(n\)\}\.U\_\{i\}^\{\*\}\(\\theta\_\{i\}\):=\\inf\_\{n\\geq 0\}\\left\\\{\\theta\_\{i\}n\-U\_\{i\}\(n\)\\right\\\}\.ForUi​\(n\)=1−exp⁡\(−ci​n\)U\_\{i\}\(n\)=1\-\\exp\(\-c\_\{i\}n\), this gives

Ui∗​\(θi\)=infn≥0\{θi​n−1\+exp⁡\(−ci​n\)\}\.U\_\{i\}^\{\*\}\(\\theta\_\{i\}\)=\\inf\_\{n\\geq 0\}\\left\\\{\\theta\_\{i\}n\-1\+\\exp\(\-c\_\{i\}n\)\\right\\\}\.For0<θi≤ci0<\\theta\_\{i\}\\leq c\_\{i\}, the first\-order condition is

θi−ci​exp⁡\(−ci​n\)=0\.\\theta\_\{i\}\-c\_\{i\}\\exp\(\-c\_\{i\}n\)=0\.Thus

exp⁡\(−ci​n\)=θici,n=1ci​log⁡ciθi\.\\exp\(\-c\_\{i\}n\)=\\frac\{\\theta\_\{i\}\}\{c\_\{i\}\},\\quad n=\\frac\{1\}\{c\_\{i\}\}\\log\\frac\{c\_\{i\}\}\{\\theta\_\{i\}\}\.Substituting this optimizer back into the conjugate yields

Ui∗​\(θi\)\\displaystyle U\_\{i\}^\{\*\}\(\\theta\_\{i\}\)=θi⋅1ci​log⁡ciθi−1\+θici\\displaystyle=\\theta\_\{i\}\\cdot\\frac\{1\}\{c\_\{i\}\}\\log\\frac\{c\_\{i\}\}\{\\theta\_\{i\}\}\-1\+\\frac\{\\theta\_\{i\}\}\{c\_\{i\}\}=θici​log⁡ciθi\+θici−1,0<θi≤ci\.\\displaystyle=\\frac\{\\theta\_\{i\}\}\{c\_\{i\}\}\\log\\frac\{c\_\{i\}\}\{\\theta\_\{i\}\}\+\\frac\{\\theta\_\{i\}\}\{c\_\{i\}\}\-1,\\quad 0<\\theta\_\{i\}\\leq c\_\{i\}\.The value atθi=0\\theta\_\{i\}=0is understood by continuity:

Ui∗​\(0\)=−1\.U\_\{i\}^\{\*\}\(0\)=\-1\.Therefore,

Ui∗​\(θi\)=θici​log⁡ciθi\+θici−1,θi∈\[0,ci\],U\_\{i\}^\{\*\}\(\\theta\_\{i\}\)=\\frac\{\\theta\_\{i\}\}\{c\_\{i\}\}\\log\\frac\{c\_\{i\}\}\{\\theta\_\{i\}\}\+\\frac\{\\theta\_\{i\}\}\{c\_\{i\}\}\-1,\\quad\\theta\_\{i\}\\in\[0,c\_\{i\}\],with the convention0​log⁡\(ci/0\)=00\\log\(c\_\{i\}/0\)=0\.

By concave Fenchel duality, for everyn≥0n\\geq 0,

Ui​\(n\)=minθi∈\[0,ci\]⁡\{θi​n−Ui∗​\(θi\)\}\.U\_\{i\}\(n\)=\\min\_\{\\theta\_\{i\}\\in\[0,c\_\{i\}\]\}\\left\\\{\\theta\_\{i\}n\-U\_\{i\}^\{\*\}\(\\theta\_\{i\}\)\\right\\\}\.Equivalently,

−Ui​\(n\)=maxθi∈\[0,ci\]⁡\{Ui∗​\(θi\)−θi​n\}\.\-U\_\{i\}\(n\)=\\max\_\{\\theta\_\{i\}\\in\[0,c\_\{i\}\]\}\\left\\\{U\_\{i\}^\{\*\}\(\\theta\_\{i\}\)\-\\theta\_\{i\}n\\right\\\}\.
For technical convenience, the algorithm projects onto the compact truncated domain

Θi=\[ϵ,ci\],\\Theta\_\{i\}=\[\\epsilon,c\_\{i\}\],whereϵ\>0\\epsilon\>0is arbitrarily small\. The approximation error from replacing\[0,ci\]\[0,c\_\{i\}\]by\[ϵ,ci\]\[\\epsilon,c\_\{i\}\]can be absorbed into the dual\-regret term\.

#### Step 2: OCO control of the utility term\.

Let

ni:=∑k=1KNi,kn\_\{i\}:=\\sum\_\{k=1\}^\{K\}N\_\{i,k\}denote the total number of rollouts allocated to promptiiby the online algorithm\. The realized utility of the algorithm is

ALG=∑i=1MUi​\(ni\)=∑i=1MUi​\(∑k=1KNi,k\)\.\\mathrm\{ALG\}=\\sum\_\{i=1\}^\{M\}U\_\{i\}\(n\_\{i\}\)=\\sum\_\{i=1\}^\{M\}U\_\{i\}\\\!\\left\(\\sum\_\{k=1\}^\{K\}N\_\{i,k\}\\right\)\.Using the concave Fenchel representation from Step 1, we have, for eachii,

−Ui​\(ni\)=maxθi∈\[0,ci\]⁡\{Ui∗​\(θi\)−θi​ni\}\.\-U\_\{i\}\(n\_\{i\}\)=\\max\_\{\\theta\_\{i\}\\in\[0,c\_\{i\}\]\}\\left\\\{U\_\{i\}^\{\*\}\(\\theta\_\{i\}\)\-\\theta\_\{i\}n\_\{i\}\\right\\\}\.Therefore,

−ALG\\displaystyle\-\\mathrm\{ALG\}=∑i=1M\[−Ui​\(∑k=1KNi,k\)\]\\displaystyle=\\sum\_\{i=1\}^\{M\}\\left\[\-U\_\{i\}\\\!\\left\(\\sum\_\{k=1\}^\{K\}N\_\{i,k\}\\right\)\\right\]=maxθ∈Θ​∑i=1M\[Ui∗​\(θi\)−θi​∑k=1KNi,k\],\\displaystyle=\\max\_\{\\theta\\in\\Theta\}\\sum\_\{i=1\}^\{M\}\\left\[U\_\{i\}^\{\*\}\(\\theta\_\{i\}\)\-\\theta\_\{i\}\\sum\_\{k=1\}^\{K\}N\_\{i,k\}\\right\],where

Θ:=∏i=1M\[0,ci\]\.\\Theta:=\\prod\_\{i=1\}^\{M\}\[0,c\_\{i\}\]\.Rewriting the right\-hand side as a sum over epochs gives

∑i=1M\[Ui∗​\(θi\)−θi​∑k=1KNi,k\]\\displaystyle\\sum\_\{i=1\}^\{M\}\\left\[U\_\{i\}^\{\*\}\(\\theta\_\{i\}\)\-\\theta\_\{i\}\\sum\_\{k=1\}^\{K\}N\_\{i,k\}\\right\]=∑k=1K∑i=1M\[1K​Ui∗​\(θi\)−θi​Ni,k\]\.\\displaystyle=\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{M\}\\left\[\\frac\{1\}\{K\}U\_\{i\}^\{\*\}\(\\theta\_\{i\}\)\-\\theta\_\{i\}N\_\{i,k\}\\right\]\.Define the prompt\-level dual reward function

ψk​\(θ\):=∑i=1M\[1K​Ui∗​\(θi\)−θi​Ni,k\]\.\\psi\_\{k\}\(\\theta\):=\\sum\_\{i=1\}^\{M\}\\left\[\\frac\{1\}\{K\}U\_\{i\}^\{\*\}\(\\theta\_\{i\}\)\-\\theta\_\{i\}N\_\{i,k\}\\right\]\.Then

−ALG=maxθ∈Θ​∑k=1Kψk​\(θ\)\.\-\\mathrm\{ALG\}=\\max\_\{\\theta\\in\\Theta\}\\sum\_\{k=1\}^\{K\}\\psi\_\{k\}\(\\theta\)\.
For technical convenience, the algorithm projects onto the truncated compactdomain

Θϵ:=∏i=1M\[ϵ,ci\],\\Theta\_\{\\epsilon\}:=\\prod\_\{i=1\}^\{M\}\[\\epsilon,c\_\{i\}\],whereϵ\>0\\epsilon\>0is arbitrarily small\. The error caused by replacingΘ\\ThetawithΘϵ\\Theta\_\{\\epsilon\}can be absorbed into the prompt\-level dualregret term\.

Assume that the prompt\-level online concave maximization subroutine guarantees

maxθ∈Θϵ​∑k=1Kψk​\(θ\)−∑k=1Kψk​\(θk\)≤Regθ​\(K\)\.\\max\_\{\\theta\\in\\Theta\_\{\\epsilon\}\}\\sum\_\{k=1\}^\{K\}\\psi\_\{k\}\(\\theta\)\-\\sum\_\{k=1\}^\{K\}\\psi\_\{k\}\(\\theta^\{k\}\)\\leq\\mathrm\{Reg\}\_\{\\theta\}\(K\)\.Then

−𝖠𝖫𝖦≤∑k=1K∑i=1M\[1K​Ui∗​\(θik\)−θik​Ni,k\]\+Regθ⁡\(K\)\.\-\\mathsf\{ALG\}\\leq\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{M\}\\left\[\\frac\{1\}\{K\}U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\)\-\\theta\_\{i\}^\{k\}N\_\{i,k\}\\right\]\+\\operatorname\{Reg\}\_\{\\theta\}\(K\)\.\(4\)

#### Step 3: OCO control of the budget violation\.

Letb:=BKb:=\\frac\{B\}\{K\}be the average rollout budget per epoch\. The total budget violation of the online allocation is

\[∑k=1K∑i=1MNi,k−B\]\+\.\\left\[\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-B\\right\]\_\{\+\}\.SinceB=K​bB=Kb, we can rewrite the violation as

\[∑k=1K\(∑i=1MNi,k−b\)\]\+\.\\left\[\\sum\_\{k=1\}^\{K\}\\left\(\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-b\\right\)\\right\]\_\{\+\}\.
Letℳ:=\[0,μ¯\]\\mathcal\{M\}:=\[0,\\bar\{\\mu\}\]be the compact domain for the global budget dual variable, whereμ¯≥1\\bar\{\\mu\}\\geq 1\. For any scalarzz, we have

\[z\]\+≤maxμ∈ℳ⁡μ​z\.\[z\]\_\{\+\}\\leq\\max\_\{\\mu\\in\\mathcal\{M\}\}\\mu z\.Applying this inequality with

z=∑k=1K\(∑i=1MNi,k−b\),z=\\sum\_\{k=1\}^\{K\}\\left\(\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-b\\right\),we obtain

\[∑k=1K∑i=1MNi,k−B\]\+\\displaystyle\\left\[\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-B\\right\]\_\{\+\}≤maxμ∈ℳ​∑k=1Kμ​\(∑i=1MNi,k−b\)\.\\displaystyle\\leq\\max\_\{\\mu\\in\\mathcal\{M\}\}\\sum\_\{k=1\}^\{K\}\\mu\\left\(\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-b\\right\)\.
Define the budget dual reward function

qk​\(μ\):=μ​\(∑i=1MNi,k−b\)\.q\_\{k\}\(\\mu\):=\\mu\\left\(\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-b\\right\)\.Assume that the global budget online concave maximization subroutine guarantees

maxμ∈ℳ​∑k=1Kqk​\(μ\)−∑k=1Kqk​\(μk\)≤Regμ​\(K\)\.\\max\_\{\\mu\\in\\mathcal\{M\}\}\\sum\_\{k=1\}^\{K\}q\_\{k\}\(\\mu\)\-\\sum\_\{k=1\}^\{K\}q\_\{k\}\(\\mu^\{k\}\)\\leq\\mathrm\{Reg\}\_\{\\mu\}\(K\)\.Then

\[∑k=1K∑i=1MNi,k−B\]\+\\displaystyle\\left\[\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-B\\right\]\_\{\+\}≤∑k=1Kqk​\(μk\)\+Regμ​\(K\)\\displaystyle\\leq\\sum\_\{k=1\}^\{K\}q\_\{k\}\(\\mu^\{k\}\)\+\\mathrm\{Reg\}\_\{\\mu\}\(K\)\(5\)=∑k=1Kμk​\(∑i=1MNi,k−b\)\+Regμ​\(K\)\.\\displaystyle=\\sum\_\{k=1\}^\{K\}\\mu^\{k\}\\left\(\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-b\\right\)\+\\mathrm\{Reg\}\_\{\\mu\}\(K\)\.
Adding \([4](https://arxiv.org/html/2606.05606#S7.E4)\) and \([5](https://arxiv.org/html/2606.05606#S7.E5)\), we get

−𝖠𝖫𝖦\+\[∑k=1K∑i=1MNi,k−B\]\+\\displaystyle\-\\mathsf\{ALG\}\+\\left\[\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-B\\right\]\_\{\+\}≤∑k=1K\{∑i=1M\[1K​Ui∗​\(θik\)−θik​Ni,k\]\+μk​\(∑i=1MNi,k−b\)\}\+Regθ⁡\(K\)\+Regμ⁡\(K\)\.\\displaystyle\\leq\\sum\_\{k=1\}^\{K\}\\left\\\{\\sum\_\{i=1\}^\{M\}\\left\[\\frac\{1\}\{K\}U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\)\-\\theta\_\{i\}^\{k\}N\_\{i,k\}\\right\]\+\\mu^\{k\}\\left\(\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-b\\right\)\\right\\\}\+\\operatorname\{Reg\}\_\{\\theta\}\(K\)\+\\operatorname\{Reg\}\_\{\\mu\}\(K\)\.\(6\)

#### Step 4: Comparison with an evenly spread offline optimum\.

LetNi,k⋆N\_\{i,k\}^\{\\star\}be an optimal offline allocation for the benchmark, and define the cumulative offline allocation for promptiiby

ni⋆:=∑k=1KNi,k⋆\.n\_\{i\}^\{\\star\}:=\\sum\_\{k=1\}^\{K\}N\_\{i,k\}^\{\\star\}\.Consider the evenly spread version of this offline allocation:

N¯i,k⋆:=ni⋆K,i∈\[M\],k∈\[K\]\.\\bar\{N\}\_\{i,k\}^\{\\star\}:=\\frac\{n\_\{i\}^\{\\star\}\}\{K\},\\quad i\\in\[M\],\\ k\\in\[K\]\.Since0≤Ni,k⋆≤Nmax0\\leq N\_\{i,k\}^\{\\star\}\\leq N\_\{\\max\}, we have

0≤ni⋆≤K​Nmax,0\\leq n\_\{i\}^\{\\star\}\\leq KN\_\{\\max\},and therefore

0≤N¯i,k⋆≤Nmax\.0\\leq\\bar\{N\}\_\{i,k\}^\{\\star\}\\leq N\_\{\\max\}\.ThusN¯i,k⋆\\bar\{N\}\_\{i,k\}^\{\\star\}is feasible for the per\-round allocation subproblem in the continuous relaxation\.

At epochkk, the allocation rule chooses

Ni,k∈arg⁡max0≤n≤Nmax⁡\(θik−μk\)​n\.N\_\{i,k\}\\in\\arg\\max\_\{0\\leq n\\leq N\_\{\\max\}\}\(\\theta\_\{i\}^\{k\}\-\\mu^\{k\}\)n\.Therefore, for everyiiandkk,

\(θik−μk\)​Ni,k≥\(θik−μk\)​N¯i,k⋆\.\(\\theta\_\{i\}^\{k\}\-\\mu^\{k\}\)N\_\{i,k\}\\geq\(\\theta\_\{i\}^\{k\}\-\\mu^\{k\}\)\\bar\{N\}\_\{i,k\}^\{\\star\}\.Equivalently,

−\(θik−μk\)​Ni,k≤−\(θik−μk\)​N¯i,k⋆\.\-\(\\theta\_\{i\}^\{k\}\-\\mu^\{k\}\)N\_\{i,k\}\\leq\-\(\\theta\_\{i\}^\{k\}\-\\mu^\{k\}\)\\bar\{N\}\_\{i,k\}^\{\\star\}\.Adding1K​Ui∗​\(θik\)\\frac\{1\}\{K\}U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\)and summing overii, we obtain

∑i=1M\[1K​Ui∗​\(θik\)−θik​Ni,k\]\+μk​\(∑i=1MNi,k−b\)\\displaystyle\\sum\_\{i=1\}^\{M\}\\left\[\\frac\{1\}\{K\}U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\)\-\\theta\_\{i\}^\{k\}N\_\{i,k\}\\right\]\+\\mu^\{k\}\\left\(\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-b\\right\)\(7\)≤∑i=1M\[1K​Ui∗​\(θik\)−θik​N¯i,k⋆\]\+μk​\(∑i=1MN¯i,k⋆−b\)\.\\displaystyle\\leq\\sum\_\{i=1\}^\{M\}\\left\[\\frac\{1\}\{K\}U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\)\-\\theta\_\{i\}^\{k\}\\bar\{N\}\_\{i,k\}^\{\\star\}\\right\]\+\\mu^\{k\}\\left\(\\sum\_\{i=1\}^\{M\}\\bar\{N\}\_\{i,k\}^\{\\star\}\-b\\right\)\.
UsingN¯i,k⋆=ni⋆/K\\bar\{N\}\_\{i,k\}^\{\\star\}=n\_\{i\}^\{\\star\}/K, the first term on the right\-hand side becomes

∑i=1M\[1K​Ui∗​\(θik\)−θik​ni⋆K\]\\displaystyle\\sum\_\{i=1\}^\{M\}\\left\[\\frac\{1\}\{K\}U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\)\-\\theta\_\{i\}^\{k\}\\frac\{n\_\{i\}^\{\\star\}\}\{K\}\\right\]=1K​∑i=1M\[Ui∗​\(θik\)−θik​ni⋆\]\.\\displaystyle=\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{M\}\\left\[U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\)\-\\theta\_\{i\}^\{k\}n\_\{i\}^\{\\star\}\\right\]\.By the Fenchel inequality from Step 1, for anyθik∈\[0,ci\]\\theta\_\{i\}^\{k\}\\in\[0,c\_\{i\}\],

Ui​\(ni⋆\)≤θik​ni⋆−Ui∗​\(θik\)\.U\_\{i\}\(n\_\{i\}^\{\\star\}\)\\leq\\theta\_\{i\}^\{k\}n\_\{i\}^\{\\star\}\-U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\)\.Equivalently,

Ui∗​\(θik\)−θik​ni⋆≤−Ui​\(ni⋆\)\.U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\)\-\\theta\_\{i\}^\{k\}n\_\{i\}^\{\\star\}\\leq\-U\_\{i\}\(n\_\{i\}^\{\\star\}\)\.Therefore,

1K​∑i=1M\[Ui∗​\(θik\)−θik​ni⋆\]≤−1K​∑i=1MUi​\(ni⋆\)=−OPTK\.\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{M\}\\left\[U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\)\-\\theta\_\{i\}^\{k\}n\_\{i\}^\{\\star\}\\right\]\\leq\-\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{M\}U\_\{i\}\(n\_\{i\}^\{\\star\}\)=\-\\frac\{\\mathrm\{OPT\}\}\{K\}\.
It remains to control the budget term\. Since the offline allocation is feasible,

∑i=1Mni⋆=∑i=1M∑k=1KNi,k⋆≤B\.\\sum\_\{i=1\}^\{M\}n\_\{i\}^\{\\star\}=\\sum\_\{i=1\}^\{M\}\\sum\_\{k=1\}^\{K\}N\_\{i,k\}^\{\\star\}\\leq B\.Hence

∑i=1MN¯i,k⋆=1K​∑i=1Mni⋆≤BK=b\.\\sum\_\{i=1\}^\{M\}\\bar\{N\}\_\{i,k\}^\{\\star\}=\\frac\{1\}\{K\}\\sum\_\{i=1\}^\{M\}n\_\{i\}^\{\\star\}\\leq\\frac\{B\}\{K\}=b\.Sinceμk≥0\\mu^\{k\}\\geq 0, we have

μk​\(∑i=1MN¯i,k⋆−b\)≤0\.\\mu^\{k\}\\left\(\\sum\_\{i=1\}^\{M\}\\bar\{N\}\_\{i,k\}^\{\\star\}\-b\\right\)\\leq 0\.Combining the preceding inequalities gives, for every epochkk,

∑i=1M\[1K​Ui∗​\(θik\)−θik​Ni,k\]\+μk​\(∑i=1MNi,k−b\)≤−OPTK\.\\displaystyle\\sum\_\{i=1\}^\{M\}\\left\[\\frac\{1\}\{K\}U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\)\-\\theta\_\{i\}^\{k\}N\_\{i,k\}\\right\]\+\\mu^\{k\}\\left\(\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-b\\right\)\\leq\-\\frac\{\\mathrm\{OPT\}\}\{K\}\.Summing overk=1,…,Kk=1,\\ldots,K, we obtain

∑k=1K\{∑i=1M\[1K​Ui∗​\(θik\)−θik​Ni,k\]\+μk​\(∑i=1MNi,k−b\)\}≤−𝖮𝖯𝖳\.\\sum\_\{k=1\}^\{K\}\\left\\\{\\sum\_\{i=1\}^\{M\}\\left\[\\frac\{1\}\{K\}U\_\{i\}^\{\*\}\(\\theta\_\{i\}^\{k\}\)\-\\theta\_\{i\}^\{k\}N\_\{i,k\}\\right\]\+\\mu^\{k\}\\left\(\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-b\\right\)\\right\\\}\\leq\-\\mathsf\{OPT\}\.\(8\)

#### Step 5: Combine the inequalities

Substituting \([8](https://arxiv.org/html/2606.05606#S7.E8)\) into \([6](https://arxiv.org/html/2606.05606#S7.E6)\), we get

−𝖠𝖫𝖦\+\[∑k=1K∑i=1MNi,k−B\]\+≤−𝖮𝖯𝖳\+Regθ⁡\(K\)\+Regμ⁡\(K\)\.\-\\mathsf\{ALG\}\+\\left\[\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-B\\right\]\_\{\+\}\\leq\-\\mathsf\{OPT\}\+\\operatorname\{Reg\}\_\{\\theta\}\(K\)\+\\operatorname\{Reg\}\_\{\\mu\}\(K\)\.Rearranging,

𝖮𝖯𝖳−𝖠𝖫𝖦\+\[∑k=1K∑i=1MNi,k−B\]\+≤Regθ⁡\(K\)\+Regμ⁡\(K\)\.\\mathsf\{OPT\}\-\\mathsf\{ALG\}\+\\left\[\\sum\_\{k=1\}^\{K\}\\sum\_\{i=1\}^\{M\}N\_\{i,k\}\-B\\right\]\_\{\+\}\\leq\\operatorname\{Reg\}\_\{\\theta\}\(K\)\+\\operatorname\{Reg\}\_\{\\mu\}\(K\)\.Since the budget violation term is nonnegative, we can drop it from the left\-hand side and obtain

𝖮𝖯𝖳−𝖠𝖫𝖦≤Regθ⁡\(K\)\+Regμ⁡\(K\)\.\\mathsf\{OPT\}\-\\mathsf\{ALG\}\\leq\\operatorname\{Reg\}\_\{\\theta\}\(K\)\+\\operatorname\{Reg\}\_\{\\mu\}\(K\)\.
###### Lemma 7\.1\(OCO regret for dual ascent updates\)

Let

Θϵ:=∏i=1M\[ϵ,ci\],ℳ:=\[0,μ¯\],\\Theta\_\{\\epsilon\}:=\\prod\_\{i=1\}^\{M\}\[\\epsilon,c\_\{i\}\],\\quad\\mathcal\{M\}:=\[0,\\bar\{\\mu\}\],whereci=η​qic\_\{i\}=\\eta q\_\{i\},ϵ\>0\\epsilon\>0, andμ¯<∞\\bar\{\\mu\}<\\infty\. Suppose that the prompt\-level dual reward functionsψk:Θϵ→ℝ\\psi\_\{k\}:\\Theta\_\{\\epsilon\}\\to\\mathbb\{R\}are concave and have uniformly bounded supergradients:

‖∇ψk​\(θ\)‖2≤Gθ,∀θ∈Θϵ,k∈\[K\],\\\|\\nabla\\psi\_\{k\}\(\\theta\)\\\|\_\{2\}\\leq G\_\{\\theta\},\\quad\\forall\\theta\\in\\Theta\_\{\\epsilon\},\\ k\\in\[K\],and that the budget dual reward functionsqk:ℳ→ℝq\_\{k\}:\\mathcal\{M\}\\to\\mathbb\{R\}are concave and have uniformly bounded supergradients:

\|∇qk​\(μ\)\|≤Gμ,∀μ∈ℳ,k∈\[K\]\.\|\\nabla q\_\{k\}\(\\mu\)\|\\leq G\_\{\\mu\},\\quad\\forall\\mu\\in\\mathcal\{M\},\\ k\\in\[K\]\.Consider the projected online gradient ascent updates

θk\+1=ΠΘϵ​\(θk\+ηθ​∇ψk​\(θk\)\),\\theta^\{k\+1\}=\\Pi\_\{\\Theta\_\{\\epsilon\}\}\\left\(\\theta^\{k\}\+\\eta\_\{\\theta\}\\nabla\\psi\_\{k\}\(\\theta^\{k\}\)\\right\),and

μk\+1=Πℳ​\(μk\+ημ​∇qk​\(μk\)\)\.\\mu^\{k\+1\}=\\Pi\_\{\\mathcal\{M\}\}\\left\(\\mu^\{k\}\+\\eta\_\{\\mu\}\\nabla q\_\{k\}\(\\mu^\{k\}\)\\right\)\.Then, with

ηθ=DθGθ​K,ημ=DμGμ​K,\\eta\_\{\\theta\}=\\frac\{D\_\{\\theta\}\}\{G\_\{\\theta\}\\sqrt\{K\}\},\\quad\\eta\_\{\\mu\}=\\frac\{D\_\{\\mu\}\}\{G\_\{\\mu\}\\sqrt\{K\}\},where

Dθ:=supθ,θ′∈Θϵ‖θ−θ′‖2,Dμ:=supμ,μ′∈ℳ\|μ−μ′\|,D\_\{\\theta\}:=\\sup\_\{\\theta,\\theta^\{\\prime\}\\in\\Theta\_\{\\epsilon\}\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\},\\quad D\_\{\\mu\}:=\\sup\_\{\\mu,\\mu^\{\\prime\}\\in\\mathcal\{M\}\}\|\\mu\-\\mu^\{\\prime\}\|,we have

maxθ∈Θϵ​∑k=1Kψk​\(θ\)−∑k=1Kψk​\(θk\)≤Dθ​Gθ​K,\\max\_\{\\theta\\in\\Theta\_\{\\epsilon\}\}\\sum\_\{k=1\}^\{K\}\\psi\_\{k\}\(\\theta\)\-\\sum\_\{k=1\}^\{K\}\\psi\_\{k\}\(\\theta^\{k\}\)\\leq D\_\{\\theta\}G\_\{\\theta\}\\sqrt\{K\},and

maxμ∈ℳ​∑k=1Kqk​\(μ\)−∑k=1Kqk​\(μk\)≤Dμ​Gμ​K\.\\max\_\{\\mu\\in\\mathcal\{M\}\}\\sum\_\{k=1\}^\{K\}q\_\{k\}\(\\mu\)\-\\sum\_\{k=1\}^\{K\}q\_\{k\}\(\\mu^\{k\}\)\\leq D\_\{\\mu\}G\_\{\\mu\}\\sqrt\{K\}\.Consequently,

Regθ​\(K\)=O​\(K\),Regμ​\(K\)=O​\(K\)\.\\mathrm\{Reg\}\_\{\\theta\}\(K\)=O\(\\sqrt\{K\}\),\\quad\\mathrm\{Reg\}\_\{\\mu\}\(K\)=O\(\\sqrt\{K\}\)\.

According to[Lemma7\.1](https://arxiv.org/html/2606.05606#S7.Thmtheorem1), we can immediately obtain:

Reg=𝖮𝖯𝖳−𝖠𝖫𝖦=O​\(K\)\.\\operatorname\{Reg\}=\\mathsf\{OPT\}\-\\mathsf\{ALG\}=O\(\\sqrt\{K\}\)\.This completes the proof\.

## 8Proof of[Lemma7\.1](https://arxiv.org/html/2606.05606#S7.Thmtheorem1)

###### Proof 8\.1

We prove the two regret bounds using the standard analysis of projected online gradient ascent\. The proof is included for completeness\.

First consider the prompt\-level dual variables\. Let

Θϵ=∏i=1M\[ϵ,ci\],\\Theta\_\{\\epsilon\}=\\prod\_\{i=1\}^\{M\}\[\\epsilon,c\_\{i\}\],and define its diameter

Dθ:=supθ,θ′∈Θϵ‖θ−θ′‖2\.D\_\{\\theta\}:=\\sup\_\{\\theta,\\theta^\{\\prime\}\\in\\Theta\_\{\\epsilon\}\}\\\|\\theta\-\\theta^\{\\prime\}\\\|\_\{2\}\.The projected gradient ascent update is

θk\+1=ΠΘϵ​\(θk\+ηθ​∇ψk​\(θk\)\)\.\\theta^\{k\+1\}=\\Pi\_\{\\Theta\_\{\\epsilon\}\}\\left\(\\theta^\{k\}\+\\eta\_\{\\theta\}\\nabla\\psi\_\{k\}\(\\theta^\{k\}\)\\right\)\.Fix any comparatorθ∈Θϵ\\theta\\in\\Theta\_\{\\epsilon\}\. By non\-expansiveness of Euclidean projection,

‖θk\+1−θ‖22\\displaystyle\\\|\\theta^\{k\+1\}\-\\theta\\\|\_\{2\}^\{2\}≤‖θk\+ηθ​∇ψk​\(θk\)−θ‖22\\displaystyle\\leq\\left\\\|\\theta^\{k\}\+\\eta\_\{\\theta\}\\nabla\\psi\_\{k\}\(\\theta^\{k\}\)\-\\theta\\right\\\|\_\{2\}^\{2\}=‖θk−θ‖22\+2​ηθ​⟨∇ψk​\(θk\),θk−θ⟩\+ηθ2​‖∇ψk​\(θk\)‖22\.\\displaystyle=\\\|\\theta^\{k\}\-\\theta\\\|\_\{2\}^\{2\}\+2\\eta\_\{\\theta\}\\left\\langle\\nabla\\psi\_\{k\}\(\\theta^\{k\}\),\\theta^\{k\}\-\\theta\\right\\rangle\+\\eta\_\{\\theta\}^\{2\}\\\|\\nabla\\psi\_\{k\}\(\\theta^\{k\}\)\\\|\_\{2\}^\{2\}\.Rearranging gives

⟨∇ψk​\(θk\),θ−θk⟩≤‖θk−θ‖22−‖θk\+1−θ‖222​ηθ\+ηθ2​‖∇ψk​\(θk\)‖22\.\\left\\langle\\nabla\\psi\_\{k\}\(\\theta^\{k\}\),\\theta\-\\theta^\{k\}\\right\\rangle\\leq\\frac\{\\\|\\theta^\{k\}\-\\theta\\\|\_\{2\}^\{2\}\-\\\|\\theta^\{k\+1\}\-\\theta\\\|\_\{2\}^\{2\}\}\{2\\eta\_\{\\theta\}\}\+\\frac\{\\eta\_\{\\theta\}\}\{2\}\\\|\\nabla\\psi\_\{k\}\(\\theta^\{k\}\)\\\|\_\{2\}^\{2\}\.Sinceψk\\psi\_\{k\}is concave, we have

ψk​\(θ\)−ψk​\(θk\)≤⟨∇ψk​\(θk\),θ−θk⟩\.\\psi\_\{k\}\(\\theta\)\-\\psi\_\{k\}\(\\theta^\{k\}\)\\leq\\left\\langle\\nabla\\psi\_\{k\}\(\\theta^\{k\}\),\\theta\-\\theta^\{k\}\\right\\rangle\.Combining the preceding two inequalities and using‖∇ψk​\(θk\)‖2≤Gθ\\\|\\nabla\\psi\_\{k\}\(\\theta^\{k\}\)\\\|\_\{2\}\\leq G\_\{\\theta\}, we obtain

ψk​\(θ\)−ψk​\(θk\)≤‖θk−θ‖22−‖θk\+1−θ‖222​ηθ\+ηθ​Gθ22\.\\psi\_\{k\}\(\\theta\)\-\\psi\_\{k\}\(\\theta^\{k\}\)\\leq\\frac\{\\\|\\theta^\{k\}\-\\theta\\\|\_\{2\}^\{2\}\-\\\|\\theta^\{k\+1\}\-\\theta\\\|\_\{2\}^\{2\}\}\{2\\eta\_\{\\theta\}\}\+\\frac\{\\eta\_\{\\theta\}G\_\{\\theta\}^\{2\}\}\{2\}\.Summing overk=1,…,Kk=1,\\ldots,K, the squared\-distance terms telescope:

∑k=1K\[ψk​\(θ\)−ψk​\(θk\)\]\\displaystyle\\sum\_\{k=1\}^\{K\}\\left\[\\psi\_\{k\}\(\\theta\)\-\\psi\_\{k\}\(\\theta^\{k\}\)\\right\]≤‖θ1−θ‖222​ηθ\+ηθ​Gθ2​K2\\displaystyle\\leq\\frac\{\\\|\\theta^\{1\}\-\\theta\\\|\_\{2\}^\{2\}\}\{2\\eta\_\{\\theta\}\}\+\\frac\{\\eta\_\{\\theta\}G\_\{\\theta\}^\{2\}K\}\{2\}≤Dθ22​ηθ\+ηθ​Gθ2​K2\.\\displaystyle\\leq\\frac\{D\_\{\\theta\}^\{2\}\}\{2\\eta\_\{\\theta\}\}\+\\frac\{\\eta\_\{\\theta\}G\_\{\\theta\}^\{2\}K\}\{2\}\.Choosing

ηθ=DθGθ​K\\eta\_\{\\theta\}=\\frac\{D\_\{\\theta\}\}\{G\_\{\\theta\}\\sqrt\{K\}\}yields

∑k=1K\[ψk​\(θ\)−ψk​\(θk\)\]≤Dθ​Gθ​K\.\\sum\_\{k=1\}^\{K\}\\left\[\\psi\_\{k\}\(\\theta\)\-\\psi\_\{k\}\(\\theta^\{k\}\)\\right\]\\leq D\_\{\\theta\}G\_\{\\theta\}\\sqrt\{K\}\.Since this holds for everyθ∈Θϵ\\theta\\in\\Theta\_\{\\epsilon\}, we have

maxθ∈Θϵ​∑k=1Kψk​\(θ\)−∑k=1Kψk​\(θk\)≤Dθ​Gθ​K\.\\max\_\{\\theta\\in\\Theta\_\{\\epsilon\}\}\\sum\_\{k=1\}^\{K\}\\psi\_\{k\}\(\\theta\)\-\\sum\_\{k=1\}^\{K\}\\psi\_\{k\}\(\\theta^\{k\}\)\\leq D\_\{\\theta\}G\_\{\\theta\}\\sqrt\{K\}\.
We next prove the corresponding bound for the global budget multiplier\. Let

ℳ=\[0,μ¯\],Dμ:=supμ,μ′∈ℳ\|μ−μ′\|=μ¯\.\\mathcal\{M\}=\[0,\\bar\{\\mu\}\],\\quad D\_\{\\mu\}:=\\sup\_\{\\mu,\\mu^\{\\prime\}\\in\\mathcal\{M\}\}\|\\mu\-\\mu^\{\\prime\}\|=\\bar\{\\mu\}\.The projected gradient ascent update is

μk\+1=Πℳ​\(μk\+ημ​∇qk​\(μk\)\)\.\\mu^\{k\+1\}=\\Pi\_\{\\mathcal\{M\}\}\\left\(\\mu^\{k\}\+\\eta\_\{\\mu\}\\nabla q\_\{k\}\(\\mu^\{k\}\)\\right\)\.Fix any comparatorμ∈ℳ\\mu\\in\\mathcal\{M\}\. By non\-expansiveness of projection,

\|μk\+1−μ\|2\\displaystyle\|\\mu^\{k\+1\}\-\\mu\|^\{2\}≤\|μk\+ημ​∇qk​\(μk\)−μ\|2\\displaystyle\\leq\\left\|\\mu^\{k\}\+\\eta\_\{\\mu\}\\nabla q\_\{k\}\(\\mu^\{k\}\)\-\\mu\\right\|^\{2\}=\|μk−μ\|2\+2​ημ​∇qk​\(μk\)​\(μk−μ\)\+ημ2​\|∇qk​\(μk\)\|2\.\\displaystyle=\|\\mu^\{k\}\-\\mu\|^\{2\}\+2\\eta\_\{\\mu\}\\nabla q\_\{k\}\(\\mu^\{k\}\)\(\\mu^\{k\}\-\\mu\)\+\\eta\_\{\\mu\}^\{2\}\|\\nabla q\_\{k\}\(\\mu^\{k\}\)\|^\{2\}\.Rearranging gives

∇qk​\(μk\)​\(μ−μk\)≤\|μk−μ\|2−\|μk\+1−μ\|22​ημ\+ημ2​\|∇qk​\(μk\)\|2\.\\nabla q\_\{k\}\(\\mu^\{k\}\)\(\\mu\-\\mu^\{k\}\)\\leq\\frac\{\|\\mu^\{k\}\-\\mu\|^\{2\}\-\|\\mu^\{k\+1\}\-\\mu\|^\{2\}\}\{2\\eta\_\{\\mu\}\}\+\\frac\{\\eta\_\{\\mu\}\}\{2\}\|\\nabla q\_\{k\}\(\\mu^\{k\}\)\|^\{2\}\.Sinceqkq\_\{k\}is concave,

qk​\(μ\)−qk​\(μk\)≤∇qk​\(μk\)​\(μ−μk\)\.q\_\{k\}\(\\mu\)\-q\_\{k\}\(\\mu^\{k\}\)\\leq\\nabla q\_\{k\}\(\\mu^\{k\}\)\(\\mu\-\\mu^\{k\}\)\.Using the uniform gradient bound

\|∇qk​\(μk\)\|≤Gμ,\|\\nabla q\_\{k\}\(\\mu^\{k\}\)\|\\leq G\_\{\\mu\},we get

qk​\(μ\)−qk​\(μk\)≤\|μk−μ\|2−\|μk\+1−μ\|22​ημ\+ημ​Gμ22\.q\_\{k\}\(\\mu\)\-q\_\{k\}\(\\mu^\{k\}\)\\leq\\frac\{\|\\mu^\{k\}\-\\mu\|^\{2\}\-\|\\mu^\{k\+1\}\-\\mu\|^\{2\}\}\{2\\eta\_\{\\mu\}\}\+\\frac\{\\eta\_\{\\mu\}G\_\{\\mu\}^\{2\}\}\{2\}\.Summing overk=1,…,Kk=1,\\ldots,Kgives

∑k=1K\[qk​\(μ\)−qk​\(μk\)\]\\displaystyle\\sum\_\{k=1\}^\{K\}\\left\[q\_\{k\}\(\\mu\)\-q\_\{k\}\(\\mu^\{k\}\)\\right\]≤\|μ1−μ\|22​ημ\+ημ​Gμ2​K2\\displaystyle\\leq\\frac\{\|\\mu^\{1\}\-\\mu\|^\{2\}\}\{2\\eta\_\{\\mu\}\}\+\\frac\{\\eta\_\{\\mu\}G\_\{\\mu\}^\{2\}K\}\{2\}≤Dμ22​ημ\+ημ​Gμ2​K2\.\\displaystyle\\leq\\frac\{D\_\{\\mu\}^\{2\}\}\{2\\eta\_\{\\mu\}\}\+\\frac\{\\eta\_\{\\mu\}G\_\{\\mu\}^\{2\}K\}\{2\}\.Choosing

ημ=DμGμ​K\\eta\_\{\\mu\}=\\frac\{D\_\{\\mu\}\}\{G\_\{\\mu\}\\sqrt\{K\}\}yields

∑k=1K\[qk​\(μ\)−qk​\(μk\)\]≤Dμ​Gμ​K\.\\sum\_\{k=1\}^\{K\}\\left\[q\_\{k\}\(\\mu\)\-q\_\{k\}\(\\mu^\{k\}\)\\right\]\\leq D\_\{\\mu\}G\_\{\\mu\}\\sqrt\{K\}\.Since this holds for everyμ∈ℳ\\mu\\in\\mathcal\{M\}, we obtain

maxμ∈ℳ​∑k=1Kqk​\(μ\)−∑k=1Kqk​\(μk\)≤Dμ​Gμ​K\.\\max\_\{\\mu\\in\\mathcal\{M\}\}\\sum\_\{k=1\}^\{K\}q\_\{k\}\(\\mu\)\-\\sum\_\{k=1\}^\{K\}q\_\{k\}\(\\mu^\{k\}\)\\leq D\_\{\\mu\}G\_\{\\mu\}\\sqrt\{K\}\.Therefore,

Regθ​\(K\)=O​\(K\),Regμ​\(K\)=O​\(K\)\.\\mathrm\{Reg\}\_\{\\theta\}\(K\)=O\(\\sqrt\{K\}\),\\quad\\mathrm\{Reg\}\_\{\\mu\}\(K\)=O\(\\sqrt\{K\}\)\.This completes the proof\.

## 9Experimental Details

In this section, we provide the training details for all four LLMs used in our experiments\. All models are trained withCEROfor500500optimization steps using the same OGD update rule\. We use H800 GPUs for all runs: the 1\.5B and 4B models are trained on44GPUs, while the 7B model is trained on88GPUs\. R1\-Distill\-1\.5B runs about9090GPU hours, Qwen3\-4B runs about170170GPU hours and Qwen2\.5\-Math\-7B runs about200200GPU hours\.

We report OGD\-related hyperparameters in[Table2](https://arxiv.org/html/2606.05606#S9.T2)\. Across all models, we initializeμ\\muandθ\\thetato10−610^\{\-6\}and10−710^\{\-7\}, respectively\. The OGD hyperparameterη\\etais set to2×10−32\\times 10^\{\-3\}for R1\-Distill\-1\.5B and1×10−31\\times 10^\{\-3\}for the remaining models\. We tune the learning rates forμ\\muandθ\\thetaseparately for each model while keeping the remaining training protocol fixed\.

Table 2:Training details for the four trained LLM\.

Similar Articles

Gradient Extrapolation-Based Policy Optimization

arXiv cs.LG

The article introduces Gradient Extrapolation-Based Policy Optimization (GXPO), a method that approximates multi-step lookahead in RL training for LLMs using only three backward passes. It demonstrates improved reasoning performance on math benchmarks over standard GRPO while maintaining fixed active-phase costs.

CurveRL: Principled Distribution-Aware Context Reweighting for LLM Reasoning

arXiv cs.LG

This paper introduces CurveRL, a principled distribution-aware prompt reweighting approach for reinforcement learning with verifiable rewards (RLVR) that improves LLM reasoning by assigning weights based on the rank and density of pass rates rather than their absolute values, consistently outperforming GRPO and other baselines.