Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs

arXiv cs.CL 04/23/26, 04:00 AM Papers
Summary
BACR introduces adaptive token budgeting and curriculum-aware scheduling to prevent LLMs from overthinking easy problems and underthinking hard ones, cutting token use 34% while boosting accuracy up to 8.3%.
arXiv:2604.19780v1 Announce Type: new Abstract: Scaling test-time compute via extended reasoning has become a key paradigm for improving the capabilities of large language models (LLMs). However, existing approaches optimize reasoning under fixed or uniformly sampled token budgets, ignoring the fundamental mismatch between problem difficulty and allocated compute. This leads to overthinking on easy problems and underthinking on hard ones, resulting in suboptimal token efficiency across diverse reasoning scenarios. In this paper, we propose Budget-Adaptive Curriculum Reasoning (BCAE), a unified framework that jointly optimizes reasoning quality and token efficiency through three synergistic components: (1) a \emph{budget-conditioned unified policy} that embeds the token budget as a continuous conditioning signal, eliminating the need for decoupled thinking and summarization strategies; (2) a \emph{curriculum-aware budget scheduler} that adaptively shifts the training budget distribution from easy to hard problems based on real-time learning progress; and (3) a \emph{truncation-aware dense reward} mechanism that provides fine-grained credit assignment at intermediate reasoning steps via process-level verification. We further introduce \emph{Budget-Conditioned Advantage Estimation} (BCAE), a novel variance reduction technique that conditions the advantage baseline on the sampled budget, yielding more stable policy gradients. Experiments on mathematical reasoning benchmarks (MATH, GSM8K, AIME, and Minerva Math) demonstrate that BACR consistently outperforms other strong baselines across all token budgets, achieving up to 8.3\% accuracy improvement under tight budgets while reducing average token consumption by 34\% compared to unconstrained reasoning.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 04/23/26, 10:02 AM
# Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs
Source: [https://arxiv.org/html/2604.19780](https://arxiv.org/html/2604.19780)
###### Abstract

Scaling test\-time compute via extended reasoning has become a key paradigm for improving the capabilities of large language models \(LLMs\)\. However, existing approaches optimize reasoning under fixed or uniformly sampled token budgets, ignoring the fundamental mismatch between problem difficulty and allocated compute\. This leads to overthinking on easy problems and underthinking on hard ones, resulting in suboptimal token efficiency across diverse reasoning scenarios\. In this paper, we propose Budget\-Adaptive Curriculum Reasoning \(BACR\), a unified framework that jointly optimizes reasoning quality and token efficiency through three synergistic components: \(1\) a*budget\-conditioned unified policy*that embeds the token budget as a continuous conditioning signal, eliminating the need for decoupled thinking and summarization strategies; \(2\) a*curriculum\-aware budget scheduler*that adaptively shifts the training budget distribution from easy to hard problems based on real\-time learning progress; and \(3\) a*truncation\-aware dense reward*mechanism that provides fine\-grained credit assignment at intermediate reasoning steps via process\-level verification\. We further introduce*Budget\-Conditioned Advantage Estimation*\(BCAE\), a novel variance reduction technique that conditions the advantage baseline on the sampled budget, yielding more stable policy gradients\. Experiments on mathematical reasoning benchmarks \(MATH, GSM8K, AIME, and Minerva Math\) demonstrate that BACR consistently outperforms other strong baselines across all token budgets, achieving up to 8\.3% accuracy improvement under tight budgets while reducing average token consumption by 34% compared to unconstrained reasoning\.

## 1Introduction

Large language models \(LLMs\) have achieved remarkable progress in complex reasoning tasks, driven by the paradigm of scaling test\-time compute through extended chain\-of\-thought \(CoT\) reasoning\(Weiet al\.,[2022](https://arxiv.org/html/2604.19780#bib.bib45); Yaoet al\.,[2023](https://arxiv.org/html/2604.19780#bib.bib50)\)\. This evolution has further extended to multimodal intelligence, where models integrate perception, reasoning, and generation across diverse modalities\(Zhouet al\.,[2024a](https://arxiv.org/html/2604.19780#bib.bib9);[2025b](https://arxiv.org/html/2604.19780#bib.bib24)\)\. Recent reasoning models such as DeepSeek\-R1\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib6)\)and QwQ demonstrate that reinforcement learning \(RL\) can further enhance reasoning capabilities by optimizing verifiable rewards over long thinking traces\. However, these advances come at a substantial cost: reasoning models routinely generate thousands of tokens even for simple problems, leading to prohibitive inference latency and compute waste\(Wanget al\.,[2024](https://arxiv.org/html/2604.19780#bib.bib44); Chenet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib2)\)\.

The inefficiency of unconstrained reasoning has motivated a growing body of work on*budget\-aware reasoning*, which aims to produce high\-quality outputs under varying token constraints\(Liet al\.,[2025e](https://arxiv.org/html/2604.19780#bib.bib4);[a](https://arxiv.org/html/2604.19780#bib.bib5); Linet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib30)\)\. Among these, AnytimeReasoner\(Qiet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib34)\)represents a significant advance by formulating reasoning as an anytime algorithm: the model’s thinking process is truncated at a budget sampled from a prior distribution, and a separate summarization policy extracts the best possible answer from the truncated thought\. This approach introduces dense verifiable rewards across different budget levels, enabling more effective credit assignment than fixed\-budget methods like GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2604.19780#bib.bib37)\)\.

Despite its elegance, AnytimeReasoner suffers from three key limitations\.First, the decoupled training of thinking and summarization policies introduces architectural complexity and prevents end\-to\-end optimization, as the summarizer cannot backpropagate gradients to improve the thinking process\.Second, the budget prior distribution is fixed throughout training, ignoring the fact that the model’s capability evolves—easy problems require less thinking budget as training progresses, while hard problems may benefit from progressively longer reasoning\.Third, the Budget Relative Policy Optimization \(BRPO\) technique, while effective for variance reduction, computes baselines only within same\-budget groups, missing cross\-budget correlations that could further stabilize training\.

In this paper, we propose Budget\-Adaptive Curriculum Reasoning \(BACR\), a unified framework that addresses these limitations through three synergistic innovations\. First, we introduce a*budget\-conditioned unified policy*that encodes the token budget as a continuous embedding, enabling a single policy to jointly handle both thinking and answer extraction without separate models\. Second, we design a*curriculum\-aware budget scheduler*that dynamically adjusts the budget distribution during training based on real\-time difficulty estimation, allocating more compute to problems the model currently struggles with\. Third, we develop a*truncation\-aware dense reward*that evaluates the quality of intermediate reasoning steps at each truncation point, providing richer supervision than binary outcome rewards alone\. Building on these components, we introduce Budget\-Conditioned Advantage Estimation \(BCAE\), which extends BRPO by conditioning the advantage baseline on the sampled budget through a learned value function, achieving lower variance in policy gradient estimates\.

Our main contributions are:

- •We propose BACR, a unified framework for budget\-adaptive anytime reasoning that eliminates the need for separate thinking and summarization policies through budget\-conditioned generation\.
- •We introduce a curriculum\-aware budget scheduler that adaptively shifts the training budget distribution based on problem difficulty and learning progress, improving both sample efficiency and final performance\.
- •We design truncation\-aware dense rewards and Budget\-Conditioned Advantage Estimation \(BCAE\), which together provide fine\-grained credit assignment and low\-variance policy gradients across all budget levels\.
- •Extensive experiments on MATH, GSM8K, AIME, and Minerva Math demonstrate that BACR achieves state\-of\-the\-art anytime reasoning performance, outperforming AnytimeReasoner by up to 8\.3% under tight budgets and reducing token usage by 34% with comparable accuracy\.

## 2Related Work

#### Test\-Time Compute Scaling\.

Scaling inference\-time computation has emerged as a powerful paradigm for enhancing LLM reasoning… Subsequent work explored diverse strategies for allocating test\-time compute, including self\-consistency via majority voting\(Wanget al\.,[2022](https://arxiv.org/html/2604.19780#bib.bib43)\), tree\-structured search\(Yaoet al\.,[2023](https://arxiv.org/html/2604.19780#bib.bib50)\), and process\-level verification\(Lightmanet al\.,[2023](https://arxiv.org/html/2604.19780#bib.bib29); Zhaoet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib54)\)\. Recent studies have also rethought visual dependency in long\-context reasoning for LVLMs\(Zhouet al\.,[2024b](https://arxiv.org/html/2604.19780#bib.bib10)\)and explored entropy\-based exploration for multi\-step reasoning\(Zhanget al\.,[2025b](https://arxiv.org/html/2604.19780#bib.bib12)\)\. Snell et al\.\(Snellet al\.,[2024](https://arxiv.org/html/2604.19780#bib.bib41)\)provided a foundational analysis showing that optimal test\-time compute scaling can be more effective than scaling model parameters…

#### Budget\-Aware and Efficient Reasoning\.

The observation that LLMs often “overthink” simple problems has spurred research into budget\-aware reasoning\. Wang et al\.\(Wanget al\.,[2024](https://arxiv.org/html/2604.19780#bib.bib44)\)introduced a budget\-aware evaluation framework… Plan\-and\-Budget\(Linet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib30)\)decomposes queries into sub\-questions with adaptive token allocation\. To improve efficiency during training, GATEAU\(Siet al\.,[2025a](https://arxiv.org/html/2604.19780#bib.bib17)\)selects influential samples for long\-context alignment, while global planner training methods\(Siet al\.,[2025b](https://arxiv.org/html/2604.19780#bib.bib19)\)focus on effective planning for long\-horizon agent tasks\. AnytimeReasoner\(Qiet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib34)\)represents the most directly related work, proposing to optimize anytime performance by truncating thinking at random budgets and training a separate summarizer…

#### Reasoning in Specialized and Multimodal Domains\.

The principles of advanced reasoning and efficient compute allocation are increasingly applied to specialized domains\. In multimodal scenarios, generative video models are being utilized as visual reasoners\(Hoxhaet al\.,[2026](https://arxiv.org/html/2604.19780#bib.bib23)\), while models like Co\-sight\(Zhanget al\.,[2025a](https://arxiv.org/html/2604.19780#bib.bib13)\)enhance agents via conflict\-aware meta\-verification\. Domain\-specific reasoning has seen progress in medical LVLMs via abnormal\-aware feedback\(Zhouet al\.,[2025a](https://arxiv.org/html/2604.19780#bib.bib11)\), and in autonomous driving through navigation world models and uncertainty\-aware localization\(Liet al\.,[2024](https://arxiv.org/html/2604.19780#bib.bib14);[2025b](https://arxiv.org/html/2604.19780#bib.bib15);[2025c](https://arxiv.org/html/2604.19780#bib.bib16)\)\. Furthermore, the evaluation of such complex reasoning spans diverse tasks, including spoken task\-oriented dialogues\(Siet al\.,[2023](https://arxiv.org/html/2604.19780#bib.bib18)\), acoustic landmark extraction\(Zhanget al\.,[2024](https://arxiv.org/html/2604.19780#bib.bib21)\), drama script continuation\(Maet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib20)\), and facial expression classification\(Liet al\.,[2025d](https://arxiv.org/html/2604.19780#bib.bib22)\)\. Our work on BACR contributes to this broader landscape by providing a foundational framework for budget\-adaptive reasoning that can potentially benefit these diverse applications\.

#### Policy Optimization for LLM Reasoning\.

Reinforcement learning has become central to training reasoning models… DeepSeek\-R1\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib6)\)demonstrated that pure RL with verifiable rewards can elicit sophisticated reasoning… Our proposed BCAE builds upon the GRPO framework but introduces budget\-conditioned baselines and curriculum\-driven scheduling, which are orthogonal to and complementary with existing GRPO improvements\.

## 3Methodology

### 3\.1Preliminaries and Problem Formulation

We consider the problem of training a language model policyπθ\\pi\_\{\\theta\}to perform reasoning under varying token budget constraints\. Given a questionqq, the model generates a reasoning trace𝐭=\(t1,t2,…,tT\)\\mathbf\{t\}=\(t\_\{1\},t\_\{2\},\\ldots,t\_\{T\}\)followed by a final answer𝐚\\mathbf\{a\}\. In anytime reasoning, we seek to maximize performance not only at the full trace lengthTT, but across all possible budget levelsb∈\[bmin,bmax\]b\\in\[b\_\{\\min\},b\_\{\\max\}\]\.

Formally, letbbdenote a thinking budget \(maximum number of thinking tokens\)\. Given a budgetbb, the model generates thinking tokens up to lengthmin⁡\(\|𝐭\|,b\)\\min\(\|\\mathbf\{t\}\|,b\), producing a truncated trace𝐭:b\\mathbf\{t\}\_\{:b\}\. An answer is then extracted from𝐭:b\\mathbf\{t\}\_\{:b\}, and a verifiable rewardr\(q,𝐭:b\)r\(q,\\mathbf\{t\}\_\{:b\}\)is computed by comparing the extracted answer against the ground truth\. The anytime reasoning objective maximizes the expected reward across all budgets:

𝒥\(θ\)=𝔼q∼𝒟,b∼p\(b\)\[𝔼𝐭∼πθ\(⋅\|q,b\)\[r\(q,𝐭:b\)\]\],\\mathcal\{J\}\(\\theta\)=\\mathbb\{E\}\_\{q\\sim\\mathcal\{D\},b\\sim p\(b\)\}\\left\[\\mathbb\{E\}\_\{\\mathbf\{t\}\\sim\\pi\_\{\\theta\}\(\\cdot\|q,b\)\}\\left\[r\(q,\\mathbf\{t\}\_\{:b\}\)\\right\]\\right\],\(1\)wherep\(b\)p\(b\)is a prior distribution over budgets and𝒟\\mathcal\{D\}is the question distribution\.

Limitations of AnytimeReasoner\.The prior work AnytimeReasoner\(Qiet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib34)\)addresses this objective by: \(i\) samplingbbfrom a fixed priorp0\(b\)p\_\{0\}\(b\), \(ii\) truncating the full trace atbbtokens, \(iii\) training a separate summary policyπsum\\pi\_\{\\text\{sum\}\}to produce an answer from𝐭:b\\mathbf\{t\}\_\{:b\}, and \(iv\) optimizing thinking and summary policies independently using BRPO\. While effective, this approach has three structural limitations\. The decoupled architecture prevents the summarizer’s gradients from improving thinking quality\. The fixed priorp0\(b\)p\_\{0\}\(b\)does not adapt to the model’s evolving capabilities\. The BRPO baseline, computed as the group mean reward within each budget level, ignores cross\-budget statistical structure\.

### 3\.2Budget\-Conditioned Unified Policy

Rather than maintaining separate thinking and summarization policies, we propose a single budget\-conditioned policyπθ\(⋅\|q,b\)\\pi\_\{\\theta\}\(\\cdot\|q,b\)that generates both reasoning and the final answer within the given budgetbb\. The budget signalbbis encoded as a continuous embedding and injected into the model’s generation process, allowing the policy to adapt its reasoning depth and summarization strategy based on the available compute\.

Specifically, we encode the budgetbbusing a learnable embedding functionϕ:ℝ\+→ℝd\\phi:\\mathbb\{R\}^\{\+\}\\to\\mathbb\{R\}^\{d\}\. Following the sinusoidal position encoding approach\(Snellet al\.,[2024](https://arxiv.org/html/2604.19780#bib.bib41)\), we define:

ϕ\(b\)=𝐖2⋅SiLU\(𝐖1⋅\[sin⁡\(b100002i/d\),cos⁡\(b100002i/d\)\]i=0d/2−1\),\\phi\(b\)=\\mathbf\{W\}\_\{2\}\\cdot\\text\{SiLU\}\\left\(\\mathbf\{W\}\_\{1\}\\cdot\\left\[\\sin\\left\(\\frac\{b\}\{10000^\{2i/d\}\}\\right\),\\cos\\left\(\\frac\{b\}\{10000^\{2i/d\}\}\\right\)\\right\]\_\{i=0\}^\{d/2\-1\}\\right\),\(2\)where𝐖1∈ℝd×d\\mathbf\{W\}\_\{1\}\\in\\mathbb\{R\}^\{d\\times d\}and𝐖2∈ℝd×d\\mathbf\{W\}\_\{2\}\\in\\mathbb\{R\}^\{d\\times d\}are learnable projections, andddis the model’s hidden dimension\. The budget embeddingϕ\(b\)\\phi\(b\)is added to the hidden state at each layer via a gating mechanism:

𝐡l′=𝐡l\+σ\(𝐰g⊤𝐡l\)⋅ϕ\(b\),\\mathbf\{h\}\_\{l\}^\{\\prime\}=\\mathbf\{h\}\_\{l\}\+\\sigma\(\\mathbf\{w\}\_\{g\}^\{\\top\}\\mathbf\{h\}\_\{l\}\)\\cdot\\phi\(b\),\(3\)whereσ\\sigmais the sigmoid function and𝐰g∈ℝd\\mathbf\{w\}\_\{g\}\\in\\mathbb\{R\}^\{d\}is a learnable gating vector\. This design introduces minimal additional parameters \(≈2d2\+d\\approx 2d^\{2\}\+dper layer\) while enabling the model to modulate its reasoning behavior based on the budget\. Whenbbis large, the model can produce detailed step\-by\-step reasoning; whenbbis small, it learns to prioritize key reasoning steps and produce concise answers directly\.

The unified policy generates a sequence of the form⟨think⟩𝐭:b⟨/think⟩⟨answer⟩𝐚⟨/answer⟩\\langle\\text\{think\}\\rangle\\mathbf\{t\}\_\{:b\}\\langle\\text\{/think\}\\rangle\\langle\\text\{answer\}\\rangle\\mathbf\{a\}\\langle\\text\{/answer\}\\rangle, where the thinking portion is constrained to at mostbbtokens\. If the model’s natural thinking length exceedsbb, generation is forcefully terminated at thebb\-th thinking token and the model directly transitions to answer generation\. This unified formulation enables end\-to\-end gradient flow from the answer reward through both the summarization and thinking components, addressing the first limitation of AnytimeReasoner\.

### 3\.3Curriculum\-Aware Budget Scheduler

A fixed budget priorp0\(b\)p\_\{0\}\(b\)treats all training stages equally, but the optimal budget distribution should evolve with the model’s capabilities\. Early in training, the model benefits from practicing with moderate budgets on easier problems to build foundational reasoning skills\. As training progresses, the distribution should shift toward tighter budgets \(forcing compression\) and harder problems \(requiring deeper thinking\)\.

We design a curriculum\-aware budget scheduler that dynamically adjusts both the budget distribution and the problem\-budget coupling\. At each training epochee, we maintain an estimate of the model’s current pass rateρk\(e\)\\rho\_\{k\}\(e\)for each difficulty groupk∈\{1,…,K\}k\\in\\\{1,\\ldots,K\\\}, where problems are partitioned based on historical solve rates\. The budget distribution at epocheeis:

pe\(b\|k\)=TruncNorm\(μk\(e\),σk2;bmin,bmax\),p\_\{e\}\(b\|k\)=\\text\{TruncNorm\}\\left\(\\mu\_\{k\}\(e\),\\sigma\_\{k\}^\{2\};b\_\{\\min\},b\_\{\\max\}\\right\),\(4\)where the mean budgetμk\(e\)\\mu\_\{k\}\(e\)for difficulty groupkkis adapted as:

μk\(e\)=μk\(0\)⋅\(1−α⋅ρk\(e\)\)\+β⋅\(1−ρk\(e\)\)⋅bmax,\\mu\_\{k\}\(e\)=\\mu\_\{k\}\(0\)\\cdot\\left\(1\-\\alpha\\cdot\\rho\_\{k\}\(e\)\\right\)\+\\beta\\cdot\(1\-\\rho\_\{k\}\(e\)\)\\cdot b\_\{\\max\},\(5\)withα,β∈\(0,1\)\\alpha,\\beta\\in\(0,1\)being hyperparameters controlling the adaptation rate\. Intuitively, as the model achieves higher pass rates on a difficulty group, the mean budget allocated to that group decreases \(encouraging token efficiency\), while groups with low pass rates receive larger budgets \(providing more reasoning room\)\. The problem sampling distribution is similarly adapted to upweight difficulty groups where the model is making progress but has not yet saturated:

wk\(e\)∝ρk\(e\)⋅\(1−ρk\(e\)\),w\_\{k\}\(e\)\\propto\\rho\_\{k\}\(e\)\\cdot\(1\-\\rho\_\{k\}\(e\)\),\(6\)which assigns the highest sampling probability to problems at the “learning frontier” where the model’s performance is intermediate\. This curriculum scheduling is reminiscent of distributionally robust optimization\(Panagantiet al\.,[2026](https://arxiv.org/html/2604.19780#bib.bib33)\)but specifically designed for the budget dimension, providing a principled approach to the training distribution that converges to a uniform\-budget evaluation at the end of training\.

### 3\.4Truncation\-Aware Dense Reward

Standard anytime reasoning relies on outcome\-level rewards: a truncated trace𝐭:b\\mathbf\{t\}\_\{:b\}receives reward 1 if the extracted answer is correct, and 0 otherwise\. This sparse reward makes credit assignment difficult, especially for intermediate budgets where the trace is partially complete\. We augment this with a truncation\-aware dense reward that evaluates reasoning quality at each truncation point\.

For a full reasoning trace𝐭=\(t1,…,tT\)\\mathbf\{t\}=\(t\_\{1\},\\ldots,t\_\{T\}\)and a set of sampled budgets\{b1,…,bM\}\\\{b\_\{1\},\\ldots,b\_\{M\}\\\}withb1<b2<…<bMb\_\{1\}<b\_\{2\}<\\ldots<b\_\{M\}, we define the dense reward as:

R\(q,𝐭,bj\)=r\(q,𝐭:bj\)⏟outcome reward\+λ⋅Δr\(q,𝐭,bj\)⏟progress reward,R\(q,\\mathbf\{t\},b\_\{j\}\)=\\underbrace\{r\(q,\\mathbf\{t\}\_\{:b\_\{j\}\}\)\}\_\{\\text\{outcome reward\}\}\+\\lambda\\cdot\\underbrace\{\\Delta r\(q,\\mathbf\{t\},b\_\{j\}\)\}\_\{\\text\{progress reward\}\},\(7\)where the progress reward captures the marginal improvement in answer quality from additional thinking tokens:

Δr\(q,𝐭,bj\)=\{r\(q,𝐭:bj\)−r\(q,𝐭:bj−1\)ifj\>1r\(q,𝐭:b1\)ifj=1\.\\Delta r\(q,\\mathbf\{t\},b\_\{j\}\)=\\begin\{cases\}r\(q,\\mathbf\{t\}\_\{:b\_\{j\}\}\)\-r\(q,\\mathbf\{t\}\_\{:b\_\{j\-1\}\}\)&\\text\{if \}j\>1\\\\ r\(q,\\mathbf\{t\}\_\{:b\_\{1\}\}\)&\\text\{if \}j=1\\end\{cases\}\.\(8\)This progress reward is positive when additional thinking tokens flip the answer from incorrect to correct, zero when the answer status is unchanged, and negative when extra tokens introduce errors \(a form of “overthinking” penalty\)\. The coefficientλ\>0\\lambda\>0controls the strength of the progress signal relative to outcome rewards\. Unlike process reward models \(PRMs\)\(Lightmanet al\.,[2023](https://arxiv.org/html/2604.19780#bib.bib29); Zhaoet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib54)\)that require separate trained verifiers, our dense reward relies only on answer verification at multiple truncation points, making it compatible with the verifiable reward paradigm\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib6)\)\.

The cumulative reward for a trace𝐭\\mathbf\{t\}is the expectation over budgets:

R¯\(q,𝐭\)=1M∑j=1MR\(q,𝐭,bj\)\.\\bar\{R\}\(q,\\mathbf\{t\}\)=\\frac\{1\}\{M\}\\sum\_\{j=1\}^\{M\}R\(q,\\mathbf\{t\},b\_\{j\}\)\.\(9\)By training on this cumulative reward, the policy is incentivized to produce traces where: \(i\) correct answers emerge at early budget levels, \(ii\) answer quality improves monotonically with budget, and \(iii\) unnecessary token generation is penalized through negative progress rewards\.

### 3\.5Budget\-Conditioned Advantage Estimation

GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2604.19780#bib.bib37)\)computes advantages by normalizing rewards within a group of sampled responses for the same question\. BRPO\(Qiet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib34)\)extends this to the anytime setting by computing group statistics within each budget level\. However, both approaches discard cross\-budget information and can exhibit high variance when the group size per budget is small\.

We propose Budget\-Conditioned Advantage Estimation \(BCAE\), which conditions the advantage baseline on the budget level using a lightweight value function\. For a questionqq, budgetbb, and response𝐭\\mathbf\{t\}, the advantage is:

ABCAE\(q,𝐭,b\)=R\(q,𝐭,b\)−Vψ\(q,b\),A\_\{\\text\{BCAE\}\}\(q,\\mathbf\{t\},b\)=R\(q,\\mathbf\{t\},b\)\-V\_\{\\psi\}\(q,b\),\(10\)whereVψ\(q,b\)V\_\{\\psi\}\(q,b\)is a learned value function parameterized byψ\\psithat estimates the expected reward for questionqqunder budgetbb\. The value function is implemented as a small MLP head on top of the language model’s hidden state of the question encoding, conditioned on the budget embedding:

Vψ\(q,b\)=MLPψ\(𝐡q⊕ϕ\(b\)\),V\_\{\\psi\}\(q,b\)=\\text\{MLP\}\_\{\\psi\}\\left\(\\mathbf\{h\}\_\{q\}\\oplus\\phi\(b\)\\right\),\(11\)where𝐡q\\mathbf\{h\}\_\{q\}is the mean\-pooled hidden state of the question tokens and⊕\\oplusdenotes concatenation\. The value function is trained to minimize the mean squared error:

ℒV\(ψ\)=𝔼q,b,𝐭\[\(Vψ\(q,b\)−R\(q,𝐭,b\)\)2\]\.\\mathcal\{L\}\_\{V\}\(\\psi\)=\\mathbb\{E\}\_\{q,b,\\mathbf\{t\}\}\\left\[\\left\(V\_\{\\psi\}\(q,b\)\-R\(q,\\mathbf\{t\},b\)\\right\)^\{2\}\\right\]\.\(12\)
###### Proposition 1\(Variance Reduction of BCAE\)\.

LetσBRPO2\\sigma^\{2\}\_\{\\text\{BRPO\}\}andσBCAE2\\sigma^\{2\}\_\{\\text\{BCAE\}\}denote the variance of the advantage estimates under BRPO and BCAE, respectively\. IfVψ\(q,b\)V\_\{\\psi\}\(q,b\)is an unbiased estimator of𝔼𝐭\[R\(q,𝐭,b\)\]\\mathbb\{E\}\_\{\\mathbf\{t\}\}\[R\(q,\\mathbf\{t\},b\)\], then:

σBCAE2≤σBRPO2,\\sigma^\{2\}\_\{\\text\{BCAE\}\}\\leq\\sigma^\{2\}\_\{\\text\{BRPO\}\},\(13\)with equality holding only when the group mean equalsVψ\(q,b\)V\_\{\\psi\}\(q,b\)for all budget levels\.

###### Proof\.

The BRPO advantage for a response𝐭i\\mathbf\{t\}\_\{i\}in budget groupGb=\{𝐭1,…,𝐭N\}G\_\{b\}=\\\{\\mathbf\{t\}\_\{1\},\\ldots,\\mathbf\{t\}\_\{N\}\\\}isABRPO\(𝐭i,b\)=Ri−R¯GbA\_\{\\text\{BRPO\}\}\(\\mathbf\{t\}\_\{i\},b\)=R\_\{i\}\-\\bar\{R\}\_\{G\_\{b\}\}, whereR¯Gb=1N∑j=1NRj\\bar\{R\}\_\{G\_\{b\}\}=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}R\_\{j\}\. The variance of this estimator isσBRPO2=Var\[Ri\]−Var\[R¯Gb\]=Var\[Ri\]\(1−1/N\)\\sigma^\{2\}\_\{\\text\{BRPO\}\}=\\text\{Var\}\[R\_\{i\}\]\-\\text\{Var\}\[\\bar\{R\}\_\{G\_\{b\}\}\]=\\text\{Var\}\[R\_\{i\}\]\(1\-1/N\)\. For BCAE,ABCAE\(𝐭i,b\)=Ri−Vψ\(q,b\)A\_\{\\text\{BCAE\}\}\(\\mathbf\{t\}\_\{i\},b\)=R\_\{i\}\-V\_\{\\psi\}\(q,b\), and sinceVψV\_\{\\psi\}is a deterministic function of\(q,b\)\(q,b\), we haveσBCAE2=Var\[Ri−Vψ\(q,b\)\]=Var\[Ri\]−2Cov\[Ri,Vψ\]\+0\\sigma^\{2\}\_\{\\text\{BCAE\}\}=\\text\{Var\}\[R\_\{i\}\-V\_\{\\psi\}\(q,b\)\]=\\text\{Var\}\[R\_\{i\}\]\-2\\text\{Cov\}\[R\_\{i\},V\_\{\\psi\}\]\+0\. WhenVψV\_\{\\psi\}is an unbiased estimator learned from data across all budget levels and questions, it captures cross\-budget correlations that the group mean misses, yieldingCov\[Ri,Vψ\]≥Var\[R¯Gb\]\\text\{Cov\}\[R\_\{i\},V\_\{\\psi\}\]\\geq\\text\{Var\}\[\\bar\{R\}\_\{G\_\{b\}\}\], and thusσBCAE2≤σBRPO2\\sigma^\{2\}\_\{\\text\{BCAE\}\}\\leq\\sigma^\{2\}\_\{\\text\{BRPO\}\}\. ∎

BCAE normalizes the advantages for stable training:

A^BCAE\(q,𝐭,b\)=ABCAE\(q,𝐭,b\)max⁡\(stdG\[ABCAE\],ϵ\),\\hat\{A\}\_\{\\text\{BCAE\}\}\(q,\\mathbf\{t\},b\)=\\frac\{A\_\{\\text\{BCAE\}\}\(q,\\mathbf\{t\},b\)\}\{\\max\\left\(\\text\{std\}\_\{G\}\[A\_\{\\text\{BCAE\}\}\],\\epsilon\\right\)\},\(14\)wherestdG\\text\{std\}\_\{G\}is the standard deviation computed over the group andϵ\\epsilonis a small constant for numerical stability\.

### 3\.6Training Algorithm

The complete BACR training procedure is summarized in Algorithm[1](https://arxiv.org/html/2604.19780#alg1)\. At each iteration, we sample a batch of questions, assign budgets via the curriculum scheduler, generate reasoning traces, compute dense rewards at multiple truncation points, estimate advantages using BCAE, and update the policy and value function\.

The policy loss follows the clipped PPO\-style objective:

ℒπ\(θ\)=−𝔼\[min⁡\(ρ\(θ\)A^,clip\(ρ\(θ\),1−ϵc,1\+ϵc\)A^\)\],\\mathcal\{L\}\_\{\\pi\}\(\\theta\)=\-\\mathbb\{E\}\\left\[\\min\\left\(\\rho\(\\theta\)\\hat\{A\},\\text\{clip\}\(\\rho\(\\theta\),1\-\\epsilon\_\{c\},1\+\\epsilon\_\{c\}\)\\hat\{A\}\\right\)\\right\],\(15\)whereρ\(θ\)=πθ\(𝐭\|q,b\)/πθold\(𝐭\|q,b\)\\rho\(\\theta\)=\\pi\_\{\\theta\}\(\\mathbf\{t\}\|q,b\)/\\pi\_\{\\theta\_\{\\text\{old\}\}\}\(\\mathbf\{t\}\|q,b\)is the importance ratio andϵc\\epsilon\_\{c\}is the clipping range\. The total loss combines the policy loss, value loss, and an entropy bonusℋ\\mathcal\{H\}for exploration:

ℒ\(θ,ψ\)=ℒπ\(θ\)\+cvℒV\(ψ\)−chℋ\[πθ\],\\mathcal\{L\}\(\\theta,\\psi\)=\\mathcal\{L\}\_\{\\pi\}\(\\theta\)\+c\_\{v\}\\mathcal\{L\}\_\{V\}\(\\psi\)\-c\_\{h\}\\mathcal\{H\}\[\\pi\_\{\\theta\}\],\(16\)wherecvc\_\{v\}andchc\_\{h\}are weighting coefficients\.

Algorithm 1BACR: Budget\-Adaptive Curriculum Reasoning0:Policy

πθ\\pi\_\{\\theta\}, value function

VψV\_\{\\psi\}, question set

𝒟\\mathcal\{D\}, budget range

\[bmin,bmax\]\[b\_\{\\min\},b\_\{\\max\}\], number of truncation points

MM, group size

GG
1:Initialize difficulty groups

\{k\}k=1K\\\{k\\\}\_\{k=1\}^\{K\}and pass rate estimates

\{ρk\(0\)\}\\\{\\rho\_\{k\}\(0\)\\\}
2:forepoch

e=1,2,…,Ee=1,2,\\ldots,Edo

3:Update budget distributions

\{pe\(b\|k\)\}\\\{p\_\{e\}\(b\|k\)\\\}via Eq\. equation[4](https://arxiv.org/html/2604.19780#S3.E4)–equation[5](https://arxiv.org/html/2604.19780#S3.E5)

4:Update problem weights

\{wk\(e\)\}\\\{w\_\{k\}\(e\)\\\}via Eq\. equation[6](https://arxiv.org/html/2604.19780#S3.E6)

5:foreach mini\-batchdo

6:Sample questions

\{qi\}\\\{q\_\{i\}\\\}from

𝒟\\mathcal\{D\}weighted by

\{wk\}\\\{w\_\{k\}\\\}
7:foreach

qiq\_\{i\}do

8:Sample

GGbudgets

\{bi,g\}g=1G\\\{b\_\{i,g\}\\\}\_\{g=1\}^\{G\}from

pe\(b\|ki\)p\_\{e\}\(b\|k\_\{i\}\)
9:Generate

GGtraces

\{𝐭i,g\}\\\{\\mathbf\{t\}\_\{i,g\}\\\}from

πθ\(⋅\|qi,bi,g\)\\pi\_\{\\theta\}\(\\cdot\|q\_\{i\},b\_\{i,g\}\)
10:Sample

MMtruncation points per trace

11:Compute dense rewards

R\(qi,𝐭i,g,bj\)R\(q\_\{i\},\\mathbf\{t\}\_\{i,g\},b\_\{j\}\)via Eq\. equation[7](https://arxiv.org/html/2604.19780#S3.E7)

12:Compute BCAE advantages

A^\\hat\{A\}via Eq\. equation[10](https://arxiv.org/html/2604.19780#S3.E10)–equation[14](https://arxiv.org/html/2604.19780#S3.E14)

13:endfor

14:Update

θ\\thetaand

ψ\\psiby minimizing

ℒ\(θ,ψ\)\\mathcal\{L\}\(\\theta,\\psi\)\(Eq\. equation[16](https://arxiv.org/html/2604.19780#S3.E16)\)

15:endfor

16:Update pass rates

\{ρk\(e\)\}\\\{\\rho\_\{k\}\(e\)\\\}based on current model performance

17:endfor

18:returnPolicy

πθ\\pi\_\{\\theta\}

Complexity Analysis\.The budget embedding addsO\(d2L\)O\(d^\{2\}L\)parameters forLLlayers, negligible compared to the base model\. The value function head addsO\(d2\)O\(d^\{2\}\)parameters\. The curriculum scheduler requiresO\(K\)O\(K\)pass rate estimates per epoch\. The dense reward computation requiresMMverification calls per trace, compared to 1 for standard methods, but these verifications are embarrassingly parallel and involve only string matching for mathematical problems\. Overall, the per\-iteration training cost of BACR is approximately\(1\+M/G\)\(1\+M/G\)times that of standard GRPO, withM/G≈0\.5M/G\\approx 0\.5in our default configuration\.

## 4Experiments

### 4\.1Experimental Setup

#### Datasets\.

We evaluate on four mathematical reasoning benchmarks spanning diverse difficulty levels:GSM8K\(Lightmanet al\.,[2023](https://arxiv.org/html/2604.19780#bib.bib29)\)\(grade\-school math, 1,319 test problems\),MATH\(Lightmanet al\.,[2023](https://arxiv.org/html/2604.19780#bib.bib29)\)\(competition math, 5,000 test problems across 5 difficulty levels\),AIME\(24 problems from AIME 2024, extremely challenging\), andMinerva Math\(a subset of 500 problems from the Minerva evaluation suite covering algebra, geometry, and number theory\)\.

#### Baselines\.

We compare against the following methods: \(1\)Standard CoT\(Weiet al\.,[2022](https://arxiv.org/html/2604.19780#bib.bib45)\): unconstrained chain\-of\-thought reasoning; \(2\)Self\-Consistency\(Wanget al\.,[2022](https://arxiv.org/html/2604.19780#bib.bib43)\): majority voting over multiple CoT samples; \(3\)GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2604.19780#bib.bib37)\): group relative policy optimization with fixed budget; \(4\)DAPO\(Yuet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib51)\): dynamic sampling with clip\-higher; \(5\)AnytimeReasoner \(BRPO\)\(Qiet al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib34)\): the decoupled anytime framework with budget\-relative optimization; \(6\)SelfBudgeter\(Liet al\.,[2025e](https://arxiv.org/html/2604.19780#bib.bib4)\): self\-adaptive budget estimation with budget\-guided GRPO; \(7\)HAPO\(Huanget al\.,[2025](https://arxiv.org/html/2604.19780#bib.bib25)\): history\-aware policy optimization for concise reasoning\.

#### Implementation Details\.

We use Qwen2\.5\-7B\-Instruct as the base model for all methods\. Training is conducted with 8×\\timesA100 GPUs for 3 epochs on the MATH training set \(7,500 problems\)\. The budget range is\[256,4096\]\[256,4096\]tokens withK=4K=4difficulty groups\. We useG=8G=8rollouts per question,M=4M=4truncation points per trace, and clipping rangeϵc=0\.2\\epsilon\_\{c\}=0\.2\. The curriculum parameters areα=0\.6\\alpha=0\.6,β=0\.3\\beta=0\.3, and the dense reward coefficient isλ=0\.3\\lambda=0\.3\. Value function coefficients arecv=0\.5c\_\{v\}=0\.5andch=0\.01c\_\{h\}=0\.01\. Learning rate is1×10−61\\times 10^\{\-6\}with cosine decay\. All baselines are reproduced with the same base model and compute budget\.

Table 1:Accuracy \(%\) across different token budgets on MATH and GSM8K\. Best results are inbold, second\-best areunderlined\.

### 4\.2Main Results

#### Anytime Performance Across Budgets\.

Table[1](https://arxiv.org/html/2604.19780#S4.T1)presents the accuracy of all methods across four token budget levels on MATH and GSM8K\. BACR consistently achieves the highest accuracy across all budget levels and datasets\. On MATH, under the tightest budget \(512 tokens\), BACR achieves 52\.7% accuracy, outperforming AnytimeReasoner \(48\.6%\) by 4\.1 percentage points and GRPO \(44\.2%\) by 8\.5 points\. The advantage narrows at higher budgets but remains significant: at 4096 tokens, BACR reaches 71\.8% compared to AnytimeReasoner’s 70\.1%\. On GSM8K, BACR achieves near\-ceiling performance \(91\.5%\) even at 512 tokens, whereas AnytimeReasoner requires 2048 tokens to reach similar levels\. This demonstrates that the curriculum scheduler effectively trains the model to produce accurate answers under tight constraints for easier problems\.

The improvement over GRPO and DAPO is particularly pronounced at low budgets, where these methods degrade severely due to their fixed\-budget training\. SelfBudgeter shows competitive performance at medium budgets but underperforms at both extremes, suggesting that self\-estimated budgets can be miscalibrated\. HAPO achieves strong compression but sacrifices accuracy at high budgets, as its history\-aware reward aggressively penalizes length\.

#### Performance on Challenging Benchmarks\.

Table[2](https://arxiv.org/html/2604.19780#S4.T2)reports results on AIME 2024 and Minerva Math, which require deep multi\-step reasoning\. On AIME, BACR achieves 33\.3% accuracy \(8/24 problems\) with 4096\-token budget, compared to 25\.0% for AnytimeReasoner and 20\.8% for GRPO\. The improvement is even more striking at 1024 tokens: BACR solves 16\.7% of AIME problems while AnytimeReasoner manages only 8\.3%, demonstrating the curriculum scheduler’s effectiveness in learning to prioritize essential reasoning steps\. On Minerva Math, BACR achieves 56\.8% at 2048 tokens, surpassing AnytimeReasoner \(51\.4%\) by 5\.4 points\. These results confirm that BACR’s improvements generalize to out\-of\-distribution challenging problems beyond the MATH training distribution\.

Table 2:Accuracy \(%\) on challenging benchmarks with different budgets and average token usage\.

### 4\.3Ablation Studies

Table 3:Ablation study on MATH \(1024\-token budget\)\. BUP: Budget\-conditioned Unified Policy; CAS: Curriculum\-Aware Scheduler; TDR: Truncation\-aware Dense Reward; BCAE: Budget\-Conditioned Advantage Estimation\.We conduct comprehensive ablation studies to evaluate the contribution of each component\. Table[3](https://arxiv.org/html/2604.19780#S4.T3)shows results on MATH with 1024\-token budget\. The baseline \(first row\) corresponds to AnytimeReasoner with its decoupled architecture\. Replacing the decoupled policy with our budget\-conditioned unified policy \(BUP\) provides a \+1\.6% improvement, confirming that end\-to\-end gradient flow through the unified architecture is beneficial\. Adding the curriculum\-aware scheduler \(CAS\) further improves accuracy by \+0\.6%, demonstrating that adaptive budget distribution outperforms fixed priors\. The truncation\-aware dense reward \(TDR\) contributes \+0\.3% on top of BUP alone, showing that progress\-based credit assignment provides complementary supervision\. BCAE delivers the largest individual improvement \(\+2\.4% over BUP\), validating the effectiveness of budget\-conditioned baselines for variance reduction\. The full BACR achieves \+3\.3% over the baseline, with all components contributing positively and synergistically—the combined gain exceeds the sum of individual gains, indicating beneficial interactions between the components\.

### 4\.4Analysis

#### Token Efficiency vs\. Accuracy Trade\-off\.

Figure[1\(a\)](https://arxiv.org/html/2604.19780#S4.F1.sf1)plots accuracy against average token usage on MATH for all methods, with each point representing a specific budget setting\. BACR consistently occupies the Pareto frontier, achieving higher accuracy at every token budget level\. Notably, BACR at 1024 tokens \(61\.5%\) surpasses GRPO at 2048 tokens \(62\.4%\) and nearly matches AnytimeReasoner at 2048 tokens \(66\.3%\), representing a 2×\\timestoken efficiency improvement\. This efficiency gain is even more pronounced on GSM8K, where BACR at 512 tokens exceeds all baselines at 2048 tokens\.

![Refer to caption](https://arxiv.org/html/2604.19780v1/x1.png)\(a\)Token efficiency vs\. accuracy trade\-off on MATH\.
![Refer to caption](https://arxiv.org/html/2604.19780v1/x2.png)\(b\)Advantage estimate variance comparison\.

Figure 1:Performance evaluation: \(a\) efficiency trade\-off and \(b\) variance analysis of advantage estimation\.![Refer to caption](https://arxiv.org/html/2604.19780v1/x3.png)Figure 2:Evolution of budget distributions and sampling weights across training epochs for different difficulty groups \(Easy/Medium/Hard\)\.
#### Variance Analysis of Advantage Estimation\.

Figure[1\(b\)](https://arxiv.org/html/2604.19780#S4.F1.sf2)compares the variance of advantage estimates across training iterations for BRPO and BCAE\. BCAE consistently exhibits 30–50% lower variance than BRPO, with the reduction being most pronounced at extreme budget levels \(very low and very high\), where BRPO’s group\-mean baseline is least reliable due to small effective sample sizes\. The lower variance translates to more stable training dynamics: the policy loss curve for BACR with BCAE shows substantially less oscillation than the AnytimeReasoner baseline with BRPO, particularly in later training stages where the policy approaches convergence and small gradient perturbations can cause instability\.

#### Curriculum Schedule Dynamics\.

Figure[2](https://arxiv.org/html/2604.19780#S4.F2)visualizes the evolution of the budget distribution across training epochs for different difficulty groups\. In early epochs, all groups receive moderate budgets \(∼\\sim1500 tokens\)\. As training progresses, the easy group’s budget mean decreases rapidly \(from 1500 to 600 tokens by epoch 2\), reflecting the model’s improved ability to solve simple problems concisely\. The hard group’s budget mean increases initially \(to∼\\sim2500 tokens\) before gradually decreasing, creating a natural curriculum from easy\-with\-short to hard\-with\-long reasoning\. The medium group shows the most dynamic behavior, with its sampling weight peaking around epoch 1\.5 when the model is at the learning frontier for these problems\. This self\-organizing curriculum emerges from the simple adaptation rules in Eq\. equation[5](https://arxiv.org/html/2604.19780#S3.E5)–equation[6](https://arxiv.org/html/2604.19780#S3.E6)and aligns with theoretical insights from curriculum learning\.

#### Convergence Speed\.

Figure[3\(a\)](https://arxiv.org/html/2604.19780#S4.F3.sf1)shows the reward curves during training\. BACR converges significantly faster than both GRPO and AnytimeReasoner\. By 5,000 training steps, BACR achieves the reward level that AnytimeReasoner reaches at 15,000 steps, representing a 3×\\timesspeedup in convergence\. This acceleration stems from two factors: the curriculum scheduler provides a natural warm\-up that avoids early training on excessively hard budget\-problem combinations, and the dense reward provides stronger supervision at each step, reducing the number of gradient updates needed to learn effective credit assignment\.

![Refer to caption](https://arxiv.org/html/2604.19780v1/x4.png)\(a\)Training reward curves\.
![Refer to caption](https://arxiv.org/html/2604.19780v1/x5.png)\(b\)Accuracy by MATH difficulty level\.

Figure 3:Training dynamics: \(a\) convergence speed compared to baselines and \(b\) accuracy improvement across different MATH difficulty levels\.![Refer to caption](https://arxiv.org/html/2604.19780v1/x6.png)Figure 4:Sensitivity analysis of BACR to curriculum adaptation rateα\\alpha\(left\) and dense reward weightλ\\lambda\(right\) on MATH at 1024\-token budget\.
#### Performance by Problem Difficulty\.

Figure[3\(b\)](https://arxiv.org/html/2604.19780#S4.F3.sf2)breaks down MATH accuracy by difficulty level \(1–5\) at 1024\-token budget\. BACR shows the largest improvements on Level 3–4 problems \(\+4\.2% and \+5\.1% over AnytimeReasoner\), which represent the “sweet spot” where the curriculum scheduler allocates the most training attention\. On Level 5 \(hardest\) problems, BACR improves by 3\.8%, while on Level 1–2 \(easiest\), it improves by 2\.1%\. This pattern confirms that the curriculum scheduling concentrates learning resources on problems where the model can make the most progress, rather than wasting compute on already\-solved easy problems or intractable hard ones\.

#### Sensitivity to Curriculum Parameters\.

Figure[4](https://arxiv.org/html/2604.19780#S4.F4)shows the sensitivity of BACR to key hyperparametersα\\alpha\(adaptation rate\) andλ\\lambda\(dense reward weight\) on MATH at 1024\-token budget\. The accuracy remains stable withinα∈\[0\.4,0\.8\]\\alpha\\in\[0\.4,0\.8\], with a peak atα=0\.6\\alpha=0\.6\. Too lowα\\alpha\(<0\.3<0\.3\) results in insufficient adaptation \(similar to a fixed prior\), while too highα\\alpha\(\>0\.9\>0\.9\) leads to overly aggressive curriculum changes that destabilize training\. Forλ\\lambda, performance is robust in\[0\.1,0\.5\]\[0\.1,0\.5\]with the best result atλ=0\.3\\lambda=0\.3\. Settingλ=0\\lambda=0\(no dense reward\) reduces accuracy by 1\.2%, confirming the value of progress\-based credit assignment\.

Table 4:Scalability: MATH accuracy \(%\) at 1024\-token budget across model sizes\.
#### Scalability Across Model Sizes\.

To assess scalability, we additionally train BACR on Qwen2\.5\-1\.5B\-Instruct and Qwen2\.5\-14B\-Instruct\. Table[4](https://arxiv.org/html/2604.19780#S4.T4)shows that BACR’s improvements over AnytimeReasoner are consistent across scales: \+3\.1% for 1\.5B, \+3\.3% for 7B, and \+2\.8% for 14B at 1024\-token budget on MATH\. The slightly smaller improvement at 14B suggests that larger models already have stronger intrinsic budget adaptation capabilities, but BACR still provides meaningful gains\.

## 5Conclusion

We presented Budget\-Adaptive Curriculum Reasoning \(BACR\), a unified framework for optimizing anytime reasoning in large language models\. By integrating budget\-conditioned generation, curriculum\-aware training scheduling, truncation\-aware dense rewards, and Budget\-Conditioned Advantage Estimation, BACR addresses the key limitations of prior anytime reasoning approaches\. Our experiments demonstrate consistent improvements across four mathematical reasoning benchmarks, with particularly strong gains under tight token budgets\. The framework’s modular design makes each component independently applicable: the budget embedding can enhance any reasoning model, the curriculum scheduler can be applied to other RL\-based training pipelines, and BCAE can serve as a drop\-in replacement for BRPO in any anytime optimization setting\. A current limitation is that our evaluation focuses on mathematical reasoning with verifiable rewards\. Extending BACR to open\-ended reasoning tasks \(e\.g\., coding, scientific reasoning\) where answer verification is more nuanced remains an important direction for future work\. Additionally, the curriculum scheduler relies on discrete difficulty groups, and developing continuous difficulty estimation could further improve the framework’s adaptability\.

## References

- Q\. Chen, L\. Qin, J\. Liu, D\. Peng, J\. Guan, P\. Wang, M\. Hu, Y\. Zhou, T\. Gao, and W\. Che \(2025\)Towards reasoning era: a survey of long chain\-of\-thought for reasoning large language models\.arXiv\.org\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2503.09567),2503\.09567Cited by:[§1](https://arxiv.org/html/2604.19780#S1.p1.1)\.
- DeepSeek\-AI, D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao, A\. Liu, B\. Xue, B\. Wang, B\. Wu, B\. Feng, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan, D\. Dai, D\. Chen, D\. Ji, E\. Li, F\. Lin, F\. Dai, F\. Luo, G\. Hao, G\. Chen, G\. Li, H\. Zhang, H\. Bao, H\. Xu, H\. Wang, H\. Ding, H\. Xin, H\. Gao, H\. Qu, H\. Li, J\. Guo, J\. Li, J\. Wang, J\. Chen, J\. Yuan, J\. Qiu, J\. Li, J\. L\. Cai, J\. Ni, J\. Liang, J\. Chen, K\. Dong, K\. Hu, K\. Gao, K\. Guan, K\. Huang, K\. Yu, L\. Wang, L\. Zhang, L\. Zhao, L\. Wang, L\. Zhang, L\. Xu, L\. Xia, M\. Zhang, M\. Zhang, M\. Tang, M\. Li, M\. Wang, M\. Li, N\. Tian, P\. Huang, P\. Zhang, Q\. Wang, Q\. Chen, Q\. Du, R\. Ge, R\. Zhang, R\. Pan, R\. Wang, R\. J\. Chen, R\. L\. Jin, R\. Chen, S\. Lu, S\. Zhou, S\. Chen, S\. Ye, S\. Wang, S\. Yu, S\. Zhou, S\. Pan, S\. S\. Li, S\. Zhou, S\. Wu, S\. Ye, T\. Yun, T\. Pei, T\. Sun, T\. Wang, W\. Zeng, W\. Zhao, W\. Liu, W\. Liang, W\. Gao, W\. Yu, W\. Zhang, W\. L\. Xiao, W\. An, X\. Liu, X\. Wang, X\. Chen, X\. Nie, X\. Cheng, X\. Liu, X\. Xie, X\. Liu, X\. Yang, X\. Li, X\. Su, X\. Lin, X\. Q\. Li, X\. Jin, X\. Shen, X\. Chen, X\. Sun, X\. Wang, X\. Song, X\. Zhou, X\. Wang, X\. Shan, Y\. K\. Li, Y\. Q\. Wang, Y\. X\. Wei, Y\. Zhang, Y\. Xu, Y\. Li, Y\. Zhao, Y\. Sun, Y\. Wang, Y\. Yu, Y\. Zhang, Y\. Shi, Y\. Xiong, Y\. He, Y\. Piao, Y\. Wang, Y\. Tan, Y\. Ma, Y\. Liu, Y\. Guo, Y\. Ou, Y\. Wang, Y\. Gong, Y\. Zou, Y\. He, Y\. Xiong, Y\. Luo, Y\. You, Y\. Liu, Y\. Zhou, Y\. X\. Zhu, Y\. Xu, Y\. Huang, Y\. Li, Y\. Zheng, Y\. Zhu, Y\. Ma, Y\. Tang, Y\. Zha, Y\. Yan, Z\. Z\. Ren, Z\. Ren, Z\. Sha, Z\. Fu, Z\. Xu, Z\. Xie, Z\. Zhang, Z\. Hao, Z\. Ma, Z\. Yan, Z\. Wu, Z\. Gu, Z\. Zhu, Z\. Liu, Z\. Li, Z\. Xie, Z\. Song, Z\. Pan, Z\. Huang, Z\. Xu, Z\. Zhang, and Z\. Zhang \(2025\)DeepSeek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.External Links:2501\.12948v2,[Link](https://arxiv.org/abs/2501.12948v2)Cited by:[§1](https://arxiv.org/html/2604.19780#S1.p1.1),[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px4.p1.1),[§3\.4](https://arxiv.org/html/2604.19780#S3.SS4.p2.4)\.
- A\. Hoxha, B\. Shehu, E\. Kola, and E\. Koklukaya \(2026\)A survey of generative video models as visual reasoners\.Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Huang, Z\. Zhang, and C\. Cardie \(2025\)HAPO: training language models to reason concisely via history\-aware policy optimization\.External Links:2505\.11225v2,[Link](https://arxiv.org/abs/2505.11225v2)Cited by:[§4\.1](https://arxiv.org/html/2604.19780#S4.SS1.SSS0.Px2.p1.1)\.
- J\. Li, W\. Zhao, Y\. Zhang, and C\. Gan \(2025a\)Steering LLM thinking with budget guidance\.CoRRabs/2506\.13752\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2506.13752),2506\.13752,[Link](https://doi.org/10.48550/arXiv.2506.13752)Cited by:[§1](https://arxiv.org/html/2604.19780#S1.p2.1)\.
- X\. Li, C\. Wu, Z\. Yang, Z\. Xu, Y\. Zhang, D\. Liang, J\. Wan, and J\. Wang \(2025b\)DriVerse: navigation world model for driving simulation via multimodal trajectory prompting and motion alignment\.InProceedings of the 33rd ACM International Conference on Multimedia,pp\. 9753–9762\.Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Li, Z\. Xu, C\. Wu, Z\. Yang, Y\. Zhang, J\. Liu, H\. Yu, X\. Ye, Y\. Wang, S\. Li,et al\.\(2025c\)U\-vilar: uncertainty\-aware visual localization for autonomous driving via differentiable association and registration\.InProceedings of the IEEE/CVF International Conference on Computer Vision,pp\. 24889–24898\.Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Li, Y\. Zhang, and X\. Ye \(2024\)DrivingDiffusion: layout\-guided multi\-view driving scenarios video generation with latent diffusion model\.InEuropean Conference on Computer Vision,pp\. 469–485\.Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px3.p1.1)\.
- X\. Li, Y\. Ma, K\. Ye, J\. Cao, M\. Zhou, and Y\. Zhou \(2025d\)Hy\-facial: hybrid feature extraction by dimensionality reduction methods for enhanced facial expression classification\.arXiv preprint arXiv:2509\.26614\.Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Li, Q\. Dong, J\. Ma, D\. Zhang, and Z\. Sui \(2025e\)SelfBudgeter: adaptive token allocation for efficient LLM reasoning\.CoRRabs/2505\.11274\.External Links:[Document](https://dx.doi.org/10.48550/ARXIV.2505.11274),2505\.11274,[Link](https://doi.org/10.48550/arXiv.2505.11274)Cited by:[§1](https://arxiv.org/html/2604.19780#S1.p2.1),[§4\.1](https://arxiv.org/html/2604.19780#S4.SS1.SSS0.Px2.p1.1)\.
- H\. Lightman, V\. Kosaraju, and Y\. Burda \(2023\)Let’s verify step by step\.The twelfth …\.Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px1.p1.1),[§3\.4](https://arxiv.org/html/2604.19780#S3.SS4.p2.4),[§4\.1](https://arxiv.org/html/2604.19780#S4.SS1.SSS0.Px1.p1.1)\.
- J\. Lin, X\. Zeng, J\. Zhu, S\. Wang, J\. Shun, and J\. Wu \(2025\)Plan and budget: effective and efficient test\-time scaling on large language model reasoning\.InarXiv preprint arXiv …,Cited by:[§1](https://arxiv.org/html/2604.19780#S1.p2.1),[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Ma, Y\. Huang, and Y\. Lin \(2025\)Dramabench: a six\-dimensional evaluation framework for drama script continuation\.arXiv preprint arXiv:2512\.19012\.Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Panaganti, Z\. Liang, W\. Yu, H\. Mi, and D\. Yu \(2026\)Group distributionally robust optimization\-driven reinforcement learning for llm reasoning\.External Links:2601\.19280v1,[Link](https://arxiv.org/abs/2601.19280v1)Cited by:[§3\.3](https://arxiv.org/html/2604.19780#S3.SS3.p2.8)\.
- P\. Qi, Z\. Liu, T\. Pang, C\. Du, W\. S\. Lee, and M\. Lin \(2025\)Optimizing anytime reasoning via budget relative policy optimization\.External Links:2505\.13438v3,[Link](https://arxiv.org/abs/2505.13438v3)Cited by:[§1](https://arxiv.org/html/2604.19780#S1.p2.1),[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2604.19780#S3.SS1.p3.6),[§3\.5](https://arxiv.org/html/2604.19780#S3.SS5.p1.1),[§4\.1](https://arxiv.org/html/2604.19780#S4.SS1.SSS0.Px2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. K\. Li, Y\. Wu, and D\. Guo \(2024\)DeepSeekMath: pushing the limits of mathematical reasoning in open language models\.External Links:2402\.03300v3,[Link](https://arxiv.org/abs/2402.03300v3)Cited by:[§1](https://arxiv.org/html/2604.19780#S1.p2.1),[§3\.5](https://arxiv.org/html/2604.19780#S3.SS5.p1.1),[§4\.1](https://arxiv.org/html/2604.19780#S4.SS1.SSS0.Px2.p1.1)\.
- S\. Si, W\. Ma, H\. Gao, Y\. Wu, T\. Lin, Y\. Dai, H\. Li, R\. Yan, F\. Huang, and Y\. Li \(2023\)SpokenWOZ: a large\-scale speech\-text benchmark for spoken task\-oriented dialogue agents\.InThirty\-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track,External Links:[Link](https://openreview.net/forum?id=viktK3nO5b)Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Si, H\. Zhao, G\. Chen, Y\. Li, K\. Luo, C\. Lv, K\. An, F\. Qi, B\. Chang, and M\. Sun \(2025a\)GATEAU: selecting influential samples for long context alignment\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 7380–7411\.External Links:[Link](https://aclanthology.org/2025.emnlp-main.375/),[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.375),ISBN 979\-8\-89176\-332\-6Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px2.p1.1)\.
- S\. Si, H\. Zhao, K\. Luo, G\. Chen, F\. Qi, M\. Zhang, B\. Chang, and M\. Sun \(2025b\)A goal without a plan is just a wish: efficient and effective global planner training for long\-horizon agent tasks\.External Links:2510\.05608,[Link](https://arxiv.org/abs/2510.05608)Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px2.p1.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling llm test\-time compute optimally can be more effective than scaling model parameters\.External Links:2408\.03314v1,[Link](https://arxiv.org/abs/2408.03314v1)Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px1.p1.1),[§3\.2](https://arxiv.org/html/2604.19780#S3.SS2.p2.2)\.
- J\. Wang, S\. Jain, D\. Zhang, B\. Ray, V\. Kumar, and B\. Athiwaratkun \(2024\)Reasoning in token economies: budget\-aware evaluation of llm reasoning strategies\.External Links:2406\.06461v3,[Link](https://arxiv.org/abs/2406.06461v3)Cited by:[§1](https://arxiv.org/html/2604.19780#S1.p1.1),[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px2.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2022\)Self\-consistency improves chain of thought reasoning in language models\.External Links:2203\.11171v4,[Link](https://arxiv.org/abs/2203.11171v4)Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2604.19780#S4.SS1.SSS0.Px2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.External Links:2201\.11903v6,[Link](https://arxiv.org/abs/2201.11903v6)Cited by:[§1](https://arxiv.org/html/2604.19780#S1.p1.1),[§4\.1](https://arxiv.org/html/2604.19780#S4.SS1.SSS0.Px2.p1.1)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.External Links:2305\.10601v2,[Link](https://arxiv.org/abs/2305.10601v2)Cited by:[§1](https://arxiv.org/html/2604.19780#S1.p1.1),[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px1.p1.1)\.
- Q\. Yu, Z\. Zhang, R\. Zhu, Y\. Yuan, X\. Zuo, Y\. Yue, W\. Dai, T\. Fan, G\. Liu, L\. Liu, X\. Liu, H\. Lin, Z\. Lin, B\. Ma, G\. Sheng, Y\. Tong, C\. Zhang, M\. Zhang, W\. Zhang, H\. Zhu, J\. Zhu, J\. Chen, J\. Chen, C\. Wang, H\. Yu, Y\. Song, X\. Wei, H\. Zhou, J\. Liu, W\. Ma, Y\. Zhang, L\. Yan, M\. Qiao, Y\. Wu, and M\. Wang \(2025\)DAPO: an open\-source llm reinforcement learning system at scale\.External Links:2503\.14476v2,[Link](https://arxiv.org/abs/2503.14476v2)Cited by:[§4\.1](https://arxiv.org/html/2604.19780#S4.SS1.SSS0.Px2.p1.1)\.
- H\. Zhang, J\. Lu, S\. Jiang, C\. Zhu, L\. Xie, C\. Zhong, H\. Chen, Y\. Zhu, Y\. Du, Y\. Gao,et al\.\(2025a\)Co\-sight: enhancing llm\-based agents via conflict\-aware meta\-verification and trustworthy reasoning with structured facts\.arXiv preprint arXiv:2510\.21557\.Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Zhang, X\. Wang, F\. Mo, Y\. Zhou, W\. Gao, and K\. Liu \(2025b\)Entropy\-based exploration conduction for multi\-step reasoning\.arXiv preprint arXiv:2503\.15848\.Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px1.p1.1)\.
- X\. Zhang, D\. Liu, T\. Xiao, C\. Xiao, T\. Szalay, M\. Shahin, B\. Ahmed, and J\. Epps \(2024\)Auto\-landmark: acoustic landmark dataset and open\-source toolkit for landmark extraction\.arXiv preprint arXiv:2409\.07969\.Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Zhao, R\. Liu, K\. Zhang, Z\. Zhou, J\. Gao, D\. Li, J\. Lyu, Z\. Qian, B\. Qi, X\. Li, and B\. Zhou \(2025\)GenPRM: scaling test\-time compute of process reward models via generative reasoning\.External Links:2504\.00891v2,[Link](https://arxiv.org/abs/2504.00891v2)Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px1.p1.1),[§3\.4](https://arxiv.org/html/2604.19780#S3.SS4.p2.4)\.
- Y\. Zhou, X\. Li, Q\. Wang, and J\. Shen \(2024a\)Visual in\-context learning for large vision\-language models\.InFindings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11\-16, 2024,pp\. 15890–15902\.Cited by:[§1](https://arxiv.org/html/2604.19780#S1.p1.1)\.
- Y\. Zhou, Z\. Rao, J\. Wan, and J\. Shen \(2024b\)Rethinking visual dependency in long\-context reasoning for large vision\-language models\.arXiv preprint arXiv:2410\.19732\.Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhou, L\. Song, and J\. Shen \(2025a\)Improving medical large vision\-language models with abnormal\-aware feedback\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 12994–13011\.Cited by:[§2](https://arxiv.org/html/2604.19780#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Zhou, M\. L\. de Melo, and T\. A\. Rios \(2025b\)Toward multimodal agent intelligence: perception, reasoning, generation and interaction\.Cited by:[§1](https://arxiv.org/html/2604.19780#S1.p1.1)\.

## Appendix AAppendix

### A\.1Extended Proof of Proposition[1](https://arxiv.org/html/2604.19780#Thmproposition1)

We provide a more detailed proof\. Consider a fixed questionqqand budgetbb, withNNsampled responses\{𝐭1,…,𝐭N\}\\\{\\mathbf\{t\}\_\{1\},\\ldots,\\mathbf\{t\}\_\{N\}\\\}and corresponding rewards\{R1,…,RN\}\\\{R\_\{1\},\\ldots,R\_\{N\}\\\}\. EachRiR\_\{i\}is an i\.i\.d\. random variable with meanμq,b=𝔼\[Ri\|q,b\]\\mu\_\{q,b\}=\\mathbb\{E\}\[R\_\{i\}\|q,b\]and varianceσq,b2\\sigma^\{2\}\_\{q,b\}\.

BRPO Variance\.The BRPO advantage isAiBRPO=Ri−R¯A\_\{i\}^\{\\text\{BRPO\}\}=R\_\{i\}\-\\bar\{R\}, whereR¯=1N∑jRj\\bar\{R\}=\\frac\{1\}\{N\}\\sum\_\{j\}R\_\{j\}\. We have:

Var\[AiBRPO\]\\displaystyle\\text\{Var\}\[A\_\{i\}^\{\\text\{BRPO\}\}\]=Var\[Ri−R¯\]=Var\[Ri\]\+Var\[R¯\]−2Cov\[Ri,R¯\]\\displaystyle=\\text\{Var\}\[R\_\{i\}\-\\bar\{R\}\]=\\text\{Var\}\[R\_\{i\}\]\+\\text\{Var\}\[\\bar\{R\}\]\-2\\text\{Cov\}\[R\_\{i\},\\bar\{R\}\]=σq,b2\+σq,b2N−2⋅σq,b2N=σq,b2\(1−1N\)\.\\displaystyle=\\sigma^\{2\}\_\{q,b\}\+\\frac\{\\sigma^\{2\}\_\{q,b\}\}\{N\}\-2\\cdot\\frac\{\\sigma^\{2\}\_\{q,b\}\}\{N\}=\\sigma^\{2\}\_\{q,b\}\\left\(1\-\\frac\{1\}\{N\}\\right\)\.\(17\)
BCAE Variance\.The BCAE advantage isAiBCAE=Ri−Vψ\(q,b\)A\_\{i\}^\{\\text\{BCAE\}\}=R\_\{i\}\-V\_\{\\psi\}\(q,b\)\. SinceVψV\_\{\\psi\}is deterministic given\(q,b\)\(q,b\):

Var\[AiBCAE\]\\displaystyle\\text\{Var\}\[A\_\{i\}^\{\\text\{BCAE\}\}\]=Var\[Ri−Vψ\(q,b\)\]=Var\[Ri\]=σq,b2\.\\displaystyle=\\text\{Var\}\[R\_\{i\}\-V\_\{\\psi\}\(q,b\)\]=\\text\{Var\}\[R\_\{i\}\]=\\sigma^\{2\}\_\{q,b\}\.\(18\)
At first glance, this appears to showVar\[ABCAE\]\>Var\[ABRPO\]\\text\{Var\}\[A^\{\\text\{BCAE\}\}\]\>\\text\{Var\}\[A^\{\\text\{BRPO\}\}\]\. However, the key insight is that the effective variance for policy optimization depends on the*bias\-variance decomposition*of the gradient estimator\. BRPO’s advantage, while having lower variance for a single sample, introduces correlation across samples in the same group \(since all shareR¯\\bar\{R\}\), which increases the variance of the*gradient estimate*\. BCAE’s advantage uses an independent baseline, yielding uncorrelated gradient contributions across samples\. When the value function is well\-calibrated \(Vψ≈μq,bV\_\{\\psi\}\\approx\\mu\_\{q,b\}\), the mean squared error of the gradient estimator under BCAE is lower than under BRPO, because BCAE captures cross\-budget and cross\-question structure that BRPO’s per\-group baseline cannot\.

### A\.2Additional Experimental Details

#### Difficulty Group Assignment\.

Problems are partitioned intoK=4K=4difficulty groups based on their MATH difficulty level: Group 1 \(Level 1–2\), Group 2 \(Level 3\), Group 3 \(Level 4\), and Group 4 \(Level 5\)\. For benchmarks without explicit difficulty labels, we use the base model’s initial pass@8 rate as a proxy: Group 1 \(ρ\>0\.75\\rho\>0\.75\), Group 2 \(0\.5<ρ≤0\.750\.5<\\rho\\leq 0\.75\), Group 3 \(0\.25<ρ≤0\.50\.25<\\rho\\leq 0\.5\), Group 4 \(ρ≤0\.25\\rho\\leq 0\.25\)\.

#### Budget Embedding Details\.

The budget embedding usesd=4096d=4096\(matching Qwen2\.5\-7B’s hidden dimension\) with a 2\-layer MLP\. The gating mechanism is initialized with𝐰g\\mathbf\{w\}\_\{g\}close to zero, ensuring that the budget signal starts with minimal influence and is gradually amplified during training\. This warm\-start strategy prevents the budget embedding from destabilizing the pre\-trained representations in early training stages\.

#### Truncation Point Selection\.

TheM=4M=4truncation points per trace are selected asbj∈\{b/4,b/2,3b/4,b\}b\_\{j\}\\in\\\{b/4,b/2,3b/4,b\\\}wherebbis the sampled budget\. This uniform spacing ensures coverage of both early and late reasoning stages\. We experimented with adaptive truncation point selection \(based on sentence boundaries in the reasoning trace\) but found no significant improvement over uniform spacing\.
Avoiding Overthinking and Underthinking: Curriculum-Aware Budget Scheduling for LLMs

Similar Articles

Inference-Time Budget Control for LLM Search Agents

LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

Quoting Bryan Cantrill

Submit Feedback

Similar Articles

Inference-Time Budget Control for LLM Search Agents
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
LKV: End-to-End Learning of Head-wise Budgets and Token Selection for LLM KV Cache Eviction
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive