SuCo: Sufficiency-guided Continuous Adaptive Reasoning

arXiv cs.CL Papers

Summary

Introduces SuCo, a two-stage training framework for Large Reasoning Models that uses the concept of Minimal Sufficient CoT to reduce reasoning tokens while improving accuracy across math, code, and science benchmarks.

arXiv:2606.17687v1 Announce Type: new Abstract: Despite remarkable performance on complex tasks, Large Reasoning Models (LRMs) often generate excessively long Chain-of-Thoughts (CoT), inflating computational costs even for simple queries. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient. In this work, we introduce Minimal Sufficient CoT (MSC), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels. Building on MSC, we propose Sufficiency-guided Continuous Adaptive Reasoning (SuCo), a two-stage training framework for autonomous reasoning control along a continuous spectrum. In stage 1, MSC-Aligned Fine-Tuning (MFT) constructs MSC data using problem-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine-tunes the model to internalize concise yet sufficient reasoning patterns. In stage 2, Sufficiency-Aware Policy Optimization (SAPO) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency-aware rewards that penalize both over- and under-thinking. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency.
Original Article
View Cached Full Text

Cached at: 06/17/26, 05:41 AM

# SuCo: Sufficiency-guided Continuous Adaptive Reasoning
Source: [https://arxiv.org/html/2606.17687](https://arxiv.org/html/2606.17687)
Bingyu LiangChenhao HuLonghui ZhangXuebo LiuMin ZhangJing LiXuelong Li

###### Abstract

Despite remarkable performance on complex tasks, Large Reasoning Models \(LRMs\) often generate excessively long Chain\-of\-Thoughts \(CoT\), inflating computational costs even for simple queries\. Existing efforts to mitigate this inefficiency typically rely on discrete reasoning modes or fixed budget tiers, lacking a principled criterion of when reasoning is sufficient\. In this work, we introduce*Minimal Sufficient CoT*\(MSC\), defined as the shortest prefix of a CoT trajectory which is adequate for producing the correct answer\. We empirically show that MSC not only reduces reasoning tokens, but also improves accuracy across difficulty levels\. Building on MSC, we propose*Sufficiency\-guided Continuous Adaptive Reasoning*\(SuCo\), a two\-stage training framework for autonomous reasoning control along a continuous spectrum\. In stage I,*MSC\-Aligned Fine\-Tuning*\(MFT\) constructs MSC data using problem\-adaptive sufficiency thresholds that naturally scale with question difficulty, then fine\-tunes the model to internalize concise yet sufficient reasoning patterns\. In stage II,*Sufficiency\-Aware Policy Optimization*\(SAPO\) further optimizes the model through reinforcement learning with dynamic complexity tracking and sufficiency\-aware rewards that penalize both over\- and under\-thinking\. Extensive experiments across mathematics, code, and science benchmarks show that SuCo consistently achieves improvements in both accuracy and reasoning efficiency\.

Machine Learning, ICML

## 1Introduction

Large Language Models \(LLMs\) have demonstrated impressive capabilities across a wide range of tasks\(Zhao et al\.,[2023](https://arxiv.org/html/2606.17687#bib.bib48); Wang et al\.,[2025a](https://arxiv.org/html/2606.17687#bib.bib37); Zhang et al\.,[2025c](https://arxiv.org/html/2606.17687#bib.bib45)\), yet continue to struggle with complex problems requiring multi\-step reasoning\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.17687#bib.bib7)\)\. To address this limitation, recent work has introduced*Large Reasoning Models*\(LRMs\), which explicitly generate intermediate reasoning steps via Chain\-of\-Thoughts \(CoT\)\(Wei et al\.,[2022](https://arxiv.org/html/2606.17687#bib.bib39)\)\. By performing step\-by\-step logical thinking before arriving at final answers, LRMs such as DeepSeek\-R1\(Guo et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib12)\)and OpenAI o1\(Jaech et al\.,[2024](https://arxiv.org/html/2606.17687#bib.bib19)\)achieve substantial gains over standard LLMs on challenging benchmarks\(Hou et al\.,[2025b](https://arxiv.org/html/2606.17687#bib.bib17); Xu et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib41)\)\.

![Refer to caption](https://arxiv.org/html/2606.17687v1/figures/msc.png)

Figure 1:MSC vs\. Full CoT on Qwen3\-8B across MATH difficulty levels\.Left axis \(↓\\downarrow\): reasoning tokens\.Right axis \(↑\\uparrow\): accuracy\. At each difficulty level, MSC achieves higher accuracy with significantly fewer tokens\.Despite these advances, current LRMs suffer from*redundant reasoning*\(Sui et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib35)\)\. Even for simple queries, they tend to generate exhaustive reasoning chains, incurring substantial computational costs and inference latency\(Aggarwal & Welleck,[2025](https://arxiv.org/html/2606.17687#bib.bib2)\)\. Such inefficiency limits practical deployment in real\-time applications \(e\.g\., online coding assistants\(Jimenez et al\.,[2024](https://arxiv.org/html/2606.17687#bib.bib22)\)\) and resource\-constrained environments \(e\.g\., edge devices\(Zhang et al\.,[2024](https://arxiv.org/html/2606.17687#bib.bib46)\)\)\.

To mitigate redundancy, recent studies have developed*Adaptive Large Reasoning Models*\(ALRMs\), which aim to adjust reasoning effort according to problem complexity\(Sui et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib35); Wu et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib40)\)\. These approaches can be broadly categorized into two paradigms\.*User\-controlled methods*require explicit prompts to select reasoning behaviors\. For example, Qwen3\(Qwen Team,[2025](https://arxiv.org/html/2606.17687#bib.bib31)\)enables manual on/off switching, while GPT\-OSS\(OpenAI,[2025](https://arxiv.org/html/2606.17687#bib.bib30)\)provides multiple predefined reasoning strategies\. In contrast,*model\-driven methods*allow autonomous reasoning decisions\. AdaCoT\(Lou et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib27)\)employs external assessors, whereas LHRM\(Jiang et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib21)\)assigns reasoning status based on domain labels\. Despite their differences, existing ALRMs fundamentally rely ondiscrete mode selection\. Reasoning effort is adjusted by switching among a finite set of manually specified options, rather than being calibrated in a continuous manner\.

We posit that an ideal ALRM requires: \(1\) reasoning length scales with problem difficulty, \(2\) autonomous resource allocation without intervention, and \(3\) optimal performance with minimal reasoning\. However, this raises a counterintuitive question: According to the test\-time scaling laws\(Snell et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib34); Brown et al\.,[2024](https://arxiv.org/html/2606.17687#bib.bib6)\), performance typically improves with more reasoning\.Can models actually perform better with less reasoning?

We provide an affirmative answer by introducingMinimal Sufficient CoT \(MSC\)— the shortest reasoning prefix of a CoT trajectory that is sufficient to yield the correct answer\. As illustrated in Figure[1](https://arxiv.org/html/2606.17687#S1.F1), across all five difficulty levels of the MATH benchmark\(Hendrycks et al\.,[2021b](https://arxiv.org/html/2606.17687#bib.bib15)\), MSC dramatically reduces reasoning tokens while consistently outperforming full CoT in accuracy\. This reveals that rather than blindly scaling reasoning resources, test\-time adaptation offers a more efficient solution\.

Building on this insight, we proposeSufficiency\-guidedContinuous Adaptive Reasoning \(SuCo\), a two\-stage training framework enabling continuous reasoning control\. Unlike prior discrete approaches that depend on external classifiers or predefined budget tiers, SuCo introduces problem\-adaptive sufficiency thresholds that naturally adjust to question difficulty\. In Stage I,*MSC\-Aligned Fine\-Tuning*\(MFT\) constructs an MSC dataset from full CoT trajectories, then performs supervised fine\-tuning \(SFT\) to internalize concise yet sufficient reasoning patterns\. In Stage II,*Sufficiency\-Aware Policy Optimization*\(SAPO\) further trains the model to dynamically allocate reasoning effort through reinforcement learning \(RL\)\. Critically, SAPO maintains a dynamic complexity pool to track evolving reasoning distributions during training, and employs sufficiency\-aware rewards that penalize both insufficient and excessive reasoning\.

Extensive experiments are conducted across mathematics, code, and science domains at both 1\.5B and 7B model scales\. Results demonstrate that SuCo achieves superior accuracy with substantially fewer reasoning tokens, outperforming full CoT and ALRM baselines\.

Our key contributions are summarized as follows:

- •We formalize MSC, providing a principled sufficiency criterion revealing that models can achieve stronger performance with less reasoning\.
- •We propose SuCo, a two\-stage training paradigm for continuous and autonomous reasoning control without discrete modes or external intervention\.
- •Comprehensive experiments spanning diverse domains demonstrate the effectiveness of our SuCo\.

## 2Related Work

#### Large Reasoning Models\.

Large Reasoning Models \(LRMs\) extend Large Language Models \(LLMs\) by explicitly generating intermediate reasoning steps via Chain\-of\-Thoughts \(CoT\), which has been shown to substantially improve performance on challenging multi\-step tasks\(Wei et al\.,[2022](https://arxiv.org/html/2606.17687#bib.bib39); Kojima et al\.,[2022](https://arxiv.org/html/2606.17687#bib.bib23)\)\. Building on this paradigm, recent LRMs such as DeepSeek\-R1\(Guo et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib12)\), OpenAI o1\(Jaech et al\.,[2024](https://arxiv.org/html/2606.17687#bib.bib19)\), and Qwen3\(Qwen Team,[2025](https://arxiv.org/html/2606.17687#bib.bib31)\)further strengthen reasoning capabilities through large\-scale supervised fine\-tuning \(SFT\) on high\-quality CoT data, often combined with reinforcement learning \(RL\) with curated rewards\. Despite these advances, current LRMs frequently produce unnecessarily verbose reasoning even for trivial queries, incurring significant inference overhead and motivating the need for more efficient reasoning control\.

#### Adaptive Large Reasoning Models\.

To mitigate reasoning redundancy, recent efforts have explored Adaptive Large Reasoning Models \(ALRMs\) that modulate reasoning length based on problem difficulty\. Early approaches primarily focus on*binary triggering*of reasoning\. AdaCoT\(Lou et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib27)\)employs an external model to decide whether to activate CoT; AdaptThink\(Zhang et al\.,[2025b](https://arxiv.org/html/2606.17687#bib.bib44)\)formulates reasoning activation as a constrained optimization problem; LHRMs\(Jiang et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib21)\)assigns reasoning behaviors using coarse domain\-level labels \(e\.g\., math vs\. chat\)\. Beyond binary control, subsequent methods investigate*multi\-mode reasoning*\. SABER\(Zhao et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib47)\)and ThinkDial\(He et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib13)\)introduce multiple predefined reasoning strategies or budget tiers, selected via system prompts\. Additional recent efforts have explored more fine\-grained control\. ThinkPrune\(Hou et al\.,[2025a](https://arxiv.org/html/2606.17687#bib.bib16)\)applies reinforcement learning to prune long reasoning chains, while CyclicReflex\(Fan et al\.,[2026](https://arxiv.org/html/2606.17687#bib.bib9)\)schedules reflection tokens cyclically to balance depth and efficiency\. AlphaOne\(Zhang et al\.,[2025a](https://arxiv.org/html/2606.17687#bib.bib43)\)explores dual\-speed reasoning at test time, enabling models to adaptively think slow or fast\. Complementary analyses have also highlighted the phenomena of*underthinking*\(Wang et al\.,[2025b](https://arxiv.org/html/2606.17687#bib.bib38)\)and the mirage of test\-time scaling\(Ghosal et al\.,[2026](https://arxiv.org/html/2606.17687#bib.bib11)\), which further motivate principled control over reasoning effort\.

Despite their progress, existing ALRMs share a fundamental limitation: reasoning effort is regulated throughdiscrete specified modes, supported by coarse supervision signals such as external estimators, predefined data categories, or heuristic length constraints\. Such discrete control overlooks the internal logical sufficiency of reasoning trajectories and lacks the flexibility to finely calibrate reasoning depth in a problem\-specific manner\.

In contrast, our work proposes*continuous adaptive reasoning*grounded in the concept of*Minimal Sufficient CoT*\(MSC\)\. By introducing a principled sufficiency criterion, we enable fine\-grained assessment of whether a reasoning prefix is adequate to support a confident answer\. Unlike discrete modes or fixed truncation rules, our sufficiency\-aware training empowers the model to autonomously calibrate its reasoning effort along a continuous spectrum\.

![Refer to caption](https://arxiv.org/html/2606.17687v1/figures/SuCo.png)Figure 2:Illustration of Minimal Sufficient CoT \(MSC\)\.For a given question, sufficiency score \(geometric mean over ground\-truth answer tokens\) is computed at each generation position\. The MSC is the shortest prefix exceeding the adaptive thresholdδ\\delta\. As shown, once the sufficiency threshold is reached, extended*waiting*or self\-verification steps lead to a rapid decline in sufficiency, indicating that additional reasoning contributes little benefit and may even degrade confidence\.

## 3Methodology

### 3\.1Problem Formulation

#### Notation\.

Consider a dataset𝒟\\mathcal\{D\}of question\-answer pairs\(x,y∗\)\(x,y^\{\*\}\), wherexxdenotes an input question andy∗y^\{\*\}the ground\-truth answer\. Givenxx, a reasoning modelπθ\\pi\_\{\\theta\}generates a CoT trajectoryz=\(z1,z2,…,zLz\)z=\(z\_\{1\},z\_\{2\},\\ldots,z\_\{L\_\{z\}\}\), withLzL\_\{z\}sentences and a total of‖z‖\\\|z\\\|tokens\. Conditioned onxxandzz, the model generates the final answer:y∼πθ\(⋅∣x,z\)y\\sim\\pi\_\{\\theta\}\(\\cdot\\mid x,z\)\.

### 3\.2MSC: Minimal Sufficient CoT

Figure[2](https://arxiv.org/html/2606.17687#S2.F2)provides an intuitive illustration of MSC\.

#### Reasoning Sufficiency\.

To quantify how well a reasoning trajectory supports the ground\-truth, we define the*reasoning sufficiency*:

𝒮θ​\(z∣x,y∗\):=\(∏i=1‖y∗‖πθ​\(yi∗∣x,z,y<i∗\)\)1/‖y∗‖\\mathcal\{S\}\_\{\\theta\}\(z\\mid x,y^\{\*\}\):=\\left\(\\prod\_\{i=1\}^\{\\\|y^\{\*\}\\\|\}\\pi\_\{\\theta\}\(y^\{\*\}\_\{i\}\\mid x,z,y^\{\*\}\_\{<i\}\)\\right\)^\{1/\\\|y^\{\*\}\\\|\}\(1\)The most natural signal is the joint probability∏iπθ​\(yi∗∣x,z,y<i∗\)\\prod\_\{i\}\\pi\_\{\\theta\}\(y^\{\*\}\_\{i\}\\mid x,z,y^\{\*\}\_\{<i\}\)\. However, it decays exponentially with answer length, making it fragile for long sequences\. To address this, we employ the geometric mean, which normalizes the joint probability into a per\-token average\. We empirically validate this choice in Appendix[A\.4](https://arxiv.org/html/2606.17687#A1.SS4)\.

#### Sufficient CoT\.

Then we can determine whether reasoning is adequate by introducing a confidence thresholdδ∈\[0,1\]\\delta\\in\[0,1\]\. A trajectoryzzis termed*δ\\delta\-sufficient*if𝒮θ​\(z\|x,y∗\)≥δ\\mathcal\{S\}\_\{\\theta\}\(z\|x,y^\{\*\}\)\\geq\\delta\.

#### MSC Definition\.

We further define the MSC as the shortest reasoning prefix satisfying sufficiency\. We identify MSC at the sentence level, as sentence boundaries naturally correspond to atomic reasoning steps, and avoid fragmentary truncation that may distort logical structure\. We say the prefixz<t∗z\_\{<t^\{\*\}\}is aδ\\delta\-MSC if and only if:

\{𝒮θ​\(z<t∗∣x,y∗\)≥δ\(Sufficiency\)𝒮θ​\(z<t∣x,y∗\)<δ,∀t<t∗\(Minimality\)\\begin\{cases\}\\mathcal\{S\}\_\{\\theta\}\(z\_\{<t^\{\*\}\}\\mid x,y^\{\*\}\)\\geq\\delta&\\text\{\(Sufficiency\)\}\\\\ \\mathcal\{S\}\_\{\\theta\}\(z\_\{<t\}\\mid x,y^\{\*\}\)<\\delta,\\quad\\forall t<t^\{\*\}&\\text\{\(Minimality\)\}\\end\{cases\}\(2\)

#### Problem\-Adaptive Threshold\.

A fixed thresholdδ\\deltaapplies the same confidence bar uniformly across all problems, regardless of their inherent difficulty\. However, for simple problems, a highδ\\deltaretains unnecessary reasoning, while for hard problems, a lowδ\\deltamay truncate critical reasoning steps prematurely\. We therefore introduce aproblem\-adaptive threshold:

δ​\(x\)=δ0\+α⋅𝒞​\(x\)\\delta\(x\)=\\delta\_\{0\}\+\\alpha\\cdot\\mathcal\{C\}\(x\)\(3\)whereδ0\\delta\_\{0\}is the base value,α\\alphacontrols sensitivity to complexity, and𝒞​\(x\)∈\[0,1\]\\mathcal\{C\}\(x\)\\in\[0,1\]denotes problem complexity\. This produces a more discriminative MSC distribution across difficulty levels, providing a stronger adaptive prior for subsequent training\.

#### Percentile\-Based Complexity Estimation\.

We estimate complexity as the percentile rank of its reasoning length in the dataset: reasoning length serves as a practical proxy for problem complexity, as empirically supported in Figure[1](https://arxiv.org/html/2606.17687#S1.F1)\. Formally, given a dataset𝒟=\{\(xi,yi∗,zi\)\}i=1N\\mathcal\{D\}=\\\{\(x\_\{i\},y\_\{i\}^\{\*\},z\_\{i\}\)\\\}\_\{i=1\}^\{N\}, we define:

𝒞​\(xi\)=1N​∑j=1N𝟙​\[‖zj‖≤‖zi‖\]\\mathcal\{C\}\(x\_\{i\}\)=\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\mathbbm\{1\}\[\\\|z\_\{j\}\\\|\\leq\\\|z\_\{i\}\\\|\]\(4\)This percentile\-based measure is robust to outliers and yields values uniformly distributed in\[0,1\]\[0,1\], ensuring stable threshold scaling across problems\.

### 3\.3Stage I: MSC\-Aligned Fine\-Tuning

The first stage, termed MSC\-Aligned Fine\-Tuning \(MFT\), aligns the model to produce concise yet sufficient reasoning through SFT on a curated MSC dataset\. This stage consists of two steps: \(1\) constructing MSC data from full CoT trajectories, and \(2\) fine\-tuning the model to internalize adaptive reasoning patterns\.

#### MSC Data Construction\.

From source dataset𝒟src=\{\(xi,yi∗\)\}i=1N\\mathcal\{D\}\_\{\\text\{src\}\}=\\\{\(x\_\{i\},y\_\{i\}^\{\*\}\)\\\}\_\{i=1\}^\{N\}, a strong reasoning modelℳLRM\\mathcal\{M\}\_\{\\text\{LRM\}\}generates full CoT and answers:\(z^i,y^i\)∼ℳLRM​\(xi\)\(\\hat\{z\}\_\{i\},\\hat\{y\}\_\{i\}\)\\sim\\mathcal\{M\}\_\{\\text\{LRM\}\}\(x\_\{i\}\)\. We then extract MSC from each trajectory via the following procedure:

▶\\blacktriangleright\(i\) Compute adaptive thresholds\.With access to all trajectory lengths\{‖z^i‖\}i=1N\\\{\\\|\\hat\{z\}\_\{i\}\\\|\\\}\_\{i=1\}^\{N\}, we derive per\-sample complexity𝒞​\(xi\)\\mathcal\{C\}\(x\_\{i\}\)and thresholdδ​\(xi\)\\delta\(x\_\{i\}\)using Eq\.[4](https://arxiv.org/html/2606.17687#S3.E4)and Eq\.[3](https://arxiv.org/html/2606.17687#S3.E3)\.

Algorithm 1MSC Dataset Construction0:Source dataset

𝒟src\\mathcal\{D\}\_\{\\text\{src\}\}; models

ℳLRM\\mathcal\{M\}\_\{\\text\{LRM\}\},

ℳrefine\\mathcal\{M\}\_\{\\text\{refine\}\},

πθ\\pi\_\{\\theta\}; hyperparameters

δ0\\delta\_\{0\},

α\\alpha,

LminL\_\{\\min\}
1:foreach

\(xi,yi∗\)∈𝒟src\(x\_\{i\},y\_\{i\}^\{\*\}\)\\in\\mathcal\{D\}\_\{\\text\{src\}\}do

2:

\(z^i,y^i\)∼ℳLRM​\(xi\)\(\\hat\{z\}\_\{i\},\\hat\{y\}\_\{i\}\)\\sim\\mathcal\{M\}\_\{\\text\{LRM\}\}\(x\_\{i\}\)
3:endfor

4:

𝒟full←\{\(xi,yi∗,z^i,y^i\)\}i=1N\\mathcal\{D\}\_\{\\text\{full\}\}\\leftarrow\\\{\(x\_\{i\},y\_\{i\}^\{\*\},\\hat\{z\}\_\{i\},\\hat\{y\}\_\{i\}\)\\\}\_\{i=1\}^\{N\}
5:foreach

i∈\[1,N\]i\\in\[1,N\]do

6:

𝒞​\(xi\)←1N​∑j=1N𝟙​\[‖z^j‖≤‖z^i‖\]\\mathcal\{C\}\(x\_\{i\}\)\\leftarrow\\frac\{1\}\{N\}\\sum\_\{j=1\}^\{N\}\\mathbbm\{1\}\[\\\|\\hat\{z\}\_\{j\}\\\|\\leq\\\|\\hat\{z\}\_\{i\}\\\|\]
7:

δ​\(xi\)←δ0\+α⋅𝒞​\(xi\)\\delta\(x\_\{i\}\)\\leftarrow\\delta\_\{0\}\+\\alpha\\cdot\\mathcal\{C\}\(x\_\{i\}\)
8:

t∗←argmint∈\[0,Lz^i\]𝒮θ​\(z^i,<t∣xi,yi∗\)≥δ​\(xi\)t^\{\*\}\\leftarrow\\operatorname\*\{argmin\}\_\{t\\in\[0,L\_\{\\hat\{z\}\_\{i\}\}\]\}\\mathcal\{S\}\_\{\\theta\}\(\\hat\{z\}\_\{i,<t\}\\mid x\_\{i\},y\_\{i\}^\{\*\}\)\\geq\\delta\(x\_\{i\}\)
9:ifno such

ttexiststhen

10:

ti∗←argmaxt∈\[0,Lz^i\]𝒮θ​\(z^i,<t∣xi,yi∗\)t\_\{i\}^\{\*\}\\leftarrow\\operatorname\*\{argmax\}\_\{t\\in\[0,L\_\{\\hat\{z\}\_\{i\}\}\]\}\\mathcal\{S\}\_\{\\theta\}\(\\hat\{z\}\_\{i,<t\}\\mid x\_\{i\},y\_\{i\}^\{\*\}\)
11:endif

12:

ziraw←z^i,<ti∗z\_\{i\}^\{\\text\{raw\}\}\\leftarrow\\hat\{z\}\_\{i,<t\_\{i\}^\{\*\}\}
13:if

Lziraw≤LminL\_\{z\_\{i\}^\{\\text\{raw\}\}\}\\leq L\_\{\\min\}then

14:

ziMSC←∅z\_\{i\}^\{\\text\{MSC\}\}\\leftarrow\\varnothing
15:else

16:

ziMSC←ℳrefine​\(xi,ziraw,y^i\)z\_\{i\}^\{\\text\{MSC\}\}\\leftarrow\\mathcal\{M\}\_\{\\text\{refine\}\}\(x\_\{i\},z\_\{i\}^\{\\text\{raw\}\},\\hat\{y\}\_\{i\}\)
17:endif

18:endfor

19:return

𝒟MSC=\{\(xi,ziMSC,y^i\)\}i=1N\\mathcal\{D\}\_\{\\text\{MSC\}\}=\\\{\(x\_\{i\},z\_\{i\}^\{\\text\{MSC\}\},\\hat\{y\}\_\{i\}\)\\\}\_\{i=1\}^\{N\}

▶\\blacktriangleright\(ii\) Identify raw MSC prefixes\.For each sample, we scan sentence\-level prefixes to find the minimal sufficient one:

ti∗=argmint∈\[0,Lz^i\]𝒮θ​\(z^i,<t∣xi,yi∗\)≥δ​\(xi\)t\_\{i\}^\{\*\}=\\operatorname\*\{argmin\}\_\{t\\in\[0,L\_\{\\hat\{z\}\_\{i\}\}\]\}\\mathcal\{S\}\_\{\\theta\}\(\\hat\{z\}\_\{i,<t\}\\mid x\_\{i\},y\_\{i\}^\{\*\}\)\\geq\\delta\(x\_\{i\}\)\(5\)If no prefix reachesδ​\(xi\)\\delta\(x\_\{i\}\), we select the most sufficient one:

ti∗=argmaxt∈\[0,Lz^i\]𝒮θ​\(z^i,<t∣xi,yi∗\)\.t\_\{i\}^\{\*\}=\\operatorname\*\{argmax\}\_\{t\\in\[0,L\_\{\\hat\{z\}\_\{i\}\}\]\}\\mathcal\{S\}\_\{\\theta\}\(\\hat\{z\}\_\{i,<t\}\\mid x\_\{i\},y\_\{i\}^\{\*\}\)\.\(6\)This yields a raw candidate:ziraw=z^i,<ti∗z\_\{i\}^\{\\text\{raw\}\}=\\hat\{z\}\_\{i,<t\_\{i\}^\{\*\}\}\.

To avoid trivial fragments, we setzirawz\_\{i\}^\{\\text\{raw\}\}with empty string if‖ziraw‖≤Lmin\\\|z\_\{i\}^\{\\text\{raw\}\}\\\|\\leq L\_\{\\min\}, indicating that the question requires no explicit reasoning\.

▶\\blacktriangleright\(iii\) Refine MSC for coherence\.Raw truncation may leave logical gaps\. We useℳrefine\\mathcal\{M\}\_\{\\text\{refine\}\}to polish each nonempty MSC with the following objectives: \(1\) naturally derive the answer, \(2\) eliminate redundancy, and \(3\) preserve stylistic consistency\. This produces the final refinedziMSCz\_\{i\}^\{\\text\{MSC\}\}\.

The final dataset is formatted as:

𝒟MSC=\{\(xi,<think\>​ziMSC​</think\>​y^i\)\}i=1N\\mathcal\{D\}\_\{\\text\{MSC\}\}=\\left\\\{\\left\(x\_\{i\},\\texttt\{<think\>\}\\,z\_\{i\}^\{\\text\{MSC\}\}\\,\\texttt\{</think\>\}\\,\\hat\{y\}\_\{i\}\\right\)\\right\\\}\_\{i=1\}^\{N\}\(7\)whereziMSCz\_\{i\}^\{\\text\{MSC\}\}can be empty for questions requiring no reasoning\. The complete procedure is detailed in Algorithm[1](https://arxiv.org/html/2606.17687#alg1)\.

#### Supervised Fine\-Tuning\.

We fine\-tune the base model by minimizing the negative log\-likelihood over𝒟MSC\\mathcal\{D\}\_\{\\text\{MSC\}\}:

ℒMFT​\(θ\)\\displaystyle\\mathcal\{L\}\_\{\\text\{MFT\}\}\(\\theta\)=−𝔼\(xi,ziMSC,y^i\)∼𝒟MSC\\displaystyle=\-\\mathbb\{E\}\_\{\(x\_\{i\},z\_\{i\}^\{\\text\{MSC\}\},\\hat\{y\}\_\{i\}\)\\sim\\mathcal\{D\}\_\{\\text\{MSC\}\}\}\(8\)\[\\displaystyle\\Bigg\[logπθ\(ziMSC∣xi\)\+logπθ\(y^i∣xi,ziMSC\)\]\\displaystyle\\quad\\log\\pi\_\{\\theta\}\(z\_\{i\}^\{\\text\{MSC\}\}\\mid x\_\{i\}\)\+\\log\\pi\_\{\\theta\}\(\\hat\{y\}\_\{i\}\\mid x\_\{i\},z\_\{i\}^\{\\text\{MSC\}\}\)\\Bigg\]

### 3\.4Stage II: Sufficiency\-Aware Policy Optimization

In the second stage, named Sufficiency\-Aware Policy Optimization \(SAPO\), we train the model to allocate reasoning steps during inference through RL with a dynamic complexity pool and sufficiency\-aware rewards\. We build upon Group Relative Policy Optimization \(GRPO\)\(Shao et al\.,[2024](https://arxiv.org/html/2606.17687#bib.bib33)\), which samples multiple trajectories per question to enable robust group\-wise advantage estimation\.

#### Dynamic Complexity Pool\.

A critical challenge in integrating MSC into onlineRLis that the reasoning length distribution shifts as the policy evolves\. The offline complexity estimates from MFT stage become obsolete\. Recomputing them over the entire dataset after each gradient step is computationally prohibitive\.

Instead, we maintain an onlinedynamic complexity pool𝒫=\{‖ziavg‖\}i=1N\\mathcal\{P\}=\\\{\\\|z\_\{i\}^\{\\text\{avg\}\}\\\|\\\}\_\{i=1\}^\{N\}that tracks the evolving reasoning length for each questionxix\_\{i\}\. The pool is initialized fromπMFT\\pi\_\{\\text\{MFT\}\}on the RL training data, i\.e\.,‖ziavg‖←𝔼z∼πMFT\(⋅\|xi\)​\[‖z‖\]\\\|z\_\{i\}^\{\\text\{avg\}\}\\\|\\leftarrow\\mathbb\{E\}\_\{z\\sim\\pi\_\{\\text\{MFT\}\}\(\\cdot\|x\_\{i\}\)\}\[\\\|z\\\|\]\. For each training batch, we update the pool via exponential moving average \(EMA\):

‖ziavg‖←\(1−η\)⋅‖ziavg‖\+η⋅1K​∑k=1K‖zi\(k\)‖,\\\|z\_\{i\}^\{\\text\{avg\}\}\\\|\\leftarrow\(1\-\\eta\)\\cdot\\\|z\_\{i\}^\{\\text\{avg\}\}\\\|\+\\eta\\cdot\\frac\{1\}\{K\}\\sum\_\{k=1\}^\{K\}\\\|z\_\{i\}^\{\(k\)\}\\\|,\(9\)whereη∈\[0,1\]\\eta\\in\[0,1\]controls the update rate, and\{zi\(k\)\}k=1K\\\{z\_\{i\}^\{\(k\)\}\\\}\_\{k=1\}^\{K\}are theKKrollout trajectories forxix\_\{i\}in the current batch\.

From𝒫\\mathcal\{P\}, we recompute complexity scores𝒞​\(xi\)\\mathcal\{C\}\(x\_\{i\}\)and thresholdsδ​\(xi\)\\delta\(x\_\{i\}\)via Eq\.[4](https://arxiv.org/html/2606.17687#S3.E4)and Eq\.[3](https://arxiv.org/html/2606.17687#S3.E3)\. This mechanism ensures sufficiency targets aligned with the policy’s current behavior, providing stable reward signals at negligible extra cost\.

#### Sufficiency\-Aware Reward Shaping\.

The total reward:

ℛ​\(z,y∣x,y∗\)=ℛcor​\(y\)\+ℛformat​\(z,y\)\+β⋅ℛsuff​\(z∣x,y∗\)\\mathcal\{R\}\(z,y\\mid x,y^\{\*\}\)=\\mathcal\{R\}\_\{\\text\{cor\}\}\(y\)\+\\mathcal\{R\}\_\{\\text\{format\}\}\(z,y\)\+\\beta\\cdot\\mathcal\{R\}\_\{\\text\{suff\}\}\(z\\mid x,y^\{\*\}\)\(10\)whereℛcor\\mathcal\{R\}\_\{\\text\{cor\}\}rewards correct answers, andℛformat\\mathcal\{R\}\_\{\\text\{format\}\}ensures proper use of<think\>\.\.\.</think\>delimiters\.

The sufficiency rewardRsuffR\_\{\\text\{suff\}\}uses the current adaptive thresholdδ​\(x\)\\delta\(x\)from the dynamic pool\. For each trajectoryzz, we identify the earliest sufficient prefixz<ti∗z\_\{<t\_\{i\}^\{\*\}\}using Eq\.[5](https://arxiv.org/html/2606.17687#S3.E5)\. If no prefix satisfies the threshold, we sett∗=∞t^\{\*\}=\\infty\. The reward penalizes both over\-thinking and under\-thinking:

ℛsuff​\(x,z,y\)\\displaystyle\\mathcal\{R\}\_\{\\text\{suff\}\}\(x,z,y\)=−λo​v​e​r⋅𝟙​\[Lz\>t∗\+ϵ\]⏟over\-thinking\\displaystyle=\\;\\underbrace\{\-\\lambda\_\{over\}\\cdot\\mathbbm\{1\}\[L\_\{z\}\>t^\{\*\}\+\\epsilon\]\}\_\{\\text\{over\-thinking\}\}\(11\)−𝟙​\[y≠y∗\]⋅λu​n​d​e​r⋅𝟙​\[Lz<t∗\]⏟under\-thinking\\displaystyle\\quad\-\\underbrace\{\\mathbbm\{1\}\[y\\neq y^\{\*\}\]\\cdot\\lambda\_\{under\}\\cdot\\mathbbm\{1\}\[L\_\{z\}<t^\{\*\}\]\}\_\{\\text\{under\-thinking\}\}The toleranceϵ\\epsilonallows minor deviations beyond sufficiency, and the under\-thinking penalty applies only to incorrect generations\.

## 4Experiments

### 4\.1Experimental Settings

#### Training datasets\.

We train SuCo on reasoning datasets spanning mathematics, code, and science\. The data are drawn from five sources: Llama\-Nemotron Post\-Training Dataset\(Bercovich et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib5)\), Mixture\-of\-Thoughts\(Hugging Face,[2025](https://arxiv.org/html/2606.17687#bib.bib18)\), OpenR1\-Math\-220k\(Lozhkov et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib28)\), OpenCodeReasoning\(Ahmad et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib3)\), and s1K\-1\.1\(Muennighoff et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib29)\)\. These datasets contain reasoning chains distilled from state\-of\-the\-art LRMs\. After filtering and deduplication, we construct the corresponding MSC for each sample following Algorithm[1](https://arxiv.org/html/2606.17687#alg1)\. We further remove low\-quality MSC samples using LLM\-based quality assessment\. Both MSC refinement and quality assessment are performed with Qwen3\-Next\-80B\-A3B\-Instruct\(Qwen Team,[2025](https://arxiv.org/html/2606.17687#bib.bib31)\)\. This process yields 270,011 high\-quality training samples\. Detailed construction procedures and statistics are provided in Appendix[B](https://arxiv.org/html/2606.17687#A2)\. Figure[3](https://arxiv.org/html/2606.17687#S4.F3)compares the token length distributions of full CoT and MSC across the training corpus\. We use the full MSC dataset for Stage I, and sample a subset of the data for RL in Stage II\.

![Refer to caption](https://arxiv.org/html/2606.17687v1/figures/dataset_token.png)

Figure 3:Token length distribution comparison between full CoT and MSC across training datasets\.
#### Implementation details\.

All trainings are performed on 8×\\timesNVIDIA H100 80GB GPUs\.MFT Stage\.We set the base thresholdδ0=0\.5\\delta\_\{0\}=0\.5and the sensitivity coefficientα=0\.4\\alpha=0\.4, resulting in problem\-adaptive thresholdsδ​\(x\)∈\[0\.5,0\.9\]\\delta\(x\)\\in\[0\.5,0\.9\]\. The minimum reasoning length is fixed toLmin=5L\_\{\\min\}=5sentences to filter trivial fragments\. We train for 3 epochs with a learning rate of1×10−41\\times 10^\{\-4\}\.SAPO Stage\.The dynamic complexity pool is initialized using predictions from the MFT model and updated during training with an EMA rateη=0\.1\\eta=0\.1\. For each training instance, we sampleK=8K=8rollout trajectories\. The sufficiency reward weight is set toβ=1\.0\\beta=1\.0, with over\- and under\-thinking penaltiesλover=λunder=0\.5\\lambda\_\{\\text\{over\}\}=\\lambda\_\{\\text\{under\}\}=0\.5and a tolerance marginϵ=2\\epsilon=2sentences\. We train using Group Relative Policy Optimization \(GRPO\)\(Shao et al\.,[2024](https://arxiv.org/html/2606.17687#bib.bib33)\)with learning rate1×10−61\\times 10^\{\-6\}, a batch size of 128 and a micro batch size of 8\.

Table 1:Main results on mathematics \(GSM8K, MATH\-500, AMC23, AIME25\), code \(MBPP, LiveCodeBench\-V6\), and science \(MMLU\-STEM, GPQA\-Diamond\) benchmarks\. Best results in each section arebolded, second best areunderlined\.MathCodeScienceAvg\.MethodsGSM8KMATH500AMC23AIME25MBPPLive\-V6MMLU\-SGPQA\-D\(I\) Reasoning Correctness Evaluation: Response Accuracy \(%\)↑\\uparrow*Qwen2\.5\-1\.5B*Math\-Base40\.122\.623\.93\.34\.00\.614\.54\.014\.1Math\-Instruct79\.072\.443\.56\.76\.12\.330\.222\.232\.8DeepSeek\-R1\-Distill80\.380\.656\.526\.741\.017\.133\.825\.345\.2AdaCoT82\.783\.262\.527\.344\.317\.733\.926\.347\.2AdaptThink83\.283\.865\.828\.342\.118\.334\.426\.847\.8S\-GRPO83\.484\.269\.531\.042\.919\.434\.828\.349\.2LHRMs85\.784\.470\.034\.043\.120\.635\.730\.350\.5SuCo \(Ours\)87\.786\.873\.833\.748\.522\.338\.633\.353\.1*Qwen2\.5\-7B*Math\-Base61\.854\.236\.27\.013\.11\.126\.413\.126\.6Math\-Instruct87\.072\.453\.512\.326\.48\.051\.332\.342\.9DeepSeek\-R1\-Distill89\.389\.075\.549\.757\.631\.467\.645\.563\.2AdaCoT91\.491\.881\.555\.061\.732\.069\.047\.066\.2AdaptThink92\.992\.882\.054\.362\.032\.668\.847\.066\.6S\-GRPO92\.892\.290\.558\.362\.335\.472\.251\.569\.4LHRMs92\.493\.087\.357\.762\.435\.471\.449\.568\.6SuCo \(Ours\)93\.993\.690\.361\.765\.738\.975\.856\.672\.1\(II\) Reasoning Efficiency Evaluation: Response Length \(Tokens\)↓\\downarrow*Qwen2\.5\-1\.5B*DeepSeek\-R1\-Distill5014,2606,76811,2393,51111,0731,9566,5825,736AdaCoT4431,4792,9366,2711,7206,4551,0293,2792,952AdaptThink3371,5642,7406,5131,4226,6899952,9142,897S\-GRPO2971,3773,0816,6401,4815,2938122,8282,726LHRMs2421,2522,4773,2571,1584,7169552,3812,055SuCo \(Ours\)3045381,6873,4849302,6297451,5501,483*Qwen2\.5\-7B*DeepSeek\-R1\-Distill4653,1265,46610,8333,18210,4071,5726,8585,239AdaCoT4221,3873,1157,6481,4238,1531,2143,9933,419AdaptThink2471,4242,8347,8421,2428,9111,0233,6773,400S\-GRPO2679882,2475,2591,3566,5347172,4532,478LHRMs2746581,5254,2171,4874,2946282,0421,891SuCo \(Ours\)2434299352,6791,1492,8095051,3891,267
#### Benchmarks and Metrics\.

We conduct comprehensive evaluations across mathematics, code, and science domains, covering a broad range of problem difficulties\. Formathematics, we evaluate on GSM8K\(Cobbe et al\.,[2021](https://arxiv.org/html/2606.17687#bib.bib7)\), MATH\-500\(Lightman et al\.,[2024](https://arxiv.org/html/2606.17687#bib.bib26)\), AMC 2023, and AIME 2025\. Due to the limited size of AMC 2023 \(40 problems\) and AIME 2025 \(30 problems\), each evaluation is repeated 10 times and results are averaged to reduce variance and improve statistical reliability\. Forcode, we use MBPP\(Austin et al\.,[2021](https://arxiv.org/html/2606.17687#bib.bib4)\), and LiveCodeBench v6\(Jain et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib20)\)\. Forscience, we test on MMLU\-STEM\(Hendrycks et al\.,[2021a](https://arxiv.org/html/2606.17687#bib.bib14)\)and GPQA\-Diamond\(Rein et al\.,[2024](https://arxiv.org/html/2606.17687#bib.bib32)\)\. Across all benchmarks, we report both accuracy and response length\.

#### Baselines\.

We implement SuCo on Qwen2\.5\-Math\-1\.5B/7B\-Base\(Yang et al\.,[2024](https://arxiv.org/html/2606.17687#bib.bib42)\)and compare against the following baselines at matched model scales\.Standard Models\.We evaluate Qwen2\.5\-Math\-Base, Qwen2\.5\-Math\-Instruct, along with DeepSeek\-R1\-Distill\-Qwen\(Guo et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib12)\)as the full CoT reasoning baseline\.Adaptive Large Reasoning Models \(ALRMs\)\.We compare with four representative ALRMs: \(1\) AdaCoT\(Lou et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib27)\)employs an external complexity assessor and PPO with Pareto optimization\. \(2\) AdaptThink\(Zhang et al\.,[2025b](https://arxiv.org/html/2606.17687#bib.bib44)\)uses constrained RL for binary mode selection\. \(3\) S\-GRPO\(Dai et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib8)\)samples multiple early\-exit positions with decaying rewards during RL training\. \(4\) LHRMs\(Jiang et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib21)\)performs hybrid fine\-tuning on categorized data followed by group policy optimization\. For fair comparison, AdaCoT and LHRMs are initialized from Qwen2\.5\-Math\-Base and trained on the same source data as SuCo, while AdaptThink and S\-GRPO follow their original implementations using DeepSeek\-R1\-Distill\-Qwen as the base model\.

Table 2:Ablation results of MFT components\. The results evaluate the effectiveness of MFT against the base model and full CoT training\. We further analyze the sensitivity of sufficiency thresholds, complexity estimation strategies, and the impact of MSC refinement\.ComponentMethodMathCodeScienceAvg\.Acc↑\\uparrowTokens↓\\downarrowAcc↑\\uparrowTokens↓\\downarrowAcc↑\\uparrowTokens↓\\downarrowAcc↑\\uparrowTokens↓\\downarrowOverallBase22\.5\-2\.3\-9\.3\-11\.4\-Full59\.15,38028\.56,34527\.73,52238\.45,082MFT69\.11,35933\.21,70634\.896645\.71,344Thresholdδ=0\.9\\delta=0\.966\.81,54530\.62,12832\.01,30843\.11,660δ=0\.8\\delta=0\.866\.11,42030\.11,85331\.01,07242\.41,448δ=0\.7\\delta=0\.767\.21,24631\.01,76933\.81,02444\.01,346δ=0\.6\\delta=0\.660\.91,07226\.61,64527\.591038\.31,209δ=0\.5\\delta=0\.561\.988925\.71,30525\.877137\.8988ComplexityMin–Max61\.71,13026\.11,50627\.984338\.61,160Log\-Scaled68\.41,50631\.51,64233\.51,04944\.51,399Refinementw/o refine65\.92,22031\.82,74132\.71,60743\.52,189

### 4\.2Main Results

#### Reasoning Correctness Evaluation\.

As shown in Table[1](https://arxiv.org/html/2606.17687#S4.T1)\(I\), across all model scales and domains, SuCo consistently achieves the highest or near\-highest accuracy\. At the 1\.5B scale, SuCo attains an accuracy of 53\.1%, achieving a relative improvement of 5\.1% over the strongest adaptive baseline LHRMs and 17\.5% over DeepSeek\-R1\-Distill\-Qwen\. At the 7B scale, SuCo further improves to 72\.1% accuracy, exceeding LHRMs by 5\.1% and DeepSeek\-R1\-Distill\-Qwen by 14\.1%\. Notably, SuCo exhibits particularly strong gains on challenging benchmarks\. For example, on AIME25, SuCo attains 33\.7% accuracy at 1\.5B scale and 61\.7% at 7B scale, corresponding to relative improvements of 26\.2% and 24\.1% over DeepSeek\-R1\-Distill\-Qwen, respectively\.

#### Reasoning Efficiency Evaluation\.

Table[1](https://arxiv.org/html/2606.17687#S4.T1)\(II\) reports reasoning efficiency measured by average response length\. In addition to attaining higher accuracy, SuCo significantly reduces token consumption compared to DeepSeek\-R1\-Distill\-Qwen\. Across all benchmarks, SuCo reduces average response tokens by 74\.1% at the 1\.5B scale and by 75\.8% at the 7B scale, yielding substantial inference cost savings\. On AIME25 at 7B scale, SuCo achieves 24\.1% higher accuracy while using 75\.3% fewer tokens\. SuCo also outperforms other adaptive reasoning methods, confirming that sufficiency\-aware training eliminates redundant reasoning without sacrificing decision quality\.

### 4\.3Ablation Study

We analyze the contribution of each component in SuCo on Qwen2\.5\-Math\-1\.5B\. Additional ablation studies on hyperparameters \(LminL\_\{\\min\},ϵ\\epsilon,η\\eta\) are provided in Appendix[A](https://arxiv.org/html/2606.17687#A1)\.

#### MFT Ablations\.

Results are summarized in Table[2](https://arxiv.org/html/2606.17687#S4.T2), with training CoT length distributions illustrated in Figure[4](https://arxiv.org/html/2606.17687#S4.F4)\.

▶\\blacktrianglerightOverall Effectiveness\.While full CoT training improves the base model from 11\.4% to 38\.4% accuracy, it generates verbose reasoning\. In contrast, MFT achieves higher accuracy at 45\.7% while consuming only 26\.4% of full CoT’s reasoning overhead\. This confirms that MSC is not merely compressed reasoning but a more effective form that filters noise and streamlines logical flow, enabling better performance with significantly reduced computational cost\.

![Refer to caption](https://arxiv.org/html/2606.17687v1/x1.png)

Figure 4:Distribution of reasoning lengths in training data constructed by different MSC variants\.▶\\blacktrianglerightProblem\-Adaptive Threshold\.Static thresholdsδ∈\[0\.5,0\.9\]\\delta\\in\[0\.5,0\.9\]exhibit clear accuracy\-efficiency trade\-offs\. High thresholds retain excessive reasoning, while low thresholds sacrifice critical reasoning steps, leading to noticeable performance degradation\. Among static settings,δ=0\.7\\delta=0\.7achieves the best balance\. Nevertheless, problem\-adaptive thresholds naturally align with problem demands, surpassing all static configurations with comparable token usage\.

▶\\blacktrianglerightPercentile\-Based Complexity Estimation\.We compare against two alternatives: Min\-Max estimation𝒞​\(xi\)=\(‖zi‖−minj⁡‖zj‖\)\(maxj⁡‖zj‖−minj⁡‖zj‖\)\\mathcal\{C\}\(x\_\{i\}\)=\\frac\{\(\\\|z\_\{i\}\\\|\-\\min\_\{j\}\\\|z\_\{j\}\\\|\)\}\{\(\\max\_\{j\}\\\|z\_\{j\}\\\|\-\\min\_\{j\}\\\|z\_\{j\}\\\|\)\}and Log\-Scaled normalization𝒞​\(xi\)=log⁡\(1\+‖zi‖\)−log⁡\(1\+minj⁡‖zj‖\)log⁡\(1\+maxj⁡‖zj‖\)−log⁡\(1\+minj⁡‖zj‖\)\\mathcal\{C\}\(x\_\{i\}\)=\\frac\{\\log\(1\+\\\|z\_\{i\}\\\|\)\-\\log\(1\+\\min\_\{j\}\\\|z\_\{j\}\\\|\)\}\{\\log\(1\+\\max\_\{j\}\\\|z\_\{j\}\\\|\)\-\\log\(1\+\\min\_\{j\}\\\|z\_\{j\}\\\|\)\}\. Min\-Max estimation is highly sensitive to outliers, a single extremely long trajectory compresses all other samples into a narrow range, resulting in poor complexity discrimination\. Log\-Scaled normalization partially alleviates this issue but still results in skewed scaling\. In contrast, percentile\-based method produces a uniform complexity distribution, ensuring stable threshold scaling across diverse problems\.

▶\\blacktrianglerightMSC Refinement\.Without refinement, directly truncated CoT trajectories often results in abrupt or incomplete logical transitions\. The refinement process bridges these logical gaps while simultaneously eliminating redundancy, producing more coherent and concise reasoning chains\. Consequently, refinement reduces reasoning length by 38\.6% and boosts accuracy by 5\.1%\. Prompts along with concrete examples are provided in Appendix[C](https://arxiv.org/html/2606.17687#A3)\.

Table 3:Ablation study of SAPO components\. Dynamic Complexity Pool \(DCP\) and Sufficiency\-Aware Reward Shaping\(Rs​u​f​fR\_\{suff\}\)\.MethodAccuracy \(%\)↑\\uparrowResponse Tokens↓\\downarrowMFT51\.51,347SAPO53\.11,483w/o DCP52\.91,642w/oRs​u​f​fR\_\{suff\}52\.72,053
#### SAPO Ablations\.

We summarize ablation results in Table[3](https://arxiv.org/html/2606.17687#S4.T3)and visualize the per\-benchmark accuracy\-efficiency trade\-offs in Figure[5](https://arxiv.org/html/2606.17687#S4.F5)\.

![Refer to caption](https://arxiv.org/html/2606.17687v1/figures/sapo_comparision.png)

Figure 5:Per\-benchmark accuracy and response length comparison\. SAPO adaptively reduces reasoning on easier benchmarks while allocating more resources to challenging ones\.▶\\blacktrianglerightOverall Effectiveness\.Although SAPO slightly increases the response length, it improves accuracy across all benchmarks\. Crucially, this response increase does not reflect redundant reasoning\. As shown in Figure[5](https://arxiv.org/html/2606.17687#S4.F5), on simple benchmarks where MFT already achieves high accuracy, SAPO successfully reduces reasoning\. Conversely, on challenging benchmarks, SAPO intelligently allocates additional reasoning budget\. This behavior indicates that SAPO learns to calibrate reasoning effort based on problem demands\.

▶\\blacktrianglerightDynamic Complexity Pool\.In this ablation\(w/o DCP\), the complexity pool is initialized using MFT predictions but remains fixed during RL training\. Without online EMA updates, the estimated complexity gradually drifts away from the evolving policy\. This misalignment results in stale thresholds that fail to provide accurate sufficiency targets\.

▶\\blacktrianglerightSufficiency\-Aware Reward Shaping\.When the sufficiency reward is removed \(w/oRsuffR\_\{\\text\{suff\}\}\), SAPO degenerates to vanilla GRPO that optimizes only correctness and format, collapsing to verbose, full\-CoT\-style reasoning patterns\. The sufficiency reward provides fine\-grained feedback on both over\-thinking and under\-thinking, encouraging concise yet reliable reasoning\.

### 4\.4Analysis

#### Difficulty\-conditioned reasoning length\.

![Refer to caption](https://arxiv.org/html/2606.17687v1/figures/token_density_by_level_compare.png)

Figure 6:Response length distribution across MATH difficulty levels for SuCo\-1\.5B \(top\) and base LRM DeepSeek\-R1\-Distill\-1\.5B \(bottom\)\. SuCo continuously adapts reasoning effort to problem complexity with significantly higher efficiency\.We compare response length distributions across MATH\(Hendrycks et al\.,[2021b](https://arxiv.org/html/2606.17687#bib.bib15)\)difficulty levels between SuCo\-1\.5B and DeepSeek\-R1\-Distill\-1\.5B\. As shown in Figure[6](https://arxiv.org/html/2606.17687#S4.F6), both models shift rightward as difficulty increases, but SuCo exhibits a much higher difficulty\-sensitivity ratio: the Level 5/Level 1 mean token ratio is≈5\.5×\\approx 5\.5\\timesfor SuCo versus≈3\.1×\\approx 3\.1\\timesfor the base LRM, indicating more discriminative resource allocation\.

Moreover, SuCo operates in a fundamentally more efficient regime\. On Level 1 problems, it uses 89% fewer tokens than the base LRM while maintaining accuracy\. The base LRM’s length variation reflects an inability to truncate unnecessary reasoning even for trivial queries, whereas SuCo’s variation reflects genuine difficulty\-conditioned allocation learned through sufficiency\-aware training\.

#### Out\-of\-Domain Generalization\.

To assess whether SuCo’s adaptive reasoning capability generalizes beyond the training domains, we conduct out\-of\-domain \(OOD\) evaluations on StrategyQA\(Geva et al\.,[2021](https://arxiv.org/html/2606.17687#bib.bib10)\), CommonsenseQA\(Talmor et al\.,[2019](https://arxiv.org/html/2606.17687#bib.bib36)\), and AlpacaEval 2\.0\(Li et al\.,[2023](https://arxiv.org/html/2606.17687#bib.bib25)\)\. These tasks differ from the training distribution\.

Table 4:Out\-of\-domain generalization results\. SuCo demonstrates strong transfer of adaptive reasoning to unseen task types\.MethodStrategyQACSQAAlpacaEvalACC / TokACC / TokLC\-WR / TokDeepSeek\-R1\-Distill53\.3 / 48345\.0 / 7431\.05 / 596Full CoT SFT22\.6 / 74219\.4 / 1,0610\.3 / 743MFT28\.0 / 21326\.6 / 3420\.67 / 314SuCo55\.7/ 44249\.3/ 3692\.4/ 288As shown in Table[4](https://arxiv.org/html/2606.17687#S4.T4), SuCo substantially outperforms all baselines on OOD tasks\. Notably, while MFT alone overfits to training domain patterns and degrades on OOD tasks, the SAPO stage enables SuCo to learn a generalizable policy for calibrating reasoning effort\.

#### Cross\-Model Robustness of MSC\.

To verify that MSC boundaries are robust across model families, we construct MSC data using different calibrator models \(Qwen3\-4B, Qwen3\-14B, DeepSeek\-R1\-Distill\-Qwen\-7B\) and train different target models\. As shown in Table[5](https://arxiv.org/html/2606.17687#S4.T5), MSC supervision from all calibrators consistently outperforms full\-CoT training across target models, confirming that the constructed datasets transfer well across model families\.

Table 5:Cross\-model robustness\. Different calibrator models produce MSC data that consistently improves over Full CoT SFT across target model families\. Format: Accuracy \(Tokens\)\.CalibratorQwen2\.5\-1\.5BLlama\-3\.2\-3BFull CoT SFT37\.4 \(5,177\)38\.1 \(5,084\)Qwen3\-4B44\.7 \(1,394\)44\.2 \(1,524\)Qwen3\-14B44\.2 \(1,521\)43\.2 \(1,821\)DS\-R1\-Distill\-7B44\.5 \(1,491\)43\.7 \(1,691\)
#### Empty CoT Analysis\.

![Refer to caption](https://arxiv.org/html/2606.17687v1/x2.png)

Figure 7:Empty CoT analysis of SuCo\-1\.5B and SuCo\-7B across problem types and difficulties\. Higher model capacity \(7B vs\. 1\.5B\) leads to increased empty CoT rates, while harder problems trigger more explicit reasoning\.SuCo learns to skip explicit reasoning when problems are trivial, directly outputting answers without explicit reasoning\. Figure[7](https://arxiv.org/html/2606.17687#S4.F7)\(a\) reveals that empty CoT rates decrease monotonically with increasing difficulty, indicating that the model increasingly engages in explicit reasoning for harder problems\. The 7B model consistently exhibits higher empty rates than 1\.5B across all levels, reflecting its stronger capabilities that reduce reliance on intermediate reasoning steps\.

Across domains \(Figure[7](https://arxiv.org/html/2606.17687#S4.F7)\(b\)\), empty CoT rates remain relatively stable at 30–37%, suggesting that the decision to omit explicit reasoning is largely task\-agnostic\. Math problems show a slightly higher proportion of empty responses, likely due to formula\-based questions requiring minimal explicit derivation\. Despite a substantial fraction of empty CoT outputs, SuCo maintains strong overall accuracy \(Table[1](https://arxiv.org/html/2606.17687#S4.T1)\), suggesting that explicit reasoning is not always necessary, and that selectively omitting CoT can preserve or even improve efficiency without sacrificing accuracy\.

## 5Conclusion

In this work, we formalize*Minimal Sufficient CoT*\(MSC\) as the shortest reasoning prefix adequate for correct answers, revealing that models can perform better with less reasoning\. Building on this insight, we propose*Sufficiency\-guided Continuous Adaptive Reasoning*\(SuCo\), a two\-stage framework enabling continuous and autonomous reasoning adaptation\. Through*MSC\-Aligned Fine\-Tuning*\(MFT\) and*Sufficiency\-Aware Policy Optimization*\(SAPO\), SuCo learns to calibrate its reasoning effort according to problem demands without relying on discrete modes or external controllers\. Extensive experiments across mathematics, code, and science benchmarks demonstrate that SuCo consistently achieves higher accuracy with significantly fewer reasoning tokens\.

#### Limitations\.

We acknowledge several limitations\. First, MSC construction relies on ground\-truth answers to compute sufficiency scores, which limits direct application to open\-ended generation tasks\. However, once trained, the model internalizes adaptive reasoning as a general capability\. Second, the MFT stage depends on data distilled from strong LRMs\. While removing the 80B refinement model still yields results superior to all baselines, reducing this dependency remains desirable\.

#### Future Work\.

Extending sufficiency estimation to open\-ended settings is a promising avenue\. Additionally, agentic tasks present a compelling application scenario, where over\-thinking incurs redundant API costs and under\-thinking leads to task failure\. Extending SuCo to such settings is a promising direction\.

## Impact Statement

This work aims to advance the field of machine learning by proposing a more efficient and adaptive training framework for reasoning models\. Our method focuses on technical efficiency improvements and does not alter the fundamental capabilities or safety properties of underlying models\. We do not foresee any ethical concerns or societal consequences beyond those commonly associated with research on large language models\.

## Acknowledgements

This work was supported in part by National Natural Science Foundation of China \(62476070\), Shenzhen Science and Technology Program \(JCYJ20241202123503005, GXWD20231128103232001, ZDSYS20230626091203008, KQTD20240729102154066\), Department of Science and Technology of Guangdong \(2024A1515011540\) and National Key R&D Program of China \(SQ2024YFE0200592\)\.

## References

- Adler et al\. \(2024\)Adler, B\., Agarwal, N\., Aithal, A\., Anh, D\. H\., Bhattacharya, P\., Brundyn, A\., Casper, J\., Catanzaro, B\., Clay, S\., Cohen, J\., et al\.Nemotron\-4 340b technical report\.*arXiv preprint arXiv:2406\.11704*, 2024\.
- Aggarwal & Welleck \(2025\)Aggarwal, P\. and Welleck, S\.L1: Controlling how long a reasoning model thinks with reinforcement learning\.In*Second Conference on Language Modeling*, 2025\.URL[https://openreview\.net/forum?id=4jdIxXBNve](https://openreview.net/forum?id=4jdIxXBNve)\.
- Ahmad et al\. \(2025\)Ahmad, W\. U\., Narenthiran, S\., Majumdar, S\., Ficek, A\., Jain, S\., Huang, J\., Noroozi, V\., and Ginsburg, B\.Opencodereasoning: Advancing data distillation for competitive coding\.*arXiv preprint arXiv:2504\.01943*, 2025\.
- Austin et al\. \(2021\)Austin, J\., Odena, A\., Nye, M\., Bosma, M\., Michalewski, H\., Dohan, D\., Jiang, E\., Cai, C\., Terry, M\., Le, Q\., et al\.Program synthesis with large language models\.*arXiv preprint arXiv:2108\.07732*, 2021\.
- Bercovich et al\. \(2025\)Bercovich, A\., Levy, I\., Golan, I\., Dabbah, M\., El\-Yaniv, R\., Puny, O\., Galil, I\., Moshe, Z\., Ronen, T\., Nabwani, N\., Shahaf, I\., Tropp, O\., Karpas, E\., Zilberstein, R\., Zeng, J\., Singhal, S\., Bukharin, A\., Zhang, Y\., Konuk, T\., Shen, G\., Mahabaleshwarkar, A\. S\., Kartal, B\., Suhara, Y\., Delalleau, O\., Chen, Z\., Wang, Z\., Mosallanezhad, D\., Renduchintala, A\., Qian, H\., Rekesh, D\., Jia, F\., Majumdar, S\., Noroozi, V\., Ahmad, W\. U\., Narenthiran, S\., Ficek, A\., Samadi, M\., Huang, J\., Jain, S\., Gitman, I\., Moshkov, I\., Du, W\., Toshniwal, S\., Armstrong, G\., Kisacanin, B\., Novikov, M\., Gitman, D\., Bakhturina, E\., Scowcroft, J\. P\., Kamalu, J\., Su, D\., Kong, K\., Kliegl, M\., Karimi, R\., Lin, Y\., Satheesh, S\., Parmar, J\., Gundecha, P\., Norick, B\., Jennings, J\., Prabhumoye, S\., Akter, S\. N\., Patwary, M\., Khattar, A\., Narayanan, D\., Waleffe, R\., Zhang, J\., Su, B\.\-Y\., Huang, G\., Kong, T\., Chadha, P\., Jain, S\., Harvey, C\., Segal, E\., Huang, J\., Kashirsky, S\., McQueen, R\., Putterman, I\., Lam, G\., Venkatesan, A\., Wu, S\., Nguyen, V\., Kilaru, M\., Wang, A\., Warno, A\., Somasamudramath, A\., Bhaskar, S\., Dong, M\., Assaf, N\., Mor, S\., Argov, O\. U\., Junkin, S\., Romanenko, O\., Larroy, P\., Katariya, M\., Rovinelli, M\., Balas, V\., Edelman, N\., Bhiwandiwalla, A\., Subramaniam, M\., Ithape, S\., Ramamoorthy, K\., Wu, Y\., Velury, S\. V\., Almog, O\., Daw, J\., Fridman, D\., Galinkin, E\., Evans, M\., Luna, K\., Derczynski, L\., Pope, N\., Long, E\., Schneider, S\., Siman, G\., Grzegorzek, T\., Ribalta, P\., Katariya, M\., Conway, J\., Saar, T\., Guan, A\., Pawelec, K\., Prayaga, S\., Kuchaiev, O\., Ginsburg, B\., Olabiyi, O\., Briski, K\., Cohen, J\., Catanzaro, B\., Alben, J\., Geifman, Y\., Chung, E\., and Alexiuk, C\.Llama\-nemotron: Efficient reasoning models, 2025\.URL[https://arxiv\.org/abs/2505\.00949](https://arxiv.org/abs/2505.00949)\.
- Brown et al\. \(2024\)Brown, B\., Juravsky, J\., Ehrlich, R\., Clark, R\., Le, Q\. V\., Ré, C\., and Mirhoseini, A\.Large language monkeys: Scaling inference compute with repeated sampling\.*arXiv preprint arXiv:2407\.21787*, 2024\.
- Cobbe et al\. \(2021\)Cobbe, K\., Kosaraju, V\., Bavarian, M\., Chen, M\., Jun, H\., Kaiser, L\., Plappert, M\., Tworek, J\., Hilton, J\., Nakano, R\., Hesse, C\., and Schulman, J\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.
- Dai et al\. \(2025\)Dai, M\., Yang, C\., and Si, Q\.S\-GRPO: Early exit via reinforcement learning in reasoning models\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.URL[https://openreview\.net/forum?id=wNMK5o0Vfg](https://openreview.net/forum?id=wNMK5o0Vfg)\.
- Fan et al\. \(2026\)Fan, C\., Zhang, Y\., Jia, J\., Hero, A\. O\., and Liu, S\.Cyclicreflex: Improving reasoning models via cyclical reflection token scheduling\.In*The Fourteenth International Conference on Learning Representations*, 2026\.
- Geva et al\. \(2021\)Geva, M\., Khashabi, D\., Segal, E\., Khot, T\., Roth, D\., and Berant, J\.Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies\.*Transactions of the Association for Computational Linguistics \(TACL\)*, 2021\.
- Ghosal et al\. \(2026\)Ghosal, S\. S\., Chakraborty, S\., Reddy, A\., Lu, Y\., Wang, M\., Manocha, D\., Huang, F\., Ghavamzadeh, M\., and Bedi, A\. S\.Does thinking more always help? mirage of test\-time scaling in reasoning models\.*Advances in Neural Information Processing Systems*, 38:172664–172691, 2026\.
- Guo et al\. \(2025\)Guo, D\., Yang, D\., Zhang, H\., Song, J\., Zhang, R\., Xu, R\., Zhu, Q\., Ma, S\., Wang, P\., Bi, X\., et al\.Deepseek\-r1: Incentivizing reasoning capability in llms via reinforcement learning\.*arXiv preprint arXiv:2501\.12948*, 2025\.
- He et al\. \(2025\)He, Q\., Yuan, S\., Li, X\., Wang, M\., and Chen, J\.Thinkdial: An open recipe for controlling reasoning effort in large language models\.*arXiv preprint arXiv:2508\.18773*, 2025\.
- Hendrycks et al\. \(2021a\)Hendrycks, D\., Burns, C\., Basart, S\., Zou, A\., Mazeika, M\., Song, D\., and Steinhardt, J\.Measuring massive multitask language understanding\.*Proceedings of the International Conference on Learning Representations \(ICLR\)*, 2021a\.
- Hendrycks et al\. \(2021b\)Hendrycks, D\., Burns, C\., Kadavath, S\., Arora, A\., Basart, S\., Tang, E\., Song, D\., and Steinhardt, J\.Measuring mathematical problem solving with the math dataset\.*NeurIPS*, 2021b\.
- Hou et al\. \(2025a\)Hou, B\., Zhang, Y\., Ji, J\., Liu, Y\., Qian, K\., Andreas, J\., and Chang, S\.Thinkprune: Pruning long chain\-of\-thought of llms via reinforcement learning\.*arXiv preprint arXiv:2504\.01296*, 2025a\.
- Hou et al\. \(2025b\)Hou, Z\., Lv, X\., Lu, R\., Zhang, J\., Li, Y\., Yao, Z\., Li, J\., Tang, J\., and Dong, Y\.T1: Advancing language model reasoning through reinforcement learning and inference scaling\.In*Forty\-second International Conference on Machine Learning*, 2025b\.URL[https://openreview\.net/forum?id=tnxONP8zTE](https://openreview.net/forum?id=tnxONP8zTE)\.
- Hugging Face \(2025\)Hugging Face\.Open r1: A fully open reproduction of deepseek\-r1, January 2025\.URL[https://github\.com/huggingface/open\-r1](https://github.com/huggingface/open-r1)\.
- Jaech et al\. \(2024\)Jaech, A\., Kalai, A\., Lerer, A\., Richardson, A\., El\-Kishky, A\., Low, A\., Helyar, A\., Madry, A\., Beutel, A\., Carney, A\., et al\.Openai o1 system card\.*arXiv preprint arXiv:2412\.16720*, 2024\.
- Jain et al\. \(2025\)Jain, N\., Han, K\., Gu, A\., Li, W\.\-D\., Yan, F\., Zhang, T\., Wang, S\., Solar\-Lezama, A\., Sen, K\., and Stoica, I\.Livecodebench: Holistic and contamination free evaluation of large language models for code\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=chfJJYC3iL](https://openreview.net/forum?id=chfJJYC3iL)\.
- Jiang et al\. \(2025\)Jiang, L\., Wu, X\., Huang, S\., Dong, Q\., Chi, Z\., Dong, L\., Zhang, X\., Lv, T\., Cui, L\., and Wei, F\.Think only when you need with large hybrid\-reasoning models\.In*The Thirty\-ninth Annual Conference on Neural Information Processing Systems*, 2025\.URL[https://openreview\.net/forum?id=fDjDVE4qdj](https://openreview.net/forum?id=fDjDVE4qdj)\.
- Jimenez et al\. \(2024\)Jimenez, C\. E\., Yang, J\., Wettig, A\., Yao, S\., Pei, K\., Press, O\., and Narasimhan, K\. R\.SWE\-bench: Can language models resolve real\-world github issues?In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=VTF8yNQM66](https://openreview.net/forum?id=VTF8yNQM66)\.
- Kojima et al\. \(2022\)Kojima, T\., Gu, S\. S\., Reid, M\., Matsuo, Y\., and Iwasawa, Y\.Large language models are zero\-shot reasoners\.*Advances in neural information processing systems*, 35:22199–22213, 2022\.
- Lee et al\. \(2022\)Lee, K\., Ippolito, D\., Nystrom, A\., Zhang, C\., Eck, D\., Callison\-Burch, C\., and Carlini, N\.Deduplicating training data makes language models better\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 8424–8445, 2022\.
- Li et al\. \(2023\)Li, X\., Zhang, T\., Dubois, Y\., Taori, R\., Gulrajani, I\., Guestrin, C\., Liang, P\., and Hashimoto, T\. B\.Alpacaeval: An automatic evaluator of instruction\-following models\.[https://github\.com/tatsu\-lab/alpaca\_eval](https://github.com/tatsu-lab/alpaca_eval), 5 2023\.
- Lightman et al\. \(2024\)Lightman, H\., Kosaraju, V\., Burda, Y\., Edwards, H\., Baker, B\., Lee, T\., Leike, J\., Schulman, J\., Sutskever, I\., and Cobbe, K\.Let’s verify step by step\.In*The Twelfth International Conference on Learning Representations*, 2024\.URL[https://openreview\.net/forum?id=v8L0pN6EOi](https://openreview.net/forum?id=v8L0pN6EOi)\.
- Lou et al\. \(2025\)Lou, C\., Sun, Z\., Liang, X\., Qu, M\., Shen, W\., Wang, W\., Li, Y\., Yang, Q\., and Wu, S\.Adacot: Pareto\-optimal adaptive chain\-of\-thought triggering via reinforcement learning\.*arXiv preprint arXiv:2505\.11896*, 2025\.
- Lozhkov et al\. \(2025\)Lozhkov, A\., Kydlíček, H\., Allal, L\. B\., Penedo, G\., Beeching, E\., Gallouédec, Q\., Habib, N\., Tunstall, L\., and von Werra, L\.Openr1\-math\-220k\.[https://huggingface\.co/datasets/open\-r1/OpenR1\-Math\-220k](https://huggingface.co/datasets/open-r1/OpenR1-Math-220k), 2025\.
- Muennighoff et al\. \(2025\)Muennighoff, N\., Yang, Z\., Shi, W\., Li, X\. L\., Fei\-Fei, L\., Hajishirzi, H\., Zettlemoyer, L\., Liang, P\., Candès, E\., and Hashimoto, T\.s1: Simple test\-time scaling, 2025\.URL[https://arxiv\.org/abs/2501\.19393](https://arxiv.org/abs/2501.19393)\.
- OpenAI \(2025\)OpenAI\.gpt\-oss\-120b & gpt\-oss\-20b model card, 2025\.URL[https://arxiv\.org/abs/2508\.10925](https://arxiv.org/abs/2508.10925)\.
- Qwen Team \(2025\)Qwen Team\.Qwen3 technical report, 2025\.URL[https://arxiv\.org/abs/2505\.09388](https://arxiv.org/abs/2505.09388)\.
- Rein et al\. \(2024\)Rein, D\., Hou, B\. L\., Stickland, A\. C\., Petty, J\., Pang, R\. Y\., Dirani, J\., Michael, J\., and Bowman, S\. R\.GPQA: A graduate\-level google\-proof q&a benchmark\.In*First Conference on Language Modeling*, 2024\.URL[https://openreview\.net/forum?id=Ti67584b98](https://openreview.net/forum?id=Ti67584b98)\.
- Shao et al\. \(2024\)Shao, Z\., Wang, P\., Zhu, Q\., Xu, R\., Song, J\., Bi, X\., Zhang, H\., Zhang, M\., Li, Y\., et al\.Deepseekmath: Pushing the limits of mathematical reasoning in open language models\.*arXiv preprint arXiv:2402\.03300*, 2024\.
- Snell et al\. \(2025\)Snell, C\. V\., Lee, J\., Xu, K\., and Kumar, A\.Scaling LLM test\-time compute optimally can be more effective than scaling parameters for reasoning\.In*The Thirteenth International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=4FWAwZtd2n](https://openreview.net/forum?id=4FWAwZtd2n)\.
- Sui et al\. \(2025\)Sui, Y\., Chuang, Y\.\-N\., Wang, G\., Zhang, J\., Zhang, T\., Yuan, J\., Liu, H\., Wen, A\., Zhong, S\., Zou, N\., et al\.Stop overthinking: A survey on efficient reasoning for large language models\.*arXiv preprint arXiv:2503\.16419*, 2025\.
- Talmor et al\. \(2019\)Talmor, A\., Herzig, J\., Lourie, N\., and Berant, J\.Commonsenseqa: A question answering challenge targeting commonsense knowledge\.In*Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\)*, pp\. 4149–4158, 2019\.
- Wang et al\. \(2025a\)Wang, J\., Liu, R\., Zhang, L\., and Li, J\.System report for CCL25\-eval task 10: SRAG\-MAV for fine\-grained Chinese hate speech recognition\.In Lin, H\., Li, B\., and Tan, H\. \(eds\.\),*Proceedings of the 24th China National Conference on Computational Linguistics \(CCL 2025\)*, pp\. 395–402, Jinan, China, August 2025a\. Chinese Information Processing Society of China\.URL[https://aclanthology\.org/2025\.ccl\-2\.47/](https://aclanthology.org/2025.ccl-2.47/)\.
- Wang et al\. \(2025b\)Wang, Y\., Liu, Q\., Xu, J\., Liang, T\., Chen, X\., He, Z\., Song, L\., Yu, D\., Li, J\., Zhang, Z\., et al\.Thoughts are all over the place: On the underthinking of o1\-like llms\.*arXiv preprint arXiv:2501\.18585*, 2025b\.
- Wei et al\. \(2022\)Wei, J\., Wang, X\., Schuurmans, D\., Bosma, M\., Xia, F\., Chi, E\., Le, Q\. V\., Zhou, D\., et al\.Chain\-of\-thought prompting elicits reasoning in large language models\.*Advances in neural information processing systems*, 35:24824–24837, 2022\.
- Wu et al\. \(2025\)Wu, C\., Li, B\., Gao, M\., and Wang, Z\.From efficiency to adaptivity: A deeper look at adaptive reasoning in large language models\.*arXiv preprint arXiv:2511\.10788*, 2025\.
- Xu et al\. \(2025\)Xu, F\., Hao, Q\., Zong, Z\., Wang, J\., Zhang, Y\., Wang, J\., Lan, X\., Gong, J\., Ouyang, T\., Meng, F\., et al\.Towards large reasoning models: A survey of reinforced reasoning with large language models\.*arXiv preprint arXiv:2501\.09686*, 2025\.
- Yang et al\. \(2024\)Yang, A\., Zhang, B\., Hui, B\., Gao, B\., Yu, B\., Li, C\., Liu, D\., Tu, J\., Zhou, J\., Lin, J\., Lu, K\., Xue, M\., Lin, R\., Liu, T\., Ren, X\., and Zhang, Z\.Qwen2\.5\-math technical report: Toward mathematical expert model via self\-improvement\.*arXiv preprint arXiv:2409\.12122*, 2024\.
- Zhang et al\. \(2025a\)Zhang, J\., Dong, R\., Wang, H\., Ning, X\., Geng, H\., Li, P\., He, X\., Bai, Y\., Malik, J\., Gupta, S\., et al\.Alphaone: Reasoning models thinking slow and fast at test time\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, pp\. 11340–11365, 2025a\.
- Zhang et al\. \(2025b\)Zhang, J\., Lin, N\., Hou, L\., Feng, L\., and Li, J\.AdaptThink: Reasoning models can learn when to think\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing*, 2025b\.URL[https://aclanthology\.org/2025\.emnlp\-main\.184/](https://aclanthology.org/2025.emnlp-main.184/)\.
- Zhang et al\. \(2025c\)Zhang, L\., Wang, J\., Zhang, M\., Cao, G\., Shi, E\., Ma, Y\., Yu, J\., Liu, H\., Li, J\., and Zhang, M\.Speed up your code: Progressive code acceleration through bidirectional tree editing\.In Che, W\., Nabende, J\., Shutova, E\., and Pilehvar, M\. T\. \(eds\.\),*Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pp\. 28563–28576, Vienna, Austria, July 2025c\. Association for Computational Linguistics\.ISBN 979\-8\-89176\-251\-0\.doi:10\.18653/v1/2025\.acl\-long\.1387\.URL[https://aclanthology\.org/2025\.acl\-long\.1387/](https://aclanthology.org/2025.acl-long.1387/)\.
- Zhang et al\. \(2024\)Zhang, P\., Zeng, G\., Wang, T\., and Lu, W\.Tinyllama: An open\-source small language model\.*arXiv preprint arXiv:2401\.02385*, 2024\.
- Zhao et al\. \(2025\)Zhao, K\., Zhao, Y\., Song, J\., He, S\., Zhang, L\., Zhang, Q\., and Li, T\.Saber: Switchable and balanced training for efficient llm reasoning\.*arXiv preprint arXiv:2508\.10026*, 2025\.
- Zhao et al\. \(2023\)Zhao, W\. X\., Zhou, K\., Li, J\., Tang, T\., Wang, X\., Hou, Y\., Min, Y\., Zhang, B\., Zhang, J\., Dong, Z\., et al\.A survey of large language models\.*arXiv preprint arXiv:2303\.18223*, 1\(2\), 2023\.

## Appendix AAdditional Ablation Studies

### A\.1Minimum Reasoning ThresholdLminL\_\{\\min\}

During MSC construction, if the raw MSC prefix contains fewer than or equal toLminL\_\{\\min\}sentences, we set it to an empty string, indicating the model should directly generate the answer without intermediate reasoning steps\.

![Refer to caption](https://arxiv.org/html/2606.17687v1/figures/lmin_ablation.png)Figure 8:Effect of the minimum reasoning lengthLminL\_\{\\min\}\.Figure[8](https://arxiv.org/html/2606.17687#A1.F8)\(a\) shows the number of affected training samples at different thresholds\. AsLminL\_\{\\min\}increases from 1 to 10, the proportion of non\-thinking samples grows from 3\.0% to 9\.8%\.

As shown in Figure[8](https://arxiv.org/html/2606.17687#A1.F8)\(b\), without the threshold \(Lmin=0L\_\{\\min\}=0\), trivial CoT fragments introduce noise, resulting in 51\.1% accuracy with 1,402 tokens\. Overly aggressive filtering \(Lmin=10L\_\{\\min\}=10\) suppresses necessary reasoning, degrading accuracy to 50\.9%\.Lmin=5L\_\{\\min\}=5achieves the optimal balance at 51\.5% accuracy with 1,347 tokens, demonstrating that filtering very short CoT fragments \(affecting 6\.7% of samples\) effectively removes noise while preserving meaningful reasoning signals\.

### A\.2EMA Rateη\\eta\.

Table 6:Effect of EMA rateη\\eta\.EMA Rate \(η\\eta\)Accuracy↑\\uparrowTokens↓\\downarrow0\.0 \(w/o DCP\)52\.91,6420\.153\.11,4830\.352\.91,4420\.552\.61,4271\.0 \(Full Update\)52\.11,369We analyze the impact of EMA rateη\\etaon the dynamic complexity pool update\. As shown in Table[6](https://arxiv.org/html/2606.17687#A1.T6),η=0\.1\\eta=0\.1achieves the best accuracy\-efficiency balance at 53\.1% accuracy with 1,483 tokens\. Static pool \(η=0\\eta=0\) retains more redundant reasoning \(1,642 tokens\) while achieving comparable accuracy \(52\.9%\)\. Overly aggressive updates \(η≥0\.5\\eta\\geq 0\.5\) reduce token usage but degrade accuracy due to unstable threshold estimation\. This validates that moderate EMA rates effectively balance tracking policy evolution with maintaining stable training signals\.

### A\.3Over\-thinking Toleranceϵ\\epsilon

The tolerance parameterϵ\\epsilonin Eq\.[11](https://arxiv.org/html/2606.17687#S3.E11)controls the strictness of over\-thinking penalties by allowing minor deviations beyond the minimal sufficient prefix\.

Table 7:Effect of over\-thinking toleranceϵ\\epsilonon SAPO performance\. Results are averaged across all benchmarks on Qwen2\.5\-Math\-1\.5B\.Tolerance \(ϵ\\epsilon\)Accuracy \(%\)↑\\uparrowTokens↓\\downarrow0 \(Strict\)52\.41,391152\.81,456253\.11,483353\.01,527552\.71,658As shown in Table[7](https://arxiv.org/html/2606.17687#A1.T7), settingϵ=0\\epsilon=0applies strict penalties for any reasoning beyond the minimal sufficient prefix, resulting in overly aggressive truncation that reduces tokens to 1,391 but harms accuracy \(52\.4%\)\. This strict constraint prevents the model from generating natural reasoning flow and exploring slightly longer but potentially more robust reasoning paths\.

With moderate tolerance \(ϵ=2\\epsilon=2\), the model achieves the best accuracy at 53\.1% while generating 1,483 tokens\. This tolerance allows the model to extend reasoning by 1\-2 sentences beyond the minimal sufficient point when beneficial, accommodating natural variations in reasoning style without sacrificing efficiency\.

Asϵ\\epsilonincreases further \(3, 5\), accuracy plateaus or slightly declines while token usage grows substantially\. Atϵ=5\\epsilon=5, the sufficiency constraint becomes too loose, allowing the model to generate verbose reasoning \(1,658 tokens\) that approaches the behavior without sufficiency\-aware rewards\. This demonstrates thatϵ=2\\epsilon=2provides an appropriate balance: it avoids overly rigid constraints that harm reasoning quality while maintaining effective control over redundant thinking\.

### A\.4Sufficiency Metric Ablation

We compare our geometric mean sufficiency formulation against alternative definitions to justify the design choice\. All variants use the same MSC construction pipeline with Qwen2\.5\-Math\-1\.5B as the target model\.

Table 8:Comparison of sufficiency metric formulations\. Geometric mean provides the best balance between accuracy and efficiency\.Sufficiency MetricTrain CoT TokensAccuracy \(%\)Inference TokensFull CoT SFT \(no truncation\)3,78138\.45,082Joint Probability2,14940\.12,573Arithmetic Mean1,34143\.11,578Geometric Mean \(Ours\)1,13845\.71,344Joint Probability\(∏iπθ​\(yi∗\|⋅\)\\prod\_\{i\}\\pi\_\{\\theta\}\(y^\{\*\}\_\{i\}\|\\cdot\)\) decays exponentially with answer length, causing the threshold to be satisfied too late for short\-answer problems and too early for long\-answer problems\. This results in inconsistent truncation quality\.

Arithmetic Mean\(1‖y∗‖​∑iπθ​\(yi∗\|⋅\)\\frac\{1\}\{\\\|y^\{\*\}\\\|\}\\sum\_\{i\}\\pi\_\{\\theta\}\(y^\{\*\}\_\{i\}\|\\cdot\)\) is dominated by a few high\-confidence tokens, making it less sensitive to tokens that genuinely require reasoning support\.

Geometric Mean\(Eq\.[1](https://arxiv.org/html/2606.17687#S3.E1)\) normalizes joint probability into per\-token average log\-probability, which is stable across varying answer lengths and equally sensitive to all answer tokens\. It achieves the highest accuracy with the most aggressive token reduction, confirming its effectiveness as a sufficiency signal\.

### A\.5Cross\-Domain vs\. Intra\-Domain Percentile

In our default setting, complexity percentiles are computed globally across all training domains\. However, different domains exhibit different baseline reasoning lengths\. For instance, code problems typically require longer traces than math problems\. This raises the question of whether a code problem might be assigned an artificially high complexity score simply because code traces are longer on average, rather than because the problem itself is harder\. To investigate this, we compare the default cross\-domain percentile with an intra\-domain variant that computes percentiles separately within each domain \(math, code, science\)\.

Table 9:Cross\-domain vs\. intra\-domain percentile estimation \(Qwen2\.5\-Math\-1\.5B MFT\)\.MathCodeScienceAvg\.MethodAccTokAccTokAccTokAccTokFull CoT SFT59\.15,38028\.56,34527\.73,52238\.45,082Cross\-domain69\.11,35933\.21,70634\.896645\.71,344Intra\-domain69\.21,37633\.21,69235\.11,02145\.81,363Intra\-domain percentile yields nearly identical performance to the cross\-domain setting, indicating that the global percentile preserves monotonicity within each domain and remains a robust measure of reasoning difficulty\. This robustness arises because percentile ranks maintain relative ordering within domains, regardless of absolute length differences across domains\.

## Appendix BMSC Dataset Construction

### B\.1Dataset Statistics

Table[10](https://arxiv.org/html/2606.17687#A2.T10)summarizes the statistics of the final dataset\. Across all samples, the full CoT traces average 3,781 tokens, while MSC reduces this to 1,138 tokens\. Notably, 109,882 samples \(40\.7%\) yield empty MSCs, indicating that the model can solve these problems without explicit reasoning\.

For Stage I, we train on all samples to learn minimal sufficient reasoning patterns\. For Stage II, we sample 50,000 instances for RL to balance training efficiency and diversity\.

Table 10:Training dataset statistics\. We report the number of samples, average token counts for full CoT and MSC, and the number of samples with empty MSC \(requiring no explicit reasoning\)\.DomainSourceSamplesFull CoTRaw MSCRefine MSCEmptyMathLlama\-Nemotron39,3772,9241,2381,06710,683Mixture\-of\-Thoughts51,0895,4142,1951,42414,684OpenR1\-Math\-220k37,8994,4821,9341,21713,997s1K\-1\.13307,9963,4101,68175CodeLlama\-Nemotron60,0002,2631,32953724,837Mixture\-of\-Thoughts12,67111,1062,7642,3792,914OpenCodeReasoning15,7046,0762,4911,5993,424ScienceMixture\-of\-Thoughts52,8761,5901,59762339,242s1K\-1\.1658,6441,1792,17826Total270,011–––109,882Average–3,7811,8031,138–
### B\.2Data Sources

We curate our training data from the following five publicly available reasoning datasets, all containing CoT trajectories distilled from advanced LRMs:

- •Mixture\-of\-Thoughts\(Hugging Face,[2025](https://arxiv.org/html/2606.17687#bib.bib18)\): 350K samples \(93K math, 83K code, 173K science\) generated by DeepSeek\-R1 with correctness filtering on final answers\.
- •OpenR1\-Math\-220k\(Lozhkov et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib28)\): 220K math reasoning trajectories distilled from 800K DeepSeek\-R1 generated solutions\.
- •Llama\-Nemotron Post\-Training Dataset\(Bercovich et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib5)\): 3\.9M samples covering math, code, science, chat, and safety\. All samples include explicit reasoning trajectories produced by DeepSeek\-R1 and refined using Nemotron\-340B\(Adler et al\.,[2024](https://arxiv.org/html/2606.17687#bib.bib1)\)\.
- •OpenCodeReasoning\(Ahmad et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib3)\): 735K Codeforces/LeetCode problems paired with CoT and executable Python solutions, including full test cases\.
- •s1K\-1\.1\(Muennighoff et al\.,[2025](https://arxiv.org/html/2606.17687#bib.bib29)\): 1,000 carefully curated high\-difficulty examples selected for difficulty, diversity, and quality, with accompanying budget\-constrained inference technique\.

### B\.3Data Preprocessing

We apply a rigorous preprocessing pipeline to ensure data quality:

Filtering\.We remove samples with: \(1\) incorrect or missing answers, \(2\) incomplete reasoning traces, \(3\) overlap with our evaluation benchmarks, and \(4\) embedded non\-textual elements \(e\.g\., images, URLs\), \(5\) non\-English content\.

Deduplication\.We apply MinHash LSH\(Lee et al\.,[2022](https://arxiv.org/html/2606.17687#bib.bib24)\)to remove near\-duplicate samples\.

Cleaning\.Questions are normalized by removing source identifiers and numbering to reduce stylistic noise\.

### B\.4MSC Construction

For each sample, we derive its MSC following Algorithm[1](https://arxiv.org/html/2606.17687#alg1)\. Additionally, we employ an LLM\-based evaluation to score each MSC along three dimensions: \(1\) correctness, \(2\) sufficiency and support for the final answer, \(3\) fluency and logical coherence\. Low\-quality MSC samples are filtered out\. Both MSC refinement and quality assessment are performed with Qwen3\-Next\-80B\-A3B\-Instruct\(Qwen Team,[2025](https://arxiv.org/html/2606.17687#bib.bib31)\)\.

## Appendix CMSC Refinement

### C\.1Refinement Prompt

Figure[9](https://arxiv.org/html/2606.17687#A3.F9)presents the complete prompt used for MSC refinement\. The prompt guides the model to polish the raw MSC prefix along three dimensions:Logical Completeness,Conciseness, andStylistic Consistency\. The refinement process focuses on improving coherence and readability of the existing MSC without modifying its underlying reasoning content\.

![Refer to caption](https://arxiv.org/html/2606.17687v1/figures/msc_refine_prompt.png)Figure 9:Complete prompt for MSC refinement\. The prompt guides the refinement model to enhance logical completeness and conciseness while maintaining stylistic consistency with the original reasoning trajectory\.
### C\.2Refinement Examples

Figures[10](https://arxiv.org/html/2606.17687#A3.F10)and[11](https://arxiv.org/html/2606.17687#A3.F11)illustrate concrete examples comparing raw MSC and refined MSC\.

![Refer to caption](https://arxiv.org/html/2606.17687v1/figures/msc_refine_case1.png)Figure 10:Refinement example demonstrating logical completion\. Raw MSC stops mid\-reasoning; refined MSC completes the derivation while preserving the original flow\.![Refer to caption](https://arxiv.org/html/2606.17687v1/figures/msc_refine_case2.png)Figure 11:Refinement example: reasoning optimization\. Raw MSC contains exploratory backtracking; refined MSC eliminates redundancy while maintaining the core logic\.

Similar Articles

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

arXiv cs.CL

Proposes ProxyCoT, a training framework that improves long-context reasoning in large language models by first obtaining chain-of-thought reasoning traces on short proxy contexts (via reinforcement learning or distillation) and then grounding them in full long contexts through supervised fine-tuning. Experiments show consistent improvements over baselines with reduced computational cost.

SuperThoughts: Reasoning Tokens in Superposition

arXiv cs.LG

SuperThoughts compresses consecutive chain-of-thought tokens into latent representations and decodes two tokens per step, achieving ~20–30% CoT length reduction with minimal accuracy loss on math reasoning benchmarks, while doubling inference throughput.