When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

arXiv cs.AI 07/01/26, 04:00 AM Papers
Summary
This paper introduces LearnStop, a lightweight checkpoint stopper for reasoning models that predicts prefix correctness from online features, and finds that learned stopping provides value over scalar rules only when many questions become correct early without a single reliable scalar signal.
arXiv:2606.30852v1 Announce Type: new Abstract: Reasoning models spend different amounts of useful computation across instances, but it remains unclear when a learned stopping rule improves over simple confidence or convergence thresholds. We study this question with LearnStop, a hidden-state-free checkpoint stopper for reasoning language models. At fixed budget checkpoints, LearnStop probes a short answer from the current reasoning prefix and predicts prefix correctness from online features such as answer confidence, entropy, prefix vote share, answer stability, and backtracking-marker density. Across 18 task-model settings spanning GSM8K, MATH-500, MMLU-Pro, AIME-90, GPQA, Qwen3, and DeepSeek-R1 distillations, the answer is task-dependent. On free-form math, learned multi-feature stopping improves the fixed-budget frontier and often beats scalar exits: on GSM8K with Qwen3-32B, the empirical frontier reaches a post-hoc peak adapt gain of +0.157, validation-selected operating points preserve positive gains, and the paired gain over the strongest scalar baseline is +0.028. On multiple-choice and very hard settings, scalar confidence, entropy, or stability rules are competitive or stronger. We therefore frame learned stopping not as a universal replacement for scalar exits, but as a tool whose value depends on trajectory structure. We further provide validation-selected operating points, paired bootstrap tests, finite-grid lost-correct risk calibration, cost accounting under KV-fork, prefix-cache, and black-box regimes, H100 serving profiles, checkpoint-schedule sweeps, transfer analyses, and robustness checks. The main practical finding is that learned stopping is useful when many questions become correct before full budget but do not exhibit a single reliable scalar stopping signal; its benefits largely disappear when confidence or answer convergence already solves the stopping problem.
Original Article
View Cached Full Text
Cached at: 07/01/26, 05:36 AM
# When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models
Source: [https://arxiv.org/html/2606.30852](https://arxiv.org/html/2606.30852)
Zhe Dong University of Maine at Presque Isle zhe\.dong@maine\.edu dongzhe181@gmail\.comFang Qin Stanford University fangq@stanford\.eduManish Shah Independent Researcher shahmh@ieee\.org

###### Abstract

Reasoning models spend different amounts of useful computation across instances, but it remains unclear when a learned stopping rule improves over simple confidence or convergence thresholds\. We study this question with*LearnStop*, a hidden\-state\-free checkpoint stopper for reasoning language models\. At fixed budget checkpoints, LearnStop probes a short answer from the current reasoning prefix and predicts prefix correctness from online features such as answer confidence, entropy, prefix vote share, answer stability, and backtracking\-marker density\. Across 18 task–model settings spanning GSM8K, MATH\-500, MMLU\-Pro, AIME\-90, GPQA, Qwen3, and DeepSeek\-R1 distillations, the answer is task\-dependent\. On free\-form math, learned multi\-feature stopping improves the fixed\-budget frontier and often beats scalar exits: on GSM8K with Qwen3\-32B, the empirical frontier reaches a post\-hoc peak adapt gain of \+0\.157, validation\-selected operating points preserve positive gains, and the paired gain over the strongest scalar baseline is \+0\.028\. On multiple\-choice and very hard settings, scalar confidence, entropy, or stability rules are competitive or stronger\. We therefore frame learned stopping not as a universal replacement for scalar exits, but as a tool whose value depends on trajectory structure\. We further provide validation\-selected operating points, paired bootstrap tests, finite\-grid lost\-correct risk calibration, cost accounting under KV\-fork, prefix\-cache, and black\-box regimes, H100 serving profiles, checkpoint\-schedule sweeps, transfer analyses, and robustness checks\. The main practical finding is that learned stopping is useful when many questions become correct before full budget but do not exhibit a single reliable scalar stopping signal; its benefits largely disappear when confidence or answer convergence already solves the stopping problem\.

## 1Introduction

Large reasoning models improve by spending more test\-time compute, often generating long chains of intermediate reasoning before a final answer\[[4](https://arxiv.org/html/2606.30852#bib.bib2),[14](https://arxiv.org/html/2606.30852#bib.bib1)\]\. The same scaling creates a serving problem: a fixed token budget wastes computation on easy questions, while aggressive shortening can cut off hard questions before they self\-correct\. Recent systems therefore attempt to adapt the reasoning budget per instance using confidence, entropy, answer convergence, or calibrated risk constraints\[[21](https://arxiv.org/html/2606.30852#bib.bib12),[17](https://arxiv.org/html/2606.30852#bib.bib14),[5](https://arxiv.org/html/2606.30852#bib.bib13),[18](https://arxiv.org/html/2606.30852#bib.bib21)\]\.

This paper asks a narrower question:*when does learning to stop help beyond strong scalar stopping rules?*The question matters because a learned stopper is not free\. It requires training data, threshold selection, probe overhead, and deployment support for pausing and probing a generation\. It is only worthwhile if it captures information that a single confidence, entropy, or stability signal misses\.

We study this question throughLearnStop, a lightweight logistic stopper over prefix\-observable features\. At each checkpoint, we force a short answer from the current reasoning prefix, compute output\-level confidence and trace features, and stop when the predicted prefix correctness exceeds a threshold\. The method deliberately avoids hidden states, making it easier to deploy across model families than hidden\-state probes, but it still requires an interactive decoder that can pause, branch a probe answer, and resume the original reasoning state\.

Our empirical conclusion is deliberately qualified\. Learned stopping helps most on free\-form math tasks, especially GSM8K and MATH\-500 with Qwen3 models\. It helps less on MMLU\-Pro and GPQA, where multiple\-choice confidence is already a strong stopping signal, and it is unstable on very hard AIME\-90 where many trajectories never become correct\. This task dependence is not a weakness to hide; it is the central finding\. A practical system should choose between a learned policy and simpler scalar exits based on the task’s trajectory structure and deployment cost model\.

We make four contributions\.

1. 1\.We provide a controlled empirical study of learned checkpoint stopping across 18 task–model settings, including validation\-selected operating points and paired bootstrap comparisons against confidence, entropy, confidence\-leap, and answer\-stability exits\.
2. 2\.We give an overhead\-aware deployment analysis: decode\-only KV\-fork accounting, prefix\-cache accounting, pure black\-box API accounting, H100 wall\-clock profiles, and checkpoint\-schedule sweeps\. These experiments show that probe overhead can erase savings unless prefix reuse is available\.
3. 3\.We calibrate thresholds with a finite\-grid upper confidence bound on*lost\-correct risk*, the probability that early stopping flips a full\-thinking correct answer to an incorrect one\. This risk upper\-bounds accuracy loss relative to full thinking\.
4. 4\.We analyze mechanisms and failure modes: trajectory decomposition, ablations, model\-class comparisons, prompt and temperature robustness, transfer protocols, and cross\-family results\. These show why learning helps on some math settings but not universally\.

## 2Related Work

#### Budget control for reasoning models\.

Budget forcing and test\-time scaling studies show that allocating more reasoning tokens can improve performance, but that the right budget varies across instances\[[11](https://arxiv.org/html/2606.30852#bib.bib10),[6](https://arxiv.org/html/2606.30852#bib.bib11)\]\. Qwen3 exposes a thinking\-budget mechanism in the model family itself\[[14](https://arxiv.org/html/2606.30852#bib.bib1)\]\. These approaches control length directly, while our work studies online exit decisions conditioned on partial reasoning evidence\.

#### Training\-free early exit\.

DEER induces trial answers at transition points and exits when confidence is high\[[21](https://arxiv.org/html/2606.30852#bib.bib12)\]\. EAT monitors entropy after an appended stop\-thinking marker\[[17](https://arxiv.org/html/2606.30852#bib.bib14)\]\. Confidence Leaps detects sudden jumps in answer probability\[[16](https://arxiv.org/html/2606.30852#bib.bib15)\]\. Certaindex/Dynasor uses answer certainty and stabilization to allocate serving compute\[[5](https://arxiv.org/html/2606.30852#bib.bib13)\]\. PUMA detects semantic convergence of reasoning traces rather than relying only on answer\-level proxies\[[10](https://arxiv.org/html/2606.30852#bib.bib16)\]\. We implement faithful output\-level proxies for these signal families where possible and compare them under a common checkpoint protocol\.

#### Learned stopping and risk control\.

Thought Calibration trains hidden\-state probes and calibrates dynamic termination with a Learn\-then\-Test perspective\[[20](https://arxiv.org/html/2606.30852#bib.bib17)\]\. TERMINATOR learns exit points from first\-answer positions\[[12](https://arxiv.org/html/2606.30852#bib.bib18)\]\. Conformal Risk Control, Conformal Language Modeling, and Conformal Thinking motivate distribution\-free risk calibration for language\-model outputs and reasoning budgets\[[2](https://arxiv.org/html/2606.30852#bib.bib19),[13](https://arxiv.org/html/2606.30852#bib.bib20),[18](https://arxiv.org/html/2606.30852#bib.bib21)\]\. Our distinction is not priority over calibrated stopping\. Instead, we ask when a hidden\-state\-free, multi\-feature learned stopper adds value over scalar exits, and how that value changes under realistic overhead assumptions\.

## 3Method

### 3\.1Checkpoint Probing

For each questionii, letB0<⋯<Bm−1B\_\{0\}<\\cdots<B\_\{m\-1\}be a budget grid\. At checkpointjj, we use the reasoning prefix up toBjB\_\{j\}tokens, append a stop\-thinking marker, and greedily decode a short answer of at mostAAtokens\. Letai,ja\_\{i,j\}be the forced answer andyiy\_\{i\}the gold answer\. The label for training isci,j=𝟙\{ai,j=yi\}c\_\{i,j\}=\\mathbb\{1\}\\\{a\_\{i,j\}=y\_\{i\}\\\}\.

At inference, the gold answer is unavailable\. LearnStop computes a feature vectorxi,jx\_\{i,j\}from the current prefix and previously probed checkpoints, estimatespi,j=Pr⁡\(ci,j=1∣xi,j\)p\_\{i,j\}=\\Pr\(c\_\{i,j\}=1\\mid x\_\{i,j\}\), and stops at the first checkpoint withpi,j≥τp\_\{i,j\}\\geq\\tau\. If no checkpoint fires, it uses the maximum budget\.

### 3\.2Prefix\-Observable Features

The feature set contains normalized budget, normalized checkpoint index, mean answer log probability, answer\-token entropy, whether the current answer matches the previous checkpoint, run length of the current answer, prefix vote share, and backtracking\-marker density\. The prefix vote share is computed only over checkpoints0,…,j0,\\ldots,j; it never uses future checkpoints\. The full experimental classifier also records whether the model has naturally emitted a thinking\-end marker by the current checkpoint and the observed thinking length\. Because these two features can invite concerns about length leakage, we report an eight\-feature ablation that removes them; the main conclusions are unchanged\. We therefore recommend the eight\-feature variant for deployment and treat the ten\-feature classifier as a diagnostic upper bound\.

### 3\.3Training and Metrics

For in\-distribution frontiers, we train logistic regression with grouped five\-fold cross\-validation, keeping all checkpoints of the same question in the same fold\. We compare to a fixed\-budget frontier\. At thresholdτ\\tau, the adapt gain is

G\(τ\)=Accadapt\(τ\)−Accfixed\(Cadapt\(τ\)\),G\(\\tau\)=\\mathrm\{Acc\}\_\{\\mathrm\{adapt\}\}\(\\tau\)\-\\mathrm\{Acc\}\_\{\\mathrm\{fixed\}\}\(C\_\{\\mathrm\{adapt\}\}\(\\tau\)\),\(1\)where the fixed\-budget accuracy is linearly interpolated at the same mean thinking\-token cost\. Peak gain summarizes the best point on the empirical frontier; validation\-selected gains use a held\-out validation split to choose the threshold before test evaluation\.

Probe answers add cost\. Our default total\-token accounting for a question stopped after checkpointjjis

Citotal=Cithink\+\(j\+1\)A,C\_\{i\}^\{\\mathrm\{total\}\}=C\_\{i\}^\{\\mathrm\{think\}\}\+\(j\+1\)A,\(2\)withA=48A=48unless otherwise stated\. This decode\-token cost assumes an inference engine can reuse the prefix or fork the KV cache\. We separately evaluate prefix\-cache and pure black\-box API regimes because repeated prefilling can dominate savings\.

### 3\.4Lost\-Correct Risk Calibration

LetFiF\_\{i\}be full\-thinking correctness andSi\(τ\)S\_\{i\}\(\\tau\)stopped\-answer correctness\. We define lost\-correct risk as

Li\(τ\)=𝟙\{Fi=1,Si\(τ\)=0\}\.L\_\{i\}\(\\tau\)=\\mathbb\{1\}\\\{F\_\{i\}=1,S\_\{i\}\(\\tau\)=0\\\}\.\(3\)This is not equal to accuracy drop; rather,

Accfull−Accτ\\displaystyle\\mathrm\{Acc\}\_\{\\mathrm\{full\}\}\-\\mathrm\{Acc\}\_\{\\tau\}=𝔼\[Li\(τ\)\]−Pr⁡\(Fi=0,Si\(τ\)=1\)\\displaystyle=\\mathbb\{E\}\[L\_\{i\}\(\\tau\)\]\-\\Pr\(F\_\{i\}=0,S\_\{i\}\(\\tau\)=1\)\(4\)≤𝔼\[Li\(τ\)\]\.\\displaystyle\\leq\\mathbb\{E\}\[L\_\{i\}\(\\tau\)\]\.\(5\)Thus controlling𝔼\[Li\(τ\)\]\\mathbb\{E\}\[L\_\{i\}\(\\tau\)\]controls an upper bound on loss relative to full thinking\. For a finite threshold grid𝒯\\mathcal\{T\}, calibration set sizenn, and confidence1−δ1\-\\delta, we use the simultaneous Hoeffding upper bound

U\(τ\)=R^cal\(τ\)\+log⁡\(\|𝒯\|/δ\)2n\.U\(\\tau\)=\\widehat\{R\}\_\{cal\}\(\\tau\)\+\\sqrt\{\\frac\{\\log\(\|\\mathcal\{T\}\|/\\delta\)\}\{2n\}\}\.\(6\)We select the most aggressive threshold withU\(τ\)≤αU\(\\tau\)\\leq\\alphaand evaluate it on a disjoint test split\. This is a finite\-grid risk\-control procedure under exchangeability of calibration and test examples\.

## 4Experiments

### 4\.1Setup

The primary models are Qwen3\-8B and Qwen3\-32B\. Cross\-family checks use DeepSeek\-R1\-Distill\-Qwen\-7B and DeepSeek\-R1\-Distill\-Llama\-8B\. We evaluate GSM8K, MATH\-500, MMLU\-Pro, GPQA\-Diamond, and AIME\-90\[[3](https://arxiv.org/html/2606.30852#bib.bib3),[7](https://arxiv.org/html/2606.30852#bib.bib4),[8](https://arxiv.org/html/2606.30852#bib.bib5),[19](https://arxiv.org/html/2606.30852#bib.bib6),[15](https://arxiv.org/html/2606.30852#bib.bib7),[9](https://arxiv.org/html/2606.30852#bib.bib8),[1](https://arxiv.org/html/2606.30852#bib.bib9)\]\. MATH\-500 is the 500\-problem MATH evaluation subset popularized by process\-supervision work\. AIME\-90 is the Hugging Face AI\-MO/aimo\-validation\-aime subset: 2022, 2023, and 2024 AIME I/II, problems 1–15 from each exam, for 90 integer\-answer questions scored against the public answer keys\. We do not deduplicate or filter this subset\. Because AIME\-90 and GPQA\-Diamond are small and difficult, we treat them as stress tests and do not base the main positive claim on them\. The main budget grid is\[0,128,192,256,384,512,640,768,1024,1536\]\[0,128,192,256,384,512,640,768,1024,1536\]; AIME uses a longer grid up to 6144 tokens\. Baselines include confidence exit, entropy exit, confidence\-leap exit, run\-stability exit, DEER\-style transition exit, EAT\-style entropy stability, a PUMA\-style convergence proxy, and TERMINATOR\-light\. Unless noted otherwise, CIs are question\-level bootstrap intervals\.

![Refer to caption](https://arxiv.org/html/2606.30852v1/x1.png)Figure 1:Accuracy–cost frontier on GSM8K with Qwen3\-32B\. The learned stopper improves over fixed budgets across a broad cost range\. The circled point is the peak adapt gain\.
### 4\.2When Does Learning Help?

Figure[1](https://arxiv.org/html/2606.30852#S4.F1)shows the clearest case for learning: GSM8K with Qwen3\-32B\. The learned frontier dominates the fixed\-budget frontier over a broad token range, with peak adapt gain \+0\.157\.

Table[1](https://arxiv.org/html/2606.30852#S4.T1)shows the broader picture\. The learned stopper is strongest on free\-form math\. Against the strongest scalar baseline in each row, paired bootstrap intervals show a clear advantage on GSM8K\-32B \(\+0\.028 over entropy exit\) and MATH\-500\-8B \(\+0\.023 over entropy exit\), and a positive but not significant difference on MATH\-500\-32B\. It does not beat the strongest scalar baseline on MMLU\-Pro, AIME\-90, or GPQA\. Validation\-selected gains preserve the same qualitative pattern: positive on GSM8K and MATH\-500, weak or unstable on the hardest and multiple\-choice settings\.

Table 1:Primary Qwen3 results\. Peak gain is post\-hoc over thresholds; Val\.\-Sel\. is a validation\-selected test gain\. The paired CI compares learned stopping to the strongest scalar baseline in the row\. AIME\-90 and GPQA\-Diamond are small, difficult stress tests, so their rows should be read as evidence against aggressive stopping rather than as precise method rankings\. Learned stopping helps most on free\-form math and is not uniformly better than scalar exits\.![Refer to caption](https://arxiv.org/html/2606.30852v1/x2.png)Figure 2:Why the result is task\-dependent\. Left: paired difference between learned stopping and the strongest scalar baseline\. Right: trajectory decomposition for Qwen3\-32B\. GSM8K has many early\-solved and oscillating trajectories, creating room for multi\-feature stopping; AIME and GPQA contain many unsolved or harmful trajectories, where aggressive learned stopping is harder to justify\.Figure[2](https://arxiv.org/html/2606.30852#S4.F2)explains the pattern\. GSM8K has many early\-solved questions and nontrivial answer oscillation, so confidence, stability, and trace features complement one another\. MMLU\-Pro has many early\-solved multiple\-choice questions, making confidence and confidence leaps strong\. AIME\-90 and GPQA have large unsolved fractions; in these settings, the safest policy is often to spend the budget or use a conservative scalar exit\.

### 4\.3Ablations and Model Classes

Ablations support two claims\. First, the potentially ambiguous length/end features are not responsible for the gains: removing them changes peak gain by less than 0\.007 across the primary settings, so the eight\-feature model is our recommended deployment variant\. Second, no single feature family explains all math gains\. On GSM8K\-32B, confidence\-only gives \+0\.088, entropy\-only \+0\.136, stability\-only \+0\.093, and the full feature set \+0\.157\. Gradient boosting and MLP slightly improve GSM8K\-32B \(\+0\.174 and \+0\.176\), but logistic regression is best on MATH\-500 and remains competitive elsewhere\. We therefore use logistic regression as the default because it is simpler and easier to calibrate\.

### 4\.4Risk\-Controlled Operating Points

Table[2](https://arxiv.org/html/2606.30852#S4.T2)reports finite\-grid lost\-correct risk calibration atα=0\.15\\alpha=0\.15,δ=0\.05\\delta=0\.05\. All selected thresholds satisfy the calibration upper bound and all held\-out test risks are below the target\. The savings are substantial but smaller after probe accounting\. For GSM8K\-32B, theα=0\.15\\alpha=0\.15operating point has test risk 0\.040, accuracy 0\.915, 54\.4% think\-token saving, and 32\.5% total\-token saving\. This is the right deployment interpretation: risk calibration can provide a user\-facing knob, but probe overhead changes the practical savings\.

Table 2:Lost\-correct risk control atα=0\.15\\alpha=0\.15,δ=0\.05\\delta=0\.05\.UUis the finite\-grid calibration upper bound\. Think and Total are percent savings relative to full thinking; Total includes probe answers\.
### 4\.5Deployment Cost and Checkpoint Schedules

LearnStop is hidden\-state\-free, but not cost\-free\. A pure black\-box API that requires resending the entire prefix at every checkpoint can eliminate or reverse savings\. The favorable regime is an inference engine with KV\-cache fork/resume or prefix\-cache reuse\. To quantify this, we profiled Qwen3 models on an H100 PCIe\. For Qwen3\-32B with batch size one and answer cap 48, 4, 7, and 10 checkpoints take 3\.59, 6\.51, and 10\.20 seconds per question, respectively\. Checkpoint density therefore matters\.

Figure[3](https://arxiv.org/html/2606.30852#S4.F3)shows a practical schedule sweep on GSM8K\-32B\. A six\-checkpoint linear schedule recovers 96% of the ten\-checkpoint gain while reducing maximum probe\-answer overhead from 480 to 288 tokens\. A denser 14\-checkpoint schedule improves peak gain to \+0\.171, but at much higher probing cost\. This motivates a deployment recipe: begin with a small linear checkpoint grid, use the eight\-feature stopper, and calibrate the threshold to a user\-specified lost\-correct risk\.

![Refer to caption](https://arxiv.org/html/2606.30852v1/x3.png)Figure 3:Deployment sensitivity\. Left: checkpoint schedule sweep on GSM8K\-32B\. Six linear checkpoints retain 96% of the ten\-checkpoint gain\. Right: H100 latency grows roughly with the number of probes for Qwen3\-32B at answer cap 48; latency includes checkpoint probe answer generation under the same answer cap\.
### 4\.6Transfer, Robustness, and Cross\-Family Checks

Transfer experiments distinguish source\-threshold zero\-shot transfer, target\-calibrated transfer, and target\-trained upper bounds\. The strongest asymmetry appears from harder math to easier math: MATH\-500 to GSM8K with Qwen3\-32B obtains target\-calibrated gain \+0\.179, close to or above target\-trained performance in the same split\. Easy\-to\-hard transfer is weaker, and transfer to AIME is unreliable\. Cross\-model transfer is also asymmetric: Qwen3\-8B to Qwen3\-32B on GSM8K benefits from target calibration, while transfer from Qwen3 to DeepSeek\-R1 distillations is weak\. These results support target\-calibrated reuse within related model families, not universal zero\-shot portability\.

Robustness checks are consistent with the main finding\. On GSM8K\-32B, sampling at temperature 0\.6 over three seeds gives peak gains 0\.148–0\.179, close to the greedy gain 0\.157\. Probe\-template choice matters: a terse answer template and a no\-reasoning template both preserve gains, but a brittle “the answer is” template collapses full accuracy to 0\.230\. Concise prompting is a useful diagnostic, not a replacement: it improves GSM8K and MATH token cost, but damages MMLU\-Pro, AIME\-90, and GPQA\.

## 5Discussion and Limitations

The strongest conclusion is not that learned stopping dominates existing early\-exit methods\. It does not\. The useful conclusion is conditional: learning helps when prefix correctness depends on multiple partially independent signals\. Free\-form math often has this property; multiple\-choice tasks often do not\. This reframes learned stopping as a task\-aware component in a serving stack, not a universal early\-exit rule\.

There are several limitations\. First, peak gain is a frontier summary, not a deployable number; validation\-selected and risk\-calibrated points are more conservative\. Second, our method requires an interactive inference interface with prefix reuse\. Under pure black\-box repeated\-prefill accounting, probe overhead can exceed savings\. Third, AIME\-90 and GPQA have small or difficult test sets, so negative results there should be interpreted as evidence against aggressive stopping rather than as a precise ranking of methods\. Fourth, our PUMA and Thought Calibration comparisons are constrained by available artifacts: PUMA is approximated by an output\-level convergence proxy, and Thought Calibration requires hidden states not collected by our hidden\-state\-free pipeline\. Finally, the AIME\-90 subset is useful as a hard stress test, but conclusions about AIME should be treated as preliminary until the exact subset definition is standardized across runs\.

## 6Responsible Use

Early stopping changes the amount of reasoning a model performs and can therefore change which errors are exposed or hidden\. We do not recommend deploying aggressive stopping policies in high\-stakes settings without task\-specific validation, lost\-correct risk targets, and a full\-thinking fallback\. Risk tolerance should be chosen by the application owner rather than tuned only for average accuracy or cost\. Systems should also monitor whether premature\-stop risk is uneven across domains, question formats, or user groups, especially when the task distribution differs from the calibration set\.

## 7Conclusion

We presented LearnStop, a prefix\-observable learned stopper for reasoning models, and used it to study when learning to stop is worthwhile\. Across 18 task–model settings and multiple baselines, the answer is not universal\. Learned stopping is most useful on free\-form math, where confidence, stability, and trace features provide complementary evidence\. It is less useful on multiple\-choice or very hard settings, where scalar exits or conservative budgets are competitive\. Risk calibration and cost accounting are essential: user\-facing stopping policies should control lost\-correct risk and report savings under the deployment regime they actually assume\. This perspective turns early stopping from a single\-method race into a design question: choose the simplest stopping signal that captures the trajectory structure of the target task under the real serving cost model\.

## Appendix Overview

This appendix contains implementation details, complete frontier tables, validation\-selected operating points, extended baseline proxies, ablations, calibration diagnostics, serving profiles, transfer analyses, and robustness checks\. It is integrated into the arXiv version for completeness\.

## Appendix AImplementation Summary

The experiments use a unified checkpoint protocol\. At each budget checkpoint, the system branches a short answer probe from the current reasoning prefix, extracts output\-level features, and either stops or continues generation\. The default answer cap is 48 tokens; all latency profiles that vary checkpoint count include checkpoint probe\-answer generation under this cap\. Reported confidence intervals are question\-level bootstrap intervals unless stated otherwise\. Peak gains summarize empirical frontiers; validation\-selected gains and risk\-controlled operating points are the deployment\-oriented quantities\.

#### Cost regimes\.

KV\-fork assumes that an inference engine can branch a probe answer and then resume the unmodified reasoning state\. Prefix\-cache accounting assumes prefix reuse but additional serving overhead\. Pure black\-box API accounting assumes repeated prefilling and can make probing uneconomical\.

## Appendix BDataset Subsets

MATH\-500 denotes the 500\-problem MATH evaluation subset used by process\-supervision evaluations\. AIME\-90 is drawn from the Hugging Face AI\-MO/aimo\-validation\-aime dataset and contains 2022, 2023, and 2024 AIME I and AIME II problems 1–15, for 90 total integer\-answer problems\. We score against the provided public answer keys and do not deduplicate or filter the subset\. We report AIME\-90 as a hard stress test rather than as the primary source of positive claims\.

## Appendix CFull In\-Distribution Frontier Summary

TaskModelFull Acc\.Mean ThinkPeak Gain95% CIGSM8KQwen3\-8B0\.88814430\.047\[0\.029, 0\.063\]GSM8KQwen3\-32B0\.92910600\.157\[0\.132, 0\.183\]MATH\-500Qwen3\-8B0\.58228190\.047\[0\.024, 0\.073\]MATH\-500Qwen3\-32B0\.60425170\.086\[0\.056, 0\.117\]MMLU\-ProQwen3\-8B0\.64515080\.013\[\-0\.002, 0\.026\]MMLU\-ProQwen3\-32B0\.72411820\.035\[0\.014, 0\.058\]AIME\-90Qwen3\-8B0\.31174880\.032\[0\.004, 0\.061\]AIME\-90Qwen3\-32B0\.32272830\.016\[\-0\.006, 0\.039\]GSM8KDS\-R1\-Qwen\-7B0\.8242070\.098\[0\.062, 0\.134\]GSM8KDS\-R1\-Llama\-8B0\.4201650\.000\[0\.000, 0\.000\]MATH\-500DS\-R1\-Qwen\-7B0\.59217890\.060\[0\.027, 0\.092\]MATH\-500DS\-R1\-Llama\-8B0\.33213810\.015\[0\.003, 0\.028\]MMLU\-ProDS\-R1\-Qwen\-7B0\.50216660\.024\[0\.004, 0\.043\]MMLU\-ProDS\-R1\-Llama\-8B0\.40116320\.001\[\-0\.008, 0\.010\]GPQAQwen3\-8B0\.48528770\.018\[\-0\.011, 0\.042\]GPQAQwen3\-32B0\.53025350\.048\[\-0\.002, 0\.100\]GPQADS\-R1\-Qwen\-7B0\.34827300\.020\[\-0\.031, 0\.081\]GPQADS\-R1\-Llama\-8B0\.31327780\.000\[\-0\.015, 0\.015\]
## Appendix DValidation\-Selected Operating Points

TaskModelPolicyτval\\tau\_\{val\}Test GainTest Acc\.Test Tok\.GSM8KQwen3\-8BLearnStop0\.8000\.0250\.843608GSM8KQwen3\-8BConfidence0\.983\-0\.0360\.848918GSM8KQwen3\-8BEntropy0\.2520\.0060\.357157GSM8KQwen3\-8BRun\-stability2\.0000\.0660\.527249GSM8KQwen3\-32BLearnStop0\.6000\.1360\.733274GSM8KQwen3\-32BConfidence0\.9020\.0840\.678270GSM8KQwen3\-32BEntropy0\.2390\.1190\.713270GSM8KQwen3\-32BRun\-stability2\.0000\.0820\.638235MATH\-500Qwen3\-8BLearnStop0\.5500\.0430\.573995MATH\-500Qwen3\-8BConfidence0\.9830\.0170\.5731202MATH\-500Qwen3\-8BEntropy0\.0740\.0220\.5701134MATH\-500Qwen3\-8BRun\-stability7\.0000\.0040\.5601198MATH\-500Qwen3\-32BLearnStop0\.3000\.0770\.417300MATH\-500Qwen3\-32BConfidence0\.8750\.0190\.337154MATH\-500Qwen3\-32BEntropy0\.2850\.0430\.357195MATH\-500Qwen3\-32BRun\-stability4\.0000\.0510\.520679MMLU\-ProQwen3\-8BLearnStop0\.5000\.0040\.490135MMLU\-ProQwen3\-8BConfidence0\.905\-0\.0030\.483109MMLU\-ProQwen3\-8BEntropy0\.0410\.0060\.627781MMLU\-ProQwen3\-8BRun\-stability3\.0000\.0060\.494160MMLU\-ProQwen3\-32BLearnStop0\.6500\.0260\.615297MMLU\-ProQwen3\-32BConfidence0\.9520\.0420\.656411MMLU\-ProQwen3\-32BEntropy0\.2450\.0330\.621289MMLU\-ProQwen3\-32BRun\-stability4\.0000\.0090\.594266AIME\-90Qwen3\-8BLearnStop0\.3000\.0440\.3154936AIME\-90Qwen3\-8BConfidence0\.7780\.0180\.01866AIME\-90Qwen3\-8BEntropy0\.4710\.0180\.018199AIME\-90Qwen3\-8BRun\-stability2\.0000\.0370\.056806AIME\-90Qwen3\-32BLearnStop0\.350\-0\.0090\.2594852AIME\-90Qwen3\-32BConfidence0\.6730\.0060\.01890AIME\-90Qwen3\-32BEntropy0\.7160\.0110\.018152AIME\-90Qwen3\-32BRun\-stability6\.0000\.0330\.2964758GPQAQwen3\-8BLearnStop0\.4000\.0170\.445334GPQAQwen3\-8BConfidence0\.855\-0\.0040\.42097GPQAQwen3\-8BEntropy0\.316\-0\.0070\.420119GPQAQwen3\-8BRun\-stability9\.0000\.0080\.5211108GPQAQwen3\-32BLearnStop0\.400\-0\.0220\.361334GPQAQwen3\-32BConfidence0\.8550\.0050\.420492GPQAQwen3\-32BEntropy0\.657\-0\.0210\.370142GPQAQwen3\-32BRun\-stability2\.000\-0\.0050\.387141
## Appendix EExtended Baselines

The main text compares against scalar confidence, entropy, confidence\-leap, and stability exits\. The table below gives additional implemented proxies for recent methods: DEER\-style transition confidence, EAT\-style entropy stability, PUMA\-style convergence, and TERMINATOR\-light\. These are output\-level proxies under our common checkpoint protocol, not exact end\-to\-end reproductions of all original systems; their purpose is to compare signal families under identical probes, schedules, and cost accounting\.

TaskModelDEEREATPUMATERM\-lightAIME\-90Qwen3\-32B0\.0220\.0010\.0310\.004AIME\-90Qwen3\-8B0\.0460\.0480\.0500\.032GPQAQwen3\-32B0\.0490\.0350\.0500\.005GPQAQwen3\-8B0\.0390\.0070\.0170\.009GSM8KQwen3\-32B0\.0920\.0150\.0930\.161GSM8KQwen3\-8B0\.0070\.0010\.0650\.062MATH\-500Qwen3\-32B0\.0620\.0200\.0480\.072MATH\-500Qwen3\-8B0\.0210\.0160\.0210\.052MMLU\-ProQwen3\-32B0\.0450\.0240\.0260\.025MMLU\-ProQwen3\-8B0\.0170\.0010\.0370\.005
## Appendix FFeature Ablations

TaskModelConf\.Ent\.Conf\+EntStability8 feat\.10 feat\.AIME\-90Qwen3\-32B0\.0060\.0080\.0150\.0100\.0090\.016AIME\-90Qwen3\-8B0\.0190\.0180\.0170\.0070\.0340\.032GPQAQwen3\-32B0\.0500\.0250\.0260\.0000\.0410\.048GPQAQwen3\-8B0\.0120\.0120\.0170\.0000\.0180\.018GSM8KQwen3\-32B0\.0880\.1360\.1540\.0930\.1570\.157GSM8KQwen3\-8B0\.0070\.0190\.0180\.0710\.0460\.047MATH\-500Qwen3\-32B0\.0630\.0720\.0650\.0550\.0880\.086MATH\-500Qwen3\-8B0\.0170\.0310\.0300\.0250\.0490\.047MMLU\-ProQwen3\-32B0\.0440\.0440\.0430\.0220\.0390\.035MMLU\-ProQwen3\-8B0\.0130\.0110\.0080\.0110\.0130\.013
## Appendix GClassifier Class Comparison

TaskModelLogisticRFGBTMLPAIME\-90Qwen3\-32B0\.0090\.0120\.0190\.007AIME\-90Qwen3\-8B0\.0340\.0350\.0180\.028GPQAQwen3\-32B0\.0410\.0510\.0390\.045GPQAQwen3\-8B0\.0180\.0180\.0130\.008GSM8KQwen3\-32B0\.1570\.1700\.1740\.176GSM8KQwen3\-8B0\.0460\.0660\.0580\.062MATH\-500Qwen3\-32B0\.0880\.0780\.0710\.073MATH\-500Qwen3\-8B0\.0490\.0450\.0350\.042MMLU\-ProQwen3\-32B0\.0390\.0420\.0500\.041MMLU\-ProQwen3\-8B0\.0130\.0270\.0350\.031
## Appendix HRisk\-Controlled Operating Points

Table reports held\-out risk\-control selections from the finite\-grid lost\-correct risk procedure\. Savings are percentages relative to full thinking\.

TaskModelα\\alphaUURiskAcc\.Think SaveTotal SaveGSM8KQwen3\-32B0\.100\.0940\.0000\.94329\.1\-0\.7GSM8KQwen3\-32B0\.150\.1390\.0400\.91554\.432\.5GSM8KQwen3\-32B0\.200\.1910\.0820\.87860\.841\.5GSM8KQwen3\-8B0\.100\.0960\.0130\.89343\.120\.7GSM8KQwen3\-8B0\.150\.1310\.0400\.87854\.434\.8GSM8KQwen3\-8B0\.200\.1960\.1130\.81563\.346\.0MATH\-500Qwen3\-32B0\.150\.1490\.0400\.57760\.446\.0MATH\-500Qwen3\-32B0\.200\.1990\.0930\.53375\.164\.5MATH\-500Qwen3\-8B0\.150\.1490\.0430\.57367\.755\.5MATH\-500Qwen3\-8B0\.200\.1940\.0830\.53771\.159\.8MMLU\-ProQwen3\-32B0\.150\.1490\.0350\.69454\.331\.6MMLU\-ProQwen3\-32B0\.200\.1990\.0580\.67364\.445\.7MMLU\-ProQwen3\-8B0\.150\.1240\.0230\.61747\.222\.1MMLU\-ProQwen3\-8B0\.200\.1620\.0560\.59058\.837\.5![Refer to caption](https://arxiv.org/html/2606.30852v1/x4.png)Figure 4:Risk\-control summary for Qwen3\-32B\. Left: savings atalpha=0\.15\\\\ alpha=0\.15\. Right: held\-out test risk versus target risk\.
## Appendix ICost and Serving Profiles

TaskModelKV\-forkPrefix\-cacheBlack\-box APIAIME\-90Qwen3\-32B24\.0\-44\.3\-198\.9AIME\-90Qwen3\-8B26\.5\-38\.9\-187\.2GPQAQwen3\-32B53\.921\.5\-64\.8GPQAQwen3\-8B82\.272\.845\.9GSM8KQwen3\-32B63\.953\.110\.7GSM8KQwen3\-8B32\.2\-4\.3\-120\.9MATH\-500Qwen3\-32B82\.273\.746\.9MATH\-500Qwen3\-8B52\.521\.3\-67\.5MMLU\-ProQwen3\-32B61\.641\.7\-18\.4MMLU\-ProQwen3\-8B81\.776\.759\.8### I\.1H100 Serving Profile

ModelCheckpointsMean Latency \(s\)Std\. \(s\)Peak Mem\. \(GB\)Qwen3\-32B43\.5901\.99261\.78Qwen3\-32B76\.5143\.30161\.78Qwen3\-32B1010\.2025\.22361\.78Qwen3\-8B43\.0490\.75515\.69Qwen3\-8B74\.8551\.36115\.69Qwen3\-8B106\.6241\.93815\.69
### I\.2Checkpoint Schedule Sweep

ScheduleNBudgetsPeak GainRetentionMax Overhead4\-linear4\[0, 256, 512, 1536\]0\.11472\.51926\-linear6\[0, 192, 384, 512, 768, 1536\]0\.15196\.32886\-log6\[0, 64, 192, 512, 1024, 1536\]0\.12277\.82888\-hybrid8\[0, 64, 128, 256, 384, 640, 1024, 1536\]0\.14390\.73848\-linear8\[0, 128, 256, 384, 512, 768, 1024, 1536\]0\.157100\.138414\-dense14\[0, 64, 128, 192, 256, 320, 384, 448, 512, 640, 768, 896, 1024, 1536\]0\.171108\.7672

## Appendix JTransfer Protocols

Zero\-shot transfer uses the source\-trained classifier and the source threshold\. Target\-calibrated transfer uses the source\-trained classifier but calibrates a threshold on target calibration data\. Target\-trained is the same protocol with a target\-trained classifier\.

LabelTypeSource In\-distZero\-shotTarget\-cal\.Target\-trainedGSM8K→MATH500 8Bcross\_task0\.0460\.0280\.0280\.029GSM8K→MATH500 32Bcross\_task0\.1570\.0650\.0810\.080GSM8K→MMLU\-Pro 8Bcross\_task0\.0460\.0050\.0060\.008GSM8K→MMLU\-Pro 32Bcross\_task0\.1570\.0130\.0210\.025MATH500→GSM8K 8Bcross\_task0\.0490\.0280\.0450\.017MATH500→GSM8K 32Bcross\_task0\.0880\.1590\.1790\.169MMLU\-Pro→GSM8K 8Bcross\_task0\.0130\.0520\.0430\.017MMLU\-Pro→GSM8K 32Bcross\_task0\.0390\.1160\.1150\.169MATH500→MMLU\-Pro 8Bcross\_task0\.049\-0\.0170\.0020\.008MATH500→MMLU\-Pro 32Bcross\_task0\.0880\.0280\.0110\.025MMLU\-Pro→MATH500 8Bcross\_task0\.0130\.0210\.0320\.029MMLU\-Pro→MATH500 32Bcross\_task0\.0390\.0690\.1060\.080AIME→GSM8K 8Bcross\_task0\.034\-0\.0040\.0030\.017AIME→GSM8K 32Bcross\_task0\.0090\.0140\.0230\.169GSM8K→AIME 8Bcross\_task0\.0460\.000\-0\.0110\.032GSM8K→AIME 32Bcross\_task0\.157\-0\.033\-0\.0370\.004Qwen3\-8B→32B GSM8Kcross\_model0\.0460\.0660\.1840\.169Qwen3\-32B→8B GSM8Kcross\_model0\.1570\.0460\.0680\.017Qwen3\-8B→DSR1\-7B GSM8Kcross\_model0\.046\-0\.0010\.0040\.085Qwen3\-8B→DSR1\-Llama GSM8Kcross\_model0\.0460\.000\-0\.009\-0\.014Qwen3\-32B→DSR1\-7B GSM8Kcross\_model0\.157\-0\.0030\.0710\.085![Refer to caption](https://arxiv.org/html/2606.30852v1/x5.png)Figure 5:Transfer protocols across selected source\-target pairs\.
## Appendix KPrompt and Decoding Robustness

TaskModelTemplateFull Acc\.Peak GainMean ThinkGSM8KQwen3\-32Bterse0\.9240\.1771052GSM8KQwen3\-32Bno\_reasoning0\.9470\.1241064GSM8KQwen3\-32Bthe\_answer\_is0\.2300\.0921069### K\.1Temperature Sweep

Temp\.SeedNFull Acc\.Peak GainNote0\.0010000\.9290\.157reference \(greedy, main run\)0\.6425000\.9200\.148sampled0\.61235000\.9280\.157sampled0\.69995000\.9380\.179sampled

## Appendix LProbability Calibration

TaskModelECEBrierGSM8KQwen3\-8B0\.0230\.131GSM8KQwen3\-32B0\.0210\.120MATH\-500Qwen3\-8B0\.0260\.187MATH\-500Qwen3\-32B0\.0180\.199MMLU\-ProQwen3\-8B0\.0160\.236MMLU\-ProQwen3\-32B0\.0260\.194AIME\-90Qwen3\-8B0\.0190\.051AIME\-90Qwen3\-32B0\.0290\.047GSM8KDS\-R1\-Qwen\-7B0\.0320\.114GSM8KDS\-R1\-Llama\-8B0\.0280\.192MATH\-500DS\-R1\-Qwen\-7B0\.0320\.175MATH\-500DS\-R1\-Llama\-8B0\.0360\.145MMLU\-ProDS\-R1\-Qwen\-7B0\.0230\.207MMLU\-ProDS\-R1\-Llama\-8B0\.0090\.179GPQAQwen3\-8B0\.0390\.239GPQAQwen3\-32B0\.0330\.230GPQADS\-R1\-Qwen\-7B0\.0720\.218GPQADS\-R1\-Llama\-8B0\.0360\.196
## References

- \[1\]AI\-MO\(2024\)AI\-MO/aimo\-validation\-aime\.Note:Hugging Face dataset containing AIME 2022–2024 problemsExternal Links:[Link](https://huggingface.co/datasets/AI-MO/aimo-validation-aime)Cited by:[§4\.1](https://arxiv.org/html/2606.30852#S4.SS1.p1.1)\.
- \[2\]A\. N\. Angelopoulos, S\. Bates, A\. Fisch, L\. Lei, and T\. Schuster\(2024\)Conformal risk control\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px3.p1.1)\.
- \[3\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4\.1](https://arxiv.org/html/2606.30852#S4.SS1.p1.1)\.
- \[4\]DeepSeek\-AI\(2025\)DeepSeek\-R1: incentivizing reasoning capability in LLMs via reinforcement learning\.External Links:2501\.12948Cited by:[§1](https://arxiv.org/html/2606.30852#S1.p1.1)\.
- \[5\]Y\. Fu, J\. Chen, S\. Zhu, Z\. Fu, Z\. Dai, Y\. Zhuang, Y\. Ma, A\. Qiao, T\. S\. Rosing, I\. Stoica, and H\. Zhang\(2025\)Efficiently scaling LLM reasoning programs with Certaindex\.InAdvances in Neural Information Processing Systems,External Links:[Link](https://papers.nips.cc/paper_files/paper/2025/hash/d037fd021c9aace128b8ce25001cdb6c-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.30852#S1.p1.1),[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px2.p1.1)\.
- \[6\]T\. Han, Z\. Wang, C\. Fang, S\. Zhao, S\. Ma, and Z\. Chen\(2025\)Token\-budget\-aware LLM reasoning\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 24842–24855\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1274)Cited by:[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px1.p1.1)\.
- \[7\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\(2021\)Measuring mathematical problem solving with the MATH dataset\.InAdvances in Neural Information Processing Systems,Cited by:[§4\.1](https://arxiv.org/html/2606.30852#S4.SS1.p1.1)\.
- \[8\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=v8L0pN6EOi)Cited by:[§4\.1](https://arxiv.org/html/2606.30852#S4.SS1.p1.1)\.
- \[9\]Mathematical Association of America\(2026\)MAA invitational competitions: american invitational mathematics examination\.External Links:[Link](https://maa.org/maa-invitational-competitions/)Cited by:[§4\.1](https://arxiv.org/html/2606.30852#S4.SS1.p1.1)\.
- \[10\]D\. Min, G\. Vaccarino, H\. Chen, Y\. Wu, G\. Yona, and L\. Cheng\(2026\)Stop when reasoning converges: semantic\-preserving early exit for reasoning models\.External Links:2605\.17672Cited by:[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px2.p1.1)\.
- \[11\]N\. Muennighoff, Z\. Yang, W\. Shi, X\. L\. Li, L\. Fei\-Fei, H\. Hajishirzi, L\. Zettlemoyer, P\. Liang, E\. Candès, and T\. Hashimoto\(2025\)S1: simple test\-time scaling\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 20275–20321\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1025),[Link](https://aclanthology.org/2025.emnlp-main.1025/)Cited by:[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px1.p1.1)\.
- \[12\]A\. Nagle, J\. Saydaliev, D\. Garbaya, M\. Gastpar, A\. V\. Makkuva, and H\. Kim\(2026\)TERMINATOR: learning optimal exit points for early stopping in chain\-of\-thought reasoning\.External Links:2603\.12529Cited by:[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px3.p1.1)\.
- \[13\]V\. Quach, A\. Fisch, T\. Schuster, A\. Yala, J\. H\. Sohn, T\. Jaakkola, and R\. Barzilay\(2024\)Conformal language modeling\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px3.p1.1)\.
- \[14\]Qwen Team\(2025\)Qwen3 technical report\.External Links:2505\.09388Cited by:[§1](https://arxiv.org/html/2606.30852#S1.p1.1),[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px1.p1.1)\.
- \[15\]D\. Rein, B\. L\. Hou, A\. C\. Stickland, J\. Petty, R\. Y\. Pang, J\. Dirani, J\. Michael, and S\. R\. Bowman\(2023\)GPQA: a graduate\-level google\-proof q&a benchmark\.External Links:2311\.12022Cited by:[§4\.1](https://arxiv.org/html/2606.30852#S4.SS1.p1.1)\.
- \[16\]P\. Tikhonov, I\. V\. Oseledets, and E\. Tutubalina\(2026\)Confidence leaps in LLM reasoning: early stopping and cross\-model transfer\.InProceedings of the 2026 Conference of the European Chapter of the Association for Computational Linguistics: Short Papers,Cited by:[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px2.p1.1)\.
- \[17\]X\. Wang, J\. McInerney, L\. Wang, and N\. Kallus\(2025\)Entropy after </Think\> for reasoning model early exiting\.External Links:2509\.26522Cited by:[§1](https://arxiv.org/html/2606.30852#S1.p1.1),[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px2.p1.1)\.
- \[18\]X\. Wang, A\. Suresh, A\. Zhang, R\. More, W\. Jurayj, B\. Van Durme, M\. Farajtabar, D\. Khashabi, and E\. Nalisnick\(2026\)Conformal thinking: risk control for reasoning on a compute budget\.External Links:2602\.03814Cited by:[§1](https://arxiv.org/html/2606.30852#S1.p1.1),[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px3.p1.1)\.
- \[19\]Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, W\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen\(2024\)MMLU\-Pro: a more robust and challenging multi\-task language understanding benchmark\.InAdvances in Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[§4\.1](https://arxiv.org/html/2606.30852#S4.SS1.p1.1)\.
- \[20\]M\. Wu, C\. Zhou, S\. Bates, and T\. Jaakkola\(2025\)Thought calibration: efficient and confident test\-time scaling\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,Cited by:[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px3.p1.1)\.
- \[21\]C\. Yang, Q\. Si, Y\. Duan, Z\. Zhu, C\. Zhu, Q\. Li, M\. Chen, Z\. Lin, and W\. Wang\(2026\)Dynamic early exit in reasoning models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=NpU7ZXafRi)Cited by:[§1](https://arxiv.org/html/2606.30852#S1.p1.1),[§2](https://arxiv.org/html/2606.30852#S2.SS0.SSS0.Px2.p1.1)\.
When Does Learning to Stop Help? A Cost-Aware Study of Early Exits in Reasoning Models

Similar Articles

@HuggingPapers: Cut your losses in parallel reasoning STOP learns to prune doomed trajectories early by reading KV-cache states, cuttin…

Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models

Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models

Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning

Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models

Submit Feedback

Similar Articles

@HuggingPapers: Cut your losses in parallel reasoning STOP learns to prune doomed trajectories early by reading KV-cache states, cuttin…
Stop When Reasoning Converges: Semantic-Preserving Early Exit for Reasoning Models
Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models
Cut Your Losses! Learning to Prune Paths Early for Efficient Parallel Reasoning
Thinking Past the Answer: Evaluating Harmful Overthinking in Large Reasoning Models