MARS: Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling
Summary
This paper introduces MARS, a stopping rule for parallel LLM test-time scaling that probes partial traces to stop early without sacrificing accuracy, saving 25–47% of tokens across reasoning models on competition math benchmarks.
View Cached Full Text
Cached at: 06/12/26, 08:54 AM
# Margin-Adversarial Risk-controlled Stopping for Parallel LLM Test-time Scaling
Source: [https://arxiv.org/html/2606.12935](https://arxiv.org/html/2606.12935)
Wenbo Chen11footnotemark:1,♠\{\}^\{\\;\\,,\\spadesuit\}Puheng Li11footnotemark:1,★\{\}^\{\\;\\,,\\bigstar\}Mengyang Liu,♠\{\}^\{\\;\\,,\\spadesuit\}Weijie Su◇Tianpei Xie♠ ♠Amazon★Stanford University◇University of PennsylvaniaEqual contribution\. Correspondence towenbochen111@outlook\.com\.Work conducted outside of the authors’ roles at Amazon\.
###### Abstract
Parallel test\-time scaling samples many reasoning traces and majority\-votes their answers, improving LLM accuracy but requiring traces to run to completion, incurring substantial computational overhead\. We observe that probing partial traces at intermediate checkpoints can extract current answers without disrupting generation, revealing an evolving aggregate vote\. Based on this observation, we introduce*MARS*, a margin\-adversarial stopping rule that estimates which active traces are likely to change their answers and stops once the leader remains safe under a conservative bound on future vote movement\. The rule separates two sources of uncertainty\. It learns the trace\-level switch probabilities that determine how much of the current margin is likely to be retained, while handling the harder question of where switching traces land through an adversarial bound calibrated from warmup traces\. With true switch probabilities, MARS guarantees with high probability that the early\-stopped answer matches the full\-budget vote\. In practice, a five\-feature logistic model closely matches oracle switching behavior\. Across three reasoning models and three competition\-math benchmarks, MARS saves 25–47% of self\-consistency tokens and 14–29% on top of DeepConf Online, a strong confidence\-weighted baseline that already filters and truncates weak traces, while matching the accuracy of the corresponding full\-budget baselines\.
Figure 1:Left:Token savings achieved by MARS across three models \(DeepSeek\-R1\-8B, Qwen3\-32B, Qwen3\-next\-80B\) and three competition math benchmarks, under both self\-consistency \(SC\) and DeepConf Online\(Fuet al\.,[2025](https://arxiv.org/html/2606.12935#bib.bib2)\)voting, while matching the accuracy\. Error bars show 95% CI across questions\.Right:MARS in action on HMMT Q22 \(DeepSeek\-R1\-8B\)\. Top: vote share evolution across probes\. Bottom: minimum slack \(binding challenger margin minus adversarial threshold\); MARS stops when slack crosses zero, keeping the correct answer and saving 59% of tokens\.## 1Introduction
Parallel test\-time scaling, sampling many reasoning traces and aggregating their answers by majority vote, is among the most reliable ways to improve LLM accuracy on hard problems\(Wanget al\.,[2023](https://arxiv.org/html/2606.12935#bib.bib3); Brownet al\.,[2024](https://arxiv.org/html/2606.12935#bib.bib4); Snellet al\.,[2024](https://arxiv.org/html/2606.12935#bib.bib5)\)\. But reliability comes at a cost: a 512\-trace run on a single competition\-math problem generates millions of tokens\. On most questions, the winning answer is decided well before all traces finish\. The remaining computation is pure waste\.
A key enabler is that modern reasoning models expose long thinking traces\(Yanget al\.,[2025a](https://arxiv.org/html/2606.12935#bib.bib1)\), and recent probing work shows that intermediate answers can be elicited from partial traces during parallel thinking\(Zhenget al\.,[2026](https://arxiv.org/html/2606.12935#bib.bib12)\)\. This makes*probing*possible: extracting intermediate answers from partial traces mid\-generation without disrupting the ongoing reasoning process\. Probing transforms parallel decoding from opaque to observable\. At each checkpoint we know who is voting for what, how the plurality has shifted, and how many traces remain undecided\. Despite its power, probing for early stopping remains largely unexplored, with only one concurrent method\(Zhenget al\.,[2026](https://arxiv.org/html/2606.12935#bib.bib12)\)exploiting it in the parallel\-vote setting\.
Figure 2:Consensus stopping fails on hard questions\. On HMMT Q6 \(DeepSeek\-8B\), 89% of traces initially vote for the wrong answer and the wrong answer leads till the middle of generation\. Consensus stopping fires early, locking in the error\. MARS waits until the correct answer overtakes and certifies it\.But observability is not the same as a stopping criterion\. Given a live view of the vote, when is it safe to stop? The naive answer, stop when the majority answer has been stable for several checkpoints, is a consensus heuristic\. And consensus heuristics fail precisely when they matter most: on hard problems\. Fig\.[2](https://arxiv.org/html/2606.12935#S1.F2)illustrates a typical failure mode\. At an early checkpoint the majority of traces converge on an*incorrect*preliminary answer\. A consensus\-based rule sees stability and stops, locking in the wrong answer\. With more computation, a smaller group of traces would have switched to the correct answer and eventually overtaken the early leader\. The concurrent Parallel\-Probe method\(Zhenget al\.,[2026](https://arxiv.org/html/2606.12935#bib.bib12)\)instantiates exactly this failure: on HMMT with DeepSeek\-8B, consensus stopping drops accuracy from 70% to 35%, cutting performance in half while ostensibly “saving tokens\.”
The failure is not accidental\. Hard questions are precisely those where the correct answer emerges late, after many traces revise their initial reasoning\. Any rule that equates*stability*with*safety*will systematically stop too early on the questions where early stopping is most dangerous\. This reveals the core research question:
*Can we stop parallel generation early, provably preserving the final vote outcome, while still saving a substantial fraction of tokens?*
The key insight is that the right object is not whether the vote has changed, but whether it*can*still change\. A lead of 300 votes is safe if the remaining uncertain traces cannot plausibly coordinate 150 switches toward a single challenger\. The same lead is fragile if most leader\-supporting traces are still mid\-reasoning and likely to revise\. Safety is a*margin*question: the current lead, minus the worst\-case damage from future switches, must remain positive for every challenger simultaneously\.
We introduce MARS \(Margin\-AdversarialRisk\-controlledStopping\), a principled early\-stopping rule for parallel LLM decoding\. At each checkpoint, MARS \(i\) probes every active trace for its current answer, \(ii\) estimates the probability that each trace will switch before the full\-budget endpoint, and \(iii\) computes the maximum margin loss that future switches could adversarially cause against each challenger\. Generation stops only when every challenger is certified: the leader’s margin exceeds the adversarial switch damage\. With true switch probabilities, this yields a high\-probability guarantee that the early\-stopped answer matches the full\-budget vote\.
The modeling problem separates cleanly into two parts\.*Whether*a trace will switch is learnable from its own history \(checkpoint position, probe confidence, answer\-flip count, stability streak\) using a lightweight logistic model trained on a handful of warmup traces per question\.*Where*a switching trace will land is fundamentally harder: the destination is an open\-ended answer string that depends on future reasoning\. Rather than learning a full destination distribution, MARS treats destinations adversarially and calibrates a per\-question contraction parameter from warmup evidence to capture how much uncertain switch mass actually behaves adversarially in practice\.
On AIME 2025, HMMT, and BRUMO 2025, across DeepSeek\-R1\-8B, Qwen3\-32B, and Qwen3\-next, MARS saves 25–47% of self\-consistency tokens and an additional 14–29% on top of DeepConf Online\(Fuet al\.,[2025](https://arxiv.org/html/2606.12935#bib.bib2)\), while matching the full\-budget baseline across all 18 settings \(within 0\.6 percentage points where accuracy changes, and improving it by up to 0\.8 points in several settings; Fig\.[1](https://arxiv.org/html/2606.12935#S0.F1)\)\. In direct comparison, Parallel\-Probe\(Zhenget al\.,[2026](https://arxiv.org/html/2606.12935#bib.bib12)\)achieves comparable savings only by sacrificing 9–35 percentage points of accuracy; when tuned to preserve accuracy, its savings collapse to≤4%\{\\leq\}4\\%\. The results confirm that principled margin certification, not consensus detection, is the right foundation for safe early stopping in parallel reasoning\.
### 1\.1Problem setup
We formalize the stopping problem for a single prompt\. The system launchesNNreasoning traces in parallel\. Each trace would ordinarily run to a full\-budget endpointTT\. We observe the traces at checkpointst1<⋯<tP=Tt\_\{1\}<\\cdots<t\_\{P\}=T\.
#### Trace states\.
At checkpointtt, the base sampler reports a status for each trace:
sj\(t\)∈\{running,finished,discarded\}\.s\_\{j\}\(t\)\\in\\\{\\mathrm\{running\},\\mathrm\{finished\},\\mathrm\{discarded\}\\\}\.A*running*trace is still generating and may change its future answer\. A*finished*trace has produced a terminal answer, so its contribution is fixed\. A*discarded*trace has been deprecated by the base pipeline \(e\.g\., filtered for low confidence\); it contributes zero weight\. The active set is𝒜t=\{j:sj\(t\)=running\}\\mathcal\{A\}\_\{t\}=\\\{j:s\_\{j\}\(t\)=\\mathrm\{running\}\\\}, andℱt\\mathcal\{F\}\_\{t\}denotes all information available at checkpointtt\.
#### Probing\.
For a running trace, we extract its current answer by appending a short stop\-thinking instruction to the partial generation, collecting the answer, and discarding the probe suffix\. The original trace continues unmodified from the checkpoint\. This matches the behavior of thinking\-mode models that can produce an answer when interrupted\(Yanget al\.,[2025a](https://arxiv.org/html/2606.12935#bib.bib1)\)\. We denote the extracted answeraj\(t\)a\_\{j\}\(t\); for finished traces this is the terminal answer, and for discarded tracesaj\(t\)=∅a\_\{j\}\(t\)=\\varnothing\.
#### Voting\.
Each trace carries an effective nonneg weightwj∈\[0,wmax\]w\_\{j\}\\in\[0,w\_\{\\max\}\], withwj=0w\_\{j\}=0for discarded traces\. The vote for answera≠∅a\\neq\\varnothingat checkpointttis
Va\(t\)=∑j=1Nwj1\{aj\(t\)=a\},V\_\{a\}\(t\)=\\sum\_\{j=1\}^\{N\}w\_\{j\}\\,\\mathbf\{1\}\\\{a\_\{j\}\(t\)=a\\\},and the current leader isL\(t\)=argmaxaVa\(t\)L\(t\)=\\arg\\max\_\{a\}V\_\{a\}\(t\)with deterministic tie\-breaking\. Uniform weights recover self\-consistency; confidence\-derived weights recover schemes like DeepConf\(Fuet al\.,[2025](https://arxiv.org/html/2606.12935#bib.bib2)\)\.
#### Stopping\.
We seek a stopping timeτ≤T\\tau\\leq Tthat saves generation cost while preserving the full\-budget answer with high probability\. For a user\-specified risk levelδ∈\(0,1\)\\delta\\in\(0,1\):
P\(L\(τ\)≠L\(T\)\)≤δ\.P\\bigl\(L\(\\tau\)\\neq L\(T\)\\bigr\)\\leq\\delta\.This is a guarantee relative to the full\-budget vote, not directly to the unknown ground truth\. Benchmark accuracy is evaluated separately in §[3](https://arxiv.org/html/2606.12935#S3)\.
## 2Margin\-Adversarial Risk\-controlled Stopping \(MARS\)
Figure 3:Illustration of MARS on a single question\. At each checkpoint, active traces are probed for current answers\. The margin test checks whether the leader’s advantage over every challenger exceeds the expected adversarial switch damage\. Generation stops only when all challengers are certified\.The key idea behind MARS is simple: the current leader is safe to output when its margin over every challenger exceeds the worst\-case damage that future answer switches could inflict \(Fig\.[3](https://arxiv.org/html/2606.12935#S2.F3)\)\. We first define the margin and present the stopping rule with its safety guarantee \(§[2\.1](https://arxiv.org/html/2606.12935#S2.SS1)–§[2\.2](https://arxiv.org/html/2606.12935#S2.SS2)\), introduce a calibrated relaxation \(§[2\.3](https://arxiv.org/html/2606.12935#S2.SS3)\), and describe the practical algorithm \(§[2\.4](https://arxiv.org/html/2606.12935#S2.SS4)\)\.
### 2\.1The margin as stopping criterion
Consider 512 parallel traces, each producing an intermediate answer at a checkpoint\. Suppose the current leader has 300 votes and the nearest challenger has 150\. Is it safe to stop? The answer depends on how many traces might still change their minds\. If only 20 traces are still mid\-reasoning, the lead is unassailable\. If 400 traces are volatile, the lead is fragile\. The relevant quantity is not the margin alone, but the margin relative to remaining uncertainty\.
We formalize this intuition\. At checkpointtt, define the margin of the leaderL=L\(t\)L=L\(t\)over each challengerkkas
Mk\(t\)=VL\(t\)−Vk\(t\)\.M\_\{k\}\(t\)=V\_\{L\}\(t\)\-V\_\{k\}\(t\)\.IfMk\(T\)\>0M\_\{k\}\(T\)\>0for every challengerkkat the full\-budget endpointTT, then the checkpoint leader survives\. Conversely, the full\-budget winner differs only if some challenger achievesMk\(T\)≤0M\_\{k\}\(T\)\\leq 0\. The stopping problem reduces to certifying, for every challenger simultaneously, that its margin cannot be closed by future vote changes\.
The checked challenger set𝒦\(t\)\\mathcal\{K\}\(t\)contains every nonleader answer with positive current vote, plus a synthetic unseen challenger⊥\\botwithV⊥\(t\)=0V\_\{\\bot\}\(t\)=0\. The⊥\\botchallenger covers answer strings that are absent at checkpointttbut could appear later: it pessimistically allows all undecided mass to coordinate on a novel answer\.
### 2\.2The stopping rule
Between checkpointttand the full budgetTT, some active traces will change their answers\. LetXj=𝟏\{aj\(T\)≠aj\(t\)\}X\_\{j\}=\\mathbf\{1\}\\\{a\_\{j\}\(T\)\\neq a\_\{j\}\(t\)\\\}indicate whether tracejjswitches, and letqj\(t\)=P\(Xj=1∣ℱt\)q\_\{j\}\(t\)=P\(X\_\{j\}=1\\mid\\mathcal\{F\}\_\{t\}\)be its conditional switch probability, estimated from observable features of the trace \(position, confidence, flip history, streak length\)\. Finished and discarded traces haveqj=0q\_\{j\}=0\.
When tracejjswitches, the damage it inflicts on the margin against challengerkkdepends on its current vote\. We bound this damage adversarially by considering the worst\-case destination:
###### Definition 2\.1\(Adversarial switch cost\)\.
cjk=\{2wj,aj\(t\)=L,−wj,aj\(t\)=k,wj,aj\(t\)∉\{L,k\}\.c\_\{j\}^\{k\}=\\begin\{cases\}2w\_\{j\},&a\_\{j\}\(t\)=L,\\\\ \-w\_\{j\},&a\_\{j\}\(t\)=k,\\\\ w\_\{j\},&a\_\{j\}\(t\)\\notin\\\{L,k\\\}\.\\end\{cases\}\(1\)
This is the worst case scenario, since: 1\) a leader\-voter switching away removeswjw\_\{j\}from the leader and at worst addswjw\_\{j\}to challengerkk, for total damage2wj2w\_\{j\}; 2\) a challenger\-voter leaving*helps*the leader regardless of destination, giving−wj\-w\_\{j\}; 3\) a neutral voter at worst joinskk, costingwjw\_\{j\}\. The intuition is simple: a challenger with volatile supporters is less threatening than its vote count suggests\.
The expected adversarial damage against challengerkkis
Γk\(t\)=∑j∈𝒜tqjcjk\.\\Gamma\_\{k\}\(t\)=\\sum\_\{j\\in\\mathcal\{A\}\_\{t\}\}q\_\{j\}\\,c\_\{j\}^\{k\}\.\(2\)MARS stops when the current margin exceeds this expected damage plus a concentration correction for every challenger:
Mk\(t\)≥Γk\(t\)\+ϵ\(N,δ\)for allk∈𝒦\(t\)\.M\_\{k\}\(t\)\\geq\\Gamma\_\{k\}\(t\)\+\\epsilon\(N,\\delta\)\\qquad\\text\{for all \}k\\in\\mathcal\{K\}\(t\)\.\(3\)
###### Assumption 2\.2\(Bounded weights\)\.
For all traces,0≤wj≤wmax0\\leq w\_\{j\}\\leq w\_\{\\max\}\.
###### Assumption 2\.3\(Conditional independence\)\.
Conditional onℱt\\mathcal\{F\}\_\{t\}, the switch indicators\{Xj\}j∈𝒜t\\\{X\_\{j\}\\\}\_\{j\\in\\mathcal\{A\}\_\{t\}\}are independent\.
###### Theorem 2\.4\(Per\-challenger safety\)\.
Under Assumptions[2\.2](https://arxiv.org/html/2606.12935#S2.ThmTheorem2)–[2\.3](https://arxiv.org/html/2606.12935#S2.ThmTheorem3), using true switch probabilitiesqj\(t\)q\_\{j\}\(t\),
P\(Mk\(T\)≤0∣ℱt\)≤exp\(−\(Mk\(t\)−Γk\(t\)\)\+22wmax2Nactive\),P\\\!\\left\(M\_\{k\}\(T\)\\leq 0\\mid\\mathcal\{F\}\_\{t\}\\right\)\\leq\\exp\\\!\\left\(\-\\frac\{\\left\(M\_\{k\}\(t\)\-\\Gamma\_\{k\}\(t\)\\right\)\_\{\+\}^\{2\}\}\{2w\_\{\\max\}^\{2\}N\_\{\\mathrm\{active\}\}\}\\right\),\(4\)whereNactive=max\{1,\|𝒜t\|\}N\_\{\\mathrm\{active\}\}=\\max\\\{1,\|\\mathcal\{A\}\_\{t\}\|\\\}\.
###### Corollary 2\.5\(Risk\-controlled stop\)\.
Set
ϵ\(N,δ\)=wmax2NactivelogNδ\.\\epsilon\(N,\\delta\)=w\_\{\\max\}\\sqrt\{2N\_\{\\mathrm\{active\}\}\\log\\frac\{N\}\{\\delta\}\}\.\(5\)If Eq\.[3](https://arxiv.org/html/2606.12935#S2.E3)holds for everyk∈𝒦\(t\)k\\in\\mathcal\{K\}\(t\), thenP\(L\(T\)≠L\(t\)∣ℱt\)≤δP\\bigl\(L\(T\)\\neq L\(t\)\\mid\\mathcal\{F\}\_\{t\}\\bigr\)\\leq\\delta\.
The proof applies Hoeffding’s inequality to∑jXjcjk\\sum\_\{j\}X\_\{j\}c\_\{j\}^\{k\}and takes a union bound over challengers; details are in Appendix[A](https://arxiv.org/html/2606.12935#A1)\. The lightweight logistic model that estimatesqjq\_\{j\}from warmup traces is described in §[3](https://arxiv.org/html/2606.12935#S3)\.
### 2\.3Calibrated conservatism viaγ\\gamma
The fully adversarial costcjkc\_\{j\}^\{k\}assumes that*all*switching mass coordinates against the same challenger\. In practice, switching traces disperse across many answers, and some even join the leader\. This makes the rule overly cautious\.
We relax the adversarial costs with a contraction parameterγ∈\[1/2,1\]\\gamma\\in\[1/2,1\]:
cjk\(γ\)=\{2γwj,aj\(t\)=L,−wj,aj\(t\)=k,γwj,aj\(t\)∉\{L,k\}\.c\_\{j\}^\{k\}\(\\gamma\)=\\begin\{cases\}2\\gamma w\_\{j\},&a\_\{j\}\(t\)=L,\\\\ \-w\_\{j\},&a\_\{j\}\(t\)=k,\\\\ \\gamma w\_\{j\},&a\_\{j\}\(t\)\\notin\\\{L,k\\\}\.\\end\{cases\}\(6\)The lower boundγ≥1/2\\gamma\\geq 1/2preserves the irreducible cost: when a leader\-voter switches, it loses at leastwjw\_\{j\}from the leader regardless of destination\. Only the “which challenger benefits” portion is contracted\. The calibrated stopping criterion becomes
Mk\(t\)≥∑j∈𝒜tqjcjk\(γ\)for allk∈𝒦\(t\)\.M\_\{k\}\(t\)\\geq\\sum\_\{j\\in\\mathcal\{A\}\_\{t\}\}q\_\{j\}\\,c\_\{j\}^\{k\}\(\\gamma\)\\qquad\\text\{for all \}k\\in\\mathcal\{K\}\(t\)\.\(7\)Eq\.[7](https://arxiv.org/html/2606.12935#S2.E7)withγ<1\\gamma<1or estimatedq^j\\hat\{q\}\_\{j\}is an empirical relaxation; Corollary[2\.5](https://arxiv.org/html/2606.12935#S2.ThmTheorem5)applies only to the certified variant withγ=1\\gamma=1and true switch probabilities\.
We calibrateγ\\gammaper question from warmup traces run to completion\. Letγwarmup\\gamma\_\{\\mathrm\{warmup\}\}be the smallestγ\\gammathat would have stopped correctly on the warmup set \(i\.e\., without changing the warmup full\-budget winner\)\. We add an upper\-confidence correction:
γout=min\(1,γwarmup\+znelig\),\\gamma\_\{\\mathrm\{out\}\}=\\min\\\!\\left\(1,\\;\\gamma\_\{\\mathrm\{warmup\}\}\+\\frac\{z\}\{\\sqrt\{n\_\{\\mathrm\{elig\}\}\}\}\\right\),\(8\)wherenelign\_\{\\mathrm\{elig\}\}is the number of eligible warmup checkpoints andz=1\.0z=1\.0in all experiments\.
### 2\.4Practical implementation
The stopping rule \(Eq\.[7](https://arxiv.org/html/2606.12935#S2.E7)\) requires two learned quantities per question: the switch probabilitiesqj\(t\)q\_\{j\}\(t\)and the contractionγ\\gamma\. Both are estimated from a small set of 16 warmup traces run to completion before the main sample is stopped\. We fit a per\-question logistic regression on \(trace, checkpoint\) pairs, labeled by whether the trace’s final answer differs from its checkpoint answer, using fiveℱt\\mathcal\{F\}\_\{t\}\-measurable features: checkpoint position, probe confidence, answer\-flip count, streak length, and confidence trend\. This givesq^j\(t\)\\hat\{q\}\_\{j\}\(t\)for every active trace at every checkpoint \(details in Appendix[C](https://arxiv.org/html/2606.12935#A3)\)\. The same warmup traces calibrateγ\\gammavia Eq\.[8](https://arxiv.org/html/2606.12935#S2.E8)\.
At each checkpoint,𝒦\(t\)\\mathcal\{K\}\(t\)includes every nonleader answer with positive votes, plus an unseen challenger⊥\\botwith zero current votes\. For⊥\\bot, no trace is akk\-voter, so there is no−wj\-w\_\{j\}benefit: the cost iscj⊥=2γwjc\_\{j\}^\{\\bot\}=2\\gamma w\_\{j\}for leader\-voters andγwj\\gamma w\_\{j\}for all others\. This guards against novel answers emerging after checkpointtt\. The complete procedure is given in Algorithm[1](https://arxiv.org/html/2606.12935#alg1)\.
Algorithm 1MARS: Margin\-Adversarial Early Stopping1:Input:trace budget
NN, checkpoints
\{t1,…,tP\}\\\{t\_\{1\},\\ldots,t\_\{P\}\\\}, warmup traces, calibration strength
zz\.
2:Run warmup traces to completion;
3:Fit logistic regression
q^\\hat\{q\}on warmup \(trace, checkpoint\) pairs\.
4:Calibrate
γ\\gammafrom warmup via Eq\.[8](https://arxiv.org/html/2606.12935#S2.E8)\.
5:Launch the
NNmain traces in parallel\.
6:forcheckpoint
t∈\{t1,…,tP\}t\\in\\\{t\_\{1\},\\ldots,t\_\{P\}\\\}do
7:Update trace statuses and active set
𝒜t\\mathcal\{A\}\_\{t\}\.
8:Probe running traces for current answers
aj\(t\)a\_\{j\}\(t\)\.
9:Compute votes, leader
L\(t\)L\(t\), margins
Mk\(t\)M\_\{k\}\(t\)\.
10:Estimate
q^j\(t\)\\hat\{q\}\_\{j\}\(t\)for
j∈𝒜tj\\in\\mathcal\{A\}\_\{t\}; set
q^j\(t\)=0\\hat\{q\}\_\{j\}\(t\)=0for inactive traces\.
11:Build
𝒦\(t\)\\mathcal\{K\}\(t\): observed challengers
∪\\cupunseen challenger
⊥\\bot\.
12:if
Mk\(t\)≥∑j∈𝒜tq^jcjk\(γ\)M\_\{k\}\(t\)\\geq\\sum\_\{j\\in\\mathcal\{A\}\_\{t\}\}\\hat\{q\}\_\{j\}\\,c\_\{j\}^\{k\}\(\\gamma\)for all
k∈𝒦\(t\)k\\in\\mathcal\{K\}\(t\)then return
L\(t\)L\(t\)\.
13:endfor
14:returnthe full\-budget leader
L\(T\)L\(T\)\.
## 3Experiments
### 3\.1Setup
#### Datasets and models\.
We evaluate on three competition mathematics benchmarks, a standard stress test for LLM reasoning\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.12935#bib.bib38)\): AIME 2025, HMMT, and BRUMO 2025, with 30 problems each\. We test three reasoning models: DeepSeek\-R1\-Distill\-Qwen\-8B\(DeepSeek\-AIet al\.,[2025](https://arxiv.org/html/2606.12935#bib.bib23)\), Qwen3\-32B, and Qwen3\-next\(Yanget al\.,[2025a](https://arxiv.org/html/2606.12935#bib.bib1)\)\. For each model–dataset pair, we generate a pool of 4,096 complete traces per problem and probe intermediate answers every 2,048 tokens\. This gives 9 model–dataset settings\.
#### Evaluation protocol\.
We simulate 512\-trace parallel runs by bootstrap sampling from the 4,096\-trace pool, using 64 iterations per question\. In each bootstrap run, 16 traces are designated as warmup traces for switch\-model fitting,γ\\gammacalibration, and DeepConf threshold calibration\. The remaining traces form the main parallel sample on which stopping decisions are evaluated\. Token savings are reported relative to the corresponding full\-budget voting baseline:
savings=meanquestions\(1−tokensMARStokensbaseline\)\.\\mathrm\{savings\}=\\mathrm\{mean\}\_\{\\text\{questions\}\}\\left\(1\-\\frac\{\\mathrm\{tokens\}\_\{MARS\}\}\{\\mathrm\{tokens\}\_\{\\mathrm\{baseline\}\}\}\\right\)\.Accuracy is the benchmark accuracy of the answer returned by the stopped or full\-budget vote\. Token savings include the 16 warmup traces in the denominator; per\-checkpoint probe completions \(up to 50 tokens each\) are excluded from the count as they are small relative to full trace lengths\.
#### Voting pipelines\.
We evaluate MARS as an early\-stopping layer on top of two voting pipelines\. Self\-consistency \(SC\) uses uniform weightswj=1w\_\{j\}=1and majority vote\(Wanget al\.,[2023](https://arxiv.org/html/2606.12935#bib.bib3)\)\. DeepConf Online \(DCO\)\(Fuet al\.,[2025](https://arxiv.org/html/2606.12935#bib.bib2)\)weights and filters traces by confidence; traces below a warmup\-calibrated threshold are discarded\. We remove DeepConf’s adaptive sampling component, as it introduces sequential dependencies after the warm\-up phase and therefore breaks full parallelism during inference\. MARS operates on the resulting effective vote identically in both cases, testing whether aggregate\-level margin stopping provides savings beyond per\-trace confidence stopping\.
#### Compared methods\.
For each pipeline, we report the fully conservative variant \(γ=1\\gamma=1\), the calibrated variant \(§[2\.4](https://arxiv.org/html/2606.12935#S2.SS4)\), and an oracle\-qqdiagnostic that replaces the learned switch model with the retrospective indicatorqj∗=𝟏\{aj\(t\)≠aj\(T\)\}q\_\{j\}^\{\*\}=\\mathbf\{1\}\\\{a\_\{j\}\(t\)\\neq a\_\{j\}\(T\)\\\}\. Full per\-method numbers are in Appendix Tables[5](https://arxiv.org/html/2606.12935#A5.T5),[6](https://arxiv.org/html/2606.12935#A5.T6), and[7](https://arxiv.org/html/2606.12935#A5.T7)\.
### 3\.2Main results
Table 1:Accuracy \(%\) and token savings for calibrated MARS\. “Baseline” is the full\-budget voting accuracy; “\+MARS” is the early\-stopped accuracy\. Across all 18 settings MARS preserves accuracy within 0\.6pp where it changes and improves it by up to 0\.8pp in several settings, while saving 25–47% \(SC\) and 14–29% \(DeepConf\) of tokens\.#### Token savings\.
Table[1](https://arxiv.org/html/2606.12935#S3.T1)and Fig\.[1](https://arxiv.org/html/2606.12935#S0.F1)show that MARS saves 25–47% of tokens under SC and an additional 14–29% on top of DeepConf\. DeepConf starts from a stronger and cheaper baseline because it already filters and truncates low\-confidence traces, so the additional savings from MARS are smaller but still consistent\. Savings are largest in settings where the model reaches consensus early, such as Qwen3\-next on BRUMO\.
#### Accuracy\.
Across all 18 settings, early\-stopped accuracy is preserved: where it changes it stays within 0\.6 percentage points of the full\-budget baseline, and in several settings it improves by up to 0\.8 points \(e\.g\., Qwen3\-next on AIME under SC, 86\.9%→\\rightarrow87\.7%, and Qwen3\-32B on BRUMO under DeepConf, 92\.4%→\\rightarrow93\.0%\)\. Most settings match exactly; the largest drop is Qwen3\-32B on BRUMO under SC \(90\.9%→\\rightarrow90\.3%\)\.
### 3\.3Ablations
Figure 4:Effect of destination calibration \(γ\\gamma\) and switch probability estimation\.a:Per\-questionγ\\gammacalibration adds 4–7pp savings under SC and 2–5pp under DeepConf beyond the fully conservativeγ=1\\gamma\{=\}1variant, confirming that warmup traces contain useful information about destination behavior\.b:The learned 5\-feature logistic model closely matches oracle\-qq, with a gap of only 1–4pp, indicating that the switch\-event signal is well captured by trace\-intrinsic features\.Fig\.[4](https://arxiv.org/html/2606.12935#S3.F4)summarizes two key ablations across all model–dataset settings\.
#### Destination calibration\.
Even the fully conservative rule \(γ=1\\gamma=1\) achieves large savings, but per\-questionγ\\gammacalibration adds 4–7 points under SC and 2–5 points on top of DeepConf\. This shows that warmup traces contain useful information about how much switch mass actually behaves adversarially\. The calibrated contraction captures part of the destination effect without fitting a full destination model\.
#### Switch probability estimation\.
The learned switch model closely matches oracle\-qqacross all settings, with a consistent gap of only 1–4 percentage points\. This confirms that the five trace\-intrinsic features capture the switch\-event signal well\. Together with theγ\\gammaablation, this supports the two\-part design: learn which traces are likely to switch, then use per\-question contraction to approximate aggregate destination behavior\.
### 3\.4Comparison with Parallel\-Probe
We compare against Parallel\-Probe\(Zhenget al\.,[2026](https://arxiv.org/html/2606.12935#bib.bib12)\), the closest concurrent method in our setting\. Parallel\-Probe uses two heuristics: \(i\)*consensus early stopping*, halt when the majority answer is unchanged foruuconsecutive probes, and \(ii\)*deviation\-based pruning*, remove any trace disagreeing with consensus forkkconsecutive probes\. Since the original method probes every 500 tokens while our data probes every 2,048 tokens, we run three configurations: a*token\-equivalent*mapping \(u=3u\{=\}3,k=2k\{=\}2, warmup=4\{=\}4probes, matching their∼7,000\{\\sim\}7\{,\}000\-token stability window\), a conservative variant \(u=4u\{=\}4\), and their*literal*parameters \(u=14u\{=\}14,k=7k\{=\}7, warmup=15\{=\}15\)\. All Parallel\-Probe runs use uniform weighting and no filtering, matching their published setup\.
Table 2:Comparison with Parallel\-Probe \(PP\)\. Accuracy \(%\) and per\-question token savings \(%\)\. PP achieves comparable savings only by sacrificing a lot of accuracy\. Under its literal parameters \(u=14u\{=\}14\), PP preserves accuracy but saves≤4%\{\\leq\}4\\%tokens \(not shown\)\. MARS achieves 25–47% savings without accuracy loss\.#### The accuracy–savings dilemma\.
Table[2](https://arxiv.org/html/2606.12935#S3.T2)reveals that Parallel\-Probe faces a fundamental tradeoff\. With token\-equivalent parameters \(u=3u\{=\}3\), it achieves 20–51% savings but loses 6–45pp accuracy, severely on harder benchmarks \(HMMT: 34\.6% vs\. the 70\.0% baseline on DeepSeek\-8B\)\. With their literal parameters \(u=14u\{=\}14\), accuracy is preserved but savings collapse to≤4%\{\\leq\}4\\%: most traces have fewer than 14 probe positions, so the stopping criterion never fires\. In contrast, MARS achieves 25–47% savings while preserving accuracy in all 9 settings\.
#### Why mode stability fails\.
The root cause is that PP’s consensus stability condition conflates*having been stable*with*being safe to stop*\. Fig\.[2](https://arxiv.org/html/2606.12935#S1.F2)illustrates this on a concrete example: at early checkpoints, 89% of traces vote for the wrong answer and the majority appears stable, so consensus stopping fires\. The correct answer only overtakes much later\. PP’s pruning exacerbates the problem: branches disagreeing with a \(potentially incorrect\) early consensus are eliminated, destroying the diversity needed for self\-correction\. MARS’s margin\-based criterion directly quantifies the remaining adversarial risk and fires only when that risk is provably insufficient to overturn the leader\.
## 4Related work
#### Self\-consistency and parallel sampling\.
Chain\-of\-thought and related decomposition prompts improve multi\-step reasoning\(Weiet al\.,[2022](https://arxiv.org/html/2606.12935#bib.bib18); Kojimaet al\.,[2022](https://arxiv.org/html/2606.12935#bib.bib26); Zhouet al\.,[2023](https://arxiv.org/html/2606.12935#bib.bib27)\)\. Self\-consistency\(Wanget al\.,[2023](https://arxiv.org/html/2606.12935#bib.bib3)\)builds on this by sampling multiple reasoning paths and taking a majority vote, extended to universal self\-consistency for free\-form answers\(Chenet al\.,[2023b](https://arxiv.org/html/2606.12935#bib.bib7)\)and scaled to large budgets\(Brownet al\.,[2024](https://arxiv.org/html/2606.12935#bib.bib4)\)\. Related test\-time search methods organize reasoning paths more explicitly, for example through tree\-structured exploration\(Yaoet al\.,[2023](https://arxiv.org/html/2606.12935#bib.bib28)\)\. DeepConf\(Fuet al\.,[2025](https://arxiv.org/html/2606.12935#bib.bib2)\)uses model\-internal confidence for confidence\-weighted voting, filtering, and online truncation, substantially improving accuracy on hard questions\. These methods all decide*how to vote*but not*when to stop*, as they run all traces to completion\. Our framework is orthogonal: it determines when the vote outcome has stabilized, and applies identically on top of any voting scheme, including both uniform and confidence\-weighted voting\. This contrasts with other test\-time scaling approaches that train explicit verifiers or process reward models to evaluate candidate solutions or reasoning steps\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.12935#bib.bib20); Uesatoet al\.,[2022](https://arxiv.org/html/2606.12935#bib.bib29); Lightmanet al\.,[2024](https://arxiv.org/html/2606.12935#bib.bib19); Zhanget al\.,[2025](https://arxiv.org/html/2606.12935#bib.bib30)\), as MARS achieves efficiency purely by observing the generation dynamics of the base model\.
#### Adaptive sample budgets\.
A related line of work reduces cost by drawing fewer*complete*samples, connecting LLM self\-consistency to classical sequential testing and time\-uniform inference\(Wald,[1945](https://arxiv.org/html/2606.12935#bib.bib8); Howardet al\.,[2021](https://arxiv.org/html/2606.12935#bib.bib9)\)\. Adaptive\-Consistency\(Aggarwalet al\.,[2023](https://arxiv.org/html/2606.12935#bib.bib10)\)generates samples sequentially and stops drawing more once a Dirichlet\-based criterion indicates vote stability; ESC\(Liet al\.,[2024](https://arxiv.org/html/2606.12935#bib.bib11)\)optimizes a performance\-cost tradeoff; CGES\(Aghazadehet al\.,[2025](https://arxiv.org/html/2606.12935#bib.bib13)\), Reliability\-Aware Adaptive Self\-Consistency\(Kimet al\.,[2026](https://arxiv.org/html/2606.12935#bib.bib31)\), and Optimal Bayesian Stopping\(Huanget al\.,[2026](https://arxiv.org/html/2606.12935#bib.bib15)\)use confidence, reliability, or Bayesian posterior estimates; Certified Self\-Consistency\(Cordero\-Encinar and Duncan,[2025](https://arxiv.org/html/2606.12935#bib.bib14)\)and CITE\(Otaet al\.,[2026](https://arxiv.org/html/2606.12935#bib.bib32)\)derive anytime\-valid guarantees; and self\-calibration methods use model confidence to allocate test\-time compute\(Huanget al\.,[2025](https://arxiv.org/html/2606.12935#bib.bib33)\)\. These methods decide*how many*complete samples to generate, but every sample runs to completion\. In the parallel deployment standard for latency\-sensitive applications where allNNtraces are launched simultaneously, these methods cannot reduce wall\-clock latency, since they require sequential generation to decide whether to draw the next sample\. Our setting is different: all traces run in parallel, and we stop them*mid\-generation*by probing intermediate answers at checkpoints, directly reducing both total tokens and wall\-clock latency\.
#### Dynamic generation\-time efficiency\.
A separate line of research targets inference efficiency by dynamically halting or accelerating computation during the generation process itself\. At the token and layer levels, techniques such as speculative decoding\(Leviathanet al\.,[2023](https://arxiv.org/html/2606.12935#bib.bib21); Chenet al\.,[2023a](https://arxiv.org/html/2606.12935#bib.bib24)\)and confident early\-exit mechanisms\(Schusteret al\.,[2022](https://arxiv.org/html/2606.12935#bib.bib25)\)reduce wall\-clock latency during generation\. For reasoning models, recent methods stop or prune individual chains based on intermediate confidence, answer entropy, or learned policies\(Yanget al\.,[2025b](https://arxiv.org/html/2606.12935#bib.bib34); Laaouach,[2025](https://arxiv.org/html/2606.12935#bib.bib35); Houet al\.,[2025](https://arxiv.org/html/2606.12935#bib.bib36)\)\. More recently, this philosophy has been elevated to parallel test\-time scaling\. The concurrent Parallel\-Probe method\(Zhenget al\.,[2026](https://arxiv.org/html/2606.12935#bib.bib12)\)implements this by extracting intermediate answers and halting generation when the aggregate majority vote remains stable for a fixed number of probes\. However, relying on consensus stability heuristics risks premature convergence, as a historically stable leader may still possess a fragile margin\. MARS addresses this by shifting the focus from historical stability to future risk, providing a rigorous margin\-based safety guarantee\.
## 5Conclusion
We introduce MARS, an early\-stopping layer for parallel LLM test\-time scaling\. MARS probes partial traces, estimates which traces are still likely to switch answers, and stops when the current vote margin is large enough to absorb adversarial future switches\. With true switch probabilities and full destination conservatism, the rule gives a high\-probability guarantee of matching the full\-budget vote\. Empirically, a lightweight learned switch model closely tracks oracle switch behavior, and per\-question contraction calibration captures additional destination structure beyond the conservative worst case\. Together these components save 25–47% of self\-consistency tokens and 14–29% on top of DeepConf while matching the accuracy of full\-budget baselines across all tested settings\. These results suggest that aggregate margin stability is a distinct and useful source of test\-time efficiency\.
There are some natural directions for future work\. Currently, the probing interval is fixed \(every 2,048 tokens\)\. An adaptive scheme that probes more frequently when the margin is tight and less when it is large could reduce probing overhead while maintaining responsiveness\. In this work, we only consider mathematical reasoning as the evaluation task; applying MARS to other domains \(code generation, agentic tasks\) where trace switch dynamics may differ is an important next step\. Another interesting future direction is to explore other methods to accurately estimate where traces will switch in the end\. If the distribution of challengers has good statistical properties, it’ll be possible to design a more principled estimator than using the calibration coefficientγ\\gammaas a surrogate\.
## References
- \[1\]\(2023\)Let’s sample step by step: adaptive\-consistency for efficient reasoning and coding with LLMs\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,pp\. 12375–12396\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px2.p1.1)\.
- \[2\]E\. Aghazadeh, A\. Ghasemi, H\. Beyhaghi, and H\. Pishro\-Nik\(2025\)CGES: confidence\-guided early stopping for efficient and accurate self\-consistency\.arXiv preprint arXiv:2511\.02603\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px2.p1.1)\.
- \[3\]B\. Brown, J\. Juravsky, R\. Ehrlich, R\. Clark, Q\. V\. Le, C\. Ré, and A\. Mirhoseini\(2024\)Large language monkeys: scaling inference compute with repeated sampling\.arXiv preprint arXiv:2407\.21787\.Cited by:[§1](https://arxiv.org/html/2606.12935#S1.p1.1),[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px1.p1.1)\.
- \[4\]C\. Chen, S\. Borgeaud, G\. Irving, J\. Lespiau, L\. Sifre, and J\. Jumper\(2023\)Accelerating large language model decoding with speculative sampling\.arXiv preprint arXiv:2302\.01318\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px3.p1.1)\.
- \[5\]X\. Chen, R\. Aksitov, U\. Alon, J\. Ren, K\. Xiao, P\. Yin, S\. Prakash, C\. Sutton, X\. Wang, and D\. Zhou\(2023\)Universal self\-consistency for large language model generation\.arXiv preprint arXiv:2311\.17311\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px1.p1.1)\.
- \[6\]K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano,et al\.\(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px1.p1.1)\.
- \[7\]P\. Cordero\-Encinar and A\. B\. Duncan\(2025\)Certified self\-consistency: statistical guarantees and test\-time training for reliable reasoning in LLMs\.arXiv preprint arXiv:2510\.17472\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px2.p1.1)\.
- \[8\]DeepSeek\-AI, D\. Guo, D\. Yang, H\. Zhang, J\. Song, R\. Zhang, R\. Xu, Q\. Zhu, S\. Ma, P\. Wang,et al\.\(2025\)DeepSeek\-R1: incentivizing reasoning capability in LLMs via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§3\.1](https://arxiv.org/html/2606.12935#S3.SS1.SSS0.Px1.p1.1)\.
- \[9\]Y\. Fu, X\. Wang, Y\. Tian, and J\. Zhao\(2025\)Deep think with confidence\.arXiv preprint arXiv:2508\.15260\.Cited by:[Figure 1](https://arxiv.org/html/2606.12935#S0.F1),[§1\.1](https://arxiv.org/html/2606.12935#S1.SS1.SSS0.Px3.p1.5),[§1](https://arxiv.org/html/2606.12935#S1.p9.1),[§3\.1](https://arxiv.org/html/2606.12935#S3.SS1.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px1.p1.1)\.
- \[10\]D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt\(2021\)Measuring mathematical problem solving with the MATH dataset\.InAdvances in Neural Information Processing Systems,Vol\.34,pp\. 15941–15952\.Cited by:[§3\.1](https://arxiv.org/html/2606.12935#S3.SS1.SSS0.Px1.p1.1)\.
- \[11\]B\. Hou, Y\. Zhang, J\. Ji, Y\. Liu, K\. Qian, J\. Andreas, and S\. Chang\(2025\)ThinkPrune: pruning long chain\-of\-thought of LLMs via reinforcement learning\.arXiv preprint arXiv:2504\.01296\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px3.p1.1)\.
- \[12\]S\. R\. Howard, A\. Ramdas, J\. McAuliffe, and J\. Sekhon\(2021\)Time\-uniform, nonparametric, nonasymptotic confidence sequences\.The Annals of Statistics49\(2\),pp\. 1055–1080\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px2.p1.1)\.
- \[13\]C\. Huang, L\. Huang, J\. Leng, J\. Liu, and J\. Huang\(2025\)Efficient test\-time scaling via self\-calibration\.arXiv preprint arXiv:2503\.00031\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px2.p1.1)\.
- \[14\]J\. Huang, W\. Ma, and Z\. Zhou\(2026\)Optimal Bayesian stopping for efficient inference of consistent LLM answers\.arXiv preprint arXiv:2602\.05395\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px2.p1.1)\.
- \[15\]S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§C\.1](https://arxiv.org/html/2606.12935#A3.SS1.SSS0.Px1.p1.1)\.
- \[16\]J\. Kim, N\. Yang, K\. Min, and K\. Jung\(2026\)Reliability\-aware adaptive self\-consistency for efficient sampling in LLM reasoning\.arXiv preprint arXiv:2601\.02970\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px2.p1.1)\.
- \[17\]T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa\(2022\)Large language models are zero\-shot reasoners\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 22199–22213\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px1.p1.1)\.
- \[18\]Y\. Laaouach\(2025\)HALT\-CoT: model\-agnostic early stopping for chain\-of\-thought reasoning via answer entropy\.In4th Muslims in ML Workshop co\-located with ICML 2025,External Links:[Link](https://openreview.net/forum?id=CX5c7C1CZa)Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px3.p1.1)\.
- \[19\]Y\. Leviathan, M\. Kalman, and Y\. Matias\(2023\)Fast inference from transformers via speculative decoding\.InInternational Conference on Machine Learning,pp\. 19274–19286\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px3.p1.1)\.
- \[20\]Y\. Li, P\. Yuan, S\. Feng, B\. Pan, X\. Wang, B\. Sun, H\. Wang, and K\. Li\(2024\)Escape sky\-high cost: early\-stopping self\-consistency for multi\-step reasoning\.InInternational Conference on Learning Representations,Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px2.p1.1)\.
- \[21\]H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe\(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px1.p1.1)\.
- \[22\]S\. Lin, J\. Hilton, and O\. Evans\(2022\)Teaching models to express their uncertainty in words\.Transactions on Machine Learning Research\.Cited by:[§C\.1](https://arxiv.org/html/2606.12935#A3.SS1.SSS0.Px1.p1.1)\.
- \[23\]H\. Ota, N\. Iwase, Y\. Ichihara, J\. Komiyama, and M\. Imaizumi\(2026\)CITE: anytime\-valid statistical inference in LLM self\-consistency\.arXiv preprint arXiv:2605\.05873\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px2.p1.1)\.
- \[24\]J\. C\. Platt\(1999\)Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods\.InAdvances in Large Margin Classifiers,pp\. 61–74\.Cited by:[§C\.3](https://arxiv.org/html/2606.12935#A3.SS3.p1.1)\.
- \[25\]T\. Schuster, A\. Fisch, J\. Gupta, M\. Dehghani, D\. Bahri, V\. Tran, Y\. Tay, and D\. Metzler\(2022\)Confident adaptive language modeling\.Advances in Neural Information Processing Systems35,pp\. 17456–17472\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px3.p1.1)\.
- \[26\]C\. Snell, J\. Lee, K\. Xu, and A\. Kumar\(2024\)Scaling LLM test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.Cited by:[§1](https://arxiv.org/html/2606.12935#S1.p1.1)\.
- \[27\]J\. Uesato, N\. Kushman, R\. Kumar, F\. Song, N\. Siegel, L\. Wang, A\. Creswell, G\. Irving, and I\. Higgins\(2022\)Solving math word problems with process\- and outcome\-based feedback\.arXiv preprint arXiv:2211\.14275\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px1.p1.1)\.
- \[28\]A\. Wald\(1945\)Sequential tests of statistical hypotheses\.The Annals of Mathematical Statistics16\(2\),pp\. 117–186\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px2.p1.1)\.
- \[29\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou\(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.12935#S1.p1.1),[§3\.1](https://arxiv.org/html/2606.12935#S3.SS1.SSS0.Px3.p1.1),[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px1.p1.1)\.
- \[30\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, F\. Xia, E\. Chi, Q\. V\. Le, D\. Zhou,et al\.\(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.Advances in neural information processing systems35,pp\. 24824–24837\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px1.p1.1)\.
- \[31\]A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv, C\. Zheng, D\. Liu, F\. Zhou, F\. Huang, F\. Hu, H\. Ge, H\. Wei, H\. Lin, J\. Tang, J\. Yang, J\. Tu, J\. Zhang, J\. Yang, J\. Yang, J\. Zhou, J\. Zhou, J\. Lin, K\. Dang, K\. Bao, K\. Yang, L\. Yu, L\. Deng, M\. Li, M\. Xue, M\. Li, P\. Zhang, P\. Wang, Q\. Zhu, R\. Men, R\. Gao, S\. Liu, S\. Luo, T\. Li, T\. Tang, W\. Yin, X\. Ren, X\. Wang, X\. Zhang, X\. Ren, Y\. Fan, Y\. Su, Y\. Zhang, Y\. Zhang, Y\. Wan, Y\. Liu, Z\. Wang, Z\. Cui, Z\. Zhang, Z\. Zhou, and Z\. Qiu\(2025\)Qwen3 technical report\.External Links:2505\.09388Cited by:[§1\.1](https://arxiv.org/html/2606.12935#S1.SS1.SSS0.Px2.p1.2),[§1](https://arxiv.org/html/2606.12935#S1.p2.1),[§3\.1](https://arxiv.org/html/2606.12935#S3.SS1.SSS0.Px1.p1.1)\.
- \[32\]C\. Yang, Q\. Si, Y\. Duan, Z\. Zhu, C\. Zhu, Q\. Li, M\. Chen, Z\. Lin, and W\. Wang\(2025\)Dynamic early exit in reasoning models\.arXiv preprint arXiv:2504\.15895\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px3.p1.1)\.
- \[33\]S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan\(2023\)Tree of thoughts: deliberate problem solving with large language models\.Advances in neural information processing systems36,pp\. 11809–11822\.Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px1.p1.1)\.
- \[34\]L\. Zhang, A\. Hosseini, H\. Bansal, M\. Kazemi, A\. Kumar, and R\. Agarwal\(2025\)Generative verifiers: reward modeling as next\-token prediction\.InInternational Conference on Learning Representations,Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px1.p1.1)\.
- \[35\]L\. Zheng, L\. Yin, Z\. Xie, C\. Sun, J\. Huang, C\. H\. Yu, S\. Cao, C\. Kozyrakis, I\. Stoica, J\. E\. Gonzalez, C\. Barrett, and Y\. Sheng\(2024\)SGLang: efficient execution of structured language model programs\.InAdvances in Neural Information Processing Systems,A\. Globerson, L\. Mackey, D\. Belgrave, A\. Fan, U\. Paquet, J\. Tomczak, and C\. Zhang \(Eds\.\),Vol\.37,pp\. 62557–62583\.External Links:[Document](https://dx.doi.org/10.52202/079017-2000),[Link](https://proceedings.neurips.cc/paper_files/paper/2024/file/724be4472168f31ba1c9ac630f15dec8-Paper-Conference.pdf)Cited by:[Appendix D](https://arxiv.org/html/2606.12935#A4.SS0.SSS0.Px1.p1.1)\.
- \[36\]T\. Zheng, C\. Huang, R\. Dai,et al\.\(2026\)Parallel\-probe: towards efficient parallel thinking via 2D probing\.arXiv preprint arXiv:2602\.03845\.Cited by:[§1](https://arxiv.org/html/2606.12935#S1.p2.1),[§1](https://arxiv.org/html/2606.12935#S1.p3.1),[§1](https://arxiv.org/html/2606.12935#S1.p9.1),[§3\.4](https://arxiv.org/html/2606.12935#S3.SS4.p1.10),[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px3.p1.1)\.
- \[37\]D\. Zhou, N\. Schärli, L\. Hou, J\. Wei, N\. Scales, X\. Wang, D\. Schuurmans, C\. Cui, O\. Bousquet, Q\. Le, and E\. Chi\(2023\)Least\-to\-most prompting enables complex reasoning in large language models\.InInternational Conference on Learning Representations,Cited by:[§4](https://arxiv.org/html/2606.12935#S4.SS0.SSS0.Px1.p1.1)\.
## Appendix AProofs
### A\.1Margin decomposition
###### Proposition A\.1\(Margin decomposition\)\.
Fix checkpointttand challengerk≠L\(t\)k\\neq L\(t\)\. The full\-budget margin splits as
Mk\(T\)=Rk\(t\)\+Φk\(t\),M\_\{k\}\(T\)=R\_\{k\}\(t\)\+\\Phi\_\{k\}\(t\),\(9\)whereRk\(t\)=∑j:aj\(t\)=L,aj\(T\)=aj\(t\)wj−∑j:aj\(t\)=k,aj\(T\)=aj\(t\)wjR\_\{k\}\(t\)=\\sum\_\{\\begin\{subarray\}\{c\}j:a\_\{j\}\(t\)=L,\\,a\_\{j\}\(T\)=a\_\{j\}\(t\)\\end\{subarray\}\}w\_\{j\}\-\\sum\_\{\\begin\{subarray\}\{c\}j:a\_\{j\}\(t\)=k,\\,a\_\{j\}\(T\)=a\_\{j\}\(t\)\\end\{subarray\}\}w\_\{j\}is the retained margin, andΦk\(t\)=∑j:aj\(T\)=L,aj\(t\)≠Lwj−∑j:aj\(T\)=k,aj\(t\)≠kwj\\Phi\_\{k\}\(t\)=\\sum\_\{\\begin\{subarray\}\{c\}j:a\_\{j\}\(T\)=L,\\,a\_\{j\}\(t\)\\neq L\\end\{subarray\}\}w\_\{j\}\-\\sum\_\{\\begin\{subarray\}\{c\}j:a\_\{j\}\(T\)=k,\\,a\_\{j\}\(t\)\\neq k\\end\{subarray\}\}w\_\{j\}is the switch flow\.
###### Proof\.
Partition the final vote forLLinto traces that already voted forLLat checkpointttand stayed there, plus traces that switched intoLL\. Do the same forkk\. Subtracting the two final vote totals gives Eq\.[9](https://arxiv.org/html/2606.12935#A1.E9)\. ∎
Using switch indicatorsXj=𝟏\{aj\(T\)≠aj\(t\)\}X\_\{j\}=\\mathbf\{1\}\\\{a\_\{j\}\(T\)\\neq a\_\{j\}\(t\)\\\}, the retained margin becomesRk=Mk\(t\)−∑j:aj\(t\)=LXjwj\+∑j:aj\(t\)=kXjwjR\_\{k\}=M\_\{k\}\(t\)\-\\sum\_\{j:a\_\{j\}\(t\)=L\}X\_\{j\}w\_\{j\}\+\\sum\_\{j:a\_\{j\}\(t\)=k\}X\_\{j\}w\_\{j\}\. BoundingΦk\\Phi\_\{k\}adversarially \(every departing leader/neutral voter joinskk\):Φk≥−∑j:aj\(t\)=LXjwj−∑j:aj\(t\)∉\{L,k\}Xjwj\\Phi\_\{k\}\\geq\-\\sum\_\{j:a\_\{j\}\(t\)=L\}X\_\{j\}w\_\{j\}\-\\sum\_\{j:a\_\{j\}\(t\)\\notin\\\{L,k\\\}\}X\_\{j\}w\_\{j\}\. SummingRk\+ΦkR\_\{k\}\+\\Phi\_\{k\}and collecting by current vote yieldsMk\(T\)≥Mk\(t\)−∑j∈𝒜tXj⋅cjkM\_\{k\}\(T\)\\geq M\_\{k\}\(t\)\-\\sum\_\{j\\in\\mathcal\{A\}\_\{t\}\}X\_\{j\}\\cdot c\_\{j\}^\{k\}, recovering Definition[2\.1](https://arxiv.org/html/2606.12935#S2.ThmTheorem1)\.
### A\.2Proof of Theorem[2\.4](https://arxiv.org/html/2606.12935#S2.ThmTheorem4)
###### Proof\.
First consider an observed challengerk≠L\(t\)k\\neq L\(t\)\. Let
Yj=wj\(𝟏\{aj\(t\)=L\}−𝟏\{aj\(t\)=k\}\)−wj\(𝟏\{aj\(T\)=L\}−𝟏\{aj\(T\)=k\}\)Y\_\{j\}=w\_\{j\}\\\!\\left\(\\mathbf\{1\}\\\{a\_\{j\}\(t\)=L\\\}\-\\mathbf\{1\}\\\{a\_\{j\}\(t\)=k\\\}\\right\)\-w\_\{j\}\\\!\\left\(\\mathbf\{1\}\\\{a\_\{j\}\(T\)=L\\\}\-\\mathbf\{1\}\\\{a\_\{j\}\(T\)=k\\\}\\right\)be tracejj’s realized decrease in the leader\-versus\-kkmargin between checkpointttand the full\-budget endpointTT\. Conditional onℱt\\mathcal\{F\}\_\{t\}, the variablesYjY\_\{j\}are independent by Assumption[2\.3](https://arxiv.org/html/2606.12935#S2.ThmTheorem3)\. Each has range width at most2wmax2w\_\{\\max\}, since one trace can change the leader\-versus\-kkmargin by at most2wj2w\_\{j\}\. Ifqj\(t\)=0q\_\{j\}\(t\)=0, thenaj\(T\)=aj\(t\)a\_\{j\}\(T\)=a\_\{j\}\(t\)almost surely andYj=0Y\_\{j\}=0almost surely, so only theNactiveN\_\{\\mathrm\{active\}\}traces contribute nonzero range\.
Definition[2\.1](https://arxiv.org/html/2606.12935#S2.ThmTheorem1)upper\-bounds the conditional expectation of each trace’s adverse contribution\. Ifaj\(t\)=La\_\{j\}\(t\)=L, a switch can decrease the margin by at most2wj2w\_\{j\}, so𝔼\[Yj∣ℱt\]≤2qj\(t\)wj\\mathbb\{E\}\[Y\_\{j\}\\mid\\mathcal\{F\}\_\{t\}\]\\leq 2q\_\{j\}\(t\)w\_\{j\}\. Ifaj\(t\)=ka\_\{j\}\(t\)=k, any switch away fromkkincreases the margin by at leastwjw\_\{j\}, so𝔼\[Yj∣ℱt\]≤−qj\(t\)wj\\mathbb\{E\}\[Y\_\{j\}\\mid\\mathcal\{F\}\_\{t\}\]\\leq\-q\_\{j\}\(t\)w\_\{j\}\. Ifaj\(t\)∉\{L,k\}a\_\{j\}\(t\)\\notin\\\{L,k\\\}, a switch can decrease the margin by at mostwjw\_\{j\}, so𝔼\[Yj∣ℱt\]≤qj\(t\)wj\\mathbb\{E\}\[Y\_\{j\}\\mid\\mathcal\{F\}\_\{t\}\]\\leq q\_\{j\}\(t\)w\_\{j\}\. Therefore
μ=𝔼\[∑jYj∣ℱt\]≤∑jqj\(t\)cjk=Γk\(t;1\)\.\\mu=\\mathbb\{E\}\\\!\\left\[\\sum\_\{j\}Y\_\{j\}\\mid\\mathcal\{F\}\_\{t\}\\right\]\\leq\\sum\_\{j\}q\_\{j\}\(t\)c\_\{j\}^\{k\}=\\Gamma\_\{k\}\(t;1\)\.Challengerkkovertakes the leader only if∑jYj≥Mk\(t\)\\sum\_\{j\}Y\_\{j\}\\geq M\_\{k\}\(t\)\. ForMk\(t\)\>Γk\(t;1\)M\_\{k\}\(t\)\>\\Gamma\_\{k\}\(t;1\), the boundμ≤Γk\(t;1\)\\mu\\leq\\Gamma\_\{k\}\(t;1\)implies that this event requires∑j\(Yj−𝔼\[Yj∣ℱt\]\)≥Mk\(t\)−Γk\(t;1\)\\sum\_\{j\}\(Y\_\{j\}\-\\mathbb\{E\}\[Y\_\{j\}\\mid\\mathcal\{F\}\_\{t\}\]\)\\geq M\_\{k\}\(t\)\-\\Gamma\_\{k\}\(t;1\)\. Hoeffding’s inequality gives
P\(∑jYj≥Mk\(t\)∣ℱt\)≤exp\(−\(Mk\(t\)−Γk\(t;1\)\)22wmax2Nactive\)\.P\\\!\\left\(\\sum\_\{j\}Y\_\{j\}\\geq M\_\{k\}\(t\)\\mid\\mathcal\{F\}\_\{t\}\\right\)\\leq\\exp\\\!\\left\(\-\\frac\{\(M\_\{k\}\(t\)\-\\Gamma\_\{k\}\(t;1\)\)^\{2\}\}\{2w\_\{\\max\}^\{2\}N\_\{\\mathrm\{active\}\}\}\\right\)\.WhenMk\(t\)≤Γk\(t;1\)M\_\{k\}\(t\)\\leq\\Gamma\_\{k\}\(t;1\), the same bound with the positive part is vacuous but valid\.
For the synthetic unseen challenger⊥\\bot, define
Y~j=2wj𝟏\{aj\(t\)=L,aj\(T\)≠aj\(t\)\}\+wj𝟏\{aj\(t\)≠L,aj\(T\)≠aj\(t\)\},\\widetilde\{Y\}\_\{j\}=2w\_\{j\}\\mathbf\{1\}\\\{a\_\{j\}\(t\)=L,\\ a\_\{j\}\(T\)\\neq a\_\{j\}\(t\)\\\}\+w\_\{j\}\\mathbf\{1\}\\\{a\_\{j\}\(t\)\\neq L,\\ a\_\{j\}\(T\)\\neq a\_\{j\}\(t\)\\\},and setM⊥\(T\)=M⊥\(t\)−∑jY~jM\_\{\\bot\}\(T\)=M\_\{\\bot\}\(t\)\-\\sum\_\{j\}\\widetilde\{Y\}\_\{j\}withM⊥\(t\)=VL\(t\)M\_\{\\bot\}\(t\)=V\_\{L\}\(t\)\. This synthetic endpoint margin is pessimistic for every answer stringbbabsent at checkpointtt: the actual decrease in the leader\-versus\-bbmargin is at most∑jY~j\\sum\_\{j\}\\widetilde\{Y\}\_\{j\}\. Hence, if any previously unseen answer overtakes the leader, thenM⊥\(T\)≤0M\_\{\\bot\}\(T\)\\leq 0\.
The variablesY~j\\widetilde\{Y\}\_\{j\}are conditionally independent, have range width at most2wmax2w\_\{\\max\}, and are almost surely zero whenqj\(t\)=0q\_\{j\}\(t\)=0\. They satisfy
𝔼\[Y~j∣ℱt\]=\{2qj\(t\)wj,aj\(t\)=L,qj\(t\)wj,aj\(t\)≠L,=qj\(t\)cj⊥\.\\mathbb\{E\}\[\\widetilde\{Y\}\_\{j\}\\mid\\mathcal\{F\}\_\{t\}\]=\\begin\{cases\}2q\_\{j\}\(t\)w\_\{j\},&a\_\{j\}\(t\)=L,\\\\ q\_\{j\}\(t\)w\_\{j\},&a\_\{j\}\(t\)\\neq L,\\end\{cases\}=q\_\{j\}\(t\)c\_\{j\}^\{\\bot\}\.The same Hoeffding argument therefore applies to⊥\\botand gives Eq\.[4](https://arxiv.org/html/2606.12935#S2.E4)\. ∎
### A\.3Proof of Corollary[2\.5](https://arxiv.org/html/2606.12935#S2.ThmTheorem5)
###### Proof\.
At checkpointtt, the augmented challenger set𝒦\(t\)\\mathcal\{K\}\(t\)contains at mostNNchallengers: at mostN−1N\-1observed nonleader answers, plus the synthetic unseen challenger⊥\\bot\. Eq\.[5](https://arxiv.org/html/2606.12935#S2.E5)makes the failure probability for each checked challenger at mostδ/N\\delta/Nby Theorem[2\.4](https://arxiv.org/html/2606.12935#S2.ThmTheorem4)\. A union bound over𝒦\(t\)\\mathcal\{K\}\(t\)gives
P\(L\(T\)≠L\(t\)∣ℱt\)≤δ\.P\\bigl\(L\(T\)\\neq L\(t\)\\mid\\mathcal\{F\}\_\{t\}\\bigr\)\\leq\\delta\.The unseen challenger covers all answer labels absent at checkpointtt, so the union bound includes both observed and newly appearing challengers\.
For the stopping\-time statement, let the finite set of checkpoints be\{t1,…,tP\}\\\{t\_\{1\},\\ldots,t\_\{P\}\\\}and letEi=\{τ=ti\}E\_\{i\}=\\\{\\tau=t\_\{i\}\\\}\. SinceEi∈ℱtiE\_\{i\}\\in\\mathcal\{F\}\_\{t\_\{i\}\}and the conditional failure probability is at mostδ\\deltawhenever the rule stops,
P\(L\(T\)≠L\(τ\)\)=∑i=1P𝔼\[𝟏EiP\(L\(T\)≠L\(ti\)∣ℱti\)\]≤∑i=1PδP\(Ei\)≤δ\.P\(L\(T\)\\neq L\(\\tau\)\)=\\sum\_\{i=1\}^\{P\}\\mathbb\{E\}\\\!\\left\[\\mathbf\{1\}\_\{E\_\{i\}\}P\\\!\\left\(L\(T\)\\neq L\(t\_\{i\}\)\\mid\\mathcal\{F\}\_\{t\_\{i\}\}\\right\)\\right\]\\leq\\sum\_\{i=1\}^\{P\}\\delta P\(E\_\{i\}\)\\leq\\delta\.∎
## Appendix BFrom optimal stopping to MARS: derivation and modeling choices
This appendix develops the full derivation connecting MARS to the Bayes\-optimal stopping rule, showing that MARS is a strict conservative relaxation\.
### B\.1From optimal stopping to margin certification
The Bayes\-optimal early\-stopping rule halts at the earliest checkpointttsatisfying
P\(L\(T\)≠L\(t\)∣ℱt\)≤δ\.P\\bigl\(L\(T\)\\neq L\(t\)\\mid\\mathcal\{F\}\_\{t\}\\bigr\)\\leq\\delta\.This is intractable: it requires the joint future distribution of all trace answers\. We reduce it to a per\-challenger margin condition\. Since the leader can only be overturned if some challengerkkachievesMk\(T\)≤0M\_\{k\}\(T\)\\leq 0, a union bound gives
P\(L\(T\)≠L\(t\)∣ℱt\)≤∑k∈𝒦\(t\)P\(Mk\(T\)≤0∣ℱt\)\.P\(L\(T\)\\neq L\(t\)\\mid\\mathcal\{F\}\_\{t\}\)\\leq\\sum\_\{k\\in\\mathcal\{K\}\(t\)\}P\(M\_\{k\}\(T\)\\leq 0\\mid\\mathcal\{F\}\_\{t\}\)\.It therefore suffices to certifyP\(Mk\(T\)≤0∣ℱt\)≤δ/\|𝒦\(t\)\|P\(M\_\{k\}\(T\)\\leq 0\\mid\\mathcal\{F\}\_\{t\}\)\\leq\\delta/\|\\mathcal\{K\}\(t\)\|for each challenger\. The problem becomes: bound the probability that the margin against each challenger drops to zero\.
### B\.2Margin decomposition
The full\-budget margin decomposes exhaustively \(Proposition[A\.1](https://arxiv.org/html/2606.12935#A1.ThmTheorem1)\):
Mk\(T\)=Rk\(t\)\+Φk\(t\),M\_\{k\}\(T\)=R\_\{k\}\(t\)\+\\Phi\_\{k\}\(t\),whereRkR\_\{k\}is the retained margin from traces keeping their answer, andΦk\\Phi\_\{k\}is the switch flow from traces that change\. Using switch indicatorsXj=𝟏\{aj\(T\)≠aj\(t\)\}X\_\{j\}=\\mathbf\{1\}\\\{a\_\{j\}\(T\)\\neq a\_\{j\}\(t\)\\\}:
Rk=Mk\(t\)−∑j∈𝒜t:aj\(t\)=LXjwj\+∑j∈𝒜t:aj\(t\)=kXjwj\.R\_\{k\}=M\_\{k\}\(t\)\-\\sum\_\{\\begin\{subarray\}\{c\}j\\in\\mathcal\{A\}\_\{t\}:\\\\ a\_\{j\}\(t\)=L\\end\{subarray\}\}X\_\{j\}\\,w\_\{j\}\+\\sum\_\{\\begin\{subarray\}\{c\}j\\in\\mathcal\{A\}\_\{t\}:\\\\ a\_\{j\}\(t\)=k\\end\{subarray\}\}X\_\{j\}\\,w\_\{j\}\.This is the current marginMk\(t\)M\_\{k\}\(t\), minus leader\-voters who depart, plus challenger\-voters who depart\.
### B\.3Retained margin: learnable via logistic regression
The retained termRkR\_\{k\}depends only on*whether*each trace switches \(XjX\_\{j\}\)\. The conditional switch probabilityqj\(t\)=P\(Xj=1∣ℱt\)q\_\{j\}\(t\)=P\(X\_\{j\}=1\\mid\\mathcal\{F\}\_\{t\}\)is a per\-trace quantity depending on observable features: checkpoint position, probe confidence, answer\-flip count, stability streak\. These are allℱt\\mathcal\{F\}\_\{t\}\-measurable, makingqjq\_\{j\}a standard binary prediction problem\. A lightweight logistic regression trained on warmup traces estimatesqjq\_\{j\}well \(Appendix[C](https://arxiv.org/html/2606.12935#A3)\)\. This component of the margin can therefore be modeled tightly\.
### B\.4Switch flow: unpredictable, therefore adversarial
The switch\-flow termΦk\\Phi\_\{k\}depends on*where*switching traces land\. The destination is an open\-ended answer string determined by future reasoning, which is not learnable fromℱt\\mathcal\{F\}\_\{t\}alone\. We therefore boundΦk\\Phi\_\{k\}adversarially\. In the worst case, every switcher leaving the leader or a neutral position joins challengerkk:
Φk≥−∑j∈𝒜t:aj\(t\)=LXjwj−∑j∈𝒜t:aj\(t\)∉\{L,k\}Xjwj\.\\Phi\_\{k\}\\geq\-\\sum\_\{\\begin\{subarray\}\{c\}j\\in\\mathcal\{A\}\_\{t\}:\\\\ a\_\{j\}\(t\)=L\\end\{subarray\}\}X\_\{j\}\\,w\_\{j\}\-\\sum\_\{\\begin\{subarray\}\{c\}j\\in\\mathcal\{A\}\_\{t\}:\\\\ a\_\{j\}\(t\)\\notin\\\{L,k\\\}\\end\{subarray\}\}X\_\{j\}\\,w\_\{j\}\.SummingRk\+ΦkR\_\{k\}\+\\Phi\_\{k\}and collecting terms by current vote:
Mk\(T\)≥Mk\(t\)−∑j∈𝒜tXj⋅cjk,M\_\{k\}\(T\)\\geq M\_\{k\}\(t\)\-\\sum\_\{j\\in\\mathcal\{A\}\_\{t\}\}X\_\{j\}\\cdot c\_\{j\}^\{k\},wherecjkc\_\{j\}^\{k\}is the adversarial switch cost \(Definition[2\.1](https://arxiv.org/html/2606.12935#S2.ThmTheorem1)\)\. This is a one\-sided bound: the trueMk\(T\)M\_\{k\}\(T\)is at least as large as this expression\.
### B\.5Conservative relaxation: formal statement
Applying Hoeffding’s inequality to∑Xjcjk\\sum X\_\{j\}c\_\{j\}^\{k\}\(Theorem[2\.4](https://arxiv.org/html/2606.12935#S2.ThmTheorem4)\) and a union bound over challengers \(Corollary[2\.5](https://arxiv.org/html/2606.12935#S2.ThmTheorem5)\) yields the stopping rule:Mk\(t\)≥Γk\(t\)\+ϵ\(N,δ\)M\_\{k\}\(t\)\\geq\\Gamma\_\{k\}\(t\)\+\\epsilon\(N,\\delta\)for allkk\. This is a*sufficient*condition forP\(L\(T\)≠L\(t\)∣ℱt\)≤δP\(L\(T\)\\neq L\(t\)\\mid\\mathcal\{F\}\_\{t\}\)\\leq\\delta\.
###### Proposition B\.1\(Conservative relaxation\)\.
Letτ∗\\tau^\{\*\}be the Bayes\-optimal stopping time andτMARS\\tau\_\{\\mathrm\{MARS\}\}the stopping time of the certified MARS rule \(Eq\.[3](https://arxiv.org/html/2606.12935#S2.E3)withγ=1\\gamma=1, trueqjq\_\{j\}\)\. Under Assumptions[2\.2](https://arxiv.org/html/2606.12935#S2.ThmTheorem2)–[2\.3](https://arxiv.org/html/2606.12935#S2.ThmTheorem3),τMARS≥τ∗\\tau\_\{\\mathrm\{MARS\}\}\\geq\\tau^\{\*\}almost surely\.
###### Proof\.
By Corollary[2\.5](https://arxiv.org/html/2606.12935#S2.ThmTheorem5), whenever Eq\.[3](https://arxiv.org/html/2606.12935#S2.E3)holds for allk∈𝒦\(t\)k\\in\\mathcal\{K\}\(t\), we haveP\(L\(T\)≠L\(t\)∣ℱt\)≤δP\(L\(T\)\\neq L\(t\)\\mid\\mathcal\{F\}\_\{t\}\)\\leq\\delta\. The checkpoints where MARS fires are therefore a subset of those where the Bayes\-optimal condition holds\. Since both rules stop at the earliest qualifying checkpoint,τMARS≥τ∗\\tau\_\{\\mathrm\{MARS\}\}\\geq\\tau^\{\*\}\. ∎
## Appendix CSwitch probability model details
This appendix details the logistic regression model used to estimate per\-trace switch probabilitiesqj\(t\)=P\(aj\(T\)≠aj\(t\)∣ℱt\)q\_\{j\}\(t\)=P\(a\_\{j\}\(T\)\\neq a\_\{j\}\(t\)\\mid\\mathcal\{F\}\_\{t\}\)\.
### C\.1Features
The model uses 5 features, allℱt\\mathcal\{F\}\_\{t\}\-measurable \(observable at checkpointttwithout future information\):
Table 3:Features for the switch probability model\. All features are trace\-intrinsic: they describe the trace’s own history, not its relationship to the vote distribution\.#### Feature intuition\.
positioncaptures the baseline switch rate, which decreases as traces approach completion\.confidencereflects the model’s certainty in its current answer; prior work shows that model confidence and expressed uncertainty can carry information about answer reliability\[[15](https://arxiv.org/html/2606.12935#bib.bib22),[22](https://arxiv.org/html/2606.12935#bib.bib37)\], and in our data high\-confidence probes switch less often\.flipsmeasures cumulative instability; traces that have changed answers many times are more likely to change again\.streakmeasures recent stability; a trace holding the same answer for many consecutive probes is unlikely to switch\.conf\_trendcaptures momentum: rising confidence suggests convergence, falling confidence suggests the trace may switch\.
### C\.2Training
#### Data\.
For each question, the model is trained on 16 warmup traces that are run to completion, providing both intermediate answers at each checkpoint and the final answeraj\(T\)a\_\{j\}\(T\)\. Each \(trace, position\) pair yields one training example, giving approximately16×27≈43016\\times 27\\approx 430examples per question\. The binary label isyj,t=𝟏\{aj\(t\)≠aj\(T\)\}y\_\{j,t\}=\\mathbf\{1\}\\\{a\_\{j\}\(t\)\\neq a\_\{j\}\(T\)\\\}\.
#### Standardization\.
Features are standardized to zero mean and unit variance before fitting\. The standardization parameters \(mean, std per feature\) are stored and applied at prediction time\.
#### Logistic regression\.
We fit:
P\(y=1∣𝐱\)=σ\(β0\+𝜷⊤𝐱std\)P\(y=1\\mid\\mathbf\{x\}\)=\\sigma\(\\beta\_\{0\}\+\\bm\{\\beta\}^\{\\top\}\\mathbf\{x\}\_\{\\text\{std\}\}\)whereσ\(z\)=1/\(1\+e−z\)\\sigma\(z\)=1/\(1\+e^\{\-z\}\)and𝐱std\\mathbf\{x\}\_\{\\text\{std\}\}is the standardized feature vector\. The parametersβ0∈ℝ\\beta\_\{0\}\\in\\mathbb\{R\}and𝜷∈ℝ5\\bm\{\\beta\}\\in\\mathbb\{R\}^\{5\}\(one coefficient per feature\) are fit by minimizing the regularized negative log\-likelihood:
ℒ\(β0,𝜷\)=−∑i\[yilogp^i\+\(1−yi\)log\(1−p^i\)\]\+λ‖𝜷‖22\\mathcal\{L\}\(\\beta\_\{0\},\\bm\{\\beta\}\)=\-\\sum\_\{i\}\\bigl\[y\_\{i\}\\log\\hat\{p\}\_\{i\}\+\(1\-y\_\{i\}\)\\log\(1\-\\hat\{p\}\_\{i\}\)\\bigr\]\+\\lambda\\\|\\bm\{\\beta\}\\\|\_\{2\}^\{2\}withλ=0\.01\\lambda=0\.01\(mild L2 regularization on coefficients, not intercept\)\. Optimization uses L\-BFGS\-B with a maximum of 200 iterations, initialized at𝟎\\bm\{0\}\.
### C\.3Platt calibration
After fitting the logistic model, we apply Platt scaling\[[24](https://arxiv.org/html/2606.12935#bib.bib16)\]to correct any systematic miscalibration\. Given raw predictionsq^j\\hat\{q\}\_\{j\}on the training set:
qjcal=σ\(a⋅logit\(q^j\)\+b\)q\_\{j\}^\{\\text\{cal\}\}=\\sigma\\\!\\left\(a\\cdot\\text\{logit\}\(\\hat\{q\}\_\{j\}\)\+b\\right\)wherelogit\(p\)=log\(p/\(1−p\)\)\\text\{logit\}\(p\)=\\log\(p/\(1\-p\)\)and\(a,b\)\(a,b\)are fit by minimizing cross\-entropy on the training predictions versus true labels\. With a well\-specified logistic model, this is approximately the identity \(a≈1,b≈0a\\approx 1,b\\approx 0\)\.
#### Why Platt calibration\.
The logistic model is trained with L2 regularization, which shrinks predictions toward 0\.5\. Platt scaling can undo this shrinkage\. More importantly, if the 5 features are insufficient to capture all switch\-relevant variation \(which they are—they miss vote\-context information\), the raw predictions may be systematically miscalibrated\. Platt scaling provides a lightweight correction\.
### C\.4Prediction
At prediction time, the fitted model producesqj\(t\)q\_\{j\}\(t\)for every tracejjat every checkpointtt\. The predictions are precomputed as a\[Ntraces×P\]\[N\_\{\\text\{traces\}\}\\times P\]matrix and indexed into bootstrap samples\. No per\-iteration computation is needed—the model is fit once per question during the warmup phase\.
#### Computational cost\.
Fitting the model \(L\-BFGS on∼\\sim430 examples with 6 parameters\) takes<<1ms per question\. Prediction \(matrix multiply \+ sigmoid\) for all traces and positions takes<<1ms\. Theqq\-model adds negligible overhead to the simulation\.
## Appendix DTrace generation and answer probing
#### Trace generation\.
We generate traces using SGLang\[[35](https://arxiv.org/html/2606.12935#bib.bib17)\]\. Table[4](https://arxiv.org/html/2606.12935#A4.T4)lists the inference parameters for each model\. We sample 4,096 traces per problem, all launched in parallel\.
Table 4:Inference parameters for trace generation\.
#### Answer probing at checkpoints\.
Checkpoints are spaced every 2,048 tokens\. At checkpointtt, we truncate the reasoning trace at positionttand append:
> Considering the limited time by the user, I have to give the solution based on the thinking directly now\.\\n</think\>\\n\\boxed\{
This closes the thinking block and forces the model to commit to an answer\. We allow up to 50 new tokens for this completion; temperature, top\-pp, and top\-kkstay the same as in trace generation\. The answer is parsed from the resulting`\\boxed\{\}`output\.
## Appendix EFull experimental results
Tables[5](https://arxiv.org/html/2606.12935#A5.T5),[6](https://arxiv.org/html/2606.12935#A5.T6), and[7](https://arxiv.org/html/2606.12935#A5.T7)report the complete results for all methods across all model–dataset combinations \(3 models×\\times3 datasets\)\. Each table shows MARS atγ=1\\gamma=1\(fully conservative, no calibration\) and with per\-questionγ\\gamma\-calibration, for both the learnedqq\-model and oracleqq, on SC and DCO\. Savings are mean per\-question token reductions relative to each family’s own baseline \(SC or DCO\)\.
Table 5:Full results for DeepSeek\-R1\-Distill\-Qwen\-8B\.Table 6:Full results for Qwen3\-32B\.Table 7:Full results for Qwen3\-next\.
## Appendix FLimitations & Broader impacts
#### Limitations\.
The margin decomposition assumes constant per\-trace weightswjw\_\{j\}, which holds exactly for uniform\-weight schemes \(self\-consistency\) but only approximately for methods with position\-dependent weighting such as DCO\. In practice, this approximation is benign: DCO’s own filtering and truncation absorb most of the weight variation, and our method preserves DCO baseline accuracy across all 9 settings tested\. Our evaluation covers mathematical reasoning; the framework’s applicability to other domains \(e\.g\., coding\) remains to be validated\.
#### Broader impacts\.
The main positive impact is reducing the compute cost, latency, and energy use of parallel LLM inference\. The same efficiency gain could also make high\-volume LLM deployment cheaper, including deployments whose outputs are low\-quality or harmful\. MARS does not introduce a new model or new generation capability, but it should be deployed with the same output\-quality and safety controls as the underlying LLM system\.Similar Articles
LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling
This paper introduces AutoTTS, an environment-driven framework that automates the discovery of test-time scaling strategies for LLMs by formulating it as controller synthesis. It demonstrates improved accuracy-cost tradeoffs on mathematical reasoning benchmarks with minimal computational overhead.
Fast Unlearning at Scale via Margin Self-Correction
Introduces MASC (Margin Self-Correction), an efficient unlearning method for LLMs that uses an online stopping rule to achieve competitive forget–retain trade-offs at reduced computational cost, validated on TOFU and MUSE benchmarks.
LLMs Know When They Know, but Do Not Act on It: A Metacognitive Harness for Test-time Scaling
This paper proposes a metacognitive harness that separates monitoring from reasoning in LLMs, using pre-solve feeling-of-knowing and post-solve judgment-of-learning signals to control when to trust, retry, or aggregate answers, improving accuracy on text, code, and multimodal benchmarks without parameter updates.
Margin-Adaptive Confidence Ranking for Reliable LLM Judgement
This paper introduces a margin-based confidence ranking method for LLM-as-a-judge systems, learning a dedicated estimator to ensure monotonicity between confidence and human-disagreement risk, with generalization guarantees and improved ranking accuracy across datasets.
CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs
CRMA introduces a spectrally-bounded residual adapter that enables continual fine-tuning of LLMs without catastrophic forgetting by enforcing a doubly-stochastic mixing matrix via Sinkhorn normalization. Experimental results on Mistral-7B and Gemma-2-9B show improved backward transfer and reduced forgetting compared to frozen-substrate baselines.