ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

arXiv cs.LG 05/27/26, 04:00 AM Papers
reasoning test-time-sampling majority-vote language-models hidden-states consensus
Summary
This paper identifies that language model reasoning trajectories during test-time sampling cluster into 'reasoning basins', causing majority vote failures when the dominant basin is incorrect. It introduces ARBITER, a model-agnostic method that uses conservative additive evidence from the model's own outputs and hidden states to improve accuracy without external data.
arXiv:2605.26172v1 Announce Type: new Abstract: When language models use test-time sampling, they generate multiple reasoning trajectories and select an answer by majority vote. We show that these trajectories are not independent: for a given question, they concentrate into a small number of clusters, or reasoning basins, each defined by a normalized final answer and the solutions that reach it. A majority vote therefore selects the most stable basin rather than the most accurate one, which creates wrong-majority failures where the correct answer is present but outvoted. We introduce ARBITER, a model-agnostic approach that models interactions between basins using only the base model's own sampled outputs, hidden states, and derived evidence. Most direct correction strategies fail; ARBITER instead uses conservative additive evidence on top of consensus. In its simplest parameter-free form, ARBITER-{\Delta} adds same-model evidence to the majority prior, while ARBITER-Enc augments this with bounded residual signals from hidden states over complete solutions. On GSM8K with Qwen3-4B, consensus over K=24 samples achieves around the mid-94% range, while a same-pool top-2 oracle reaches around the mid-96% range. ARBITER recovers a subset of these cases using zero external information. Across three model families and three math benchmarks, it yields consistent gains with no net-negative cases; for example, on Llama-3.1-8B MMLU-HS-Math, it improves accuracy from the mid-78% range to the mid-82% range, recovering about 22% of the available oracle headroom, indicating that this headroom can be partially recovered from the sample pool itself.
Original Article
View Cached Full Text
Cached at: 05/27/26, 09:04 AM
# ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling
Source: [https://arxiv.org/html/2605.26172](https://arxiv.org/html/2605.26172)
Meng Cai Lars Kulik Farhana Choudhury School of Computing and Information Systems University of Melbourne meng\.cai1@student\.unimelb\.edu\.au lkulik@unimelb\.edu\.au farhana\.choudhury@unimelb\.edu\.au

###### Abstract

When language models use test\-time sampling, they generate multiple reasoning trajectories and select an answer by majority vote\. We show that these trajectories are not independent: for a given question, they concentrate into a small number of clusters, or*reasoning basins*, each defined by a normalized final answer and the solutions that reach it\. A majority vote therefore selects the most stable basin rather than the most accurate one, which creates*wrong\-majority failures*where the correct answer is present but outvoted\. We introduceArbiter, a model\-agnostic approach that models interactions between basins using only the base model’s own sampled outputs, hidden states, and derived evidence\. Most direct correction strategies fail;Arbiterinstead uses conservative additive evidence on top of consensus\. In its simplest parameter\-free form,Arbiter\-Δ\\Deltaadds same\-model evidence to the majority prior, whileArbiter\-Encaugments this with bounded residual signals from hidden states over complete solutions\. On GSM8K with Qwen3\-4B, consensus overK=24K\{=\}24samples achieves around the mid\-94% range, while a same\-pool top\-2 oracle reaches around the mid\-96% range\.Arbiterrecovers a subset of these cases using zero external information\. Across three model families and three math benchmarks, it yields consistent gains with no net\-negative cases; for example, on Llama\-3\.1\-8B MMLU\-HS\-Math, it improves accuracy from the mid\-78% range to the mid\-82% range, recovering about22%22\\%of the available oracle headroom, indicating that this headroom can be partially recovered from the sample pool itself\.

## 1Introduction

A standard way to improve language models at inference time is to sample multiple reasoning trajectories and aggregate their final answers by majority vote\. This baseline is already strong: across models and benchmarks, it usually improves over greedy decoding and often places the correct answer somewhere in the sampled pool\. The remaining challenge is therefore not generation alone, but post\-consensus recovery: can a system reliably identify when the majority answer is wrong without degrading the many cases where consensus is already correct?

We find that sampled trajectories concentrate into a small number of coherent answer basins: groups of solutions whose extracted final answers match after task\-specific normalization\. Majority vote therefore acts as a dominant\-basin selector\. This usually works well, but an important failure case remains: the dominant basin can be wrong while a correct challenger basin is already present in the sampled pool\.

These wrong\-majority cases reveal substantial recoverable headroom within the sampled pool itself\. In many examples, the correct answer is already present among the observed challenger basins but loses to a larger wrong basin\. However, our experiments show that recovering these cases reliably is surprisingly difficult\. Broad self\-review, hidden\-state reranking, trajectory coherence scoring, graph routing, framing\-first replacement, and direct basin\-selection methods often degrade a strong consensus baseline even when they reveal real structure in the reasoning process\. Hidden\-state structure and trajectory coherence are therefore not reliable indicators of correctness\.

This empirical pattern motivates a more conservative principle: consensus should remain the prior\. A challenger basin should override the dominant basin only when additional same\-model evidence accumulates in its favor\. Reliable recovery comes not from replacing consensus, but from sparse additive evidence layered on top of it\.

Existing approaches fall short because they evaluate trajectories in isolation\. Sample\-and\-aggregate methods rely on agreement but do not model relationships between basins\. Hidden\-state scorers, self\-verification methods, and entropy\-based approaches assess individual trajectories\. They do not directly model comparative evidence between alternative basins\. As a result, they struggle to determine when a minority basin should override a stable majority\.

We introduceArbiter, a basin\-structured framework for post\-consensus selection\. The key principle is simple:*consensus remains the prior*\. It is overridden only when additional same\-model evidence supports a challenger basin\. For each question,Arbitergroups sampled trajectories into basins and constructs compact representations of their structure\. It then collects additional same\-model evidence by asking the model to reinterpret, compare, and solve again under competing basin hypotheses, and accumulates this evidence relative to the dominant basin\.

Our approach operates in a strict zero\-external\-information setting\. It uses only the model’s own sampled outputs and internal representations\. It does not rely on external verifiers or additional training signals\. This isolates the question of whether the model already contains enough internal evidence to recover from wrong\-majority failures\.

The central empirical lesson is that post\-consensus recovery must be conservative\. Many consensus errors are recoverable in principle, but broad correction strategies often harm more correct cases than they fix\. Signals such as trajectory coherence and hidden\-state structure reveal real organization in the sampled pool, yet they do not reliably identify correctness\. Reliable gains instead come from sparse, high\-precision overrides supported by additive basin\-level evidence\.

Our contributions are as follows\.

1. \(1\)We identify*wrong\-majority failure*as a key failure case of consensus decoding\. Sampled trajectories concentrate into a small number of reasoning basins, and majority vote selects the most stable basin rather than the most accurate one\.
2. \(2\)We introduceArbiter, a basin\-structured framework that performs post\-consensus selection by accumulating same\-model evidence for challenger basins relative to the dominant basin, while treating consensus as a prior\.
3. \(3\)We show that post\-consensus recovery is inherently selective\. Across a broad range of self\-review, hidden\-state, graph\-routing, and framing\-based interventions, most direct correction strategies degrade a strong consensus baseline\. Reliable recovery instead comes from sparse, high\-precision additive evidence\.

## 2Related work

Test\-time sampling and answer aggregation\.Chain\-of\-thought prompting and self\-consistency established the standard recipe of sampling multiple reasoning traces and aggregating final answers\(Wei et al\.,[2022](https://arxiv.org/html/2605.26172#bib.bib28); Wang et al\.,[2023](https://arxiv.org/html/2605.26172#bib.bib27)\)\. Universal Self\-Consistency extends this idea by using the model itself to select among candidate solutions beyond exact\-answer majority voting\(Chen et al\.,[2023](https://arxiv.org/html/2605.26172#bib.bib3)\)\. More recent test\-time scaling work studies how to allocate inference compute across problems rather than spend it uniformly\(Snell et al\.,[2025](https://arxiv.org/html/2605.26172#bib.bib25)\)\. These methods motivate our baseline: raw consensus is a strong dominant\-basin estimator\.Arbiterkeeps that estimator as the prior and studies when same\-model evidence justifies selecting another observed basin\.

Hidden\-state and trajectory signals\.Prior work uses hidden states, token uncertainty, step\-level pruning, latent actions, or proactive refinement to evaluate or control reasoning\(Liang et al\.,[2026](https://arxiv.org/html/2605.26172#bib.bib17); Ghasemabadi and Niu,[2025](https://arxiv.org/html/2605.26172#bib.bib5); Chen et al\.,[2026](https://arxiv.org/html/2605.26172#bib.bib2); Li et al\.,[2025](https://arxiv.org/html/2605.26172#bib.bib15); Han et al\.,[2025](https://arxiv.org/html/2605.26172#bib.bib7); Shi et al\.,[2026](https://arxiv.org/html/2605.26172#bib.bib23)\)\. Recent work also studies semantic or latent structure across sampled reasoning trajectories, including semantic consistency, latent majority\-set selection, and hidden\-state clustering approaches\(Knappe et al\.,[2024](https://arxiv.org/html/2605.26172#bib.bib11); Oh and Lee,[2025](https://arxiv.org/html/2605.26172#bib.bib21); Liang et al\.,[2025](https://arxiv.org/html/2605.26172#bib.bib16)\)\. Recent trajectory\-level views further support treating a complete solution as a path through latent computation rather than as a bag of isolated token states\(Liang et al\.,[2026](https://arxiv.org/html/2605.26172#bib.bib17); Shi et al\.,[2026](https://arxiv.org/html/2605.26172#bib.bib23)\)\. This literature supports the idea that model\-internal computation contains useful structure\. Our results sharpen an important limitation: structure is not truth\. Coherence, stability, and graph reconstruction often detect commitment or risk, not correctness\. We therefore treat trajectory encoders and basin graphs as residual or diagnostic components, not as standalone selectors\.

Self\-correction limitations\.Iterative self\-feedback and reflection frameworks show that model\-generated feedback can improve outputs in some settings\(Madaan et al\.,[2023](https://arxiv.org/html/2605.26172#bib.bib19); Shinn et al\.,[2023](https://arxiv.org/html/2605.26172#bib.bib24)\)\. A parallel line of work shows that unguided self\-correction can be weak or harmful without a reliable verifier\(Huang et al\.,[2024](https://arxiv.org/html/2605.26172#bib.bib10); Zhang et al\.,[2024](https://arxiv.org/html/2605.26172#bib.bib31); Vasudev et al\.,[2026](https://arxiv.org/html/2605.26172#bib.bib26)\)\. This matches our broader experimental findings: broad self\-review, cluster judging, and direct replacement policies often disrupt already\-correct consensus answers\.Arbiterresponds by making correction sparse, additive, and auditable through recovered/degraded counts\.

Framing and semantic decomposition\.Math\-reasoning benchmarks and perturbation studies show that wording, entities, units, and symbolic form can strongly affect model behavior\(Cobbe et al\.,[2021](https://arxiv.org/html/2605.26172#bib.bib4); Hendrycks et al\.,[2021a](https://arxiv.org/html/2605.26172#bib.bib8),[b](https://arxiv.org/html/2605.26172#bib.bib9); Li et al\.,[2024](https://arxiv.org/html/2605.26172#bib.bib14); Mirzadeh et al\.,[2024](https://arxiv.org/html/2605.26172#bib.bib20)\)\. We use same\-model semantic descriptions to expose competing interpretations of observed answer basins\. Unlike framing\-first replacement,Arbiter\-Δ\\Deltauses these descriptions only to collect additive evidence on top of the raw consensus prior\.

## 3Problem setup

We study*post\-consensus recovery*for a frozen autoregressive language modelMMunder*zero external information*: the selector uses only the model’s own sampled outputs, internal states, and evidence derived from them\. Gold labels are used only after prediction for evaluation\. For each questionqq, the raw baseline is ordinary sampled generation, final\-answer clustering, and majority selection\.

The following notation defines the objects used throughout the paper\. A*candidate solution*is one complete generated completion\. A*trajectory*is used only when hidden states are involved: it is the layer\-by\-layer, token\-by\-token hidden\-state sequence recorded while the frozen model generates that complete solution\. An*answer basin*is an observed final\-answer cluster together with the generated solutions and their hidden\-state trajectories\. We define basins by final\-answer agreement, not by requiring all solutions in a basin to share the same reasoning path\. The dominant basin is the largest answer basin; challenger basins are all other observed basins\. Eqs\. \(D1\)–\(D10\) give the formal objects\. For reference, Appendix[B](https://arxiv.org/html/2605.26172#A2)lists every symbol used in these definitions and in the method score\.

𝒮\(q\)=\{s1,…,sK\},si=\(yi,1,…,yi,Ti\)\\displaystyle\\mathcal\{S\}\(q\)=\\\{s\_\{1\},\\ldots,s\_\{K\}\\\},\\quad s\_\{i\}=\(y\_\{i,1\},\\ldots,y\_\{i,T\_\{i\}\}\)raw candidate pool\(D1\)Hi=\(hi,1\(1:L\),…,hi,Ti\(1:L\)\),hi,t\(ℓ\)∈ℝdmodel\\displaystyle H\_\{i\}=\\bigl\(h\_\{i,1\}^\{\(1:L\)\},\\ldots,h\_\{i,T\_\{i\}\}^\{\(1:L\)\}\\bigr\),\\quad h\_\{i,t\}^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\_\{\\mathrm\{model\}\}\}hidden\-state trajectory\(D2\)ai=Ans\(si\)\\displaystyle a\_\{i\}=\\mathrm\{Ans\}\(s\_\{i\}\)task\-normalized final answer\(D3\)Cr\(q\)=\{i:ai=αr\},\|C1\|≥\|C2\|≥⋯≥\|Cm\(q\)\|\\displaystyle C\_\{r\}\(q\)=\\\{\\,i:a\_\{i\}=\\alpha\_\{r\}\\,\\\},\\quad\|C\_\{1\}\|\\geq\|C\_\{2\}\|\\geq\\cdots\\geq\|C\_\{m\(q\)\}\|ranked answer clusters\(D4\)Br\(q\)=\(αr,Cr,\{Hi:i∈Cr\}\)\\displaystyle B\_\{r\}\(q\)=\\bigl\(\\alpha\_\{r\},C\_\{r\},\\\{H\_\{i\}:i\\in C\_\{r\}\\\}\\bigr\)observed answer basin\(D5\)y^cons\(q\)=α1\\displaystyle\\hat\{y\}\_\{\\mathrm\{cons\}\}\(q\)=\\alpha\_\{1\}raw consensus prediction\(D6\)
Equations \(D1\)–\(D6\) separate the*answer*selected by consensus from the*solutions*that produced it\. Ties in Eq\. \(D4\) are broken deterministically by the earliest sampled index in the cluster and then by the canonical answer string\. A question belongs to the disagreement slice whenm\(q\)≥2m\(q\)\\geq 2\. Only such questions admit within\-pool arbitration, because no alternative observed basin exists when all sampled solutions collapse to one answer\.

Lety⋆\(q\)y^\{\\star\}\(q\)be the gold answer\. Gold is unavailable to the selector\. It is used only to compute accuracy, diagnostic oracle ceilings, and recovered/degraded counts:

Acccons=𝔼q\[𝟏\{α1\(q\)=y⋆\(q\)\}\]\\displaystyle\\mathrm\{Acc\}\_\{\\mathrm\{cons\}\}=\\mathbb\{E\}\_\{q\}\\bigl\[\\mathbf\{1\}\\\{\\alpha\_\{1\}\(q\)=y^\{\\star\}\(q\)\\\}\\bigr\]raw consensus accuracy\(D7\)Oracle@k\(q\)=𝟏\{∃r≤k:αr\(q\)=y⋆\(q\)\}\\displaystyle\\mathrm\{Oracle@\}k\(q\)=\\mathbf\{1\}\\\{\\exists r\\leq k:\\alpha\_\{r\}\(q\)=y^\{\\star\}\(q\)\\\}diagnostic same\-pool ceiling\(D8\)WM\(q\)=𝟏\{α1\(q\)≠y⋆\(q\),∃r\>1:αr\(q\)=y⋆\(q\)\}\\displaystyle\\mathrm\{WM\}\(q\)=\\mathbf\{1\}\\\{\\alpha\_\{1\}\(q\)\\neq y^\{\\star\}\(q\),\\;\\exists r\>1:\\alpha\_\{r\}\(q\)=y^\{\\star\}\(q\)\\\}wrong\-majority indicator\(D9\)ΔAcc\(π\)=𝔼q\[𝟏\{απ\(q\)\(q\)=y⋆\(q\)\}−𝟏\{α1\(q\)=y⋆\(q\)\}\]\\displaystyle\\Delta\\mathrm\{Acc\}\(\\pi\)=\\mathbb\{E\}\_\{q\}\\\!\\left\[\\mathbf\{1\}\\\{\\alpha\_\{\\pi\(q\)\}\(q\)=y^\{\\star\}\(q\)\\\}\-\\mathbf\{1\}\\\{\\alpha\_\{1\}\(q\)=y^\{\\star\}\(q\)\\\}\\right\]net recovery of policyπ\\pi\(D10\)
Oracle@kkis not a deployable method: it is a diagnostic ceiling that shows how often the correct answer already appears among the observed challenger basins\. Equivalently, Eq\. \(D10\) is the probability of wrong\-to\-right recoveries minus the probability of right\-to\-wrong degradations, which explains why high\-accuracy consensus is hard to improve\. Thus a useful policy must treat consensus as the default and override it only when sufficient same\-model evidence accumulates in favor of a challenger basin\.

## 4Method

### 4\.1Inference pipeline

Arbiteris a post\-consensus arbitration method\. For each question, it first samples a raw pool of ordinary solutions, clusters them by final answer, and treats the largest basin as the consensus prior\. It then asks the same frozen model to produce compact interpretations of competing basins and to generate auxiliary evidence streams\. Each auxiliary output is parsed back into one of the observed answer basins\. Finally,Arbiter\-Δ\\Deltaadds the evidence as log ratios and keeps consensus unless a challenger has positive accumulated evidence over the dominant basin\.

This pipeline is intentionally simple:*sample, cluster, describe, collect evidence, add evidence*\. It uses no external verifier, no tool, no human feedback, and no gold label during selection\.

### 4\.2Evidence sources and notation

A*frame*is a one\-sentence semantic interpretation of a basin: what quantity is being asked for, which entities and units are involved, and what operation pattern connects them\. A*framed\-pool solve*asks the model to state such an interpretation before solving\. A*panel trial*shows the model two basin interpretations side by side and then asks it to solve fresh\. A*guided re\-solve*gives one basin interpretation as a hypothesis and asks the model to re\-derive the answer\.

For a challenger basinBrB\_\{r\}and the dominant basinB1B\_\{1\}, every evidence source produces counts for the two basins\. A count increases when the parsed final answer from that source matches the basin answer\. Invalid outputs or answers outside the compared pair are not added to either count, but they reduce the source reliability term below\.

For reference, Appendix[B](https://arxiv.org/html/2605.26172#A2)gives the count\-level symbol table for these sources\.

The rule has no learned or tuned source scaling constants\. The smoothing valueα=1\\alpha=1is a fixed Laplace count used only to avoid zero\-count log ratios; it is not a learned prior or a source\-weight parameter\. For the main top\-2 policy, framed and guided reliability are the top\-2 masses:

rf\(q\)=f1\+f2NFatt,rg\(q\)=g1\+g2NGatt,\\displaystyle r\_\{f\}\(q\)=\\frac\{f\_\{1\}\+f\_\{2\}\}\{N\_\{F\}^\{\\mathrm\{att\}\}\},\\qquad r\_\{g\}\(q\)=\\frac\{g\_\{1\}\+g\_\{2\}\}\{N\_\{G\}^\{\\mathrm\{att\}\}\},\(M1\)whereNFattN\_\{F\}^\{\\mathrm\{att\}\}andNGattN\_\{G\}^\{\\mathrm\{att\}\}count all attempted framed and guided outputs, including invalid outputs and answers outside the compared pair\. The mass shrinkage means that if many framed or guided trials land on answers other thanB1B\_\{1\}orB2B\_\{2\}, that source is unreliable for this question and contributes less\. Panel evidence is not part of the main framed\+guided rule, but is retained for source\-set ablations\. For those ablations, letpj\+p\_\{j\}^\{\+\}andpj−p\_\{j\}^\{\-\}be counts for basinBjB\_\{j\}when the two frames are shown in the original and swapped order, sopj=pj\+\+pj−p\_\{j\}=p\_\{j\}^\{\+\}\+p\_\{j\}^\{\-\}\. LetNP\+,attN\_\{P\}^\{\+,\\mathrm\{att\}\}andNP−,attN\_\{P\}^\{\-,\\mathrm\{att\}\}be the attempted counts in the two orders\. We use top\-pair mass times order symmetry:

mP,r=p1\+\+pr\+\+p1−\+pr−NP\+,att\+NP−,att,νr±=pr±\+αp1±\+pr±\+2α,ρP,r=mP,r\(1−\|νr\+−νr−\|\)\.\\displaystyle m\_\{P,r\}=\\frac\{p\_\{1\}^\{\+\}\+p\_\{r\}^\{\+\}\+p\_\{1\}^\{\-\}\+p\_\{r\}^\{\-\}\}\{N\_\{P\}^\{\+,\\mathrm\{att\}\}\+N\_\{P\}^\{\-,\\mathrm\{att\}\}\},\\quad\\nu\_\{r\}^\{\\pm\}=\\frac\{p\_\{r\}^\{\\pm\}\+\\alpha\}\{p\_\{1\}^\{\\pm\}\+p\_\{r\}^\{\\pm\}\+2\\alpha\},\\quad\\rho\_\{P,r\}=m\_\{P,r\}\\bigl\(1\-\|\\nu\_\{r\}^\{\+\}\-\\nu\_\{r\}^\{\-\}\|\\bigr\)\.\(M1p\)ThusρP,r\\rho\_\{P,r\}is small when panel outputs go off\-pair or when swapping the order of the two shown frames changes the top\-pair distribution substantially\. These factors use only model outputs and parse statistics, never gold labels\.

### 4\.3The Delta rule

For each questionqqwith at least two observed basins, letB1B\_\{1\}be the majority basin andB2B\_\{2\}the leading challenger\. The main parameter\-freeArbiter\-Δ\\Deltapolicy uses only raw, framed, and guided counts for this top\-2 pair:

Δ2\(q\)=log⁡b2\+αb1\+α\+rf\(q\)log⁡f2\+αf1\+α\+rg\(q\)log⁡g2\+αg1\+α\.\\displaystyle\\small\\Delta\_\{2\}\(q\)=\\log\\frac\{b\_\{2\}\+\\alpha\}\{b\_\{1\}\+\\alpha\}\+r\_\{f\}\(q\)\\log\\frac\{f\_\{2\}\+\\alpha\}\{f\_\{1\}\+\\alpha\}\+r\_\{g\}\(q\)\\log\\frac\{g\_\{2\}\+\\alpha\}\{g\_\{1\}\+\\alpha\}\.\(M2\)The first term is the raw majority prior: the challenger starts behind when it has fewer raw votes\. The second and third terms are same\-model framed\-pool and guided re\-solve evidence\. The fixed pseudo\-count isα=1\.0\\alpha=1\.0\. There is no panel term in the main policy; panel evidence is used only in source\-set ablations\. The rule selectsB2B\_\{2\}iffΔ2\(q\)\>0\\Delta\_\{2\}\(q\)\>0and otherwise keepsB1B\_\{1\}:

y^\(q\)=\{α2,Δ2\(q\)\>0,α1,otherwise\.\\displaystyle\\hat\{y\}\(q\)=\\begin\{cases\}\\alpha\_\{2\},&\\Delta\_\{2\}\(q\)\>0,\\\\\[2\.0pt\] \\alpha\_\{1\},&\\text\{otherwise\.\}\\end\{cases\}\(M3\)For notational compatibility with appendix ablations, we writeΔr\\Delta\_\{r\}for the analogous score comparing a challengerBrB\_\{r\}againstB1B\_\{1\}; the main reported policy usesr=2r=2\. Thus the decision boundary is simply the sign of the accumulated evidence\. The framed\+guided source set was fixed before the3×33\\times 3evaluation matrix was reported; panel and all\-source variants are retained only as source\-set ablations\. The same sign rule and the same fixed main source set are applied across the evaluation matrix\. In practice, almost all accepted moves are to the leading challenger: weaker basins rarely accumulate enough evidence to overcome the raw\-consensus prior, and their oracle contribution is usually small\. In the Qwen3\-4B GSM8K run, raw consensus is94\.54%94\.54\\%, the top\-2 oracle is97\.27%97\.27\\%, the top\-3 oracle is97\.88%97\.88\\%, and top\-5 adds only limited additional headroom\. Appendix[M](https://arxiv.org/html/2605.26172#A13)gives a complete count\-level calculation of Eq\. \(M2\)\.

### 4\.4Log\-linear source\-pooling view

Equation \(M2\) is a Bayes\-motivated log\-linear pooling score, not a calibrated Bayesian posterior over all auxiliary samples\. For the top\-2 pair, define unnormalized pooled support forj∈\{1,2\}j\\in\\\{1,2\\\}as

w~j\(q\)=\(bj\+α\)\(fj\+α\)rf\(q\)\(gj\+α\)rg\(q\)\.\\widetilde\{w\}\_\{j\}\(q\)=\(b\_\{j\}\+\\alpha\)\\bigl\(f\_\{j\}\+\\alpha\\bigr\)^\{r\_\{f\}\(q\)\}\\bigl\(g\_\{j\}\+\\alpha\\bigr\)^\{r\_\{g\}\(q\)\}\.Taking the log ratio gives the implemented score exactly:

log⁡w~2\(q\)w~1\(q\)=log⁡b2\+αb1\+α\+rf\(q\)log⁡f2\+αf1\+α\+rg\(q\)log⁡g2\+αg1\+α=Δ2\(q\)\.\\displaystyle\\log\\frac\{\\widetilde\{w\}\_\{2\}\(q\)\}\{\\widetilde\{w\}\_\{1\}\(q\)\}=\\log\\frac\{b\_\{2\}\+\\alpha\}\{b\_\{1\}\+\\alpha\}\+r\_\{f\}\(q\)\\log\\frac\{f\_\{2\}\+\\alpha\}\{f\_\{1\}\+\\alpha\}\+r\_\{g\}\(q\)\\log\\frac\{g\_\{2\}\+\\alpha\}\{g\_\{1\}\+\\alpha\}=\\Delta\_\{2\}\(q\)\.\(M4\)Thus Eq\. \(M3\) selects the challenger exactly when its pooled support exceeds the dominant basin’s pooled support\. The form keeps the Bayesian intuition that log evidence adds, but it deliberately treats each evidence source as one bounded empirical opinion\. A full Bayesian update would generally scale with the number of auxiliary trials; Eq\. \(M2\) avoids that scaling because same\-model evidence streams can be correlated\. Appendix[C](https://arxiv.org/html/2605.26172#A3)gives the exact algebraic derivation of this pooling identity\.

![Refer to caption](https://arxiv.org/html/2605.26172v1/x1.png)Figure 1:One wrong\-majority case forArbiter\-Δ\\Delta\.\(a\)Basin Story Graph: dominant basinB1=AB\_\{1\}\{=\}A\(n=18n\{=\}18\), challengerB2=BB\_\{2\}\{=\}B\(n=4n\{=\}4\), three singleton basins\. Edges from each evidence source \(§[4\.2](https://arxiv.org/html/2605.26172#S4.SS2)\) to the top\-2 are weighted by the per\-basin counts\(b,f,p,g\)\(b,f,p,g\); the dashed red edge marks the conflicting frames betweenB1B\_\{1\}andB2B\_\{2\}\. The sampled pool favorsAA\(b1=18,br=4b\_\{1\}\{=\}18,b\_\{r\}\{=\}4\), but the compact framed\+guided main policy already favorsBB\(f:13vs\.8f\{:\}13\{\\,\\text\{vs\.\}\\,\}8,g:4vs\.0g\{:\}4\{\\,\\text\{vs\.\}\\,\}0\), and the panel stream gives additional support in full\-source ablations \(p:9vs\.2p\{:\}9\{\\,\\text\{vs\.\}\\,\}2\);Δr\>0\\Delta\_\{r\}\{\>\}0in Eq\. \(M2\), so the rule flips selection toBB\.\(b\)Hidden\-state divergence visualization for the same example; the largerK=120K\{=\}120visualization is for illustrative only, not a quantitative reference\.
### 4\.5Optional encoder residual

Arbiter\-Encis a diagnostic upper\-bound variant, not the main claim: it indicates how much additional headroom is reachable when a bounded hidden\-state residual is added on top of Eq\. \(M2\)\. It is not trained to say which answer is correct\. Its objectives are: predict held\-out same\-model evidence, estimate whether the residual is reliable enough to use, and keep the correction small enough that it cannot overwhelm the hand Delta rule\.

For basinBrB\_\{r\}, the encoder aggregates the hidden\-state trajectories of the solutions in that basin into a representationzr=Aθ\(\{Hi:i∈Cr\}\)z\_\{r\}=A\_\{\\theta\}\(\\\{H\_\{i\}:i\\in C\_\{r\}\\\}\), whereAθA\_\{\\theta\}is a permutation\-invariant trajectory aggregator\. It outputs a signed residualeθ,r\(q\)e\_\{\\theta,r\}\(q\)and, when enabled, a permission scoreγθ,r\(q\)∈\[0,1\]\\gamma\_\{\\theta,r\}\(q\)\\in\[0,1\]:

ΔrEnc\(q\)=Δr\(q\)\+λγθ,r\(q\)clip⁡\(eθ,r\(q\),−cclip,cclip\)\.\\displaystyle\\Delta^\{\\mathrm\{Enc\}\}\_\{r\}\(q\)=\\Delta\_\{r\}\(q\)\+\\lambda\\,\\gamma\_\{\\theta,r\}\(q\)\\,\\operatorname\{clip\}\\\!\\left\(e\_\{\\theta,r\}\(q\),\-c\_\{\\mathrm\{clip\}\},c\_\{\\mathrm\{clip\}\}\\right\)\.\(M5\)The clipped form makes the encoder residual conservative by construction\. Hereλ≥0\\lambda\\geq 0is the residual scale andcclip\>0c\_\{\\mathrm\{clip\}\}\>0is the clipping bound; both are fixed before evaluation in the fixed setting and may be tuned only on labeled validation data in the calibrated setting\. Runs that do not use a permission head setγθ,r\(q\)≡1\\gamma\_\{\\theta,r\}\(q\)\\equiv 1\. We report calibrated residuals separately because they are a stronger but different regime\.

Training uses held\-out evidence rather than gold labels\. For example, the encoder may see raw and framed\-pool features and learn to predict the panel/guided evidence delta\. The basic target is:

δo,r\(q\)=log⁡no,r\+αno,1\+α,ℒres=∑q,r\>1ωq,rHuber\(eθ,r\(q\)−δo,r\(q\)\),\\displaystyle\\delta\_\{o,r\}\(q\)=\\log\\frac\{n\_\{o,r\}\+\\alpha\}\{n\_\{o,1\}\+\\alpha\},\\qquad\\mathcal\{L\}\_\{\\mathrm\{res\}\}=\\sum\_\{q,r\>1\}\\omega\_\{q,r\}\\,\\mathrm\{Huber\}\\\!\\left\(e\_\{\\theta,r\}\(q\)\-\\delta\_\{o,r\}\(q\)\\right\),\(M6\)whereoois the held\-out evidence source or source group,no,jn\_\{o,j\}is the number of held\-out outputs fromooassigned to basinBjB\_\{j\}, andωq,r∈\[0,1\]\\omega\_\{q,r\}\\in\[0,1\]is a prespecified weight, typically the held\-out top\-pair mass\(no,1\+no,r\)/Noatt\(n\_\{o,1\}\+n\_\{o,r\}\)/N\_\{o\}^\{\\mathrm\{att\}\}\. We use the standard Huber loss, which is quadratic for small residuals and linear for large residuals, so that a small number of noisy held\-out counts cannot dominate training\. If a permission head is learned, it is trained only from held\-out evidence reliability and is used to attenuate this residual, not to predict gold correctness\. In words, the encoder learns whether independent same\-model evidence would support a challenger; it does not learn a direct correctness label\.

## 5Experiments

We evaluate post\-consensus recovery under zero external information\. The section is organized around the empirical path of the project: establish raw consensus and oracle headroom, summarize the negative correction attempts that motivated the framing pivot, and report the main positive results forArbiter\-Δ\\DeltaandArbiter\-Enc\. Full per\-experiment details, historical runs, and reliability tiers are in Appendices[D](https://arxiv.org/html/2605.26172#A4)–[R](https://arxiv.org/html/2605.26172#A18)\.

Methods:The baseline throughout is*raw consensus*: aK=24K\{=\}24ordinary sampled solution pool, answer clustering, and majority selection\. ReportedKKcounts the sampled generations used in the consensus vote; a greedy anchor is recorded separately for greedy accuracy and compute accounting and is not included in the raw\-consensus vote\. We useK=24K\{=\}24as the default raw pool size\. A GSM8K raw\-only diagnostic atK=56K\{=\}56did not outperformK=24K\{=\}24for the three model families \(Appendix[D](https://arxiv.org/html/2605.26172#A4), Table[8](https://arxiv.org/html/2605.26172#A4.T8)\), but we do not treat it as a strict compute\-matched full\-matrix baseline\. This baseline uses no framing prompt, no frame panel, no guided re\-solving, and no encoder\-based selection\. Because different scripts use different seeds, prompt families, and run configurations, every method is compared only to the raw consensus baseline from the same run family\. The methods are summarized in Table[1](https://arxiv.org/html/2605.26172#S5.T1)\.

Metrics:For every selector, we report accuracy and the correction decomposition:*overrides*,*recovered*,*degraded*, and*net = recovered−\-degraded*\. Recovered cases are wrong raw\-consensus predictions changed to correct; degraded cases are correct raw\-consensus predictions changed to wrong\. This decomposition is essential because raw consensus is already strong, and a selector can appear plausible while hurting more correct cases than it fixes\. At high consensus accuracies, even modest positive gains are difficult to obtain without introducing larger numbers of right\-to\-wrong degradations\. The benchmark sizeNNis not reduced by model parse failures; invalid generated outputs are treated as invalid predictions/evidence and remain in attempted\-output denominators\. We also report the evidence budget: raw\-pool size, number of framed solves, number of panel trials, number of guided re\-solves, and whether an encoder residual is used\. Across all tested seeds and run families,Arbiter\-Δ\\Deltaexceeds raw consensus on this decomposition\.

Table 1:Evidence budget and status of the main methods\. Framed\-pool solves, panel trials, and guided trials are defined in Section[4\.2](https://arxiv.org/html/2605.26172#S4.SS2); all auxiliary evidence is generated by the same frozen base model\.Datasets and language models:We evaluate frozen instruction\-tuned language models across math reasoning benchmarks\. The main multi\-model results use \(i\) Qwen3\-4B\-Instruct\-2507\(Yang et al\.,[2025](https://arxiv.org/html/2605.26172#bib.bib30)\), \(ii\) Llama\-3\.1\-8B\-Instruct\(Grattafiori et al\.,[2024](https://arxiv.org/html/2605.26172#bib.bib6)\), and \(iii\) Phi\-4\(Abdin et al\.,[2024](https://arxiv.org/html/2605.26172#bib.bib1)\)over the datasets of \(i\) GSM8K\(Cobbe et al\.,[2021](https://arxiv.org/html/2605.26172#bib.bib4)\)\(n=1319n\{=\}1319\), \(ii\) MMLU\-HS\-Math\(Hendrycks et al\.,[2021a](https://arxiv.org/html/2605.26172#bib.bib8)\)\(n=270n\{=\}270\), and \(iii\) MATH\-500\(Hendrycks et al\.,[2021b](https://arxiv.org/html/2605.26172#bib.bib9); Lightman et al\.,[2024](https://arxiv.org/html/2605.26172#bib.bib18)\)\(n=500n\{=\}500\)\. GSM8K answers are normalized numeric strings, MMLU\-HS\-Math answers are canonical multiple\-choice labels, and MATH\-500 answers are normalized boxed expressions with symbolic equivalence when available\.

### 5\.1Experiment results

Raw consensus and oracle headroom:The full consensus/oracle table is in Appendix[D](https://arxiv.org/html/2605.26172#A4)\(Table[7](https://arxiv.org/html/2605.26172#A4.T7)\)\. The results show that raw consensus leaves visible same\-pool headroom\. In brief, consensus improves over greedy in seven of nine model–dataset cells, matches it in one cell, and remains the in\-family baseline for all selector comparisons; the single below\-greedy cell is reported explicitly in the appendix\. In that Qwen3\-4B MATH\-500 cell, the same\-pool oracle remains higher than both greedy and consensus, indicating that the 4B model often samples a correct answer but does not form a stable dominant correct basin\. The top\-3 oracle reveals substantial recoverable headroom, especially for the models that have lower accuracy and harder benchmarks\. A GSM8K raw\-only diagnostic atK=56K\{=\}56did not improve overK=24K\{=\}24\(Table[8](https://arxiv.org/html/2605.26172#A4.T8)\), so the gains below should not be interpreted as merely drawing more ordinary samples\.

Negative ladder: most global correction fails:The full negative ladder is reported in Appendix[D](https://arxiv.org/html/2605.26172#A4)\. Broad self\-review, top\-up vote merging, raw\-trace review, answer\-memo review, basin\-principle judging, our original trajectory encoder, the cluster\-GNN/router, framing\-first replacement, and direct panel/guided replacement all fail as global selectors\. These results are not incidental: they show that stability, coherence, and graph structure often measure self\-consistency rather than correctness\. This motivates the final design choice: keep raw consensus as the prior and add auxiliary evidence only through a conservative arbitration score\.

Main result:Arbiter\-Δ\\Delta:Table[2](https://arxiv.org/html/2605.26172#S5.T2)reports the main clean result using the fixed framed\+guided parameter\-free policy\. Each row comparesArbiter\-Δ\\Deltaagainst the raw consensus baseline from the same run family\. Count\-level net change is positive in 8 of 9 cells and neutral in one; accuracy is positive in eight cells and unchanged in one\. Net counts are recovered minus degraded examples\.

Table 2:Arbiter\-Δ\\Deltaacross three models and three benchmarks using the fixed framed\+guided parameter\-free policy\. Raw consensus is the in\-family no\-framing majority baseline\. “Overrides / Rec\. / Deg\.” means arbitration moves, wrong\-to\-right recoveries, and right\-to\-wrong degradations\. Counts are reported for every benchmark row\.The largest gains occur where consensus has more headroom: Llama\-3\.1\-8B averages\+1\.77\+1\.77points across datasets, with the strongest single\-cell gain of\+3\.00\+3\.00on MATH\-500\. Near\-ceiling cells show small or neutral changes, consistent with the oracle analysis\. The Qwen3\-4B GSM8K policy illustrates the operating regime: 16 overrides yield 9 recoveries and 5 degradations\. Across the evaluation matrix, the method makes 168 overrides, 78 recoveries, and 35 degradations, for a net gain of 43 examples\. The method improves by making a small number of higher\-precision non\-consensus moves, not by broadly rewriting predictions\.

Arbiter\-Enc: encoder residual gains:The trajectory encoder is a bounded residual on top ofArbiter\-Δ\\Delta; it is not a standalone verifier and is not expected to improve the fixed raw\-majority vote itself\. Appendix[K](https://arxiv.org/html/2605.26172#A11)reports the full fixed/no\-tune and calibrated/tuned gain ranges\. The fixed encoder is near\-zero to modest in the strict label\-free setting, mainly by reducing degraded flips and identifying noisy framing evidence\. The calibrated encoder shows a stronger range when labeled validation data is available, so we report it separately from the strict zero\-external\-information setting\.

Experimental results overview:The full empirical\-program table is in Appendix[A](https://arxiv.org/html/2605.26172#A1)\(Table[3](https://arxiv.org/html/2605.26172#A1.T3)\)\. The overview of the results is: raw consensus is the stable sampled\-vote baseline, most global correction mechanisms are negative or fragile, and the clean fixed full\-matrix selector with non\-negative count\-level net outcomes isArbiter\-Δ\\Delta\. Unless stated otherwise, representative GSM8K numbers in the appendix use Qwen3\-4B on the GSM8K test split\.

Summary:Five findings follow from the experiments\. \(1\) Raw consensus is the correct in\-family baseline – it improves or matches greedy in most evaluated cells and exposes substantial recoverable headroom\. \(2\) Most global correction methods fail or are fragile\. \(3\) Framing helps only as additive evidence; framing\-as\-replacement is negative\. \(4\)Arbiter\-Δ\\Deltahas non\-negative point\-estimate count\-level net change in all 9 cells and positive accuracy gain in eight cells\. \(5\)Arbiter\-Encis best interpreted as an optional residual: fixed gains are near\-zero to modest, while calibrated variants are stronger but use validation labels\.

## 6Analysis

Limits of global correction\.Our experiments show that many internal signals are descriptive rather than selective\. Stability, coherence, hidden\-state structure, and graph reconstruction reveal that basins exist, but they do not determine which basin is correct\. A stable wrong frame can dominate the pool, while a correct challenger can remain small\. This is why broad replacement policies and global rerankers often degrade raw consensus\.

Additive framing evidence\.Framing evidence helps when it is treated as a side channel rather than a new baseline\. A frame can expose a target\-quantity, entity\-binding, unit, or reasoning\-structure disagreement between basins\. However, a frame can also be coherent and wrong\.Arbiter\-Δ\\Deltatherefore uses frames to collect additional evidence and adds that evidence to the raw support prior\. The method changes the consensus decision only when the auxiliary evidence overcomes the dominant\-basin prior\.

Residual role of the encoder\.The encoder’s successful role is not to verify correctness directly\. Our earlier basin encoder ranked risk and abstention well, but review\-based correction was negative\. The revised role is narrower: identify when framed, panel, or guided evidence is noisy; reduce degraded flips; and provide a clipped residual in the same log\-evidence space asArbiter\-Δ\\Delta\. This keeps hidden\-state structure useful without allowing it to override consensus alone\.

Diagnostic role of basin graphs\.The graph branch remains valuable as an explanatory object even though the router was negative as a selector\. A Basin Story Graph can visualize how sampled trajectories split into basins, which frames conflict, and which evidence sources support each basin\. We treat this as diagnostic structure rather than as the main selection mechanism\.

## 7Limitations

Arbiter\-Δ\\Deltauses additional evidence beyond the raw consensus pool\. We therefore report the evidence budget and avoid matched\-compute claims against raw consensus\. All auxiliary evidence comes from the same frozen base model but still requires additional generation\. Baseline values differ across run families because prompts, seeds, dataset adapters, and scripts differ, and even nominally identical runs can vary due to nondeterminism in batched inference \(e\.g\., vLLM scheduling\)\. We therefore compare each method only to the raw consensus baseline from the same run family\. Oracle numbers are diagnostic ceilings, not achievable results without gold labels\. The log\-linear source\-pooling view of Eq\. \(M2\) is not a calibrated Bayesian posterior over all auxiliary samples\. It treats each evidence source as a bounded empirical opinion; this avoids letting larger same\-model auxiliary budgets automatically dominate the raw consensus term\. However, framed\-pool, panel, and guided evidence all come from the same frozen model and can share correlated mistakes, which can lead to double\-counting\. The current method mitigates this with reliability shrinkage and a conservative sign rule, but direct dependence modeling remains future work\. The calibrated encoder variant uses labeled validation data to tune the residual rule, so it does not satisfy the strict zero\-external\-information setting\. The fixed encoder variant remains compatible with zero external information but produces smaller and sometimes near\-zero gains\. The experiments cover three frozen instruction\-tuned model families and three math benchmarks\. The design is model\-size agnostic, but frontier\-scale models, non\-math tasks, and tasks with harder answer equivalence remain future work\. Results can also depend on answer extraction and equivalence rules, especially for symbolic math\. Appendix[O](https://arxiv.org/html/2605.26172#A15)reports standard deviations over three random seeds for a closely related Qwen3\-4B GSM8K source\-set diagnostic; broader seed replications for the fixed main policy remain future work\.

## 8Conclusion

Raw consensus is a strong dominant\-basin estimator but leaves a wrong\-majority failure case: the correct answer can appear in the pool yet lose to a larger wrong basin\.Arbiteraddresses this as basin arbitration under zero external information\.Arbiter\-Δ\\Deltakeeps consensus as the prior and adds same\-model evidence only when it supports an alternative\. Across the3×33\\times 3matrix and tested seeds, it consistently exceeds raw consensus, with accuracy gain in eight cells\. The broader lesson: structure is not truth\. Reliable recovery requires conservative additive evidence and explicit accounting of recovered and degraded cases\.

## References

- Abdin et al\. \[2024\]Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J\. Hewett, Mojan Javaheripi, Piero Kauffmann, James R\. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C\. T\. Mendes, Anh Nguyen, Eric Price, Gustavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Dingli Yu, Cyril Zhang, and Yi Zhang\.Phi\-4 technical report\.*arXiv preprint arXiv:2412\.08905*, 2024\.URL[https://arxiv\.org/abs/2412\.08905](https://arxiv.org/abs/2412.08905)\.
- Chen et al\. \[2026\]Jinkun Chen, Fengxiang Cheng, Sijia Han, and Vlado Keselj\.“I may not have articulated myself clearly”: Diagnosing dynamic instability in LLM reasoning at inference time\.*arXiv preprint arXiv:2602\.02863*, 2026\.URL[https://arxiv\.org/abs/2602\.02863](https://arxiv.org/abs/2602.02863)\.
- Chen et al\. \[2023\]Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou\.Universal self\-consistency for large language model generation\.*arXiv preprint arXiv:2311\.17311*, 2023\.URL[https://arxiv\.org/abs/2311\.17311](https://arxiv.org/abs/2311.17311)\.
- Cobbe et al\. \[2021\]Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman\.Training verifiers to solve math word problems\.*arXiv preprint arXiv:2110\.14168*, 2021\.URL[https://arxiv\.org/abs/2110\.14168](https://arxiv.org/abs/2110.14168)\.
- Ghasemabadi and Niu \[2025\]Amirhosein Ghasemabadi and Di Niu\.Can LLMs predict their own failures? self\-awareness via internal circuits\.*arXiv preprint arXiv:2512\.20578*, 2025\.URL[https://arxiv\.org/abs/2512\.20578](https://arxiv.org/abs/2512.20578)\.
- Grattafiori et al\. \[2024\]Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, et al\.The Llama 3 herd of models\.*arXiv preprint arXiv:2407\.21783*, 2024\.URL[https://arxiv\.org/abs/2407\.21783](https://arxiv.org/abs/2407.21783)\.
- Han et al\. \[2025\]Jinyi Han, Xinyi Wang, Haiquan Zhao, Tingyun Li, Zishang Jiang, Sihang Jiang, Jiaqing Liang, Xin Lin, Weikang Zhou, Zeye Sun, Fei Yu, and Yanghua Xiao\.A stitch in time saves nine: Proactive self\-refinement for language models\.*arXiv preprint arXiv:2508\.12903*, 2025\.URL[https://arxiv\.org/abs/2508\.12903](https://arxiv.org/abs/2508.12903)\.
- Hendrycks et al\. \[2021a\]Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt\.Measuring massive multitask language understanding\.In*International Conference on Learning Representations*, 2021a\.URL[https://arxiv\.org/abs/2009\.03300](https://arxiv.org/abs/2009.03300)\.
- Hendrycks et al\. \[2021b\]Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt\.Measuring mathematical problem solving with the MATH dataset\.In*Advances in Neural Information Processing Systems*, 2021b\.URL[https://arxiv\.org/abs/2103\.03874](https://arxiv.org/abs/2103.03874)\.
- Huang et al\. \[2024\]Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou\.Large language models cannot self\-correct reasoning yet\.In*International Conference on Learning Representations*, 2024\.URL[https://arxiv\.org/abs/2310\.01798](https://arxiv.org/abs/2310.01798)\.
- Knappe et al\. \[2024\]Tim Knappe, Ryan Li, Ayush Chauhan, Kaylee Chhua, Kevin Zhu, and Sean O’Brien\.Semantic self\-consistency: Enhancing language model reasoning via semantic weighting, 2024\.URL[https://arxiv\.org/abs/2410\.07839](https://arxiv.org/abs/2410.07839)\.
- Kwon et al\. \[2023\]Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E\. Gonzalez, Hao Zhang, and Ion Stoica\.Efficient memory management for large language model serving with PagedAttention\.In*Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles*, 2023\.URL[https://arxiv\.org/abs/2309\.06180](https://arxiv.org/abs/2309.06180)\.
- Lhoest et al\. \[2021\]Quentin Lhoest, Albert Villanova del Moral, Yacine Jernite, Abhishek Thakur, Patrick von Platen, Suraj Patil, Julien Chaumond, Mariama Drame, Julien Plu, Lewis Tunstall, Joe Davison, Mario Šaško, Gunjan Chhablani, Bhavitvya Malik, Simon Brandeis, Teven Le Scao, Victor Sanh, Canwen Xu, Nicolas Patry, Angelina McMillan\-Major, Philipp Schmid, Sylvain Gugger, Clément Delangue, Théo Matussière, Lysandre Debut, Stas Bekman, Pierric Cistac, Thibault Goehringer, Victor Mustar, François Lagunas, Alexander M\. Rush, and Thomas Wolf\.Datasets: A community library for natural language processing\.In*Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*, pages 175–184, Online and Punta Cana, Dominican Republic, 2021\. Association for Computational Linguistics\.doi:10\.18653/v1/2021\.emnlp\-demo\.21\.URL[https://aclanthology\.org/2021\.emnlp\-demo\.21/](https://aclanthology.org/2021.emnlp-demo.21/)\.
- Li et al\. \[2024\]Qintong Li, Leyang Cui, Xueliang Zhao, Lingpeng Kong, and Wei Bi\.GSM\-plus: A comprehensive benchmark for evaluating the robustness of LLMs as mathematical problem solvers\.In*Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\)*, pages 2961–2984, Bangkok, Thailand, 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.acl\-long\.163\.URL[https://aclanthology\.org/2024\.acl\-long\.163/](https://aclanthology.org/2024.acl-long.163/)\.
- Li et al\. \[2025\]Xianzhi Li, Ethan Callanan, Abdellah Ghassel, and Xiaodan Zhu\.Entropy\-gated branching for efficient test\-time reasoning\.*arXiv preprint arXiv:2503\.21961*, 2025\.URL[https://arxiv\.org/abs/2503\.21961](https://arxiv.org/abs/2503.21961)\.
- Liang et al\. \[2025\]Zhenwen Liang, Ruosen Li, Yujun Zhou, Linfeng Song, Dian Yu, Xinya Du, Haitao Mi, and Dong Yu\.CLUE: Non\-parametric verification from experience via hidden\-state clustering, 2025\.URL[https://arxiv\.org/abs/2510\.01591](https://arxiv.org/abs/2510.01591)\.
- Liang et al\. \[2026\]Zhixiang Liang, Beichen Huang, Zheng Wang, and Minjia Zhang\.Hidden states as early signals: Step\-level trace evaluation and pruning for efficient test\-time scaling\.*arXiv preprint arXiv:2601\.09093*, 2026\.URL[https://arxiv\.org/abs/2601\.09093](https://arxiv.org/abs/2601.09093)\.
- Lightman et al\. \[2024\]Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\.Let’s verify step by step\.In*International Conference on Learning Representations*, 2024\.URL[https://arxiv\.org/abs/2305\.20050](https://arxiv.org/abs/2305.20050)\.
- Madaan et al\. \[2023\]Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark\.Self\-refine: Iterative refinement with self\-feedback\.In*Advances in Neural Information Processing Systems*, 2023\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3\-Abstract\-Conference\.html](https://proceedings.neurips.cc/paper_files/paper/2023/hash/91edff07232fb1b55a505a9e9f6c0ff3-Abstract-Conference.html)\.
- Mirzadeh et al\. \[2024\]Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar\.GSM\-Symbolic: Understanding the limitations of mathematical reasoning in large language models\.*arXiv preprint arXiv:2410\.05229*, 2024\.URL[https://arxiv\.org/abs/2410\.05229](https://arxiv.org/abs/2410.05229)\.
- Oh and Lee \[2025\]Jungsuk Oh and Jay\-Yoon Lee\.Latent self\-consistency for reliable majority\-set selection in short\- and long\-answer reasoning, 2025\.URL[https://arxiv\.org/abs/2508\.18395](https://arxiv.org/abs/2508.18395)\.
- Paszke et al\. \[2019\]Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Köpf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala\.PyTorch: An imperative style, high\-performance deep learning library\.In*Advances in Neural Information Processing Systems*, 2019\.URL[https://arxiv\.org/abs/1912\.01703](https://arxiv.org/abs/1912.01703)\.
- Shi et al\. \[2026\]Zhenning Shi, Yijia Zhu, Junhan Shi, Xun Zhang, Lei Wang, and Congcong Miao\.Internalizing LLM reasoning via discovery and replay of latent actions\.*arXiv preprint arXiv:2602\.04925*, 2026\.URL[https://arxiv\.org/abs/2602\.04925](https://arxiv.org/abs/2602.04925)\.
- Shinn et al\. \[2023\]Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R\. Narasimhan, and Shunyu Yao\.Reflexion: Language agents with verbal reinforcement learning\.In*Advances in Neural Information Processing Systems*, 2023\.URL[https://proceedings\.neurips\.cc/paper\_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90\-Abstract\-Conference\.html](https://proceedings.neurips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html)\.
- Snell et al\. \[2025\]Charlie Victor Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar\.Scaling LLM test\-time compute optimally can be more effective than scaling parameters for reasoning\.In*International Conference on Learning Representations*, 2025\.URL[https://openreview\.net/forum?id=4FWAwZtd2n](https://openreview.net/forum?id=4FWAwZtd2n)\.
- Vasudev et al\. \[2026\]Rakshith Vasudev, Melisa Russak, Dan Bikel, and Waseem Alshikh\.Accurate failure prediction in agents does not imply effective failure prevention\.*arXiv preprint arXiv:2602\.03338*, 2026\.URL[https://arxiv\.org/abs/2602\.03338](https://arxiv.org/abs/2602.03338)\.
- Wang et al\. \[2023\]Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V\. Le, Ed H\. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\.Self\-consistency improves chain of thought reasoning in language models\.In*International Conference on Learning Representations*, 2023\.URL[https://arxiv\.org/abs/2203\.11171](https://arxiv.org/abs/2203.11171)\.
- Wei et al\. \[2022\]Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H\. Chi, Quoc V\. Le, and Denny Zhou\.Chain\-of\-thought prompting elicits reasoning in large language models\.In*Advances in Neural Information Processing Systems*, 2022\.URL[https://arxiv\.org/abs/2201\.11903](https://arxiv.org/abs/2201.11903)\.
- Wolf et al\. \[2020\]Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M\. Rush\.Transformers: State\-of\-the\-art natural language processing\.In*Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations*, pages 38–45, Online, 2020\. Association for Computational Linguistics\.doi:10\.18653/v1/2020\.emnlp\-demos\.6\.URL[https://aclanthology\.org/2020\.emnlp\-demos\.6/](https://aclanthology.org/2020.emnlp-demos.6/)\.
- Yang et al\. \[2025\]An Yang, Anfeng Li, Baosong Yang, et al\.Qwen3 technical report\.*arXiv preprint arXiv:2505\.09388*, 2025\.URL[https://arxiv\.org/abs/2505\.09388](https://arxiv.org/abs/2505.09388)\.
- Zhang et al\. \[2024\]Yunxiang Zhang, Muhammad Khalifa, Lajanugen Logeswaran, Jaekyeom Kim, Moontae Lee, Honglak Lee, and Lu Wang\.Small language models need strong verifiers to self\-correct reasoning\.In*Findings of the Association for Computational Linguistics: ACL 2024*, pages 15637–15653, Bangkok, Thailand, 2024\. Association for Computational Linguistics\.doi:10\.18653/v1/2024\.findings\-acl\.924\.URL[https://aclanthology\.org/2024\.findings\-acl\.924/](https://aclanthology.org/2024.findings-acl.924/)\.

## Appendix AExperimental program overview table

Table 3:Empirical program summary across all major experiment families\. “Acc\. gain range” is the observed range of accuracy change over the in\-family raw consensus baseline \(or over greedy for the consensus row\)\. Unless a row states otherwise, representative numbers are from Qwen3\-4B on the GSM8K test split\. Most correction routes fail; the positive results concentrate in additive basin arbitration\.
## Appendix BSymbol reference for setup and method

Table 4:Symbols used in the problem setup\.Table 5:Method evidence symbols used byArbiter\-Δ\\Delta\. Each count is computed per question by parsing same\-model outputs and assigning the final answer to an observed basin\. “Dominant” means the raw\-consensus basinB1B\_\{1\}; “challenger” means an alternative observed basinBrB\_\{r\}\.
## Appendix CDerivation of the additive Delta score

This appendix records the algebra behind Eq\. \(M4\)\. The proposition is only a score\-form identity; it is not a theorem thatArbiter\-Δ\\Deltastrictly improves accuracy\. Accuracy improvement is an empirical claim reported in Section[5\.1](https://arxiv.org/html/2605.26172#S5.SS1)\.

###### Proposition C\.1\(Pairwise log\-linear source pooling\)

Fix a questionqq, the dominant basinB1B\_\{1\}, the leading challengerB2B\_\{2\}, raw supportsb1,b2b\_\{1\},b\_\{2\}, framed countsf1,f2f\_\{1\},f\_\{2\}, guided countsg1,g2g\_\{1\},g\_\{2\}, a pseudo\-countα\>0\\alpha\>0, and reliability factorsrf\(q\),rg\(q\)∈\[0,1\]r\_\{f\}\(q\),r\_\{g\}\(q\)\\in\[0,1\]\. Forj∈\{1,2\}j\\in\\\{1,2\\\}define

w~j\(q\)=\(bj\+α\)\(fj\+α\)rf\(q\)\(gj\+α\)rg\(q\)\.\\widetilde\{w\}\_\{j\}\(q\)=\(b\_\{j\}\+\\alpha\)\(f\_\{j\}\+\\alpha\)^\{r\_\{f\}\(q\)\}\(g\_\{j\}\+\\alpha\)^\{r\_\{g\}\(q\)\}\.Then

log⁡w~2\(q\)w~1\(q\)=log⁡b2\+αb1\+α\+rf\(q\)log⁡f2\+αf1\+α\+rg\(q\)log⁡g2\+αg1\+α=Δ2\(q\)\.\\log\\frac\{\\widetilde\{w\}\_\{2\}\(q\)\}\{\\widetilde\{w\}\_\{1\}\(q\)\}=\\log\\frac\{b\_\{2\}\+\\alpha\}\{b\_\{1\}\+\\alpha\}\+r\_\{f\}\(q\)\\log\\frac\{f\_\{2\}\+\\alpha\}\{f\_\{1\}\+\\alpha\}\+r\_\{g\}\(q\)\\log\\frac\{g\_\{2\}\+\\alpha\}\{g\_\{1\}\+\\alpha\}=\\Delta\_\{2\}\(q\)\.Therefore Eq\. \(M3\) selects the challenger exactly when its pooled support exceeds the dominant basin’s pooled support\.

#### Proof\.

Taking the ratio of the two definitions gives

w~2\(q\)w~1\(q\)=b2\+αb1\+α\(f2\+αf1\+α\)rf\(q\)\(g2\+αg1\+α\)rg\(q\)\.\\frac\{\\widetilde\{w\}\_\{2\}\(q\)\}\{\\widetilde\{w\}\_\{1\}\(q\)\}=\\frac\{b\_\{2\}\+\\alpha\}\{b\_\{1\}\+\\alpha\}\\left\(\\frac\{f\_\{2\}\+\\alpha\}\{f\_\{1\}\+\\alpha\}\\right\)^\{r\_\{f\}\(q\)\}\\left\(\\frac\{g\_\{2\}\+\\alpha\}\{g\_\{1\}\+\\alpha\}\\right\)^\{r\_\{g\}\(q\)\}\.Taking logarithms converts the product into the additive score in Eq\. \(M2\)\. The sign rule in Eq\. \(M3\) is therefore equivalent to comparingw~2\(q\)\\widetilde\{w\}\_\{2\}\(q\)withw~1\(q\)\\widetilde\{w\}\_\{1\}\(q\)\.□\\square

This identity is Bayes\-motivated because likelihood ratios also add in log space, but Eq\. \(M2\) should not be read as a calibrated Bayesian posterior over all auxiliary samples\. It is a log\-linear pooling rule over source\-level empirical opinions\. A posterior update over independent individual auxiliary trials would generally scale with the number of trials in each source; the implemented score deliberately avoids that scaling because all sources come from the same frozen model and can share correlated mistakes\.

## Appendix DFull empirical ledger

TableLABEL:tab:full\_empirical\_ledgerrecords the broader project history\. The main paper focuses on the cleanest comparisons, while this appendix preserves the full set of positive, negative, calibrated, hard\-slice, and exploratory results\.

Table 6:Full empirical ledger\. “Clean” means gold labels are used only after prediction for evaluation\. “Gold\-tuned” means a configuration was selected using target labels and is therefore reported only as an exploratory upper bound\.PhaseMethodRecorded resultInterpretationStatusEarly selectorConsensus×\\timesrobustness / resolveRobustness could zero out correct clusters despite high support\.Misaligned confidence signal\.Negativeraw consensusRawK=24K\{=\}24answer clusteringTypical GSM8K full\-test accuracy≈94\.3\\approx 94\.3–94\.7%94\.7\\%; gold\-in\-pool about98%98\\%\.Stable backbone\.CleanGeneration\-time controlProbe\-delta \+ conservative generation\-time controlAbout91\.96%91\.96\\%greedy\-path accuracy\.Best generation\-time branch, still below consensus\.CleanHidden\-state trajectory rerankHidden\-state coherence rerank94\.92%94\.92\\%\.Small guarded gain; fragile\.ExploratoryCGAC / token\-1 FFNToken\-1 FFN activation bias and layer interventionsReal bias effects resembled or underperformed random controls\.Token\-1 FFN not reliable\.NegativeShared\-prefix branchBranch\-window hidden\-state detection and patchingBranch divergence detectable; forcing 1–5 branch tokens and one\-cell patching weak\.Detectability≠\\neqcontrollability\.NegativeContinuation steeringCoherent continuation donor insertionClear dose\-response steering between basins\.Strong mechanistic evidence\.MechanisticSteerability selectorConvert steerability into a selector scoreAUC≈0\.69\\approx 0\.69–0\.730\.73; best selector about\+1/162\+1/162qids\.Steering signal too weak for deployment\.Weak positiveLocal divergence scoringLocal divergence contrastive scoringAbout0\.72840\.7284on 162\-qid slice; about\+2\+2qids over consensus with∼52%\\sim 52\\%coverage\.Slice signal only\.Slice\-onlySelf\-review sessionsSolve→\\rightarrowreview→\\rightarrowfinal consensusStable reruns were negative: raw94\.69%→94\.69\\%\\rightarrowreviewed93\.10%93\.10\\%\. An earlier self\-review run showed94\.77%→95\.00%94\.77\\%\\rightarrow 95\.00\\%but is treated as non\-replicated\.Broad self\-review is not a stable correction method\.Negative / unstableHistorical guarded sweepPost\-hoc guarded rerank95\.22%95\.22\\%; 7 overrides, 6 recovered, 0 degraded\.Historical upper bound selected by gold\-evaluated grid\.Gold\-tunedTop\-up mergeGenerate extra sessions for 593 ambiguous qids and merge votesHistorical raw/final/guarded changed from94\.77/95\.00/95\.2294\.77/95\.00/95\.22to94\.24/94\.54/94\.3194\.24/94\.54/94\.31\.Extra votes diluted correct majorities\.NegativeRaw\-trace reviewSequential review over raw tracesBest 32k solve\-at\-end known\-else\-c170\.87%70\.87\\%vs76\.96%76\.96\\%consensus\.Raw\-trace review underperforms\.NegativeAnswer\-memo reviewMutual memos with answersPair/triad reviews≈63\\approx 63–66\.5%66\.5\\%vs74\.71%74\.71\\%consensus\.Summaries lose decisive evidence\.NegativeAnswer\-only reviewQuestion \+ candidate answers, two\-turn solve/reviewHarder slice:64\.71%→70\.59%64\.71\\%\\rightarrow 70\.59\\%\.Good hard\-slice solver\.Slice\-only positiveLocal branch packetsBranch\-local packetization and answer review5% packetized review: slice61\.7%→66\.0%61\.7\\%\\rightarrow 66\.0\\%, full\-test net\+1\+1\. Swapped\-only tiny stratum reached 100% on 3 packets\.Useful stratum detection, low coverage\.Slice\-onlyBasin tournamentBasin principle cards and pairwise judgingFull 217 multi\-basin qids: 10 overrides, 1 recovered, 6 degraded, net−5\-5\.Direct basin judging unreliable\.NegativeLLM basin baselineBasin summaries without encoderWithout support guard: 42 overrides, 7 recovered, 29 degraded, net−22\-22\.Unsafe selector\.NegativeBasin encoder Task ATrajectory encoder for difficulty / abstention50% coverage→98\.79%\\rightarrow 98\.79\\%kept\-set accuracy\.Strong risk ranking\.Clean, not full coverageBasin encoder Task BEncoder\-filtered review correctionOnly 5/362 review candidates rescued consensus errors; accept\-any\-review net−46\-46\.Review source is poor\.NegativeCluster GNN/routerGraph router over basin structureQwen3\-4B GSM8K consensus94\.54%94\.54\\%; top\-2 oracle97\.27%97\.27\\%; router overrides net negative\.Structure learned, correctness not learned\.NegativeFraming\-firstFraming\-prompt generation as replacementFlat framed replacement94\.84%94\.84\\%; style\-balanced94\.31%94\.31\\%; style\+temp93\.71%93\.71\\%\.Replacement framing is fragile rather than the main deployable selector\.FragileFraming additiveBasin frames as side evidence94\.54%→94\.84%94\.54\\%\\rightarrow 94\.84\\%; 16 overrides, 9 recovered, 5 degraded\.Framing works additively\.Clean positiveArbiter\-Δ\\DeltaAdditive log\-linear evidence over raw plus the fixed framed\+guided source set3×\\times3 matrix: 8 accuracy\-positive, 1 neutral; count\-level net non\-negative in all cells\.Main clean method\.Clean positiveArbiter\-EncFraming\-aware hidden\-state residualFixed:−0\.1\-0\.1to\+0\.5\+0\.5mean extra gain\. Calibrated:\+0\.8\+0\.8to\+1\.5\+1\.5\.Optional residual variant\.Fixed / calibratedTable[7](https://arxiv.org/html/2605.26172#A4.T7)reports the consensus and oracle headroom table for the main evaluation\. Method comparisons use these same in\-family raw\-consensus baselines for the evaluation matrix\.

Table 7:Consensus and oracle headroom for the main evaluation matrix\. O@k is a diagnostic same\-pool oracle over the top k answer basins; count columns report additional correct examples over raw consensus\.Table 8:GSM8K raw\-only sampling\-budget diagnostic\. TheK=56K\{=\}56rows use ordinary raw samples only, without framed, guided, panel, or encoder evidence\. This diagnostic is included to show that the GSM8K gains are not explained by simply drawing more ordinary samples, but it is not a strict compute\-matched full\-matrix baseline\.
## Appendix ESelf\-review and top\-up results

Table 9:Self\-review audit\. The stable conclusion is negative: broad review reduces accuracy and increases fragmentation\. The historical self\-review improvement is listed for completeness but is not used as the main stable evidence\.Table 10:Earlier self\-review session run details on GSM8K\. This run is kept as an audit record, not as the stable self\-review claim\.Table 11:Historical post\-hoc guarded sweep\. This is gold\-tuned unless the selected configuration is frozen on a separate validation split and retested\.Table 12:top\-up targeted top\-up merge\. Extra sessions were generated for 593 ambiguous qids and merged into the vote\.
## Appendix FCluster review and answer\-only review details

Table 13:Cluster review and answer\-review results\. These experiments show that explicit cluster judging and abstract memos usually underperform consensus, while answer\-only review can help on selected hard slices\.
## Appendix GLocal\-branch packetization

Table 14:Local\-branch packetized review\. Results are useful mainly as stratum detection, not as a full\-coverage method\.Table 15:Swapped\-only local\-branch stratum\. This tiny stratum carried concentrated rescue signal\.
## Appendix HBasin encoder diagnostics

Our original basin encoder was useful for risk ranking but not for correction\. This motivates the revisedArbiter\-Encdesign, where the encoder is used only as a bounded trajectory residual\.

Table 16:Basin encoder training and data summary\.Table 17:Basin encoder Task A: difficulty / abstention\. Kept\-set accuracy is reported after abstaining on the riskiest examples; the old ambiguous “Lift” column is omitted\.Table 18:Basin encoder Task B: review filtering\. The generated review candidates are poor rescue sources\.The key ceiling result is that only 5 of 362 review candidates produced the correct answer for a consensus error\. Therefore, our old encoder is not a truth selector\. Its useful role is to rank risk and, in the new method, provide a bounded residual over framing evidence\.

## Appendix ICluster\-GNN and router diagnostics

The cluster\-GNN/router branch is reported as a negative result\. The graph model learned structural objectives, but its score did not correlate with correction quality\.

Table 19:Cluster\-GNN run summary on Qwen3\-4B GSM8K\.Table 20:Cluster\-GNN diagnostic feature separation\. Recoverable means consensus is wrong and the challenger is correct\. Degradable means consensus is correct and the challenger is wrong\.Simple support geometry separated recoverable and degradable cases better than the learned GNN score\. This is why the graph/router branch motivates the framing pivot rather than serving as the main positive selector\.

## Appendix JDetailed framing and delta results

### J\.1Replacement versus additive framing

Table 21:Replacement versus additive framing evidence on Qwen3\-4B GSM8K\. Framing information is safest when accumulated as side evidence; direct replacement can be positive in a narrow flat variant but is fragile under style/temperature changes and unsafe for panel/guided replacement\.
### J\.2Qwen3\-4B GSM8K additive delta policies

Table 22:Qwen3\-4B GSM8K additive delta source\-set policies\. Baseline raw consensus is94\.54%94\.54\\%\. The fixed main policy is framed\+guided with top\-2\-mass reliability; panel/order variants are ablations and are not used for the headline matrix\.
### J\.3Answer\-visible frame panel ablation

Table 23:Answer\-visible ablation on Qwen3\-4B GSM8K\. Showing candidate answers to the panel improves the best policies in this run\.
### J\.4Full delta matrix

Table 24:FullArbiter\-Δ\\Deltamatrix for the fixed framed\+guided parameter\-free policy\.

## Appendix KArbiter\-Encgain ranges

Table 25:Mean extra gain range ofArbiter\-EncoverArbiter\-Δ\\Delta\. The endpoints summarize saved fixed/no\-tune and calibrated residual variants, not random\-seed confidence intervals\.Table 26:Cell\-level extra gain ranges forArbiter\-EncoverArbiter\-Δ\\Delta\. The endpoints summarize saved residual variants, not random\-seed confidence intervals\.
## Appendix LReliability tiers

Table 27:Reliability tiers for reported results\. This prevents clean zero\-external\-information results from being confused with calibrated, hard\-slice, or gold\-tuned exploratory results\.
## Appendix MWorkedArbiter\-Δ\\Deltacalculation

This appendix gives a complete arithmetic example for Eq\. \(M2\)\. Consider the wrong\-majority case illustrated in Figure[1](https://arxiv.org/html/2605.26172#S4.F1)\. The dominant basin has raw supportb1=18b\_\{1\}=18and the leading challenger has raw supportb2=4b\_\{2\}=4\. The auxiliary evidence counts aref2=13,f1=8f\_\{2\}=13,f\_\{1\}=8for framed\-pool solves,p2=9,p1=2p\_\{2\}=9,p\_\{1\}=2for panel trials, andg2=4,g1=0g\_\{2\}=4,g\_\{1\}=0for guided re\-solves\. The main policy uses the framed\+guided sourcesFFandGG\. With the fixed Laplace smoothing constantα=1\\alpha=1and unit reliability for this illustrative calculation,

Δ2=log⁡4\+118\+1\+log⁡13\+18\+1\+log⁡4\+10\+1\.\\displaystyle\\small\\Delta\_\{2\}=\\log\\frac\{4\+1\}\{18\+1\}\+\\log\\frac\{13\+1\}\{8\+1\}\+\\log\\frac\{4\+1\}\{0\+1\}\.\(A1\)The three terms are−1\.335\-1\.335,0\.4420\.442, and1\.6091\.609, soΔ2=0\.716\>0\\Delta\_\{2\}=0\.716\>0\. The raw prior favors the dominant basin, but the framed and guided same\-model evidence streams more than compensate for that prior, so Eq\. \(M3\) selects the challenger\. A full\-source ablation would add the panel termlog⁡\(\(9\+1\)/\(2\+1\)\)=1\.204\\log\(\(9\+1\)/\(2\+1\)\)=1\.204\. In actual main\-policy runs, the same calculation includesrf\(q\)r\_\{f\}\(q\)andrg\(q\)r\_\{g\}\(q\)from Eq\. \(M1\); source\-set ablations with panels also useρP,r\\rho\_\{P,r\}from Eq\. \(M1p\)\. These factors reduce the contribution of fragmented or order\-sensitive sources\.

## Appendix NPrompt\-template summary

Full prompt strings are included in the supplementary code package ZIP\. This appendix summarizes the prompt families so that the evidence budget is understandable without opening the code\. The GSM8K raw\-consensus script uses four ordinary solve templates: step\-by\-step solving with a required\#\#\#\# <number\>final line, an equation\-first variant, a constraints/backwards\-reasoning variant, and a solve\-then\-arithmetic\-self\-check variant\. The script samples across these templates and temperatures, then clusters by the extracted final answer\.

The framing\-basin runs add three families\. First, framed\-pool prompts ask the model to identify the target quantity, entities, units, and operation pattern before solving\. Second, panel prompts compare two basin interpretations side by side, then ask the same frozen model to solve fresh\. Third, guided prompts give one basin interpretation as a hypothesis and ask the same model to verify or reject it by re\-solving\. For MMLU\-HS\-Math, prompts require a final A/B/C/D answer; for MATH\-500, prompts require a boxed mathematical answer\. The supplementary ZIP contains the exact prompt files and scripts used to instantiate these families\.

## Appendix OThree\-seed replication

Table[28](https://arxiv.org/html/2605.26172#A15.T28)reports a related Qwen3\-4B GSM8KArbiter\-Δ\\Deltasource\-set diagnostic over three full random seeds\. The varying factor is the sampling/evidence\-generation seed; the model, dataset split, parser, and policy family are held fixed within this diagnostic\. The main matrix in Table[2](https://arxiv.org/html/2605.26172#S5.T2)uses the fixed framed\+guided policy, so this table is a seed\-variability reference rather than a formal confidence interval for all main\-policy cells\.

Table 28:Three\-seed replication table for a related Qwen3\-4B GSM8KArbiter\-Δ\\Deltasource\-set diagnostic\. Error bars in the final row are 1\-sigma sample standard deviations over the three full runs\.
## Appendix PReproducibility, compute, assets, and broader impacts

Reproducibility\.All main experiments use frozen public instruction\-tuned models, public benchmark test splits, deterministic answer extraction/normalization rules, and explicit evidence budgets\. The key reproducibility parameters are the model identifier, dataset split, raw\-pool sizeKK, prompt family, sampling temperatures, parser/equivalence rule, random seed, and the number of framed, panel, and guided evidence trials\. Gold answers are used only after prediction for evaluation\. The supplementary code package includes scripts, configs, prompt files, summary artifacts, a README with exact commands, and a package\-version lock file\.

Seed variability\.Table[28](https://arxiv.org/html/2605.26172#A15.T28)reports the three\-seed source\-set diagnostic for Qwen3\-4B GSM8K\. The reported error bars are 1\-sigma sample standard deviations over three full runs with different sampling/evidence seeds\. We use this as a seed\-variability check rather than as a formal hypothesis test for the main matrix\.

Compute environment\.Experiments were run on RunPod H100 GPU instances using the containerrunpod/pytorch:1\.0\.2\-cu1281\-torch280\-ubuntu2404\(RunPod PyTorch 2\.8\.0, CUDA 12\.8, Ubuntu 24\.04\)\. We used vLLM\-style batched inference\[Kwon et al\.,[2023](https://arxiv.org/html/2605.26172#bib.bib12)\], PyTorch\[Paszke et al\.,[2019](https://arxiv.org/html/2605.26172#bib.bib22)\], Hugging Face Transformers\[Wolf et al\.,[2020](https://arxiv.org/html/2605.26172#bib.bib29)\], and Hugging Face Datasets\[Lhoest et al\.,[2021](https://arxiv.org/html/2605.26172#bib.bib13)\]\. Package versions were not updated beyond the April 2026 environment used for the reported runs; the exact installed versions should be provided in the supplementaryrequirements\_lock\_pip\_freeze\.txt\.

Compute budget\.A single model–dataset consensus cell records one greedy pass plusK=24K\{=\}24sampled generations per question;KKcounts sampled generations, and the run manifest recordsgreedy\_generatedandgreedy\_in\_consensus\. The mainArbiter\-Δ\\Deltapolicy additionally uses about 24 framed solves and 4 guided re\-solves per question; source\-set ablations with panels use about 12 additional panel trials\. Non\-encoderArbiter\-Δ\\Deltaexperiments can usually be completed within hours on H100\-class hardware; a full GSM8K cell typically takes roughly 8–12\+ hours depending on model, decoding budget, and batching\. Hidden\-state encoder experiments are substantially more expensive because teacher\-forced hidden\-state extraction dominates runtime; end\-to\-end encoder runs can require several days\. Preliminary negative experiments consumed more total compute than the final selector because they include self\-review, top\-up, graph\-router, trajectory\-encoder, and review\-ablation branches\.

Existing assets and licenses\.Table[29](https://arxiv.org/html/2605.26172#A16.T29)lists the main existing assets\. We do not redistribute model weights\. The supplementary ZIP includes anASSET\_LICENSES\.mdfile with the same information and any additional package licenses\.

Table 29:Existing assets, license/terms, and use in the paper\.Broader impacts\.The work improves selection among a model’s own sampled reasoning outputs\. Potential positive impacts include more reliable mathematical reasoning and better auditing of why consensus fails\. Potential negative impacts follow from improving the reliability of automated reasoning systems: users may over\-trust generated answers, or the same selection machinery could improve harmful automated problem solving in domains outside the math benchmarks studied here\. We mitigate these risks by reporting recovered/degraded counts, stating that the method is not a correctness verifier, and limiting claims to frozen\-model inference on evaluated reasoning benchmarks\.

## Appendix QAdditional trajectory\-graph visualization

![Refer to caption](https://arxiv.org/html/2605.26172v1/x2.png)Figure 2:Token\-wise hidden\-state kNN graph for the same example\. Each panel shows ak=3k=3nearest\-neighbor graph over hidden states of 24 sampled generations plus the separately recorded greedy anchor, projected to two dimensions\. At early tokens all generations occupy one tight cloud; long\-tail basins peel off first; the top two basins become distinguishable but remain partially overlapping\. This supports the diagnostic role of trajectory graphs while explaining why hidden\-state similarity alone is too graded to serve as a direct correctness selector\.
## Appendix RRecorded data artifacts

Table[30](https://arxiv.org/html/2605.26172#A18.T30)lists the main artifact families recorded across the project\. We include this table to make the empirical evidence auditable and reproducible\.

Table 30:Artifact map\.
## Appendix SFraming and delta artifact paths

Table 31:Framing\-basin and delta artifacts produced per run or per cell\.
ARBITER: Reasoning Trajectory Basins and Majority Vote Failures in Test-Time Sampling

Similar Articles

Reasoning Can Be Restored by Correcting a Few Decision Tokens

Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics

Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance

Why Do Reasoning Models Lose Coverage? The Role of Data and Forks in the Road

Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation

Submit Feedback

Similar Articles

Reasoning Can Be Restored by Correcting a Few Decision Tokens
Monitoring the Internal Monologue: Probe Trajectories Reveal Reasoning Dynamics
Bridging Reasoning Trajectories in On-Policy Distillation via Near-Future Guidance
Why Do Reasoning Models Lose Coverage? The Role of Data and Forks in the Road
Backtracking When It Strays: Mitigating Dual Exposure Biases in LLM Reasoning Distillation