When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference

arXiv cs.AI Papers

Summary

The paper proposes a delegation-based aggregator called Propagational Proxy Voting (PPV) that uses letter entropy and reasoning geometry to improve over majority voting for multi-sample LLM inference, achieving gains on MMLU-Pro without requiring gold labels or auxiliary training.

arXiv:2606.08098v1 Announce Type: new Abstract: Majority voting over sampled answers is the dominant unsupervised aggregator for multi-sample LLM inference. We show that piping the signals every sample carries into a delegation-based aggregator (Propagational Proxy Voting, PPV) yields an unsupervised consensus rule that beats majority on MMLU-Pro by +1.5 pp overall and +2.24 pp on the non-trivial subset (paired McNemar p ~ 1.0e-14, n = 8,099). Majority discards two free signals every sample carries: within-group letter entropy and between-group reasoning geometry. PPV exposes two per-voter levers that consume exactly these signals: WHEN (how much weight a voter keeps on its own pick) and WHOM (how it splits the remainder across peers). We drive WHEN with letter entropy and WHOM with per-question-centered embedding cosine. The method needs no gold labels and no auxiliary training: per question, we partition 128 sampled generations into 16 groups, compute each group's letter-level semantic entropy and reasoning embedding centroid, and feed both into a stochastic delegation matrix whose stationary distribution selects the consensus answer. We walk through an example in which PPV overturns a clear 10-6 majority for the wrong letter: the 10-voter majority cluster is geometrically incoherent (mean within-cluster cosine -0.02) while the 6-voter minority is tight (+0.26), so propagated delegation mass concentrates on the minority's answer even though entropy alone would keep the majority ahead. We further report delegation strategies with negative results that constrain the design space for unsupervised LLM aggregation: no within-question ensemble of confidence modes closes the oracle gap.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:54 AM

# When Does Delegation Beat Majority? A Delegation-Based Aggregator for Multi-Sample LLM Inference
Source: [https://arxiv.org/html/2606.08098](https://arxiv.org/html/2606.08098)
Allen SongMIT Media Lab, Cambridge, MA \{yasushis, allen017, kll\}@media\.mit\.eduKent LarsonMIT Media Lab, Cambridge, MA \{yasushis, allen017, kll\}@media\.mit\.edu

###### Abstract

Majority voting over sampled answers is the dominant unsupervised aggregator for multi\-sample LLM inference\. We show that piping those signals into a delegation\-based aggregator \(Propagational Proxy Voting, PPV;Sakaiet al\.,[2025](https://arxiv.org/html/2606.08098#bib.bib36)\) yields an unsupervised consensus rule that beats majority on MMLU\-Pro by\+1\.5\+1\.5pp overall and\+2\.24\+2\.24pp on the non\-trivial subset \(paired McNemarp≈1\.0×10−14p\\approx 1\.0\\times 10^\{\-14\},n=8,099n=8\{,\}099\)\. Majority discards two free signals that every sample carries: within\-group letter entropy and between\-group reasoning geometry\. PPV exposes two per\-voter levers that consume exactly these signals:When\(how much weight a voter keeps on its own pick\) andWhom\(how it splits the remainder across peers\)\. We driveWhenwith letter entropy andWhomwith per\-question\-centered embedding cosine\. Our method needs no gold labels and no auxiliary training: per\-question, we partition128128sampled generations into1616groups, compute each group’s letter\-level semantic entropy and reasoning embedding centroid, and feed both into a stochastic delegation matrix whose stationary distribution selects the consensus answer\. We walk through an example in which PPV overturns a clear1010–66majority for the wrong letter: the1010\-voter majority cluster is geometrically incoherent \(mean within\-cluster cosine−0\.02\-0\.02\) while the66\-voter minority is tight \(\+0\.26\+0\.26\), so propagated delegation mass concentrates on the minority’s answer even though entropy alone would keep the majority ahead\. We further report delegation strategies with negative results that constrain the design space for unsupervised LLM aggregation\. No within\-question ensemble of confidence modes closes the oracle gap\.

## 1Introduction

Sampling an LLM many times and aggregating the answers*self\-consistency*\(Wanget al\.,[2023](https://arxiv.org/html/2606.08098#bib.bib1)\), is now standard practice for reasoning tasks\. The aggregator is almost always majority vote over the parsed answers: simple, model\-agnostic, label\-free, and a strong baseline\. Yet each sample carries two signals beyond its parsed answer that majority discards:

1. 1\.Letter\-level uncertainty\.A group ofkksamples that all reach the same letter is more informative thankksamples split⌈k/2⌉\\lceil k/2\\rceilto⌊k/2⌋\\lfloor k/2\\rfloor\. Letter entropy over a small group is a free per\-group confidence signal—the MCQ\-degenerate case of semantic entropy\(Kuhnet al\.,[2023](https://arxiv.org/html/2606.08098#bib.bib15); Farquharet al\.,[2024](https://arxiv.org/html/2606.08098#bib.bib16)\)\.
2. 2\.Reasoning geometry\.Embedding each sample’s reasoning text gives a high\-dimensional position in semantic space\. Two groups that pick the same letter via similar reasoning sit close in that space; two groups that pick the same letter via unrelated reasoning sit far apart\. Majority does not take this into account\.

![Refer to caption](https://arxiv.org/html/2606.08098v1/figures/network.png)Figure 1:Simplified network of direct voting and delegation from problemphilosophy\_314\.Each bot is a voting \(a group of LLM samples\)\. It keeps some weight on its own answer letter and routes the rest to peers whose reasoning\-embedding aligns\. The full graph has1616voters and majority picks the wrong letter1010to66; we draw55\(33majority \+22minority\) for legibility\. A coherent minority cluster plus a defecting majority voter \(v9v\_\{9\}\) absorb enough redirected mass to flip the consensus\. The green letterIstands in for the gold answer which PPV picked; the redDis the letter the majority collectively mispicked\.#### Approach\.

We treat each group of samples as a*delegate*in a voting graph and aggregate via Propagational Proxy Voting \(PPV;Sakaiet al\.,[2025](https://arxiv.org/html/2606.08098#bib.bib36)\), a liquid\-democracy mechanism that propagates voting mass through an absorbing Markov chain\. PPV exposes two levers per delegate, one for each of the discarded signals above:

- •When— how much weight a delegate keeps on its own pick\. We drive it with the group’s letter entropy: low entropy⇒\\Rightarrowhigh self\-weight\.
- •Whom— how the remaining weight is split across peer delegates\. We drive it with per\-question\-centered embedding cosine: delegate to peers whose reasoning aligns\.

The result is a parameter\-free, label\-free aggregator, once embeddings are precomputed\.

#### Findings\.

- •On the full MMLU\-Pro test split \(12,03212\{,\}032questions×\\times128128Qwen3\-1\.7B samples\), PPV withα=1−H^\\alpha=1\-\\hat\{H\}achieves42\.2%42\.2\\%versus majority’s40\.7%40\.7\\%\. On the8,0998\{,\}099non\-trivial questions, PPV scores30\.2%30\.2\\%versus28\.0%28\.0\\%:\+2\.24\+2\.24pp, McNemarp≈10−14p\\approx 10^\{\-14\}\.
- •AWhen/Whomdecomposition shows the entire gain comes fromWhen: explicit peer\-quality multipliers inWhomprovide no lift, and in1111of2525tested configurations actively hurt\. PPV’s multi\-hop propagation does implicit*quality laundering*\.
- •Per\-question centering of embeddings before computing cosines is essential: raw same\-question cosines sit in\[\+0\.88,\+0\.99\]\[\+0\.88,\+0\.99\]\. Centering exposes a discriminative geometry with cosines spanning\[−0\.68,\+0\.64\]\[\-0\.68,\+0\.64\]\.
- •Three negative results:P​\(True\)P\(\\text\{True\}\)\(Kadavathet al\.,[2022](https://arxiv.org/html/2606.08098#bib.bib22)\)has area under the ROC curve \(AUROC\)0\.470\.47—anti\-correlated with correctness; CoCoA\-style products\(Vashurinet al\.,[2025](https://arxiv.org/html/2606.08098#bib.bib9)\)are dragged down by it; and no unsupervised ensemble of confidence modes closes the oracle gap, suggesting that bridging it requires a supervised polarity classifier rather than within\-question features alone\.

We show that \(i\) A concrete parameterization of PPV from per\-LLM\-sample signals, a recipe for using delegation networks as an unsupervised aggregator over sampled generations\. \(ii\) A large\-scale empirical study on MMLU\-Pro at the full12,03212\{,\}032\-question scale, with paired McNemar significance\. \(iii\) AWhen/Whomdecomposition that isolates which lever delivers the gain, with a mechanistic worked example\. \(iv\) Three negative results that constrain the design space for future unsupervised LLM aggregators\.

## 2Related Work

Our work sits at the intersection of six threads of prior work\. We survey each in turn and situate our contribution\.

### 2\.1Self\-consistency and multi\-sample aggregation

Wanget al\.\([2023](https://arxiv.org/html/2606.08098#bib.bib1)\)established sample\-and\-vote as the default unsupervised aggregator for chain\-of\-thought reasoning\. The aggregator is plain plurality: generate many responses, extract the answer string from each, and return the most frequent one\. Subsequent work has explored alternatives within this regime\. Verifier\-reranked voting\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.08098#bib.bib3)\)scores candidates with a trained reward model before selecting\. There are generative approaches like Universal Self\-Consistency \(USC;Chenet al\.[2023](https://arxiv.org/html/2606.08098#bib.bib2)\) which prompts the LLM itself to nominate the most coherent candidate\. Ranked voting methods such as instant\-runoff voting, Borda count, mean reciprocal rank have recently been applied to LLM self\-consistency, yielding modest gains over plurality\(Wanget al\.,[2025](https://arxiv.org/html/2606.08098#bib.bib4)\)\.

The most directly concurrent work to ours isPanet al\.\([2025](https://arxiv.org/html/2606.08098#bib.bib5)\), who replace majority with aggregators that exploit first\- and second\-order correlations among model responses, evaluating on MMLU and UltraFeedback; we differ in using a single model’s repeated samples \(not a panel of models\), and in grounding the aggregator in a formal delegation mechanism with an explicit per\-voter confidence parameterization\.

Adaptive sampling methods reduce the sample budget without sacrificing accuracy: RASC\(Wanet al\.,[2025](https://arxiv.org/html/2606.08098#bib.bib7)\)trains a CoT\-quality scoring function for early stopping and score\-weighted voting;Aggarwalet al\.\([2023](https://arxiv.org/html/2606.08098#bib.bib8)\)explores heuristic stopping rules on the sample set\.Cordero\-Encinar and Duncan \([2025](https://arxiv.org/html/2606.08098#bib.bib6)\)provide a theoretical foundation, deriving finite\-sample concentration bounds that quantify how reliably majority vote recovers the mode of the model’s distribution, and introduce the Martingale Majority Certificate as a sequential stopping rule\. These results characterize the regime our aggregator operates in but do not address*which*aggregation function to use once a sample budget is fixed\.

### 2\.2Test\-time compute scaling

Test\-time compute \(TTC\) allocates additional inference compute to improve output quality\.Snellet al\.\([2024](https://arxiv.org/html/2606.08098#bib.bib10)\)survey the space, classifying methods into parallel \(sample and aggregate\) and sequential \(iterative refinement, tree search\)\. Best\-of\-N with a verifier\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.08098#bib.bib3); Lightmanet al\.,[2023](https://arxiv.org/html/2606.08098#bib.bib14)\)is the standard parallel baseline\. Beam search and Monte Carlo Tree Search\(Yaoet al\.,[2023](https://arxiv.org/html/2606.08098#bib.bib13)\)explore the sequential branch\. Our work is parallel and unsupervised: we draw128128samples and apply a richer aggregation function—no additional rollouts, no verifier\.

Muennighoffet al\.\([2025](https://arxiv.org/html/2606.08098#bib.bib11)\)show that a simple budget\-forcing approach \(“wait” tokens\) transfers reasoning compute into longer chains\.Sharma and Chopra \([2025](https://arxiv.org/html/2606.08098#bib.bib12)\)argue that inverse\-entropy–weighted voting over sequentially refined outputs outperforms parallel majority at matched compute, connecting entropy\-based weighting to the sequential paradigm\. Our finding thatα=1−H^\\alpha=1\-\\hat\{H\}\(inverse entropy as confidence\) delivers the full gain in the parallel regime is broadly consistent with the entropy\-weighting intuition, while demonstrating it within the PPV delegation framework\.

### 2\.3Semantic Entropy\(SE\) and uncertainty quantification

Kuhnet al\.\([2023](https://arxiv.org/html/2606.08098#bib.bib15)\)introduced semantic entropy: cluster generations by meaning equivalence \(via NLI\), then take the entropy of the cluster distribution\.Farquharet al\.\([2024](https://arxiv.org/html/2606.08098#bib.bib16)\)extended the method for hallucination detection at scale \(Nature 2024\), showing that semantic entropy is a reliable unsupervised signal for factual reliability\. For multiple\-choice questions the NLI clustering degenerates to grouping by the extracted letter—the form we use; we additionally apply the Miller–Madow bias correction\(Miller,[1955](https://arxiv.org/html/2606.08098#bib.bib21)\)\.

Several recent papers extend or approximate semantic entropy\.Kossenet al\.\([2024](https://arxiv.org/html/2606.08098#bib.bib17)\)propose Semantic Entropy Probes \(SEPs\), which approximate semantic entropy from single\-forward\-pass hidden states, reducing the 5–10×\\timesinference overhead\. Kernel Language Entropy \(KLE;Nikitinet al\.[2024](https://arxiv.org/html/2606.08098#bib.bib18)\) generalizes SE to a kernel\-based uncertainty measure that captures both intra\-cluster spread and inter\-cluster distance without hard partitions;Nguyenet al\.\([2025](https://arxiv.org/html/2606.08098#bib.bib19)\)\(ACL 2025\) independently make a similar argument and propose SNNE, a nearest\-neighbour entropy estimator with provable generalization over SE\. We use letter\-level semantic entropy which is a low\-overhead instantiation of this family tailored to multiple\-choice; the centering transformation we apply to embeddings before computing cosines is in spirit similar to KLE’s inter\-cluster term\. A statistically consistent estimator of semantic uncertainty for open\-ended generation is studied byLiu \([2025](https://arxiv.org/html/2606.08098#bib.bib20)\), with formal guarantees on convergence\.

### 2\.4Confidence calibration and self\-verification

Kadavathet al\.\([2022](https://arxiv.org/html/2606.08098#bib.bib22)\)showed that LLMs can be prompted to estimateP​\(True\)P\(\\text\{True\}\)for their own outputs\. CoCoA\(Vashurinet al\.,[2025](https://arxiv.org/html/2606.08098#bib.bib9)\)combinesP​\(True\)P\(\\text\{True\}\)with semantic entropy multiplicatively\. We evaluateP​\(True\)P\(\\text\{True\}\)on our setup and find it anti\-correlated with correctness \(AUROC0\.470\.47\); CoCoA\-style products inherit the anti\-correlation and underperformH^\\hat\{H\}alone\. We read this as model\- and temperature\-dependent: high\-temperature generation from a 1\.7B\-parameter model produces confident\-sounding\-but\-wrong outputs, and the same\-model verifier rewards surface fluency rather than factual accuracy\.

The broader calibration literature supports this view\.Guoet al\.\([2017](https://arxiv.org/html/2606.08098#bib.bib23)\)document that standard training produces systematically over\-confident models; temperature scaling corrects marginal calibration but not group\-level calibration\.Zhouet al\.\([2026](https://arxiv.org/html/2606.08098#bib.bib24)\)show that RL fine\-tuning \(DPO, PPO, GRPO\) degrades calibration by exploiting reward advantage weighting; post\-RL SFT restores it\. The Qwen3\-1\.7B model we use is trained with RL, which offers a mechanistic explanation for why itsP​\(True\)P\(\\text\{True\}\)is anti\-correlated with correctness in our setting\. Graph\-based confidence calibration\(Liet al\.,[2024](https://arxiv.org/html/2606.08098#bib.bib25)\)uses similarity graphs over multiple responses and learns to correct miscalibration—a supervised approach that would require gold labels unavailable in our unsupervised regime\.Linet al\.\([2024](https://arxiv.org/html/2606.08098#bib.bib26)\)study verbalized confidence and its calibration properties\. The overall lesson across this literature is that reliable confidence signals for small RL\-trained models at high temperature cannot be taken for granted; our negative result onP​\(True\)P\(\\text\{True\}\)is consistent with this\.

### 2\.5Liquid democracy and propagational proxy voting

Liquid democracy\(Ford,[2002](https://arxiv.org/html/2606.08098#bib.bib27)\)is a voting framework in which agents may delegate their votes transitively\. Its properties have been studied theoretically\(Christoff and Grossi,[2017](https://arxiv.org/html/2606.08098#bib.bib28); Brill,[2018](https://arxiv.org/html/2606.08098#bib.bib29)\), including the Condorcet\-jury analysis of when delegation helps versus hurts relative to direct voting\(Kahnget al\.,[2018](https://arxiv.org/html/2606.08098#bib.bib30)\), the algorithmic perspective on optimal delegation\(Brillet al\.,[2022](https://arxiv.org/html/2606.08098#bib.bib32)\), and the game\-theoretic analysis of rational delegation\(Bloembergenet al\.,[2019](https://arxiv.org/html/2606.08098#bib.bib31)\)\. Recent computational social\-choice work continues to refine the picture:Alouf\-Heffetzet al\.\([2024](https://arxiv.org/html/2606.08098#bib.bib33)\)\(IJCAI 2024\) study manipulation of delegation graphs;Alouf\-Heffetzet al\.\([2025](https://arxiv.org/html/2606.08098#bib.bib34)\)\(AAAI 2025\) analyze the welfare cost of liquid mechanisms; andBersetche \([2025](https://arxiv.org/html/2606.08098#bib.bib35)\)\(IJGT 2025\) generalize liquid democracy to multi\-agent settings with equilibrium analysis\.

Sakaiet al\.\([2025](https://arxiv.org/html/2606.08098#bib.bib36)\)introduce PPV, an absorbing\-Markov\-chain formulation that admits*split*delegation \(each voter can distribute its budget across multiple peers and one policy simultaneously\)\. To our knowledge the present work is the first application of liquid democracy—or of any delegation\-graph mechanism—to multi\-sample LLM aggregation\. The conceptual mapping is clean: each group of samples is a voter, the delegation budget encodes trust in peers calibrated by embedding cosine, and the stationary distribution of the chain resolves the consensus\.

### 2\.6Multi\-agent debate and ensemble consensus

A parallel line of work, multi\-agent debate \(MAD\), aggregates outputs by having multiple LLM*instances*debate iteratively\(Duet al\.,[2023](https://arxiv.org/html/2606.08098#bib.bib37); Lianget al\.,[2024](https://arxiv.org/html/2606.08098#bib.bib38)\)\.Khanet al\.\([2024](https://arxiv.org/html/2606.08098#bib.bib39)\)show that debate between models can surface factual errors that a single model would propagate\. The key differences from our setting are: \(i\) MAD involves multiple distinct models or independently prompted instances that communicate across rounds, whereas our setting draws repeated samples from a fixed temperature distribution of one model; \(ii\) MAD aggregation is sequential \(each round conditions on prior outputs\), whereas ours is parallel; \(iii\) MAD is compute\-intensive and can exhibit sycophancy—agents converging to a wrong consensus under social pressure\(Pitreet al\.,[2025](https://arxiv.org/html/2606.08098#bib.bib40)\)\.

Mixture\-of\-agents approaches\(Wanget al\.,[2024a](https://arxiv.org/html/2606.08098#bib.bib42)\)use heterogeneous models as an ensemble;Zhaoet al\.\([2024](https://arxiv.org/html/2606.08098#bib.bib43)\)run a council of LLMs that collectively author, take, and grade a benchmark, ranking each other in a democratic fashion\. These methods lie in a complementary regime \(multiple models, iterative, often supervised in some component\) to our single\-model, single\-pass, fully unsupervised approach\. Recent work applying social choice formalisms to multi\-agent debate\(Wanget al\.,[2026](https://arxiv.org/html/2606.08098#bib.bib41)\)asks when it is*safe*to commit to a debate outcome—a question about confidence rather than aggregation function, but connecting to our concern about polarity mismatch\.

Our work also differs fromPanet al\.\([2025](https://arxiv.org/html/2606.08098#bib.bib5)\)and concurrent LLM\-panel aggregation methods in that we treat the*reasoning text*as a first\-class signal \(via embedding geometry\) rather than only the final answer distribution\. This reasoning\-geometry lever is invisible to methods that pool only extracted answers\.

## 3Preliminaries: Propagational Proxy Voting

We summarize the PPV machinery used; for proofs and the general formulation seeSakaiet al\.\([2025](https://arxiv.org/html/2606.08098#bib.bib36)\)\.

#### Setup\.

Fix a set of votersN=\{d1,…,dn\}N=\\\{d\_\{1\},\\dots,d\_\{n\}\\\}and a set of policies \(possible answers\)P=\{p1,…,pk\}P=\\\{p\_\{1\},\\dots,p\_\{k\}\\\}\. PPV operates on a column\-stochastic*voting matrix*:

V=\[Vd←d𝟎Vp←dIk\],∑iVi,j=1​∀j,V\\;=\\;\\begin\{bmatrix\}V\_\{d\\leftarrow d\}&\\mathbf\{0\}\\\\ V\_\{p\\leftarrow d\}&I\_\{k\}\\end\{bmatrix\},\\quad\\sum\_\{i\}V\_\{i,j\}=1\\;\\;\\forall j,\(1\)whereVd←d∈ℝn×nV\_\{d\\leftarrow d\}\\in\\mathbb\{R\}^\{n\\times n\}is the delegate\-to\-delegate sub\-block \(with zero diagonal\),Vp←d∈ℝk×nV\_\{p\\leftarrow d\}\\in\\mathbb\{R\}^\{k\\times n\}is the delegate\-to\-policy sub\-block, andIkI\_\{k\}makes each policy an absorbing state\. Columnjjencodes the outgoing distribution of delegatedjd\_\{j\}: how its unit of voting mass splits between its own pick \(a policy\) and its peers \(other delegates\)\.

#### Consensus via the limit matrix\.

Sakaiet al\.\([2025](https://arxiv.org/html/2606.08098#bib.bib36), Theorem IV\.3\)prove that the limitW∗=limx→∞VxW^\{\*\}=\\lim\_\{x\\to\\infty\}V^\{x\}exists and is computable by repeated squaring\(Sakaiet al\.,[2025](https://arxiv.org/html/2606.08098#bib.bib36), Corollary IV\.4\)\. The columns ofW∗W^\{\*\}live entirely on the policy block: each delegate’s mass is fully absorbed\. The PPVconsensus winneris

p^=arg⁡maxp∈P​∑j∈NWp,j∗\.\\hat\{p\}\\;=\\;\\arg\\max\_\{p\\in P\}\\;\\sum\_\{j\\in N\}W^\{\*\}\_\{p,j\}\.\(2\)
PPV reduces to several familiar mechanisms in limits: if every column hasVπj,j=1V\_\{\\pi\_\{j\},j\}=1\(each delegate keeps everything\), PPV collapses to majority vote weighted by the column distribution; if some column has full mass on a single peer, that peer absorbs the delegator’s vote entirely \(classical proxy voting\)\.

## 4Method

We instantiate PPV for unsupervised LLM aggregation in four steps: sampling, signal extraction, matrix construction, and propagation\.

### 4\.1Sampling and partitioning

For each questionqq, the solver generatesS=128S=128chain\-of\-thought samples, each terminating in an extracted answer letterℓc∈\{A,…,J,∅\}\\ell\_\{c\}\\in\\\{A,\\dots,J,\\varnothing\\\}\(∅\\varnothingfor parsing failure\)\. We partition the128128samples deterministically inton=16n=16groups ofg=8g=8\. Each group becomes a*voter*\(delegate\) in PPV’s terminology; its pickπj\\pi\_\{j\}is the majority letter among its88samples \(πj=Z\\pi\_\{j\}=\\texttt\{Z\}if the majority is∅\\varnothing\)\. The set of policiesPPis the set of letters actually picked\. The partition is fixed \(not learned\);128/16128/16gives each group enough samples for a stable entropy estimate while leaving enough voters for meaningful delegation\.

### 4\.2Per\-voter signals

#### Letter\-level semantic entropy\.

For voterjjwith letter counts\{cℓ\}ℓ∈ℒj\\\{c\_\{\\ell\}\\\}\_\{\\ell\\in\\mathcal\{L\}\_\{j\}\}over its88samples \(ℒj\\mathcal\{L\}\_\{j\}= letters observed,Kj=\|ℒj\|K\_\{j\}=\|\\mathcal\{L\}\_\{j\}\|\), the Miller–Madow corrected andlog⁡g\\log g\-normalized entropy is

H^j=−∑ℓ∈ℒjcℓg​log⁡cℓg\+Kj−12​glog⁡g\.\\hat\{H\}\_\{j\}\\;=\\;\\frac\{\-\\sum\_\{\\ell\\in\\mathcal\{L\}\_\{j\}\}\\frac\{c\_\{\\ell\}\}\{g\}\\log\\frac\{c\_\{\\ell\}\}\{g\}\\;\+\\;\\frac\{K\_\{j\}\-1\}\{2g\}\}\{\\log g\}\.\(3\)H^j=0\\hat\{H\}\_\{j\}=0when all88samples agree on one letter; values approach11when the88samples spread evenly across letters\. The Miller–Madow correctionKj−12​g\\frac\{K\_\{j\}\-1\}\{2g\}\(Miller,[1955](https://arxiv.org/html/2606.08098#bib.bib21)\)compensates for the downward bias of plug\-in entropy atg=8g=8; for fully spread distributions the correction can push the normalized value mildly above11, in which case we clip downstream confidence values to\[0,1\]\[0,1\]\.

#### Reasoning embeddings and per\-question centering\.

Each sample’s reasoning text is embedded by Qwen3\-Embedding\-8B into a40964096\-dimensional unit vectorece\_\{c\}\. Lete¯\(q\)=1S​∑c=1Sec\\bar\{e\}^\{\(q\)\}=\\frac\{1\}\{S\}\\sum\_\{c=1\}^\{S\}e\_\{c\}denote the per\-question centroid over allS=128S=128samples\. We center and renormalize:

e~c=ec−e¯\(q\)‖ec−e¯\(q\)‖2,e¯j=1g​∑c∈groupje~c,\\tilde\{e\}\_\{c\}\\;=\\;\\frac\{e\_\{c\}\-\\bar\{e\}^\{\(q\)\}\}\{\\\|e\_\{c\}\-\\bar\{e\}^\{\(q\)\}\\\|\_\{2\}\},\\quad\\bar\{e\}\_\{j\}\\;=\\;\\frac\{1\}\{g\}\\sum\_\{c\\in\\text\{group\}\_\{j\}\}\\tilde\{e\}\_\{c\},\(4\)and define the inter\-voter cosine matrixcosi​j=⟨e¯i/‖e¯i‖,e¯j/‖e¯j‖⟩\\cos\_\{ij\}=\\langle\\bar\{e\}\_\{i\}/\\\|\\bar\{e\}\_\{i\}\\\|,\\bar\{e\}\_\{j\}/\\\|\\bar\{e\}\_\{j\}\\\|\\rangle\.

Centering is the difference between an informative geometry and noise\. Uncentered embeddings, off\-diagonal cosines on a single question lie in\[\+0\.88,\+0\.99\]\[\+0\.88,\+0\.99\]with standard deviation≈0\.025\\approx 0\.025: every voter’s reasoning embedding is dominated by the shared question content\. Subtracting the per\-question centroid removes that common component and yields cosines in\[−0\.68,\+0\.64\]\[\-0\.68,\+0\.64\]with standard deviation≈0\.32\\approx 0\.32\.

#### Intra\-group reasoning diversity\.

Within voterjj’s88samples, the average pairwise centered cosine quantifies how varied its internal reasoning is:

Dintra,j=12​\(1−1g​\(g−1\)​∑a,b∈groupja≠b⟨e~a,e~b⟩\)∈\[0,1\]\.D\_\{\\mathrm\{intra\},j\}\\;=\\;\\frac\{1\}\{2\}\\left\(1\-\\frac\{1\}\{g\(g\-1\)\}\\\!\\\!\\sum\_\{\\begin\{subarray\}\{c\}a,b\\in\\text\{group\}\_\{j\}\\\\ a\\neq b\\end\{subarray\}\}\\\!\\\!\\langle\\tilde\{e\}\_\{a\},\\tilde\{e\}\_\{b\}\\rangle\\right\)\\in\[0,1\]\.\(5\)A voter withDintra,jD\_\{\\mathrm\{intra\},j\}near0has88near\-duplicate samples \(templated reasoning\); near0\.50\.5has88mutually orthogonal samples \(independent reasoning paths\)\.

### 4\.3Constructing the voting matrix

For each voterjjwe choose a*confidence*αj=f​\(H^j,Dintra,j\)∈\[0,1\]\\alpha\_\{j\}=f\(\\hat\{H\}\_\{j\},D\_\{\\mathrm\{intra\},j\}\)\\in\[0,1\]from one of five modes \(Table[1](https://arxiv.org/html/2606.08098#S4.T1)\)\. ColumnjjofVVis then

Vπj,j\\displaystyle V\_\{\\pi\_\{j\},j\}=αj,\\displaystyle=\\alpha\_\{j\},\(6\)Vi,j\\displaystyle V\_\{i,j\}=\(1−αj\)⋅max⁡\(cosi​j,0\)∑ℓ≠jmax⁡\(cosℓ​j,0\)​∀i≠j,\\displaystyle=\(1\-\\alpha\_\{j\}\)\\cdot\\frac\{\\max\(\\cos\_\{ij\},0\)\}\{\\sum\_\{\\ell\\neq j\}\\max\(\\cos\_\{\\ell j\},0\)\}\\;\\;\\forall\\,i\\neq j,withVj,j=0V\_\{j,j\}=0andVp,j=0V\_\{p,j\}=0forp≠πjp\\neq\\pi\_\{j\}\. TheWhenmassαj\\alpha\_\{j\}goes into the policy block \(Vp←dV\_\{p\\leftarrow d\}\); theWhommass1−αj1\-\\alpha\_\{j\}goes into the delegate block \(Vd←dV\_\{d\\leftarrow d\}\)\. Clipping negative cosines to zero encodes “do not delegate to peers whose reasoning is anti\-aligned with mine\.” If everymax⁡\(cosℓ​j,0\)=0\\max\(\\cos\_\{\\ell j\},0\)=0\(no positively\-aligned peer\), the peer budget is split uniformly over the othern−1n\-1voters, keeping the column stochastic\.

Table 1:Confidence modes\.s=H^s=\\hat\{H\},d=Dintrad=D\_\{\\mathrm\{intra\}\}\. All clipped to\[0,1\]\[0,1\]\.
### 4\.4Propagation and decision

We computeW∗W^\{\*\}by repeated squaring ofVVand read off the consensus winner via Equation[2](https://arxiv.org/html/2606.08098#S3.E2)\.

## 5Experiments

#### Setup\.

Solver: Qwen3\-1\.7B at temperature0\.70\.7,128128chain\-of\- thought samples per question\. Benchmark: MMLU\-Pro test split\(Wanget al\.,[2024b](https://arxiv.org/html/2606.08098#bib.bib44)\), all1414subjects,12,03212\{,\}032questions\. Embeddings: Qwen3\-Embedding\-8B\(Zhanget al\.,[2025](https://arxiv.org/html/2606.08098#bib.bib45)\),40964096\-dim, L2\-normalized, stored as fp16 memmap\. This equates to a total generation of≈1\.54\\approx 1\.54M reasoning trajectories\.

#### Baselines\.

- •Majority: top letter across1616voter picks,∅\\varnothingexcluded; ties counted incorrect\.
- •Best dictator: the single voter index with highest accuracy across the dataset, applied to every question\.
- •Oracle \(pass@16\): correct iff any of the1616voters picked gold\. Ceiling for any1616\-voter aggregator\.

#### Main results\.

Table[2](https://arxiv.org/html/2606.08098#S5.T2)shows that PPV withα=1−H^\\alpha=1\-\\hat\{H\}beats majority by\+1\.50\+1\.50pp overall and\+2\.24\+2\.24pp on the non\-trivial subset\. The paired McNemar test on the non\-trivial subset \(Table[3](https://arxiv.org/html/2606.08098#S5.T3)\) reports366366PPV wins versus185185majority wins out of8,0998\{,\}099non trivial questions,p≈1\.0×10−14p\\approx 1\.0\\times 10^\{\-14\}\.

Table 2:Accuracy on MMLU\-Pro test \(Qwen3\-1\.7B,128128samples/question\)\. Non\-trivial==top letter<12/16<12/16votes \(n=8,099n=8\{,\}099of12,03212\{,\}032\)\. A question is*trivial*when the top letter receives≥12/16\\geq 12/16votes; on such questions all confidence modes agree with majority by construction\. We report results on both the full set and the non\-trivial subset, where the differentiating signal lives\.![Refer to caption](https://arxiv.org/html/2606.08098v1/figures/Results.png)Figure 2:Accuracy on the8,0998\{,\}099non\-trivial MMLU\-Pro questions\. Bars zoomed to highlight the∼\\sim6 pp gap between majority and oracle\. PPV \(confidence\) closes38%38\\%of that gap unsupervised\.Table 3:Paired McNemar test,confidencevs\. majority on the8,0998\{,\}099non\-trivial questions\.PPV correct, majority wrong366366Majority correct, PPV wrong185185Net\+181\+181Two\-sided exact binomialpp≈1\.0×10−14\\approx 1\.0\\times 10^\{\-14\}
#### Per\-disagreement precision\.

On the non\-trivial subset,divwins187/\(187\+39\)=82\.7%187/\(187\+39\)=82\.7\\%of disagreements,inverted\_x\_divwins68\.9%68\.9\\%, andconfidencewins66\.4%66\.4\\%\(Figure[3](https://arxiv.org/html/2606.08098#S5.F3)\)\. The aggregate\-accuracy winner \(confidence\) and the per\-disagreement\-precision winner \(div\) are different modes: the former disagrees with majority more often and is right less often per disagreement; the latter disagrees less often but more reliably\.

![Refer to caption](https://arxiv.org/html/2606.08098v1/figures/disagreement.png)Figure 3:Where the disagreements land on the non\-trivial subset\. Left \(pink\) = majority correct, PPV wrong; right \(green\) = PPV correct, majority wrong\.divfires rarely but is right82\.7%82\.7\\%of the time;confidencefires more and rescues the most absolute questions \(\+181\+181net\)\.
#### No single mode dominates\.

The five modes form a Pareto front, not a ranking\.confidencemaximizes overall accuracy;divmaximizes per\-disagreement precision;inverted\_x\_divmaximizes accuracy on the inverted\-polarity “confidently wrong” family\. The modes rescue overlapping but distinct sets of questions\.

## 6Mechanism

We unpack*how*PPV delivers the gain through a worked example, the mirrored failure case, and aWhen/Whomablation\.

### 6\.1A clear majority overturned:philosophy\_314

Questionphilosophy\_314\(gold==I\) is a1010–66majority for the wrong letter:1010voters pickD,66pickI\. This is not a tiebreak — D wins majority with a44\-vote margin over the runner\-up\. PPV withconfidencenonetheless resolves the question toI\. The flip is driven byWhom, notWhen\.

#### Own\-pick weight \(When\) is nearly a wash\.

D\-pickers have meanH^=0\.713\\hat\{H\}=0\.713, I\-pickers0\.6730\.673— a gap of only0\.040\.04\. Underα=1−H^\\alpha=1\-\\hat\{H\}, the first\-iteration self\-mass favors*D*:∑j∈Dαj=2\.87\\sum\_\{j\\in D\}\\alpha\_\{j\}=2\.87vs∑j∈Iαj=1\.96\\sum\_\{j\\in I\}\\alpha\_\{j\}=1\.96\. Entropy alone would keep D ahead\.

#### Reasoning geometry \(Whom\) is decisive\.

In centered embedding space, the two clusters look qualitatively different \(Figure[4](https://arxiv.org/html/2606.08098#S6.F4)\)\. The1010D\-voters reachDvia*unrelated*reasoning paths — mean within\-D cosine−0\.021\-0\.021, a near\-orthogonal cloud\. The66I\-voters reachIvia tightly similar reasoning — mean within\-I cosine\+0\.257\+0\.257\. Cross\-cluster cosine is−0\.170\-0\.170: D and I sit on opposite sides of reasoning space\. The clipped\-cosine peer weights inherit this asymmetry: within\-I clipped average is\+0\.257\+0\.257vs within\-D\+0\.053\+0\.053\. Even though D\-pickers route on average only15\.8%15\.8\\%of their peer budget toward the I\-cluster, the*absolute*mass arriving at any I\-voter is large because the alternative D\-side weights are tiny\. The16×1616\\times 16initial voting matrixVV\(Figure[5](https://arxiv.org/html/2606.08098#S6.F5)\) makes this visible: the I\-cluster’s columns \(d3,d5,d6,d10,d11,d14d\_\{3\},d\_\{5\},d\_\{6\},d\_\{10\},d\_\{11\},d\_\{14\}\) channel mass into a coherent block, while D\-cluster columns spray weakly across the rest\.

Table 4:Per\-voter signals onphilosophy\_314\.α=1−H^\\alpha=1\-\\hat\{H\}is theWhenown\-pick weight;1−α1\-\\alphais the peer budget split via clipped centered cosine\. D\-pickers \(left of separator\) vs\. I\-pickers \(right\)\.![Refer to caption](https://arxiv.org/html/2606.08098v1/figures/philosophy_314_centroids.png)Figure 4:PCA\-2D of the1616per\-voter centroids onphilosophy\_314after per\-question centering\. The1010D\-pickers form a scattered cloud \(mean within\-cluster cosine−0\.021\-0\.021\); the66I\-pickers form a tight cluster \(\+0\.257\+0\.257\)\. The geometric incoherence of the majority is what PPV exploits\.![Refer to caption](https://arxiv.org/html/2606.08098v1/figures/philosophy_314_matrix.png)Figure 5:Initial column\-stochastic voting matrixVVforphilosophy\_314\. Diagonal==Whenown\-pick weightαj=1−H^j\\alpha\_\{j\}=1\-\\hat\{H\}\_\{j\}; column off\-diagonals==Whompeer split by clipped centered cosine\. I\-cluster columns concentrate mass on the I block; D\-cluster columns spread weakly\.
#### Propagation\.

After repeated squaring ofVV, the stationary distribution lands at

∑jWI,j=8\.94,∑jWD,j=7\.06⇒p^=I​\(gold\)\.\\textstyle\\sum\_\{j\}W\_\{I,j\}=8\.94,\\quad\\sum\_\{j\}W\_\{D,j\}=7\.06\\;\\Rightarrow\\;\\hat\{p\}=\\texttt\{I\}\\;\\;\\text\{\(gold\)\}\.A clear1010–66majority overturned because the minority’s reasoning is coherent and the majority’s is not\.Whomdid the work thatWhenalone could not\.

#### When entropy carries the day:economics\_124\.

Not all PPV wins look likephilosophy\_314\. Oneconomics\_124\(gold==J, picks split88/88betweenIandJ\), the entropy gap is the visible lever: J\-pickers have meanH^=0\.536\\hat\{H\}=0\.536vs I\-pickers’0\.6230\.623, and the cosine geometry is only mildly assortative \(within\-J\+0\.276\+0\.276, within\-I\+0\.233\+0\.233, cross\+0\.265\+0\.265\)\. The two signals each do a little work and compound to∑jWJ,j=8\.69\\sum\_\{j\}W\_\{J,j\}=8\.69vs7\.317\.31, flipping a tiebreak\. Across the8,0998\{,\}099non\-trivial questions, both regimes occur: sometimesWhendominates, sometimesWhom\.

### 6\.2When the signal hurts:engineering\_330

Not every question favors the normal polarity\. Onengineering\_330\(gold==B, voter picks split7×7\\timesA,7×7\\timesB,1×1\\timesC,1×1\\timesZ\), the B\-pickers have*higher*meanH^\\hat\{H\}\(0\.6080\.608\) than the A\-pickers \(0\.5170\.517\): the solver is more internally consistent on the wrong answer\. Underα=1−H^\\alpha=1\-\\hat\{H\}, A\-pickers retain more own\-pick mass and A wins\. Switching to inverted polarity \(α=H^\\alpha=\\hat\{H\}\) recovers B\. This is a “confidently wrong” failure of letter\-entropy as a per\-question confidence signal, and is the population thatinverted\_x\_divtargets\. The two modes thus rescue disjoint families of questions; on MMLU\-Pro the normal\-polarity family is the larger of the two\.

### 6\.3TheWhen/Whomdecomposition

PPV’s column construction \(Eq\.[6](https://arxiv.org/html/2606.08098#S4.E6)\) has two levers,When\(how much to keep on own pick\) andWhom\(how to split the rest\)\. To isolate which lever delivers the gain, we ablate each independently:

- •FixWhom, varyWhen\.Holding peer weights at clipped centered cosine, varyingαj\\alpha\_\{j\}over the five modes \(Table[1](https://arxiv.org/html/2606.08098#S4.T1)\) produces the full spread of accuracies in Table[2](https://arxiv.org/html/2606.08098#S5.T2)\.
- •FixWhen, varyWhom\.Holdingα=1−H^\\alpha=1\-\\hat\{H\}, multiplyingmax⁡\(cosi​j,0\)\\max\(\\cos\_\{ij\},0\)by each of four peer\-quality scores shifts accuracy by at most±0\.1\\pm 0\.1pp—inside the noise\.

Whenis the load\-bearing lever\.Whomas plain clipped cosine is essentially optimal in this design space\.

#### WhyWhomdoesn’t need an explicit quality gate\.

PPV’s repeated\-squaring propagation handles peer\-side quality implicitly\. A low\-quality voter that receives mass routes it back out through*its own column*, which again uses cosine\. After two to three hops, mass settles in neighborhoods of mutually high\-cosine voters which, empirically, are also high\-quality neighborhoods\.

### 6\.4The centering ablation

Skipping per\-question centering—using rawece\_\{c\}in place ofe~c\\tilde\{e\}\_\{c\}—is the single largest ablation\. Off\-diagonal cosines become near\-constant \(std0\.0250\.025, range\[\+0\.88,\+0\.99\]\[\+0\.88,\+0\.99\]\), theWhomblock degenerates to near\-uniform peer weights, and PPV reduces to a soft averaging that brings no gain over majority\. The centering trick is the difference between PPV\-as\-aggregator and PPV\-as\-soft\-majority\.

## 7Negative Results

### 7\.1P​\(True\)P\(\\text\{True\}\)is anti\-correlated with correctness

The same\-modelP​\(True\)P\(\\text\{True\}\)signal ofKadavathet al\.\([2022](https://arxiv.org/html/2606.08098#bib.bib22)\)is a standard auxiliary confidence\. We computedP​\(True\)P\(\\text\{True\}\)per voter over alln⋅\|Q\|=192,512n\\cdot\|Q\|=192\{,\}512\(voter, question\) rows on this setup \(Table[5](https://arxiv.org/html/2606.08098#S7.T5)\)\. Its AUROC for predicting correctness is0\.470\.47—below chance\. The top\-25%25\\%of voters byP​\(True\)P\(\\text\{True\}\)are correct32\.1%32\.1\\%of the time, below the base rate of35\.4%35\.4\\%\. CoCoA\-style products withH^\\hat\{H\}inherit the anti\-correlation and underperformH^\\hat\{H\}alone\. We read this as model\- and setting\-specific: high\-temperature generation from a1\.71\.7B\-parameter solver produces confident\-sounding\-but\-wrong outputs and its same\-model verifier rewards them\. For this setup the consequence is clear:P​\(True\)P\(\\text\{True\}\)should not enter the aggregator\.

Table 5:Per\-voter confidence calibration on MMLU\-Pro \(Qwen3\-1\.7B,192,512192\{,\}512voter\-question rows\)\. Base rate=0\.354=0\.354\.

## 8Discussion

#### The signal: letter entropy is load\-bearing, geometry is the medium\.

Across all our experiments the dominant ablation is the choice ofWhen\. Geometry \(Whom\) is necessary for the propagation to flow non\-trivially, but its functional form is essentially fixed once centered cosine is in place\. This pattern of having a strong per\-voter scalar signal channeled through a mild geometric backbone suggests a general design principle for delegation\-based aggregators: invest in the scalar, use the embedding only to route\.

#### Why propagation, not weighting?

A simpler approach would weight each voter’s pick byαj\\alpha\_\{j\}and sum: an entropy\-weighted majority\. PPV’s propagation differs in that low\-confidence voters do not just contribute less—they*redistribute*their voting mass to peers\. Onphilosophy\_314\(§[6\.1](https://arxiv.org/html/2606.08098#S6.SS1)\) this redistribution is what overturns the majority: the D\-cluster’s leaked budget, routed by clipped cosine, concentrates absolute mass on the geometrically coherent I\-cluster even though most D\-pickers nominally delegate to fellow D\-pickers — the within\-D weights are too small to matter once propagation runs\.

#### Polarity is question\-dependent\.

The cleanest open problem we leave is the polarity question: on most MMLU\-Pro questions, lower\-entropy voters are more reliable, but on a structured minority the relationship inverts\. Detecting this per question without gold labels is, on the features we tried, unreliable\. A supervised polarity classifier may be the next step, but it changes the regime: aggregation becomes “unsupervised at inference time but supervised at design time\.”

## 9Conclusion

Self\-consistency leaves two free signals on the table\. By feeding letter\-level semantic entropy and centered embedding cosine into PPV\(Sakaiet al\.,[2025](https://arxiv.org/html/2606.08098#bib.bib36)\), we obtain an unsupervised aggregator that beats majority by\+2\.24\+2\.24pp on the non\-trivial subset of MMLU\-Pro at scale, with paired McNemar significancep≈10−14p\\approx 10^\{\-14\}\. TheWhen/Whomdecomposition shows the gain is delivered by per\-voter entropy as confidence; the geometricWhomside stays cosine\-only because PPV’s propagation implicitly launders peer\-side quality\. Three negative results constrain the design space for future unsupervised LLM aggregators and isolate the open problem \(per\-question polarity selection\) where supervision plausibly helps\.

## References

- P\. Aggarwal, A\. Madaan, Y\. Yang, and Mausam \(2023\)Let’s sample step by step: adaptive\-consistency for efficient reasoning and coding with LLMs\.InProceedings of the Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 8333–8347\.Cited by:[§2\.1](https://arxiv.org/html/2606.08098#S2.SS1.p3.1)\.
- S\. Alouf\-Heffetz, T\. Inamdar, P\. Jain, Y\. More, and N\. Talmon \(2024\)Controlling delegations in liquid democracy\.InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems \(AAMAS\),Cited by:[§2\.5](https://arxiv.org/html/2606.08098#S2.SS5.p1.1)\.
- S\. Alouf\-Heffetz, Ł\. Janeczko, G\. Lisowski, and G\. Papasotiropoulos \(2025\)The cost perspective of liquid democracy: feasibility and control\.InProceedings of the AAAI Conference on Artificial Intelligence,Cited by:[§2\.5](https://arxiv.org/html/2606.08098#S2.SS5.p1.1)\.
- F\. Bersetche \(2025\)Generalizing liquid democracy to multi\-agent delegation: a voting weight measure and equilibrium analysis\.International Journal of Game Theory\.External Links:[Document](https://dx.doi.org/10.1007/s00182-025-00962-2)Cited by:[§2\.5](https://arxiv.org/html/2606.08098#S2.SS5.p1.1)\.
- D\. Bloembergen, D\. Grossi, and M\. Lackner \(2019\)On rational delegations in liquid democracy\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.33,pp\. 1797–1804\.Cited by:[§2\.5](https://arxiv.org/html/2606.08098#S2.SS5.p1.1)\.
- M\. Brill, T\. Delemazure, A\. George, M\. Lackner, and U\. Schmidt\-Kraepelin \(2022\)Liquid democracy with ranked delegations\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.36,pp\. 4884–4891\.Cited by:[§2\.5](https://arxiv.org/html/2606.08098#S2.SS5.p1.1)\.
- M\. Brill \(2018\)Interactive democracy\.InProceedings of the 17th International Conference on Autonomous Agents and Multiagent Systems \(AAMAS\),Cited by:[§2\.5](https://arxiv.org/html/2606.08098#S2.SS5.p1.1)\.
- X\. Chen, R\. Aksitov, U\. Alon, J\. Ren, K\. Xiao, P\. Yin, S\. Prakash, C\. Sutton, X\. Wang, and D\. Zhou \(2023\)Universal self\-consistency for large language model generation\.arXiv preprint arXiv:2311\.17311\.External Links:[Link](https://arxiv.org/abs/2311.17311)Cited by:[§2\.1](https://arxiv.org/html/2606.08098#S2.SS1.p1.1)\.
- Z\. Christoff and D\. Grossi \(2017\)Binary voting with delegable proxy: an analysis of liquid democracy\.InProceedings of the 16th Conference on Theoretical Aspects of Rationality and Knowledge, TARK 2017,pp\. 137–151\.Cited by:[§2\.5](https://arxiv.org/html/2606.08098#S2.SS5.p1.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.External Links:[Link](https://arxiv.org/abs/2110.14168)Cited by:[§2\.1](https://arxiv.org/html/2606.08098#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2606.08098#S2.SS2.p1.1)\.
- P\. Cordero\-Encinar and A\. B\. Duncan \(2025\)Certified self\-consistency: statistical guarantees and test\-time training for reliable reasoning in llms\.arXiv preprint arXiv:2510\.17472\.External Links:[Link](https://arxiv.org/abs/2510.17472)Cited by:[§2\.1](https://arxiv.org/html/2606.08098#S2.SS1.p3.1)\.
- Y\. Du, S\. Li, A\. Torralba, J\. B\. Tenenbaum, and I\. Mordatch \(2023\)Improving factuality and reasoning in language models through multiagent debate\.arXiv preprint arXiv:2305\.14325\.External Links:[Link](https://arxiv.org/abs/2305.14325)Cited by:[§2\.6](https://arxiv.org/html/2606.08098#S2.SS6.p1.1)\.
- S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal \(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630\(8017\),pp\. 625–630\.Cited by:[item 1](https://arxiv.org/html/2606.08098#S1.I1.i1.p1.4),[§2\.3](https://arxiv.org/html/2606.08098#S2.SS3.p1.1)\.
- B\. Ford \(2002\)Delegative democracy\.Note:Unpublished manuscriptExternal Links:[Link](https://bford.info/deleg/deleg.pdf)Cited by:[§2\.5](https://arxiv.org/html/2606.08098#S2.SS5.p1.1)\.
- C\. Guo, G\. Pleiss, Y\. Sun, and K\. Q\. Weinberger \(2017\)On calibration of modern neural networks\.InInternational Conference on Machine Learning \(ICML\),D\. Precup and Y\. W\. Teh \(Eds\.\),Proceedings of Machine Learning Research, Vol\.70,pp\. 1321–1330\.Cited by:[§2\.4](https://arxiv.org/html/2606.08098#S2.SS4.p2.2)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson, S\. Johnston, S\. E\. Showk, A\. Jones, N\. Elhage, T\. Hume, A\. Chen, Y\. Bai, S\. Bowman, S\. Fort, D\. Ganguli, D\. Hernandez, J\. Jacobson, J\. Kernion, S\. Kravec, L\. Lovitt, K\. Ndousse, C\. Olsson, S\. Ringer, D\. Amodei, T\. Brown, J\. Clark, N\. Joseph, B\. Mann, S\. McCandlish, C\. Olah, and J\. Kaplan \(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.External Links:[Link](https://arxiv.org/abs/2207.05221)Cited by:[4th item](https://arxiv.org/html/2606.08098#S1.I3.i4.p1.2),[§2\.4](https://arxiv.org/html/2606.08098#S2.SS4.p1.5),[§7\.1](https://arxiv.org/html/2606.08098#S7.SS1.p1.12)\.
- A\. Kahng, S\. MacKenzie, and A\. D\. Procaccia \(2018\)Liquid democracy: an algorithmic perspective\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.32\.Cited by:[§2\.5](https://arxiv.org/html/2606.08098#S2.SS5.p1.1)\.
- A\. Khan, J\. Hughes, D\. Valentine, L\. Ruis, K\. Sachan, A\. Radhakrishnan, E\. Grefenstette, S\. R\. Bowman, T\. Rocktäschel, and E\. Perez \(2024\)Debating with more persuasive LLMs leads to more truthful answers\.arXiv preprint arXiv:2402\.06782\.External Links:[Link](https://arxiv.org/abs/2402.06782)Cited by:[§2\.6](https://arxiv.org/html/2606.08098#S2.SS6.p1.1)\.
- J\. Kossen, J\. Han, M\. Razzak, L\. Schut, S\. Malik, and Y\. Gal \(2024\)Semantic entropy probes: robust and cheap hallucination detection in LLMs\.arXiv preprint arXiv:2406\.15927\.External Links:[Link](https://arxiv.org/abs/2406.15927)Cited by:[§2\.3](https://arxiv.org/html/2606.08098#S2.SS3.p2.1)\.
- L\. Kuhn, Y\. Gal, and S\. Farquhar \(2023\)Semantic uncertainty: linguistic invariances for uncertainty estimation in natural language generation\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=VD-AYtP0dve)Cited by:[item 1](https://arxiv.org/html/2606.08098#S1.I1.i1.p1.4),[§2\.3](https://arxiv.org/html/2606.08098#S2.SS3.p1.1)\.
- Y\. Li, S\. Wang, L\. Huang, and L\. Liu \(2024\)Graph\-based confidence calibration for large language models\.arXiv preprint arXiv:2411\.02454\.External Links:[Link](https://arxiv.org/abs/2411.02454)Cited by:[§2\.4](https://arxiv.org/html/2606.08098#S2.SS4.p2.2)\.
- T\. Liang, Z\. He, W\. Jiao, X\. Wang, Y\. Wang, R\. Wang, Y\. Yang, S\. Shi, and Z\. Tu \(2024\)Encouraging divergent thinking in large language models through multi\-agent debate\.arXiv preprint arXiv:2305\.19118\.External Links:[Link](https://arxiv.org/abs/2305.19118)Cited by:[§2\.6](https://arxiv.org/html/2606.08098#S2.SS6.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.External Links:[Link](https://arxiv.org/abs/2305.20050)Cited by:[§2\.2](https://arxiv.org/html/2606.08098#S2.SS2.p1.1)\.
- Z\. Lin, S\. Trivedi, and J\. Sun \(2024\)Generating with confidence: uncertainty quantification for black\-box large language models\.Transactions on Machine Learning Research \(TMLR\)\.External Links:ISSN 2835\-8856,[Link](https://arxiv.org/abs/2305.19187)Cited by:[§2\.4](https://arxiv.org/html/2606.08098#S2.SS4.p2.2)\.
- Y\. Liu \(2025\)A statistically consistent measure of semantic uncertainty using language models\.arXiv preprint arXiv:2502\.00507\.External Links:[Link](https://arxiv.org/abs/2502.00507)Cited by:[§2\.3](https://arxiv.org/html/2606.08098#S2.SS3.p2.1)\.
- G\. A\. Miller \(1955\)Note on the bias of information estimates\.Information theory in psychology: Problems and methods,pp\. 95–100\.Cited by:[§2\.3](https://arxiv.org/html/2606.08098#S2.SS3.p1.1),[§4\.2](https://arxiv.org/html/2606.08098#S4.SS2.SSS0.Px1.p1.14)\.
- N\. Muennighoff, Z\. Yang, W\. Shi, X\. L\. Li, L\. Fei\-Fei, H\. Hajishirzi, L\. Zettlemoyer, P\. Liang, E\. Candes, and T\. Hashimoto \(2025\)S1: simple test\-time scaling\.arXiv preprint arXiv:2501\.19393\.External Links:[Link](https://arxiv.org/abs/2501.19393)Cited by:[§2\.2](https://arxiv.org/html/2606.08098#S2.SS2.p2.1)\.
- D\. Nguyen, A\. Payani, and B\. Mirzasoleiman \(2025\)Beyond semantic entropy: boosting LLM uncertainty quantification with pairwise semantic similarity\.InFindings of the Association for Computational Linguistics: ACL 2025,External Links:[Link](https://aclanthology.org/2025.findings-acl.234)Cited by:[§2\.3](https://arxiv.org/html/2606.08098#S2.SS3.p2.1)\.
- A\. Nikitin, J\. Kossen, Y\. Gal, and P\. Marttinen \(2024\)Kernel language entropy: fine\-grained uncertainty quantification for LLMs from semantic similarities\.InAdvances in Neural Information Processing Systems \(NeurIPS\),External Links:[Link](https://arxiv.org/abs/2405.20003)Cited by:[§2\.3](https://arxiv.org/html/2606.08098#S2.SS3.p2.1)\.
- Y\. Pan, Z\. Kuang, Z\. Wang, H\. Wang, H\. Liu, and M\. Yang \(2025\)Beyond majority voting: llm aggregation by leveraging higher\-order information\.arXiv preprint arXiv:2510\.01499\.External Links:[Link](https://arxiv.org/abs/2510.01499)Cited by:[§2\.1](https://arxiv.org/html/2606.08098#S2.SS1.p2.1),[§2\.6](https://arxiv.org/html/2606.08098#S2.SS6.p3.1)\.
- P\. Pitre, N\. Ramakrishnan, and X\. Wang \(2025\)CONSENSAGENT: towards efficient and effective consensus in multi\-agent LLM interactions through sycophancy mitigation\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 22112–22133\.External Links:[Link](https://aclanthology.org/2025.findings-acl.1141/)Cited by:[§2\.6](https://arxiv.org/html/2606.08098#S2.SS6.p1.1)\.
- Y\. Sakai, P\. Atchade\-Adelomou, R\. Jiang, L\. Alonso, K\. Larson, and K\. Suzuki \(2025\)Propagational proxy voting\.arXiv preprint arXiv:2504\.13641\.External Links:[Link](https://arxiv.org/abs/2504.13641)Cited by:[§1](https://arxiv.org/html/2606.08098#S1.SS0.SSS0.Px1.p1.1),[§2\.5](https://arxiv.org/html/2606.08098#S2.SS5.p2.1),[§3](https://arxiv.org/html/2606.08098#S3.SS0.SSS0.Px2.p1.2),[§3](https://arxiv.org/html/2606.08098#S3.p1.1),[§9](https://arxiv.org/html/2606.08098#S9.p1.2)\.
- A\. Sharma and P\. Chopra \(2025\)The sequential edge: inverse\-entropy voting beats parallel self\-consistency at matched compute\.arXiv preprint arXiv:2511\.02309\.External Links:[Link](https://arxiv.org/abs/2511.02309)Cited by:[§2\.2](https://arxiv.org/html/2606.08098#S2.SS2.p2.1)\.
- C\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2024\)Scaling LLM test\-time compute optimally can be more effective than scaling model parameters\.arXiv preprint arXiv:2408\.03314\.External Links:[Link](https://arxiv.org/abs/2408.03314)Cited by:[§2\.2](https://arxiv.org/html/2606.08098#S2.SS2.p1.1)\.
- R\. Vashurin, M\. Goloburda, P\. Nakov, A\. Shelmanov, and M\. Panov \(2025\)CoCoA: a generalized approach to uncertainty quantification by integrating confidence and consistency of LLM outputs\.arXiv preprint arXiv:2502\.04964\.External Links:[Link](https://arxiv.org/abs/2502.04964)Cited by:[4th item](https://arxiv.org/html/2606.08098#S1.I3.i4.p1.2),[§2\.4](https://arxiv.org/html/2606.08098#S2.SS4.p1.5)\.
- G\. Wan, Y\. Wu, J\. Chen, and S\. Li \(2025\)Reasoning aware self\-consistency: leveraging reasoning paths for efficient llm sampling\.InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\),External Links:[Link](https://arxiv.org/abs/2408.17017)Cited by:[§2\.1](https://arxiv.org/html/2606.08098#S2.SS1.p3.1)\.
- J\. Wang, J\. Wang, B\. Athiwaratkun, C\. Zhang, and J\. Zou \(2024a\)Mixture\-of\-agents enhances large language model capabilities\.arXiv preprint arXiv:2406\.04692\.External Links:[Link](https://arxiv.org/abs/2406.04692)Cited by:[§2\.6](https://arxiv.org/html/2606.08098#S2.SS6.p2.1)\.
- M\. F\. Wang, H\. Xie, G\. Wang, A\. Gao, G\. Yang, Z\. Li, Q\. W\. Qiu, F\. Han, H\. Qiu, Y\. Huang, B\. Zhu, and J\. O\. Woo \(2026\)From debate to decision: conformal social choice for safe multi\-agent deliberation\.arXiv preprint arXiv:2604\.07667\.External Links:[Link](https://arxiv.org/abs/2604.07667)Cited by:[§2\.6](https://arxiv.org/html/2606.08098#S2.SS6.p2.1)\.
- W\. Wang, Y\. Wang, and H\. Huang \(2025\)Ranked voting based self\-consistency of large language models\.InFindings of the Association for Computational Linguistics: ACL 2025,External Links:[Link](https://arxiv.org/abs/2505.10772)Cited by:[§2\.1](https://arxiv.org/html/2606.08098#S2.SS1.p1.1)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InInternational Conference on Learning Representations \(ICLR\),External Links:[Link](https://openreview.net/forum?id=1PL1NIMMrw)Cited by:[§1](https://arxiv.org/html/2606.08098#S1.p1.1),[§2\.1](https://arxiv.org/html/2606.08098#S2.SS1.p1.1)\.
- Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Xia, and W\. Chen \(2024b\)MMLU\-Pro: a more robust and challenging multi\-task language understanding benchmark\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.37\.Cited by:[§5](https://arxiv.org/html/2606.08098#S5.SS0.SSS0.Px1.p1.6)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.36,pp\. 24458–24483\.Cited by:[§2\.2](https://arxiv.org/html/2606.08098#S2.SS2.p1.1)\.
- Y\. Zhang, M\. Li, D\. Long, X\. Zhang, H\. Lin, B\. Yang, P\. Xie, A\. Yang, D\. Liu, J\. Lin, F\. Huang, and J\. Zhou \(2025\)Qwen3 embedding: advancing text embedding and reranking through foundation models\.arXiv preprint arXiv:2506\.05176\.External Links:[Link](https://arxiv.org/abs/2506.05176)Cited by:[§5](https://arxiv.org/html/2606.08098#S5.SS0.SSS0.Px1.p1.6)\.
- J\. Zhao, F\. M\. Plaza\-del\-Arco, B\. Genchel, and A\. C\. Curry \(2024\)Language model council: democratically benchmarking foundation models on highly subjective tasks\.arXiv preprint arXiv:2406\.08598\.External Links:[Link](https://arxiv.org/abs/2406.08598)Cited by:[§2\.6](https://arxiv.org/html/2606.08098#S2.SS6.p2.1)\.
- Y\. Zhou, Y\. Liu, Y\. Sun, and J\. Leskovec \(2026\)Know when you’re wrong: aligning confidence with correctness for LLM error detection\.arXiv preprint arXiv:2603\.06604\.External Links:[Link](https://arxiv.org/abs/2603.06604)Cited by:[§2\.4](https://arxiv.org/html/2606.08098#S2.SS4.p2.2)\.

Similar Articles

Revisiting DAgger in the Era of LLM-Agents

Hugging Face Daily Papers

This paper revisits Dataset Aggregation (DAgger) for training long-horizon LLM agents, demonstrating that turn-level teacher-student policy interpolation mitigates covariate shift and outperforms existing methods on software engineering benchmarks like SWE-bench Verified.

@rohanpaul_ai: New Stanford paper argues that, under equal reasoning budgets, one LLM usually solves multi-hop problems better than ma…

X AI KOLs Timeline

A new Stanford paper shows that under equal reasoning token budgets, single LLMs typically outperform multi-agent systems on multi-hop reasoning tasks, with gains from multi-agent setups often stemming from additional compute rather than architectural superiority. The paper uses the Data Processing Inequality to explain why information loss in handoffs harms multi-agent performance, and identifies context quality as the key factor where multi-agent systems can provide benefits.