Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces
Summary
This paper introduces EP-HUBO, a quantum-inspired method that treats evidence selection in chain-of-thought reasoning as a combinatorial optimization problem, significantly improving performance on legal reasoning benchmarks like MMLU-Pro law and LEXam by allowing minority-but-correct hypotheses to override noisy majorities.
View Cached Full Text
Cached at: 06/08/26, 09:14 AM
# Quantum-Inspired Trace-Augmented Evidence Selection for Reasoning over Structured Hypothesis Spaces
Source: [https://arxiv.org/html/2606.06941](https://arxiv.org/html/2606.06941)
Laura Wynter , Nirvik Sahoo, Paul Griffin School of Computing and Information Systems Singapore Management University Singapore
###### Abstract
Large language models \(LLMs\) now solve a wide range of expert\-level exams at or above human level, yet remain brittle on specialised, evidence\-intensive domains such as law\. On these tasks, errors arise not only from gaps in world knowledge but also from subtle distinctions between pieces of evidence and inconsistent use of supporting evidence\. The most common aggregator over sampled chain\-of\-thought \(CoT\) traces, majority vote, returns the most popular answer regardless of whether its evidence is actually strongest\. We propose to treat the selection of CoT reasoning fragments into a set of evidence as an explicit combinatorial optimisation problem, allowing well\-supported but minority hypotheses to override noisy majorities, and to evaluate the approach on legal\-reasoning benchmarks that are particularly sensitive to evidence quality\. We introduce EP\-HUBO \(Evidence Pool Higher\-Order Binary Optimisation\), which \(i\) generates multiple CoT traces with a small local model, \(ii\) parses fragments into per\-hypothesis evidence pools, \(iii\) solves a higher\-order unconstrained binary optimisation per pool with quality\-derived weights \(relevance, specificity, distinctiveness\), and \(iv\) delegates a single adjudication call per question to a frontier model\. We evaluate EP\-HUBO on two evidence\-intensive legal benchmarks \(MMLU\-Pro law and LEXam\) using both simulated annealing on classical hardware and the Dirac\-3 photonic entropy\-quantum machine from Quantum Computing Inc\. EP\-HUBO beats majority vote by\+12\.6\+12\.6pp on MMLU\-Pro law and by up to\+23\.2\+23\.2pp on LEXam with a strong frontier adjudicator, and beats zero\-shot frontier adjudication by\+1\.5\+1\.5pp on MMLU\-Pro law and up to\+5\.1\+5\.1pp on LEXam\. On LEXam, zero\-shot Claude Sonnet 4\.6 exhibits a severe position bias, choosing option “E” on 87\.7% of questions; EP\-HUBO\-selected evidence reduces the bias and yields up to\+20\.2\+20\.2pp over zero\-shot Sonnet at HUBO precision92\.0%92\.0\\%\. Quantum solutions on Dirac\-3 are competitive with classical simulated annealing\. HUBO\-style optimisation gives a principled way to aggregate reasoning fragments while preserving minority\-but\-correct hypotheses, and is most valuable in low\-contamination domains where frontier models have not already absorbed the benchmark material\.
## 1Introduction
Large language models \(LLMs\) now solve a wide range of expert\-level exams at or above human level, yet they remain brittle on specialised, evidence\-intensive domains such as law\. On these tasks, errors arise not only from gaps in world knowledge, but also from subtle distinctions between pieces of evidence and inconsistent use of supporting evidence\. A common approach is to sample many LLM responses and aggregate the answers\. The aggregator most commonly used, majority vote\[[2](https://arxiv.org/html/2606.06941#bib.bib2)\], returns the most frequent answer regardless of whether its evidence is actually strongest\.
We introduce EP\-HUBO: Evidence Pool Higher\-Order Binary Optimisation, that treats evidence selection in Chain\-of\-Thought \(CoT\) reasoning traces as a structured combinatorial problem\. Unlike self\-consistency and majority\-vote schemes, EP\-HUBO does not reward popularity: reasoning fragment weights are derived from relevance, specificity, and distinctiveness, allowing well\-supported but minority hypotheses to override noisy majorities\. Our approach involves using the optimisation problem to identify the strongest CoT reasoning evidences from per\-hypothesis evidence pools over the structured hypothesis space\. EP\-HUBO is a quantum\-inspired formulation that allows solving the optimisation problem with either a classical or a quantum computer\.
EP\-HUBO is characterised by: \(i\) local open\-weight trace generation, \(ii\) per\-hypothesis evidence pools, \(iii\) HUBO with quality\-derived rather than popularity\-derived weights, and \(iv\) a single frontier adjudication call per question\. The approach has four phases\. First, for cost efficiency, we use a smaller local model to generate multiple CoT traces per question\. Then, also using the smaller local model along with deterministic rules, fragments from the traces are parsed into answer\-specific evidence pools\. This step relies on the hypothesis space being well\-structured, so that the answers can be partitioned into a discrete set of candidate hypotheses\. Third, we formulate and solve a higher\-order unconstrained binary optimisation \(HUBO\) problem; the optimisation problem serves to select a subset of fragments that supports strongly each answer pool\. Lastly, a single call to a larger, frontier model adjudicates over the options using the optimised evidence\.
The legal domain presents an excellent area of application for EP\-HUBO\. Legal decisions rely on compiling separate, often independent, pieces of evidence to support a conclusion\. Contrary to mathematical proofs, the evidences need not follow a specific order\. As such, they may be drawn from multiple, different CoT reasoning traces\. In addition, if a reasoning trace misses key pieces of evidence, the legal reasoning may collapse, which motivates generating multiple CoT traces\. We thus evaluate EP\-HUBO on two challenging legal reasoning benchmarks: the law subset of MMLU\-Pro\[[18](https://arxiv.org/html/2606.06941#bib.bib18)\]and the LEXam benchmark\[[17](https://arxiv.org/html/2606.06941#bib.bib17)\]of Swiss and international\-law questions\. We evaluate EP\-HUBO on both classical computers and a quantum computer; using classical computers we solve the optimisation problem via simulated annealing \(SA\), and we solve the quantum formulation on the Dirac\-3 photonic entropy quantum machine from Quantum Computing Inc\. \(QCi\)\[[3](https://arxiv.org/html/2606.06941#bib.bib3)\]\. Our results show that EP\-HUBO substantially improves over majority vote on both benchmarks\. Beyond accuracy, EP\-HUBO exposes and mitigates adjudicator bias\. On LEXam, one of the frontier LLMs exhibits a severe bias, choosing one of the answer options on 87\.7% of all questions\. However, using HUBO\-selected evidence reduces the bias and yields an\+11\.4\+11\.4pp gain over the zero\-shot LLM\. Using Dirac\-3 to solve the HUBO optimisation gives comparable results to simulated annealing on classical computers\.
This paper thus makes the following contributions\.
1. 1\.Our evidence\-pool HUBO framework\. We define the trace\-augmented problem formulation with evidence pools, and a higher\-order binary optimisation problem for evidence selection, which is amenable to both classical and quantum computers\.
2. 2\.A theoretical analysis of EP\-HUBO\.
3. 3\.An empirical study using both classical and a photonic quantum computer on MMLU\-Pro law and LEXam Swiss and international\-law benchmarks using two open\-weight smaller LLM trace generators\.
The paper is organised as follows\. Section[2](https://arxiv.org/html/2606.06941#S2)discusses related work including self\-consistency, combinatorial optimisation for NLP, quantum\-inspired AI, and benchmark contamination\. Section[3](https://arxiv.org/html/2606.06941#S3)introduces our framework and algorithm\. Section[4](https://arxiv.org/html/2606.06941#S4)provides our theoretical analysis\. Section[5](https://arxiv.org/html/2606.06941#S5)describes our experimental setup and reports our results\. Section[6](https://arxiv.org/html/2606.06941#S6)presents an ablation study that isolates the contribution of EP\-HUBO’s method components and the effect of adjudicator model strength\. Section[7](https://arxiv.org/html/2606.06941#S7)concludes along with a discussion of the implications and limitations of our work\. All scripts and traces are released alongside this paper\.
## 2Related Work
First, we discuss prior work on CoT reasoning traces as well as other combinatorial and quantum approaches\.
#### Self\-Consistency and Majority Vote
Self\-consistency\[[2](https://arxiv.org/html/2606.06941#bib.bib2)\]is an approach that generatesNNindependent CoT traces for a single task\. It is used in general with majority vote, that selects the most\-common final answer\. While it improves over greedy decoding, it is limited by the majority signal: if the correct answer is in the minority of traces self\-consistency cannot recover it\. Recent theoretical analysis by Feng et al\.\[[4](https://arxiv.org/html/2606.06941#bib.bib4)\]shows that its accuracy follows a power\-law in the number of samples; they introduce an adaptive variant \(Blend\-ASC\) that achieves the same accuracy with 6\.8×\\timesfewer samples by detecting when additional traces yield diminishing returns\. Kang et al\.\[[5](https://arxiv.org/html/2606.06941#bib.bib5)\]propose*self\-certainty*, a reward\-model\-free metric derived from the model’s own output distribution, enabling sample\-efficient best\-of\-NNselection without external verifiers\.
#### Reasoning verification and generative verifiers\.
Beyond self\-consistency, Yao et al\.\[[20](https://arxiv.org/html/2606.06941#bib.bib20)\]introduce Tree\-of\-Thought search over partial solutions; Saunders et al\.\[[21](https://arxiv.org/html/2606.06941#bib.bib21)\]show that language models can be fine\-tuned to critique their own intermediate steps; Welleck et al\.\[[22](https://arxiv.org/html/2606.06941#bib.bib22)\]train a separate generative verifier whose score is used to re\-rank candidate answers\. These methods improve over self\-consistency by introducing a verification stage, but the verification scope is typically the*full reasoning trace*, not the individual evidence fragments within it\. EP\-HUBO can be seen as taking the verification idea one level finer: rather than ranking complete traces, we score and select fragments and let the frontier model verify the resulting evidence ensemble\.
#### Combinatorial structure in NLP decoding\.
Beam search is itself a discrete optimisation: each step selects from a candidate set under a log\-probability objective\. More structured combinatorial decoding has been studied at least since Roth and Yih\[[23](https://arxiv.org/html/2606.06941#bib.bib23)\], who formulated semantic\-role labelling as an integer linear program, and Riedel and Clarke\[[24](https://arxiv.org/html/2606.06941#bib.bib24)\]who used ILP for dependency parsing\. Modern variants use SAT solvers for constrained generation\[[25](https://arxiv.org/html/2606.06941#bib.bib25)\]or differentiable relaxations of structured prediction objectives\[[26](https://arxiv.org/html/2606.06941#bib.bib26)\]\. EP\-HUBO continues this tradition by treating evidence\-fragment selection as a higher\-order pseudo\-Boolean optimisation, with the distinguishing feature that the objective coefficients are produced by a learned model rather than designed by hand\.
#### Scaling Test\-Time Compute and Reinforcement\-Learned Reasoning
A parallel line of research improves reasoning by scaling inference\-time computation\. DeepSeek\-R1\[[6](https://arxiv.org/html/2606.06941#bib.bib6)\]demonstrated that reinforcement learning without supervised chain\-of\-thought annotation induces emergent behaviours—self\-reflection, verification, and strategy adaptation—achieving 84\.0% on MMLU\-Pro\. Yang et al\.\[[7](https://arxiv.org/html/2606.06941#bib.bib7)\]show that naïvely extending CoT length can*hurt*accuracy on certain problem types and propose a*Thinking\-Optimal Scaling*strategy that lets models self\-select minimal reasoning length per problem\. These findings motivate our use of bounded number of structured traces rather than unbounded chain\-of\-thought\.
#### Multi\-LLM Aggregation and Mixture of Agents
Wang et al\.\[[8](https://arxiv.org/html/2606.06941#bib.bib8)\]propose Mixture\-of\-Agents \(MoA\), a layered architecture in which each agent refines outputs from all prior agents, achieving substantial gains over single\-model baselines on open\-ended generation tasks\. A follow\-up by Li et al\.\[[9](https://arxiv.org/html/2606.06941#bib.bib9)\]shows that*Self\-MoA*—aggregating multiple outputs from the single strongest model—outperforms diverse multi\-model mixtures on AlpacaEval 2\.0, suggesting that answer diversity from one capable model is more valuable than architectural diversity across weaker models\. Ashiga et al\.\[[10](https://arxiv.org/html/2606.06941#bib.bib10)\]survey seven ensemble paradigms for LLMs \(weight merging, mixture\-of\-experts, output ensembling, routing, cascading\) and find that output\-level ensembles with voting are competitive with more complex fusion strategies when the individual models are sufficiently capable\. EP\-HUBO on the other hand aggregates*evidence fragments*from multiple traces of a*single model*, then delegates the final synthesis to a stronger frontier model\. Note that EP\-HUBO could easily be extended to select from traces across multiple LLMs\.
#### Combinatorial Optimisation of Reasoning Fragments
The idea of casting reasoning\-fragment selection as a combinatorial optimisation problem was introduced by Esencan et al\.\[[11](https://arxiv.org/html/2606.06941#bib.bib11)\], who map LLM\-generated candidate reasons onto a quadratic QUBO problem to select an optimal evidence subset for chain\-of\-thought prompting, benchmarked against majority voting\. Zhang et al\.\[[12](https://arxiv.org/html/2606.06941#bib.bib12)\]extend this paradigm to an end\-to\-end framework \(LLM\-QUBO\) in which an LLM automatically parses natural\-language problem descriptions and generates QUBO formulations, integrated with a hybrid quantum\-classical Benders’ decomposition\. Relative to Esencan et al\. and LLM\-QUBO, our EP\-HUBO is \(i\) per\-hypothesis, \(ii\) uses small local models for both trace generation and scoring, and \(iii\) is evaluated systematically on the legal domain as it presents an ideal application for optimally selecting and combining CoT reasoning traces\.
#### QCR\-LLM
QCR\-LLM\[[1](https://arxiv.org/html/2606.06941#bib.bib1)\]extends the formulation from a quadratic QUBO to HUBO, with third\-order interactions, and applies it to LLM reasoning traces\. They extract atomic reasoning fragments fromN=20N=20frontier\-model CoT traces per question and encode the traces as binary variablesxi∈\{0,1\}x\_\{i\}\\in\\\{0,1\\\}\. The HUBO energy function is defined as:
H\(𝐱\)=∑iwixi\+∑i<jwijxixj\+∑i<j<kwijkxixjxkH\(\\mathbf\{x\}\)=\\sum\_\{i\}w\_\{i\}x\_\{i\}\+\\sum\_\{i<j\}w\_\{ij\}x\_\{i\}x\_\{j\}\+\\sum\_\{i<j<k\}w\_\{ijk\}x\_\{i\}x\_\{j\}x\_\{k\}\(1\)where 1\-body coefficients encode fragment*popularity*\(fraction of traces containing the fragment\):
wi=−μpi\+α⋅pi\(1−pi\)w\_\{i\}=\-\\mu p\_\{i\}\+\\alpha\\cdot p\_\{i\}\(1\-p\_\{i\}\)\(2\)and pairwise/triplet coefficients encode*statistical co\-occurrence*corrected for a semantic similarity penalty\. The low\-energy subset \(bottom 25thpercentile\) is selected and passed to a frontier model for final inference\. Their results on BIG\-Bench Extra Hard \(BBEH\) show gains of up to 9 pp over GPT\-4o, DeepSeek R1, and o3\-high, with≈\\approx5×\\timesbetter energy efficiency than o3\-high\.
QCR\-LLM is an important first step toward HUBO\-based combinatorial reasoning in LLMs, but several aspects of the formulation merit further scrutiny\. \(1\) Trace generation cost: QCR\-LLM uses frontier models \(GPT\-4o, DeepSeek\) to generate each of their 20 traces for each question\. For large\-scale deployment over a full benchmark \(e\.g\. 12,032 MMLU\-Pro questions\) this incurs substantial API cost and largely defeats the purpose of their subsequent optimisation step\. \(2\) MV\-correlated weights: QCM\-LLM selection is based on fragment popularitypip\_\{i\}\(eq\.[2](https://arxiv.org/html/2606.06941#S2.E2)\), which is directly proportional to the fraction of traces endorsing that fragment\. Since traces endorse fragments that support the majority\-vote answer, the QCM\-LLM 1\-body term implicitly replicates the MV signal\. \(3\) Missing baseline: QCR\-LLM reports gains against frontier baselines \(GPT\-4o, o3\-high\) but does not report majority vote across its own frontier\-LLM\-generated 20\-trace ensemble\.
While QCR\-LLM applies HUBO to a single fragment set, our EP\-HUBO selects separately from each candidate\-answer pool then provides selected fragments from all pools to a frontier model for adjudication\. Our fragments are generated by local, open\-weight models, and EP\-HUBO uses a frontier model only once per question for adjudication across HUBO\-selected evidences\. Our EP\-HUBO optimiser selects the best evidence*for each hypothesis*independently, rather than selecting a consensus that may suppress minority\-but\-correct reasoning\.
#### Quantum and Classical Annealing for Combinatorial AI Tasks
Pomeroy et al\.\[[13](https://arxiv.org/html/2606.06941#bib.bib13)\]reformulate feature selection, instance selection, and clustering as QUBO problems and show that D\-Wave quantum annealing matches or exceeds classical SA on standard ML benchmarks, even on current noisy hardware\. Nausheen et al\.\[[14](https://arxiv.org/html/2606.06941#bib.bib14)\]survey quantum natural language processing approaches—from quantum encoding of word embeddings to task\-specific quantum circuits—identifying QUBO\-based combinatorial reasoning as one of the most practically mature quantum\-NLP interfaces available today\. The standard HUBO‑to‑QUBO reduction proceeds by quadratization of cubic terms; on D‑Wave quantum hardware this induces a polynomial blowup in the number of qubits\. Hardware‑aware HUBO solvers such as the BF‑DCQO procedure used by QCR‑LLM on IBM gate‑model devices operate directly on the higher‑order formulation\. In our work, Phase 3 is solved using both classical simulated annealing and the Dirac\-3 photonic quantum machine\. Dirac\-3 allows us to work directly with the same higher\-order HUBO\-level formulation, avoiding the auxiliary\-variable quadratization that would typically be required for direct implementation on D\-Wave quantum annealers, and reducing the reformulation or circuit\-overhead burden that often arises in gate\-based QAOA approaches\.
## 3The EP\-HUBO Framework
### 3\.1Problem Formulation
We formalise the trace\-augmented question\-answering setting that EP\-HUBO addresses\. Table[1](https://arxiv.org/html/2606.06941#S3.T1)summarises the notation used throughout the paper\.
Table 1:Notation used throughout the paper\.###### Definition 3\.1\(Trace\-augmented instance\)\.
A*trace\-augmented instance*is a tupleℐ=\(q,𝒜,ℳL,ℳF,N\)\\mathcal\{I\}=\(q,\\mathcal\{A\},\\mathcal\{M\}\_\{L\},\\mathcal\{M\}\_\{F\},N\)whereqqis a natural\-language question,𝒜=\{a1,…,aL\}\\mathcal\{A\}=\\\{a\_\{1\},\\dots,a\_\{L\}\\\}is a finite set of answer options,ℳL\\mathcal\{M\}\_\{L\}is a local language model used for trace generation and weight scoring,ℳF\\mathcal\{M\}\_\{F\}is a frontier model used for adjudication, andNNis the number of independent traces generated byℳL\\mathcal\{M\}\_\{L\}per question\. The gold answer is denoteda∗\(q\)a^\{\*\}\(q\)\.
###### Definition 3\.2\(Trace, fragment, answer label\)\.
A*trace*for questionqqis a pairτ=\(r,a^\)\\tau=\(r,\\hat\{a\}\)whererris a chain\-of\-thought text anda^∈𝒜\\hat\{a\}\\in\\mathcal\{A\}is the trace’s extracted final answer\. An*extraction operator*frag:r↦\{f1,…,fm\}\\mathrm\{frag\}:r\\mapsto\\\{f\_\{1\},\\dots,f\_\{m\}\\\}produces a set of candidate evidence fragments from the reasoning text\. We write𝒯\(q\)=\{τ1,…,τN\}\\mathcal\{T\}\(q\)=\\\{\\tau\_\{1\},\\dots,\\tau\_\{N\}\\\}for the trace set generated forqq\.
###### Definition 3\.3\(Evidence pool\)\.
For each answer optionaℓ∈𝒜a\_\{\\ell\}\\in\\mathcal\{A\}, the*evidence pool*Pℓ\(q\)P\_\{\\ell\}\(q\)is the multiset of fragments extracted from traces that concluded with answeraℓa\_\{\\ell\}:
Pℓ\(q\)=⋃τ∈𝒯\(q\):a^\(τ\)=aℓfrag\(r\(τ\)\)\.P\_\{\\ell\}\(q\)\\;=\\;\\bigcup\_\{\\tau\\in\\mathcal\{T\}\(q\):\\hat\{a\}\(\\tau\)=a\_\{\\ell\}\}\\mathrm\{frag\}\(r\(\\tau\)\)\.A pool is*populated*if\|𝒯ℓ\(q\)\|≥τmin\|\\mathcal\{T\}\_\{\\ell\}\(q\)\|\\geq\\tau\_\{\\min\}for a minimum trace countτmin\\tau\_\{\\min\}; we letℒ\(q\)⊆𝒜\\mathcal\{L\}\(q\)\\subseteq\\mathcal\{A\}denote the set of options with populated pools\. We setτmin=3\\tau\_\{\\min\}=3in our experiments\.
###### Definition 3\.4\(HUBO selection\)\.
For each evidence poolPℓ\(q\)P\_\{\\ell\}\(q\)with\|Pℓ\|=nℓ\|P\_\{\\ell\}\|=n\_\{\\ell\}fragments, the EP\-HUBO optimisation problem determines the optimal binary vector𝐱ℓ∈\{0,1\}nℓ\\mathbf\{x\}\_\{\\ell\}\\in\\\{0,1\\\}^\{n\_\{\\ell\}\}that minimises the following higher\-order energy objective function, subject to‖𝐱ℓ‖1=Kℓ\\\|\\mathbf\{x\}\_\{\\ell\}\\\|\_\{1\}=K\_\{\\ell\}, whereKℓK\_\{\\ell\}is a cardinality hyperparameter chosen based on fragment diversity\.
Hℓ\(𝐱ℓ\)=∑i∈Pℓwi\(1\)xi\+∑i<j∈Pℓwij\(2\)xixj\+∑i<j<k∈Pℓwijk\(3\)xixjxk,H\_\{\\ell\}\(\\mathbf\{x\}\_\{\\ell\}\)\\;=\\;\\sum\_\{i\\in P\_\{\\ell\}\}w\_\{i\}^\{\(1\)\}x\_\{i\}\\;\+\\;\\sum\_\{i<j\\in P\_\{\\ell\}\}w\_\{ij\}^\{\(2\)\}x\_\{i\}x\_\{j\}\\;\+\\;\\sum\_\{i<j<k\\in P\_\{\\ell\}\}w\_\{ijk\}^\{\(3\)\}x\_\{i\}x\_\{j\}x\_\{k\},\(3\)where weightsw\(1\),w\(2\),w\(3\)w^\{\(1\)\},w^\{\(2\)\},w^\{\(3\)\}are scored byℳL\\mathcal\{M\}\_\{L\}along quality dimensions including relevance, specificity, distinctiveness, pairwise criteria like support vs\. contradict and triplet coherence vs\. undermining the answer\. LetSℓ=\{i∈Pℓ:xi=1\}S\_\{\\ell\}=\\\{i\\in P\_\{\\ell\}:x\_\{i\}=1\\\}denote the selected fragment subset of poolℓ\\ell\.
###### Definition 3\.5\(Adjudication function\)\.
The adjudicator is a map𝒜F:\(q,𝒜,\{Sℓ\}ℓ∈ℒ\(q\)\)↦a^∈𝒜\\mathcal\{A\}\_\{F\}:\(q,\\mathcal\{A\},\\\{S\_\{\\ell\}\\\}\_\{\\ell\\in\\mathcal\{L\}\(q\)\}\)\\mapsto\\hat\{a\}\\in\\mathcal\{A\}implemented as a single call to the frontier LLMℳF\\mathcal\{M\}\_\{F\}that receivesqq, all options, and the per\-pool selected fragments\{Sℓ\}\\\{S\_\{\\ell\}\\\}labelled by their candidate answer label\. We denote the EP\-HUBO output bya^H\(q\)\\hat\{a\}\_\{H\}\(q\)\.
###### Definition 3\.6\(Baselines\)\.
We use two baselines: the*zero\-shot*predictiona^ZS\(q\)=𝒜F\(q,𝒜,∅\)\\hat\{a\}\_\{ZS\}\(q\)=\\mathcal\{A\}\_\{F\}\(q,\\mathcal\{A\},\\emptyset\)which calls the frontier LLM without any evidence pools, and the*majority\-vote*predictiona^MV\(q\)=argmaxℓ\|𝒯ℓ\(q\)\|\\hat\{a\}\_\{MV\}\(q\)=\\arg\\max\_\{\\ell\}\|\\mathcal\{T\}\_\{\\ell\}\(q\)\|that uses the local LLM traces and selects the majority answer from them\.
###### Definition 3\.7\(HUBO precision\)\.
For a question distribution𝒟\\mathcal\{D\}and a fixed adjudicatorℳF\\mathcal\{M\}\_\{F\}, define the win and hurt eventsW\(q\)=𝟏\[a^H\(q\)=a∗\(q\)∧a^ZS\(q\)≠a∗\(q\)\]W\(q\)=\\mathbf\{1\}\[\\hat\{a\}\_\{H\}\(q\)=a^\{\*\}\(q\)\\wedge\\hat\{a\}\_\{ZS\}\(q\)\\neq a^\{\*\}\(q\)\]andL\(q\)=𝟏\[a^H\(q\)≠a∗\(q\)∧a^ZS\(q\)=a∗\(q\)\]L\(q\)=\\mathbf\{1\}\[\\hat\{a\}\_\{H\}\(q\)\\neq a^\{\*\}\(q\)\\wedge\\hat\{a\}\_\{ZS\}\(q\)=a^\{\*\}\(q\)\]\. The*HUBO precision*over zero\-shot is
precH=Pr𝒟\[W\(q\)=1\]Pr𝒟\[W\(q\)=1\]\+Pr𝒟\[L\(q\)=1\]=\|W\|\|W\|\+\|L\|\\mathrm\{prec\}\_\{H\}\\;=\\;\\frac\{\\Pr\_\{\\mathcal\{D\}\}\[W\(q\)=1\]\}\{\\Pr\_\{\\mathcal\{D\}\}\[W\(q\)=1\]\+\\Pr\_\{\\mathcal\{D\}\}\[L\(q\)=1\]\}\\;=\\;\\frac\{\|W\|\}\{\|W\|\+\|L\|\}\(4\)when estimated from a finite sample\. A value ofprecH\>1/2\\mathrm\{prec\}\_\{H\}\>1/2indicates that, conditional on disagreement between HUBO and ZS, HUBO is more often correct\.
### 3\.2Pipeline Overview
EP\-HUBO proceeds in four phases as described below\. See Figure[1](https://arxiv.org/html/2606.06941#S3.F1)\.
Phase 1Local LLMN=20N\{=\}20CoT tracesPhase 2Local LLMHUBO weightsPhase 3SA SolverKKfrags / poolPhase 4Frontier LLMFinal answerevidence poolscardinality\-preservingFigure 1:EP\-HUBO four\-phase pipeline\. Phases 1–3 run locally \(free\); Phases 1\-2 call a local LLM while Phase 3 uses python code\. Phase 4 requires a single frontier API call per question\. Green = local; blue = frontier LLM API\.#### Phase 1: Local LLM Trace Generation
We use a local open\-weight LLM for reasoning trace generation\. Each trace is prompted with a standardised template requesting 3–5 key facts followed byFINAL ANSWER: \[label\]\.
#### Phase 2: Answer\-Evidence Pools
Traces are grouped by their extracted answer\. For each pool with answer candidateℓ\\ell, fragments are extracted from the rawraw\_textfield using several quality filters \(for ex\. minimum 10 words;≥2\\geq 2capitalised entity words, etc\)\. This ensures that the HUBO operates within a hypothesis\-specific evidence space: fragments in poolℓ\\ellare those produced by traces that concluded the answer isℓ\\ell\.
Each fragmentfi∈poolℓf\_\{i\}\\in\\text\{pool\}\_\{\\ell\}is then scored by the local LLM on three dimensions:
wi\(1\)=−\(α⋅ri\+β⋅si\+γ⋅di\)w\_\{i\}^\{\(1\)\}=\-\\bigl\(\\alpha\\cdot r\_\{i\}\+\\beta\\cdot s\_\{i\}\+\\gamma\\cdot d\_\{i\}\\bigr\)\(5\)whererir\_\{i\}is*relevance*to the question,sis\_\{i\}is*specificity*\(concrete fact, statute, or named entity\), anddid\_\{i\}is*distinctiveness*, i\.e\., the logical contribution not implied by other fragments\. All scores∈\[0,1\]\\in\[0,1\]are elicited via a structured JSON prompt\. No popularity or vote\-frequency signal enters Eq\.[5](https://arxiv.org/html/2606.06941#S3.E5)so that the weights reflect fragment quality, not majority\-vote\.
Pairwise\(i,j\)\(i,j\)interactions are scored on*support*and*contradict*; triplet\(i,j,k\)\(i,j,k\)interactions on*coherent*and*undermining*\. Top pairs are selected by composite 1\-body score and top triplets by 2\-body score\.
wij\(2\)\\displaystyle w\_\{ij\}^\{\(2\)\}=−λsupp⋅supportij\+λcontra⋅contradictij\\displaystyle=\-\\lambda\_\{\\text\{supp\}\}\\cdot\\text\{support\}\_\{ij\}\+\\lambda\_\{\\text\{contra\}\}\\cdot\\text\{contradict\}\_\{ij\}\(6\)wijk\(3\)\\displaystyle w\_\{ijk\}^\{\(3\)\}=−λcoh⋅coherentijk\+λund⋅undermineijk\\displaystyle=\-\\lambda\_\{\\text\{coh\}\}\\cdot\\text\{coherent\}\_\{ijk\}\+\\lambda\_\{\\text\{und\}\}\\cdot\\text\{undermine\}\_\{ijk\}\(7\)
#### Phase 3: Simulated Annealing Selector
The per\-pool HUBO of Eq\.[3](https://arxiv.org/html/2606.06941#S3.E3)is minimised by simulated annealing \(SA\) implemented in python \(see Algorithm[1](https://arxiv.org/html/2606.06941#alg1)\)\. Each step proposes dropping one selected fragment and adding one unselected fragment, preserving‖𝐱‖1=K\\\|\\mathbf\{x\}\\\|\_\{1\}=K\. The quadraticΔE\\Delta Eis computed inO\(K\)O\(K\)per step via a pre\-built adjacency list keyed by selected fragment indices; the cubic correction uses a triplet lookup table cached at weight\-construction time\. By using swap moves that preserve‖𝐱‖1=K\\\|\\mathbf\{x\}\\\|\_\{1\}=K, SA explores only fixed\-cardinality subsets, which regularises evidence\-set size\.
Algorithm 1Cardinality\-Preserving Swap\-Based Simulated Annealing for EP\-HUBO1:Pool
PℓP\_\{\\ell\}, weights
w\(1\),w\(2\),w\(3\)w^\{\(1\)\},w^\{\(2\)\},w^\{\(3\)\}, cardinality
KK, initial temperature
T0T\_\{0\}, cooling rate
η∈\(0,1\)\\eta\\in\(0,1\), steps
MM\.
2:Selected fragment subset
Sℓ⊆PℓS\_\{\\ell\}\\subseteq P\_\{\\ell\}with
\|Sℓ\|=K\|S\_\{\\ell\}\|=K\.
3:
S←S\\leftarrowgreedy top\-
KKby 1\-body score \(warm start\)
4:
E←Hℓ\(𝟏S\)E\\leftarrow H\_\{\\ell\}\(\\mathbf\{1\}\_\{S\}\)⊳\\trianglerightevaluate Eq\.[3](https://arxiv.org/html/2606.06941#S3.E3)
5:
T←T0T\\leftarrow T\_\{0\}
6:for
t=1,…,Mt=1,\\dots,Mdo
7:sample
i∈S,j∈Pℓ∖Si\\in S,\\ j\\in P\_\{\\ell\}\\setminus Suniformly
8:
S′←\(S∖\{i\}\)∪\{j\}S^\{\\prime\}\\leftarrow\(S\\setminus\\\{i\\\}\)\\cup\\\{j\\\}
9:
ΔE←Δ1\-body\(i,j\)\+Δquad\(i,j;S\)\+Δcubic\(i,j;S\)\\Delta E\\leftarrow\\Delta\_\{\\text\{1\-body\}\}\(i,j\)\+\\Delta\_\{\\text\{quad\}\}\(i,j;S\)\+\\Delta\_\{\\text\{cubic\}\}\(i,j;S\)
10:if
ΔE<0\\Delta E<0or
u∼Unif\(0,1\)<exp\(−ΔE/T\)u\\sim\\mathrm\{Unif\}\(0,1\)<\\exp\(\-\\Delta E/T\)then
11:
S←S′S\\leftarrow S^\{\\prime\};
E←E\+ΔEE\\leftarrow E\+\\Delta E
12:endif
13:
T←ηTT\\leftarrow\\eta T
14:endfor
15:return
SS
KKis chosen per pool based on intra\-pool fragment diversity:
K=\{max\(2,Kbase−1\)ifd¯Jaccard<0\.30Kbaseif0\.30≤d¯Jaccard≤0\.60Kbase\+1ifd¯Jaccard\>0\.60K=\\begin\{cases\}\\max\(2,K\_\{\\text\{base\}\}\-1\)&\\text\{if \}\\bar\{d\}\_\{\\text\{Jaccard\}\}<0\.30\\\\ K\_\{\\text\{base\}\}&\\text\{if \}0\.30\\leq\\bar\{d\}\_\{\\text\{Jaccard\}\}\\leq 0\.60\\\\ K\_\{\\text\{base\}\}\+1&\\text\{if \}\\bar\{d\}\_\{\\text\{Jaccard\}\}\>0\.60\\end\{cases\}\(8\)whered¯Jaccard\\bar\{d\}\_\{\\text\{Jaccard\}\}is the mean pairwise Jaccard distance between fragment word sets\. In our experiments fragment diversity averaged 0\.91, consistently triggeringK=Kbase\+1=4K=K\_\{\\text\{base\}\}\+1=4\.
#### Phase 4: Frontier Adjudication
A single frontier LLM API call receives the question, all options, and the HUBO\-selected fragments for each evidence pool, labelled by the pool answer\. For example:
```
=== Evidence supporting (D) ===
1. Under common law, force used in escape immediately after
taking converts larceny to robbery.
...
=== Evidence supporting (E) ===
1. Robbery requires force used *to* obtain the property...
```
The model is instructed to reply with the answer option most strongly supported by the evidence\. All frontier calls are submitted as a single Batch API job\.
We apply domain\-specificλ\\lambdapresets based on reasoning style\. Additionally,λcontra\\lambda\_\{\\text\{contra\}\}is scaled up by up to2×2\\timeswhen cross\-pool vocabulary overlap exceeds 0\.50, signalling that pools are making competing claims on the same domain concepts\.
Table 2:Category\-specific HUBO hyperparameters\. Law questions emphasise specificity and penalise contradiction heavily \(legal facts are binary\)\.
## 4Theoretical Analysis
This section establishes properties of EP\-HUBO that motivate the empirical findings\.
A central design choice of EP\-HUBO is to score fragments by per\-fragment*quality dimensions*\(relevance, specificity, distinctiveness\) rather than by frequency as is used for majority vote\.
###### Proposition 4\.1\(MV\-decoupling of the HUBO weights\)\.
Assume that the per\-fragment quality scoresri,si,di∈\[0,1\]r\_\{i\},s\_\{i\},d\_\{i\}\\in\[0,1\]produced byℳL\\mathcal\{M\}\_\{L\}are computed from the fragment textfif\_\{i\}and the questionqqalone, without access to the poolPℓP\_\{\\ell\}or to the number of traces\|𝒯ℓ\(q\)\|\|\\mathcal\{T\}\_\{\\ell\}\(q\)\|\. Then for any questionqqand any poolℓ\\ell, the 1\-body weightwi\(1\)=−\(αri\+βsi\+γdi\)w\_\{i\}^\{\(1\)\}=\-\(\\alpha r\_\{i\}\+\\beta s\_\{i\}\+\\gamma d\_\{i\}\)is independent of\|Pℓ\|\|P\_\{\\ell\}\|and\|𝒯ℓ\(q\)\|\|\\mathcal\{T\}\_\{\\ell\}\(q\)\|in the sense thatCov\(wi\(1\),\|Pℓ\|\)=0\\mathrm\{Cov\}\(w\_\{i\}^\{\(1\)\},\|P\_\{\\ell\}\|\)=0over any sampling of traces that preserves the fragment\-text marginals\.
###### Proof\.
By construction the LLM scoring prompt receives only\(q,fi\)\(q,f\_\{i\}\)as input; pool size enters neither the prompt nor the post\-processing of the score\. Hencewi\(1\)w\_\{i\}^\{\(1\)\}is conditionally independent of\|Pℓ\|\|P\_\{\\ell\}\|givenfif\_\{i\}and covariance zero follows\. ∎
Empirically, this is reflected in the large gains over MV shown in the Results section, where fragment selection clearly departs from popularity\-based aggregation\. This contrasts with the QCR\-LLM formulation of\[[1](https://arxiv.org/html/2606.06941#bib.bib1)\], in which 1\-body coefficientswi\(1\)=−μpi\+αpi\(1−pi\)w\_\{i\}^\{\(1\)\}=\-\\mu p\_\{i\}\+\\alpha p\_\{i\}\(1\-p\_\{i\}\)are explicit functions of the empirical fragment frequencypip\_\{i\}\. EP\-HUBO solves\|ℒ\(q\)\|\|\\mathcal\{L\}\(q\)\|independent HUBO problems, one per populated pool, rather than a single cross\-pool optimisation problem\. We next show that this decomposition has no loss relative to joint optimisation\.
###### Proposition 4\.2\(Per\-pool decomposability\)\.
Suppose the adjudication function𝒜F\\mathcal\{A\}\_\{F\}is invariant under permutations of the labelled\-pool inputs, i\.e\.𝒜F\(q,𝒜,\{Sσ\(ℓ\)\}ℓ\)=𝒜F\(q,𝒜,\{Sℓ\}ℓ\)\\mathcal\{A\}\_\{F\}\(q,\\mathcal\{A\},\\\{S\_\{\\sigma\(\\ell\)\}\\\}\_\{\\ell\}\)=\\mathcal\{A\}\_\{F\}\(q,\\mathcal\{A\},\\\{S\_\{\\ell\}\\\}\_\{\\ell\}\)for every permutationσ\\sigmaofℒ\(q\)\\mathcal\{L\}\(q\)\. Then the joint HUBO problemmin𝐱1,…,𝐱L∑ℓHℓ\(𝐱ℓ\)\\min\_\{\\mathbf\{x\}\_\{1\},\\dots,\\mathbf\{x\}\_\{L\}\}\\sum\_\{\\ell\}H\_\{\\ell\}\(\\mathbf\{x\}\_\{\\ell\}\)decomposes exactly into the\|ℒ\(q\)\|\|\\mathcal\{L\}\(q\)\|independent per\-pool optimisations solved by EP\-HUBO in Phase 3\.
###### Proof\.
The energy objective function∑ℓHℓ\(𝐱ℓ\)\\sum\_\{\\ell\}H\_\{\\ell\}\(\\mathbf\{x\}\_\{\\ell\}\)is additively separable across pools becauseHℓH\_\{\\ell\}depends only on𝐱ℓ\\mathbf\{x\}\_\{\\ell\}\. Therefore any joint minimiser is a concatenation of per\-pool minimisers, and conversely\. Adjudicator permutation invariance ensures that re\-labelling pool order does not change the downstream decision\. ∎
While permutation invariance should always hold, we noticed that our LEXam results in fact show a prior in one of the zero\-shot frontier models\. This is a non\-standard occurrence however\. Our next result shows that EP\-HUBO is at least as expressive as majority vote: there exists a weight assignment for which the pipeline recovers the MV prediction\.
###### Proposition 4\.3\(MV recoverability\)\.
Let the 1\-body weights be set aswi\(1\)=−\|𝒯ℓ\(q\)\|/Nw\_\{i\}^\{\(1\)\}=\-\|\\mathcal\{T\}\_\{\\ell\}\(q\)\|/Nfor everyi∈Pℓi\\in P\_\{\\ell\}, and let all pairwise and triplet weights vanish\. Then for anyK≥1K\\geq 1the per\-pool HUBO objective is minimised by selecting anyKKfragments from the largest pool, and the resulting adjudicator input maximally weights the majority answer\.
###### Proof\.
With pairwise and triplet terms zero,Hℓ\(𝐱\)=∑iwi\(1\)xi=−K\|𝒯ℓ\(q\)\|/NH\_\{\\ell\}\(\\mathbf\{x\}\)=\\sum\_\{i\}w\_\{i\}^\{\(1\)\}x\_\{i\}=\-K\|\\mathcal\{T\}\_\{\\ell\}\(q\)\|/Nfor every cardinality\-KKselection in poolℓ\\ell\. Hence the minimum energy over all\(ℓ,S\)\(\\ell,S\)pairs is achieved at the pool with the largest trace count, recovering MV\. ∎
###### Theorem 4\.4\(Sample complexity of HUBO precision\)\.
Letn=\|W\|\+\|L\|n=\|W\|\+\|L\|be the number of observed HUBO/ZS disagreements on a held\-out sample, and letp^H=\|W\|/n\\hat\{p\}\_\{H\}=\|W\|/nbe the empirical HUBO precision estimator\. Then for anyδ∈\(0,1\)\\delta\\in\(0,1\), with probability at least1−δ1\-\\delta,
\|p^H−precH\|≤log\(2/δ\)2n\.\\bigl\|\\hat\{p\}\_\{H\}\-\\mathrm\{prec\}\_\{H\}\\bigr\|\\;\\leq\\;\\sqrt\{\\frac\{\\log\(2/\\delta\)\}\{2n\}\}\.\(9\)
The next result shows that when the trace distribution collapses to a single pool, EP\-HUBO cannot differ from MV\. This explains why HUBO gain over MV is structurally limited when the local trace generator produces low pool diversity\.
###### Theorem 4\.5\(Pool\-collapse implies MV\-equivalence\)\.
Fix a questionqqand suppose\|ℒ\(q\)\|=1\|\\mathcal\{L\}\(q\)\|=1, i\.e\. exactly one pool is populated \(all other pools have fewer thanτmin\\tau\_\{\\min\}traces\)\. Thena^H\(q\)=a^MV\(q\)\\hat\{a\}\_\{H\}\(q\)=\\hat\{a\}\_\{MV\}\(q\)deterministically, independent of any HUBO weights or the SA solver’s stochasticity\.
###### Proof\.
With only one populated poolPℓ∗P\_\{\\ell^\{\*\}\}, Phase 3 solves a single HUBO onPℓ∗P\_\{\\ell^\{\*\}\}and Phase 4 receives evidence labelled only with labelℓ∗\\ell^\{\*\}\. Adjudicators conditioned on a single labelled\-pool input default to that label; MV with\|ℒ\(q\)\|=1\|\\mathcal\{L\}\(q\)\|=1also returnsℓ∗\\ell^\{\*\}\. Hence the outputs coincide\. ∎
The mean number of populated pools per question \(1\.43 and 1\.77\) quantifies how often EP\-HUBO can differ from MV; limited pool diversity structurally caps the feasible gains\. Lastly, we make precise the sense in which EP\-HUBO mitigates adjudicator biases\. The proposition was motivated by an experimental observation that we discuss in the Results section, where on LEXam, zero\-shot Sonnet’s has a strong prior to select option ’E’\.
###### Proposition 4\.6\(Bias mitigation lower bound\)\.
Suppose the adjudicatorℳF\\mathcal\{M\}\_\{F\}has a fixed\-label prior in the absence of evidence: there existsℓ†∈𝒜\\ell^\{\\dagger\}\\in\\mathcal\{A\}andπ≥1/2\\pi\\geq 1/2such thatPr\[a^ZS\(q\)=ℓ†\]≥π\\Pr\[\\hat\{a\}\_\{ZS\}\(q\)=\\ell^\{\\dagger\}\]\\geq\\pifor allqq\. Letℬ=\{q:a∗\(q\)≠ℓ†\}\\mathcal\{B\}=\\\{q:a^\{\*\}\(q\)\\neq\\ell^\{\\dagger\}\\\}be the set of questions where the prior is incorrect\. If on the subsetℬ\\mathcal\{B\}the EP\-HUBO pipeline correctly populates a gold poolPa∗\(q\)P\_\{a^\{\*\}\}\(q\)\(i\.e\. at leastτmin\\tau\_\{\\min\}traces supporta∗a^\{\*\}\) and the adjudicator selectsa∗a^\{\*\}when provided with that pool, thenPr\[a^H=a∗\]≥Pr\[a∗≠ℓ†\]⋅precover\\Pr\[\\hat\{a\}\_\{H\}=a^\{\*\}\]\\geq\\Pr\[a^\{\*\}\\neq\\ell^\{\\dagger\}\]\\cdot p\_\{\\text\{recover\}\}, whereprecoverp\_\{\\text\{recover\}\}is the conditional probability that the gold pool is populated and adjudicated correctly\.
###### Proof\.
Forq∈ℬq\\in\\mathcal\{B\}the ZS prior is wrong, so any HUBO success onℬ\\mathcal\{B\}is a strict gain \(a win event\)\. Conditioning on the populated\-gold\-pool event and applying the adjudicator success probability gives the stated bound\. ∎
## 5Results
We evaluate on two legal\-reasoning benchmarks\. MMLU\-Pro\[[18](https://arxiv.org/html/2606.06941#bib.bib18)\]is a harder successor to MMLU with 10\-option multiple\-choice questions\. We restrict our evaluation to the complete law test split \(n=1,101n=1\{,\}101questions\)\. LEXam\[[17](https://arxiv.org/html/2606.06941#bib.bib17)\]is a 569\-question benchmark of Swiss and international\-law multiple\-choice questions with 8 options \(A–H\), covering Interdisciplinary \(n=494n=494\), Private law \(n=63n=63\), and Public law \(n=12n=12\) areas\. We select LEXam to test out\-of\-domain generalisation as compared to the more commonly\-used MMLU\-Pro; LEXam uses a different option count, civil\-law distinct from the common\-law MMLU\-Pro questions, and is substantially harder for local models \(MV baseline 48\.3% versus 63\.1% for MMLU\-Pro law\)\.
For the trace generator \(Phase 1\), we use smaller open\-weight models Qwen3\.5\-35B and OSS\-20B, run on a local server at temperature 0\.8\. We setN=20N=20per question as a balance between marginal accuracy gain and trace\-generation cost; early experiments largerNNup to 100 showed diminishing returns beyond 20\. Statistics on the reasoning traces produced by the two local LLMs on MMLU\-Pro are provided in Table[3](https://arxiv.org/html/2606.06941#S5.T3)\. OSS\-20B produces more diverse reasoning paths, as reflected in mean pool count and intra\-pool fragment diversity\. The HUBO weight scorer \(Phase 2\) also uses the same local models used for trace generation with temperature 0\.0\. For frontier LLM adjudication \(Phase 4\) we use two Anthropic frontier models, Claude Sonnet 4\.6 and Claude Opus 4\.6, Opus being the larger of the two models\.
Table 3:Trace statistics for the MMLU\-Pro law test questions\. Both trace generators produced 20 traces per question\. Parse rate is the fraction of traces from which a final answer label could be extracted; both generators reach 99\.9%\. Mean acc\. is the per\-trace accuracy averaged across the corpus; MV acc\. is the majority\-vote accuracy aggregated per question, restricted to the law subset\. Pools/q, the mean number of distinct answer\-label pools with≥3\\geq 3traces, relevant to HUBO’s effectiveness, is at 1\.43 and 1\.77 for the two local trace generator models\.We compare EP\-HUBO against two baselines:
- •Majority Vote \(MV\):Most common answer across all local traces per question\. This is the natural self\-consistency baseline, answering the question: does HUBO selection beat the trivial trace aggregator\.
- •Zero\-shot frontier model \(ZS\):Direct frontier adjudication of the question and its options, with no traces and no HUBO\. The two frontier LLMs we use are prompted exactly as in EP\-HUBO Phase 4 without providing the EP\-HUBO evidences\. This baseline asks whether the EP\-HUBO pipeline adds value over simply calling the frontier model directly\.
For each configuration we report three quantities: accuracy, net gain over a baseline, and HUBO precision\.*Accuracy*on a question set𝒬\\mathcal\{Q\}of size\|𝒬\|\|\\mathcal\{Q\}\|is the fraction of questions for which the prediction matches the gold answer,\|𝒬\|−1∑q∈𝒬𝟏\[a^\(q\)=a∗\(q\)\]\|\\mathcal\{Q\}\|^\{\-1\}\\sum\_\{q\\in\\mathcal\{Q\}\}\\mathbf\{1\}\[\\hat\{a\}\(q\)=a^\{\*\}\(q\)\]\.*Net gain over majority vote*isΔMV=acc\(a^H\)−acc\(a^MV\)\\Delta\_\{MV\}=\\mathrm\{acc\}\(\\hat\{a\}\_\{H\}\)\-\\mathrm\{acc\}\(\\hat\{a\}\_\{MV\}\), and*net gain over zero\-shot*ΔZS=acc\(a^H\)−acc\(a^ZS\)\\Delta\_\{ZS\}=\\mathrm\{acc\}\(\\hat\{a\}\_\{H\}\)\-\\mathrm\{acc\}\(\\hat\{a\}\_\{ZS\}\)\. Both are absolute differences in accuracy \(percentage points\)\.*HUBO precision*is the per\-question paired statistic of Definition[3\.7](https://arxiv.org/html/2606.06941#S3.Thmdefinition7), repeated here for convenience:precH=\|W\|/\(\|W\|\+\|L\|\)\\mathrm\{prec\}\_\{H\}=\|W\|/\(\|W\|\+\|L\|\), whereWWcounts questions on which EP\-HUBO is correct and zero\-shot is not, andLLcounts the reverse\. The four possible outcomes are: both correct, both wrong, trace\-driven win \(WW\), or HUBO hurt \(LL\)\.
### 5\.1MMLU\-Pro Results
Table[4](https://arxiv.org/html/2606.06941#S5.T4)presents the main MMLU\-Pro law results across the two trace generators×\\timestwo adjudicators, alongside majority vote and the zero\-shot baselines\. This section summarises the results using simulated annealing \(SA\) on a classical computer to solve the HUBO optimisation problem\. HUBO\+Opus delivers a\+12\.6\+12\.6percentage point \(pp\) improvement over MV with Qwen\-35B traces \(\+139\+139questions on the 1,101\-question law set\) and\+27\.9\+27\.9pp with OSS\-20B traces \(\+307\+307questions\)\. HUBO\+Sonnet gives\+4\.0\+4\.0pp and\+17\.6\+17\.6pp respectively\. These gains are substantial and confirm that HUBO selection adds value over trivial trace aggregation\. Recall that this is achieved by EP\-HUBO as it does not assign trace weights based on majority\-vote\-type popularity\.
Zero\-shot \(ZS\) Opus achieves74\.2%74\.2\\%on MMLU\-Pro law without HUBO\-optimised traces; HUBO\+Opus with Qwen\-35B traces reaches75\.7%75\.7\\%\(\+1\.5\+1\.5pp over ZS Opus\), and HUBO precision with Opus is56\.7%56\.7\\%for Qwen3\.5\-35B and49\.5%49\.5\\%for OSS\-20B\. With Sonnet as adjudicator, HUBO precision is60\.3%60\.3\\%and52\.3%52\.3\\%respectively\. EP\-HUBO with Qwen\-35B traces thus measurably outperforms a direct frontier call on MMLU\-Pro law, with both adjudicators exceeding theprecH\>1/2\\mathrm\{prec\}\_\{H\}\>1/2threshold at which HUBO wins exceed HUBO hurts in expectation\.
MethodAdj\.MMLU\-Pro Law \(n=1,101\)ZS \(no traces/HUBO\)Sonnet4\.663\.1%ZS \(no traces/HUBO\)Opus4\.674\.2%Trace generator: Qwen3\.5\-35B, 20 traces/questionMV \(Qwen\-35B\)—63\.1%EP\-HUBO \(Qwen\-35B\)Sonnet4\.667\.1%EP\-HUBO \(Qwen\-35B\)Opus4\.675\.7%Trace generator: OSS\-20B, 20 traces/questionMV \(OSS\-20B\)—46\.2%EP\-HUBO \(OSS\-20B\)Sonnet4\.663\.9%EP\-HUBO \(OSS\-20B\)Opus4\.674\.1%Net gain vs majority vote \(MV\) and zero\-shot \(ZS\) baselines \(Qwen\-35B traces\)Net Sonnet vs MV \(Qwen\-35B\)\+4\.0\+4\.0ppNet Sonnet vs ZS Son \(Qwen\-35B\)\+4\.0\+4\.0ppNet Opus vs MV \(Qwen\-35B\)\+12\.6\+12\.6ppNet Opus vs ZS Opus \(Qwen\-35B\)\+1\.5\+1\.5ppNet gain vs majority vote \(MV\) and zero\-shot \(ZS\) baselines \(OSS\-20B traces\)Net Sonnet vs MV \(OSS\-20B\)\+17\.6\+17\.6ppNet Sonnet vs ZS Son \(OSS\-20B\)\+0\.7\+0\.7ppNet Opus vs MV \(OSS\-20B\)\+27\.9\+27\.9ppNet Opus vs ZS Opus \(OSS\-20B\)−0\.1\-0\.1ppHUBO prec \(Sonnet, Qwen\-35B\)129 W /85 H 60\.3%HUBO prec \(Opus, Qwen\-35B\)72 W /55 H 56\.7%HUBO prec \(Sonnet, OSS\-20B\)91 W /83 H 52\.3%HUBO prec \(Opus, OSS\-20B\)55 W /56 H 49\.5%Table 4:MMLU\-Pro law results \(n=1,101n\{=\}1\{,\}101\)\. HUBO optimisation significantly outperforms majority vote \(MV\) trace aggregation and also provides measurable gains over zero\-shot \(ZS\) frontier adjudication on this benchmark\. W refers to trace\-driven wins, and H refers to HUBO hurting the outcome, as compared to ZS\. HUBO precision is defined in Eq\.[4](https://arxiv.org/html/2606.06941#S3.E4)\. Precision ¿ 0\.5 means that whenever HUBO disagrees with the ZS frontier model, HUBO improves accuracy in expectation\. The bolded row, EP\-HUBO with the Qwen3\.5\-35B local model traces, is the best configuration in terms of accuracy for each frontier model beating both MV and ZS\.
### 5\.2LEXam Results
We also evaluate EP\-HUBO on LEXam\[[17](https://arxiv.org/html/2606.06941#bib.bib17)\]a 569\-question benchmark of Swiss and international law with 8 option multiple\-choice questions \(A–H\) across three legal domains: Interdisciplinary \(n=494n\{=\}494\), Private law \(n=63n\{=\}63\), and Public law \(n=12n\{=\}12\)\. LEXam was released in 2024 and is sourced from a civil\-law jurisdiction \(Switzerland\) with limited English\-language online presence relative to common\-law sources\. Its 8\-option format and European legal context are atypical of the benchmarks frontier models are routinely evaluated on\. As such, we consider LEXam as a low\-contamination benchmark to complement MMLU\-Pro law\. On LEXam, zero\-shot Opus achieves only66\.8%66\.8\\%as compared to74\.2%74\.2\\%on MMLU\-Pro law\. This provides more potential benefit for the EP\-HUBO optimisation to achieve\.
On LEXam, we evaluate EP\-HUBO using two trace generators: Qwen3\.5\-35B and OSS\-20B with 20 traces per question each\. This section summarises the LEXam results using simulated annealing \(SA\) on a classical computer to solve the HUBO optimisation problem\. With Opus as adjudicator, EP\-HUBO with Qwen\-35B traces reaches71\.9%71\.9\\%and with OSS\-20B traces71\.5%71\.5\\%\.
The two trace generators behave differently on LEXam\. Qwen\-35B’s traces have low answer entropy: its MV baseline is already69\.1%69\.1\\%, only2\.82\.8pp below HUBO\+Opus\. OSS\-20B’s traces have higher answer entropy: its MV baseline is only48\.3%48\.3\\%, with HUBO\+Opus delivering\+23\.2\+23\.2pp\. The two trace generators thus illustrate two regimes, one where MV is strong and HUBO finds a small margin, and the other where MV scatters across wrong answers and HUBO recovers a large margin\.
Against the zero\-shot frontier baseline, both Opus configurations exceed ZS Opus’s66\.8%66\.8\\%: Qwen\-35B\+Opus by\+5\.1\+5\.1pp and OSS\-20B\+Opus by\+4\.7\+4\.7pp, with HUBO precisions of68\.8%68\.8\\%and64\.5%64\.5\\%respectively\.
The most striking finding on LEXam concerns Sonnet4\.6 baseline performance on LEXam\. We observed that zero\-shot Sonnet returns option “E” on 499 of the 569 LEXam questions, whereas the true answers are distributed near\-uniformly across the eight options \(see Table[6](https://arxiv.org/html/2606.06941#S5.T6)\)\. In particular, only13%13\\%of the gold answers are “E”\. This position bias leads to a zero\-shot accuracy using Sonnet on LEXam of only22\.3%22\.3\\%\.
Examining the Sonnet bias by LEXam sub\-area \(Table[7](https://arxiv.org/html/2606.06941#S5.T7)\) shows that the collapse is essentially in the Interdisciplinary subset which is entirely Swiss\-Law content, where Sonnet picks “E” on95\.7%95\.7\\%of questions and achieves14\.8%14\.8\\%zero\-shot accuracy\. On the Private subset \(Chinese and US Business Law,n=63n=63\), Sonnet’s answer distribution is better spread out and zero\-shot accuracy is76\.2%76\.2\\%\. Opus4\.6 exhibits a smaller version of the same pattern, over\-picking “E” by\+10\+10pp on Interdisciplinary\. This indicates that the trigger of the LLM position bias is the Swiss\-Law corpus content rather than the multiple\-choice format itself, and that Opus’s broader pre\-training largely absorbs it\.
EP\-HUBO is able to recover a large share of these position bias errors: with Qwen\-35B traces, HUBO\+Sonnet wins 126 questions vs\. ZS Sonnet while introducing only 11 hurts, for a HUBO precision of92\.0%92\.0\\%— the highest value of any configuration in this study\. With OSS\-20B traces, HUBO\+Sonnet attains 80\.4% precision \(86 wins, 21 hurts\)\. HUBO\+Sonnet remains below the MV baseline \(42\.5%42\.5\\%with Qwen\-35B traces,33\.7%33\.7\\%with OSS\-20B\) because MV draws on the underlying trace ensembles, which are not E\-biased\.
Table 5:EP\-HUBO results on LEXam’s Swiss/international law questions\. Wins \(W\) and hurts \(H\) are computed against the zero\-shot frontier LLM\. HUBO precision is defined in Eq\.[4](https://arxiv.org/html/2606.06941#S3.E4)\. HUBO\+Opus delivers a\+5\.1\+5\.1pp gain over the already\-strong ZS Opus baseline with Qwen\-35B traces \(\+4\.7\+4\.7pp with OSS\-20B\)\. HUBO\+Sonnet improves dramatically over the zero\-shot Sonnet, whose zero\-shot accuracy is very low 22\.3%, due to an observed position bias\. Qwen\-35B \+ Sonnet attains the highest HUBO precision in this study at 92\.0% \(126 wins for only 11 hurts versus ZS Sonnet\)\. The bolded row is the best configuration in terms of overall accuracy\.Table 6:Distribution of answers across the eight options on the full LEXam set, rounded to integers\. The distribution is near\-uniform whereas Sonnet has a very strong ’E’ bias\.Table 7:Zero\-shot “E” rate and accuracy by LEXam sub\-area for Sonnet4\.6 and Opus4\.6\. The E\-bias is concentrated in the Interdisciplinary \(Swiss\-Law\) subset, where Sonnet’s output is essentially degenerate\.
### 5\.3Quantum Results on MMLU\-Pro and LEXam
Table[8](https://arxiv.org/html/2606.06941#S5.T8)summarises the solver configuration used for all Dirac\-3 experiments reported in this section\. To validate these choices, we swept the relaxation schedule over values\{1,2,3,4\}\\\{1,2,3,4\\\}and increased the number of samples to 10 on a held\-out subset of 50 randomly selected questions per benchmark; accuracy and HUBO precision were invariant across these variations, so we retained the configuration in Table[8](https://arxiv.org/html/2606.06941#S5.T8)for the full evaluation\.
Table 8:Dirac\-3 machine configuration used for all quantum experiments\. Other relaxation schedules \(\{1,3,4\}\\\{1,3,4\\\}\) and sample counts up to 10 were evaluated on a 50\-question held\-out subset; results were invariant, so this configuration was retained for the full evaluation\.The HUBO formulation in Eq\.[3](https://arxiv.org/html/2606.06941#S3.E3)is a higher\-order binary optimisation problem that is hardware\-agnostic\. While the results reported in Sections[5](https://arxiv.org/html/2606.06941#S5)\.1–[5](https://arxiv.org/html/2606.06941#S5)\.2 use classical simulated annealing \(Algorithm[1](https://arxiv.org/html/2606.06941#alg1)\) as the Phase 3 solver, in this section we report the results when the Phase 3 optimisation is solved on the Dirac\-3 photonic quantum machine\. Table[9](https://arxiv.org/html/2606.06941#S5.T9)summarises the results\.
Table 9:Results when the HUBO optimisation of Phase 3 is solved using the Dirac\-3 photonic quantum machine, rather than simulated annealing\.Δ\\Delta\(Quant−\-SA\) is the accuracy difference quantum minus simulated annealing on a classical computer\. Bold Dirac\-3 accuracy: the only quantum configuration whose accuracy exceeds the classical\-SA configuration\. Bold HUBO precision: the highest precision in the table which is so high because zero\-shot Sonnet is biased on LEXam and itself has very low ZS accuracy\.On LEXam the quantum solver is comparable to classical simulated annealing across both adjudicators: with Opus the differences with respect to SA are\+1\.0\+1\.0pp using Qwen\-35B and−0\.3\-0\.3pp using OSS\-20B as trace generator\. With Sonnet as frontier adjudicator the quantum results are slightly lower than the classical solver, at−0\.5\-0\.5pp with Qwen\-35B and−1\.5\-1\.5pp with OSS\-20B\. With Sonnet adjudication on LEXam, the quantum solver also recovers a large share of the ZS Sonnet option\-position bias: Qwen\-35B with Dirac\-3 and Sonnet reaches42\.0%42\.0\\%accuracy with HUBO precision 91\.8%, comparable to the corresponding classical SA result\.
On MMLU\-Pro law, by contrast, the quantum solver underperforms classical SA across both adjudicators by 5–8 percentage points and the Dirac\-3 configurations all fall below their respective ZS frontier baselines\. HUBO precisions using the quantum solver are all below theprecH=50%\\mathrm\{prec\}\_\{H\}=50\\%break\-even threshold\. The Dirac\-3 runs still substantially exceed majority vote, so the quantum solver is finding signal in the HUBO landscape, but it is selecting fragment subsets the adjudicator follows less reliably than the classical solver’s selections on MMLU\-Pro\.
Dirac\-3 imposes a hard limit of 135 binary variables for third\-order polynomial objectives\. When the total number of fragments across all answer\-option pools for a given question exceeds this limit, we truncate to the 135 highest\-scoring fragments before submitting to the hardware\. Specifically, fragments are ranked by their diagonal HUBO weighthi=α⋅ri\+β⋅si\+γ⋅dih\_\{i\}=\\alpha\\cdot r\_\{i\}\+\\beta\\cdot s\_\{i\}\+\\gamma\\cdot d\_\{i\}\(the 1\-body relevance–specificity–distinctiveness composite from Eq\.[3](https://arxiv.org/html/2606.06941#S3.E3)\), and the top 135 are retained; all 2\-body and 3\-body interaction terms involving discarded fragments are dropped, and variable indices are remapped accordingly\. The classical SA solver operates on the full, untruncated fragment pool\. This asymmetry is a plausible contributor to the accuracy gap: questions in the benchmark datasets tend to generate larger and more diverse trace pools, so truncation is more frequent and removes a greater fraction of the interaction structure from the HUBO landscape before the quantum solver sees it\.
## 6Ablation Study
To isolate the contribution of the five EP\-HUBO method components, and to assess sensitivity to adjudicator strength, we report a small\-scale ablation study on two parallel 25\-question subsets: the first 25 law questions of MMLU\-Pro \(Table[11](https://arxiv.org/html/2606.06941#S6.T11)\) and the first 25 English\-language questions of LEXam \(Table[12](https://arxiv.org/html/2606.06941#S6.T12)\)\. For both, traces are generated using a small ollama locally\-run Qwen3\.5\-9B at the denser rate of 100 traces per question, as opposed to the 20 traces per question used in our main results tables\.
Our ablation contrasts four EP\-HUBO configurations against the zero\-shot \(ZS\) and majority\-vote \(MV\) baselines\. EP\-HUBO\-base is a minimal version of EP\-HUBO without the full method components: fragment re\-extraction, category\-specificλ\\lambdapresets, dynamicKK, distinctiveness scoring, adaptive contradiction penalty; see Table[10](https://arxiv.org/html/2606.06941#S6.T10)\)\. The last three configurations differ only in the Phase 4 frontier adjudicator \(Haiku 4\.5, Sonnet 4\.6, Opus 4\.6, in order from smallest to largest model, respectively\)\. Table[11](https://arxiv.org/html/2606.06941#S6.T11)reports the MMLU\-Pro results while the LEXam ablation results are reported in Table[12](https://arxiv.org/html/2606.06941#S6.T12)\.
Table 10:The five method components in the full EP\-HUBO that distinguish it from its base configuration in Table[11](https://arxiv.org/html/2606.06941#S6.T11)\.Table 11:Ablation study on the 25\-question MMLU\-Pro law set using as trace generator an ollama locally\-run Qwen3\.5\-9B with 100 traces per question\. The EP\-HUBO\-base row isolates the contribution of the full EP\-HUBO method from its base version\. The three last rows isolate the effect of adjudicator strength LLM capacity\. ZS are the zero\-shot baselines, computed on the same 25 questions\. W and H are HUBO trace\-driven wins and HUBO hurts versus its ZS baseline\. The bolded row is the best configuration\.Three observations can be drawn from Table[11](https://arxiv.org/html/2606.06941#S6.T11)\. First, the method\-components ablation: switching on the five components of EP\-HUBO with Sonnet held fixed raises HUBO precision from66\.7%66\.7\\%to75\.0%75\.0\\%\. Second, adjudicator strength: with the EP\-HUBO method components included, replacing Sonnet with Opus pushes HUBO precision to80\.0%80\.0\\%, the highest of any configuration in this study\. Third, the cheap\-adjudicator configuration using Haiku reaches only64%64\\%accuracy, below both Sonnet and Opus configurations and below the ZS Sonnet and ZS Opus baselines but EP\-HUBO with Haiku gains\+4\+4pp over its own ZS baseline at HUBO precision60\.0%60\.0\\%\. Thus, adjudicator pre\-training capacity sets a ceiling on what HUBO evidence can recover\. All four EP\-HUBO configurations clear theprecH\>1/2\\mathrm\{prec\}\_\{H\}\>1/2threshold at which HUBO wins exceed HUBO hurts in expectation\. The ablation study suggests HUBO evidence is more useful when adjudicator models are stronger, which also mirrors the stronger gains over ZS we see on the full benchmarks with the larger frontier model Opus\.
#### LEXam companion ablation\.
Table[12](https://arxiv.org/html/2606.06941#S6.T12)repeats the same ablation grid on the first 25 questions of LEXam\-mcq\-8\-en, using the identical Qwen3\.5\-9B trace generator and 100 traces per question\.
Table 12:Ablation study on the first 25 questions of LEXam\-mcq\-8\-en \(English\), using as trace generator the same locally\-run Qwen3\.5\-9B with 100 traces per question\. Rows, columns and conventions mirror Table[11](https://arxiv.org/html/2606.06941#S6.T11)\. The bolded row is the best configuration\.Table 13:Answer distribution on 25\-question LEXam ablation subset, rounded to integers\. Sonnet’s very strong ’E’ bias is not supported by the ground truth of 12% true ’E’ questions\.The LEXam ablation reproduces similar findings to MMLU\-Pro’s, with the exception of the ’E’ position bias issue in the LEXam baseline LLM results\. The gold\-answer distribution on the LEXam 25\-question subset is shown in Table[13](https://arxiv.org/html/2606.06941#S6.T13)\. Adjudicator LLM strength is again the dominant factor: Opus reaches72\.0%72\.0\\%accuracy,\+8\+8pp above its ZS baseline \(64\.0%64\.0\\%\) and\+36\+36pp above the MV baseline; Haiku reaches44\.0%44\.0\\%,\+4\+4pp above ZS Haiku; Sonnet is at28\.0%28\.0\\%, well below ZS Opus and below MV again due to the ’E’ answer position bias\. HUBO precision with Opus is75\.0%75\.0\\%, the highest accuracy in the table but below the80\.0%80\.0\\%that Opus achieves on MMLU\-Pro, mirroring the smaller headline gain over ZS \(\+8\+8pp on LEXam vs\.\+12\+12pp on MMLU\-Pro\)\. The method\-components contrast is muted under Sonnet: EP\-HUBO–base and EP\-HUBO \(full\) both yield28\.0%28\.0\\%, because zero\-shot Sonnet’s answer distribution on this subset is dominated by a position bias \(20 of its 25 answers are choice‘E′\`E^\{\\prime\}\), so the full method’s refinements do not flip its selection on the same questions\. EP\-HUBO with Sonnet does, however, recover three HUBO wins over ZS Sonnet at HUBO precision75%75\\%, indicating that the evidence pools carry signal even when overall accuracy is constrained by the adjudicator’s idiosyncrasies\. All four EP\-HUBO configurations clear theprecH\>1/2\\mathrm\{prec\}\_\{H\}\>1/2threshold on LEXam as well\.
## 7Conclusion and Discussion
We introduced EP\-HUBO, an evidence\-pooling higher\-order binary optimisation approach for reasoning trace selection in evidence\-intensive legal domains\. EP\-HUBO uses a local smaller model to generate multiple chain\-of\-thought traces, organises fragments into answer\-specific evidence pools, and applies a HUBO optimisation procedure to select a compact, coherent, and distinctive subset of evidence for each option before delegating final adjudication to a frontier model\. By deriving fragment weights from relevance, specificity, and distinctiveness instead of co\-occurrence frequency, the method decouples evidence selection from majority\-vote signal, enabling minority but well\-supported hypotheses to override noisy majorities\. Beyond accuracy, EP\-HUBO exposes and partially mitigates frontier idiosyncrasies\. On LEXam, zero\-shot Claude Sonnet 4\.6 exhibits a position bias, selecting choice ‘E’ on 87\.7% of questions; EP\-HUBO\-selected evidence reduces this bias and yields up to\+20\.2\+20\.2pp gain over zero\-shot Sonnet at HUBO precision92\.0%92\.0\\%using Qwen\-35B reasoning traces\.
Against the small\-model baseline of majority vote \(MV\), EP\-HUBO is very effective:\+12\.6\+12\.6pp on MMLU\-Pro law and up to\+23\.2\+23\.2pp on LEXam with Opus as adjudicator\. Against the frontier baseline \(ZS\), EP\-HUBO delivers measurable gains on both legal benchmarks:\+1\.5\+1\.5pp on MMLU\-Pro law and up to\+5\.1\+5\.1pp on LEXam with Opus \(HUBO precision56\.7%56\.7\\%on MMLU\-Pro law with Qwen\-35B traces; on LEXam,68\.8%68\.8\\%with Qwen\-35B traces and64\.5%64\.5\\%with OSS\-20B traces\)\. The gains are larger on LEXam, where the frontier model does not already saturate accuracy; the contamination differential between MMLU\-Pro and LEXam \(ZS Opus74\.2%74\.2\\%vs\.66\.8%66\.8\\%\) bounds the improvement available to any trace\-augmented pipeline\. Quantum\-combinatorial optimisation thus appears most valuable in settings where frontier models have not already absorbed the benchmark material—low\-contamination domains and specialised legal or regulatory subfields\.
We also report results in which Phase 3 is solved on the Dirac\-3 photonic quantum machine instead of by simulated annealing on classical computers\. On LEXam, the quantum solver matches or modestly exceeds classical SA: the strongest quantum configuration in absolute accuracy is Qwen\-35B \+ Opus at72\.9%72\.9\\%,\+1\.0\+1\.0pp above the matched classical run and\+6\.2\+6\.2pp above the ZS Opus baseline\. With Sonnet adjudication, Qwen\-35B \+ Dirac\-3 on LEXam attains a HUBO precision of91\.8%91\.8\\%, the highest precision of any quantum configuration we evaluate and essentially tied with the classical\-SA result on the same configuration\. On MMLU\-Pro law, by contrast, the quantum solver underperforms classical SA by 5–8 percentage points across both adjudicators and the Dirac\-3 configurations fall below their respective ZS baselines, although they still substantially exceed majority vote\. A likely contributing factor is Dirac\-3’s 135\-variable limit for third\-order objectives: the dataset pools frequently exceed this threshold, requiring truncation of the fragment interaction graph prior to hardware submission, whereas the classical SA solver operates on the full untruncated HUBO\.
One surprising result from this work was the LLM’s bias on LEXam which is a reminder that even frontier models can exhibit priors that produce catastrophic failures on out\-of\-distribution datasets and tasks\. EP\-HUBO partially mitigates such priors but does not eliminate them\. In regulated sector deployment, therefore, transparency and audit logs of evidence, as well as human\-in\-the\-loop review remain important safeguards\.
Several directions remain open\. On the evidence side, our implementation uses single\-LLM trace pools; extending EP\-HUBO to multi\-model trace mixtures is an interesting avenue for further study\. Another direction of interest is to test EP\-HUBO on free\-form question answering, code generation, and mathematical reasoning, domains which also admit finite hypothesis spaces by clustering the model’s outputs into canonical equivalence classes\. Regarding the reasoning fragment scoring, stepwise interactive selection could improve the feature space for the HUBO optimisation problem\. Finally, the current formulation uses up to third\-order interactions among reasoning fragments; Dirac\-3’s native support for higher\-order polynomial objectives makes it a natural platform for exploring fourth\- and fifth\-order HUBO terms, which could capture more complex fragment interdependencies and potentially yield further accuracy gains\.
## Acknowledgments
The authors would like to thank Quantum Computing Inc\. through their in\-kind contributions\.
## References
- \[1\]C\. Flores\-Garrigos, G\. Dev, M\. Falkenthal, A\. Gomez Cadavid, A\. Simen, S\. Kumar, E\. Solano, and N\. N\. Hegade, “Quantum Combinatorial Reasoning for Large Language Models,”arXiv preprint arXiv:2510\.24509, 2025\.
- \[2\]X\. Wang, J\. Wei, D\. Schuurmans, Q\. Le, E\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou, “Self\-Consistency Improves Chain of Thought Reasoning in Language Models,”arXiv preprint arXiv:2203\.11171, 2022\.
- \[3\]L\. Nguyen, M\.\-A\. Miri, R\. J\. Rupert, W\. Dyk, S\. Wu, N\. Vrahoretis, I\. Huang, M\. Begliarbekov, N\. Chancellor, U\. Chukwu, P\. Mahamuni, C\. Martinez\-Delgado, D\. Haycraft, C\. Spear, J\. R\. Huffman, Y\. M\. Sua, and Y\.\-P\. Huang, “Entropy computing, a paradigm for optimization in open photonic systems,”Communications Physics, vol\. 8, article 411, 2025\. doi: 10\.1038/s42005\-025\-02324\-6\.
- \[4\]A\. Feng, M\. Alonso, and A\. Odonnat, “Optimal Self\-Consistency for Efficient Reasoning with Large Language Models,”arXiv preprint arXiv:2511\.12309, 2025\.
- \[5\]Z\. Kang, X\. Zhao, and D\. Song, “Scalable Best\-of\-N Selection for Large Language Models via Self\-Certainty,”arXiv preprint arXiv:2502\.18581, 2025\.
- \[6\]DeepSeek\-AI, “DeepSeek\-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning,”arXiv preprint arXiv:2501\.12948, 2025\.
- \[7\]W\. Yang, S\. Ma, Y\. Lin, and F\. Wei, “Towards Thinking\-Optimal Scaling of Test\-Time Compute for LLM Reasoning,”arXiv preprint arXiv:2502\.18080, 2025\.
- \[8\]J\. Wang, J\. Wang, B\. Athiwaratkun, C\. Zhang, and J\. Zou, “Mixture\-of\-Agents Enhances Large Language Model Capabilities,”arXiv preprint arXiv:2406\.04692, 2024\.
- \[9\]W\. Li, Y\. Lin, M\. Xia, and C\. Jin, “Rethinking Mixture\-of\-Agents: Is Mixing Different Large Language Models Beneficial?”arXiv preprint arXiv:2502\.00674, 2025\.
- \[10\]M\. Ashiga, W\. Jie, F\. Wu, V\. Voskanyan, F\. Dinmohammadi, P\. Brookes, J\. Gong, and Z\. Wang, “Ensemble Learning for Large Language Models in Text and Code Generation: A Survey,”arXiv preprint arXiv:2503\.13505, 2025\.
- \[11\]M\. Esencan, T\. A\. Kumar, A\. A\. Asanjan, P\. A\. Lott, M\. Mohseni, C\. Unlu, D\. Venturelli, and A\. Ho, “Combinatorial Reasoning: Selecting Reasons in Generative AI Pipelines via Combinatorial Optimization,”arXiv preprint arXiv:2407\.00071, 2024\.
- \[12\]H\. Zhang, M\. Emu, and S\. Choudhury, “LLM\-QUBO: An End\-to\-End Framework for Automated QUBO Transformation from Natural Language Problem Descriptions,”arXiv preprint arXiv:2509\.00099, 2025\.
- \[13\]C\. Pomeroy, A\. Pramov, K\. Thakrar, and L\. Yendapalli, “Quantum Annealing for Machine Learning: Applications in Feature Selection, Instance Selection, and Clustering,”arXiv preprint arXiv:2507\.15063, 2025\.
- \[14\]F\. Nausheen, K\. Ahmed, M\. I\. Khan, and F\. Riaz, “Quantum Natural Language Processing: A Comprehensive Review of Models, Methods, and Applications,”arXiv preprint arXiv:2504\.09909, 2025\.
- \[15\]G\. Srivastava, S\. Cao, and X\. Wang, “Towards Reasoning Ability of Small Language Models,”arXiv preprint arXiv:2502\.11569, 2025\.
- \[16\]L\. Stuhlmann, M\. F\. Argerich, and J\. Fürst, “Bench360: Benchmarking Local LLM Inference from 360 Degrees,”arXiv preprint arXiv:2511\.16682, 2025\.
- \[17\]Y\. Fan, J\. Ni, J\. Merane, Y\. Tian, Y\. Hermstrüwer, et al\., “LEXam: Benchmarking Legal Reasoning on 340 Law Exams,”arXiv preprint arXiv:2505\.12864, 2025\. Dataset:[https://huggingface\.co/datasets/LEXam\-Benchmark/LEXam](https://huggingface.co/datasets/LEXam-Benchmark/LEXam)\.
- \[18\]Y\. Wang, X\. Ma, G\. Zhang, Y\. Ni, A\. Chandra, S\. Guo, A\. Ren, A\. Arulraj, X\. He, Z\. Jiang, T\. Li, M\. Ku, K\. Wang, A\. Zhuang, R\. Fan, X\. Yue, and W\. Chen, “MMLU\-Pro: A More Robust and Challenging Multi\-Task Language Understanding Benchmark,”arXiv preprint arXiv:2406\.01574, 2024\.
- \[19\]J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou, “Chain\-of\-Thought Prompting Elicits Reasoning in Large Language Models,”arXiv preprint arXiv:2201\.11903, 2022\.
- \[20\]S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. L\. Griffiths, Y\. Cao, and K\. Narasimhan, “Tree of Thoughts: Deliberate Problem Solving with Large Language Models,”Advances in Neural Information Processing Systems \(NeurIPS\), 2023\.
- \[21\]W\. Saunders, C\. Yeh, J\. Wu, S\. Bills, L\. Ouyang, J\. Ward, and J\. Leike, “Self\-critiquing models for assisting human evaluators,”arXiv preprint arXiv:2206\.05802, 2022\.
- \[22\]S\. Welleck, X\. Lu, P\. West, F\. Brahman, T\. Shen, D\. Khashabi, and Y\. Choi, “Generating Sequences by Learning to Self\-Correct,”International Conference on Learning Representations \(ICLR\), 2023\.
- \[23\]D\. Roth and W\. Yih, “A Linear Programming Formulation for Global Inference in Natural Language Tasks,”Proc\. CoNLL, 2004\.
- \[24\]S\. Riedel and J\. Clarke, “Incremental Integer Linear Programming for Non\-projective Dependency Parsing,”Proc\. EMNLP, 2006\.
- \[25\]G\. Poesia, A\. Polozov, V\. Le, A\. Tiwari, G\. Soares, C\. Meek, and S\. Gulwani, “Synchromesh: Reliable Code Generation from Pre\-trained Language Models,”International Conference on Learning Representations \(ICLR\), 2022\.
- \[26\]V\. Niculae, A\. F\. T\. Martins, M\. Blondel, and C\. Cardie, “SparseMAP: Differentiable Sparse Structured Inference,”International Conference on Machine Learning \(ICML\), 2018\.
- \[27\]E\. Boros, A\. Gruber, “On Quadratization of Pseudo\-Boolean Functions,”International Symposium on Artificial Intelligence and Mathematics, 2014\.
- \[28\]O\. Sainz, J\. A\. Campos, I\. García\-Ferrero, J\. Etxaniz, O\. Lopez de Lacalle, and E\. Agirre, “NLP Evaluation in Trouble: On the Need to Measure LLM Data Contamination for each Benchmark,”Findings of EMNLP, 2023\.
- \[29\]I\. Magar and R\. Schwartz, “Data Contamination: From Memorization to Exploitation,”Proc\. ACL, 2022\.
- \[30\]N\. Carlini, D\. Ippolito, M\. Jagielski, K\. Lee, F\. Tramer, and C\. Zhang, “Quantifying Memorization Across Neural Language Models,”International Conference on Learning Representations \(ICLR\), 2023\.
- \[31\]P\. Liang, R\. Bommasani, T\. Lee, et al\., “Holistic Evaluation of Language Models,”Transactions on Machine Learning Research, 2023\. arXiv preprint arXiv:2211\.09110\.
- \[32\]Y\. Perlitz, E\. Bandel, A\. Gera, O\. Arviv, L\. Ein\-Dor, E\. Shnarch, N\. Slonim, M\. Shmueli\-Scheuer, and L\. Choshen, “Efficient Benchmarking of Language Models,”arXiv preprint arXiv:2308\.11696, 2023\.
- \[33\]R\. Dror, G\. Baumer, S\. Shlomov, and R\. Reichart, “The Hitchhiker’s Guide to Testing Statistical Significance in Natural Language Processing,”Proc\. ACL, 2018\.
- \[34\]B\. Hajek, “Cooling Schedules for Optimal Annealing,”Mathematics of Operations Research, vol\. 13, no\. 2, pp\. 311–329, 1988\.
- \[35\]S\. Geman and D\. Geman, “Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol\. 6, no\. 6, pp\. 721–741, 1984\.
## Appendix APrompt Templates
We reproduce the prompt templates used in each phase\. Variable placeholders are written in<angle\_brackets\>\. All prompts are released verbatim with the code repository; the templates here are abbreviated for space\.
#### Phase 1 \(trace generation\)\.
Each trace is generated with the following user\-message template, preceded by a system message that pins the local model’s role\.
```
SYSTEM:
You are a careful reasoner for multiple-choice questions.
For each question, list 3-5 key facts, then state your final answer
on a line of the form: FINAL ANSWER: <letter>.
USER:
Question: <q.text>
Options:
A) <q.choices[0]>
...
L) <q.choices[L-1]>
Reason carefully, then output FINAL ANSWER: <letter>.
```
#### Phase 2 \(1\-body fragment scoring\)\.
The local model is queried with the question and a single candidate fragment; the output is a JSON object with three real\-valued scores in\[0,1\]\[0,1\]\.
```
SYSTEM:
You score a single candidate evidence fragment along three
dimensions. Output strict JSON with keys "relevance",
"specificity", and "distinctiveness" each in [0,1].
USER:
Question: <q.text>
Candidate evidence: <f.text>
Rate relevance, specificity, and distinctiveness.
```
#### Phase 2 \(pairwise and triplet scoring\)\.
Pairwise prompts requestsupportandcontradictscores for two fragments simultaneously; triplet prompts requestcoherentandunderminefor three\. Both follow the same JSON\-output pattern\.
#### Phase 4 \(frontier adjudication\)\.
The frontier model receives all selected evidence labelled by its candidate\-label pool\.
```
SYSTEM:
You are an adjudicator for multiple-choice questions in <domain>.
You will be given the question, the answer options, and a small set
of evidence fragments grouped by which option each fragment supports.
Choose the single option most strongly supported by the evidence.
Reply with the answer letter only, in parentheses, e.g. (A).
USER:
Question: <q.text>
Options:
A) <q.choices[0]>
...
=== Evidence supporting (A) ===
1. <fragment_A_1>
...
=== Evidence supporting (B) ===
...
Which answer is most strongly supported?
```
#### Phase 4 \(zero\-shot baseline\)\.
The ZS baseline uses the same Phase 4 system message with the evidence blocks omitted \(the user message contains only the question and options\)\.
## Appendix BAdditional Details
Hyperparameters used throughout the paper are listed in Table[14](https://arxiv.org/html/2606.06941#A2.T14)\.
Table 14:Hyperparameter\. Category\-specific\(α,β,γ,λsupp,λcontra,λcoh\)\(\\alpha,\\beta,\\gamma,\\lambda\_\{\\text\{supp\}\},\\lambda\_\{\\text\{contra\}\},\\lambda\_\{\\text\{coh\}\}\)values are in Table[2](https://arxiv.org/html/2606.06941#S3.T2)\.#### Proof of Theorem 4\.4
###### Proof\.
Each disagreement is a Bernoulli\(precH\)\(\\mathrm\{prec\}\_\{H\}\)trial: either HUBO wins or HUBO hurts\. Hoeffding’s inequality applied to the empirical mean ofnnsuch trials yields Eq\.[9](https://arxiv.org/html/2606.06941#S4.E9)\. ∎Similar Articles
Unified Data Selection for LLM Reasoning
The paper proposes High-Entropy Sum (HES), a training-free metric for selecting high-quality reasoning data for LLM training, validated across SFT, RFT, and RL paradigms.
Hint-Guided Diversified Policy Optimization for LLM Reasoning
This paper introduces Hint-Guided Diversified Policy Optimization (HDPO), a two-stage RL framework that encourages LLMs to first generate multiple candidate solution outlines (hints) and then select the most reliable one for detailed reasoning, improving reasoning diversity and reliability.
Hidden Thoughts Are Not Secret: Reasoning Trace Exposure in LLMs
This paper introduces Reasoning Exposure Prompting (REP), a method that uses shadow-model demonstrations in code-like formats to elicit hidden reasoning traces from LLMs, showing that interface-level trace hiding is insufficient to prevent extraction of useful reasoning signals.
Stepwise Reasoning Enhancement for LLMs via External Subgraph Generation
This paper proposes SGR, a framework that enhances LLM stepwise reasoning by integrating external knowledge graphs through query-relevant subgraph generation, combining Cypher-based reasoning with collaborative reasoning integration. Experiments on CWQ, WebQSP, GrailQA, and KQA Pro show improved reasoning accuracy over standard prompting and knowledge-enhanced baselines.
Why LLMs Hallucinate on Structured Knowledge: A Mechanistic Analysis of Reasoning over Linearized Representations
This paper presents a mechanistic analysis of why LLMs hallucinate when reasoning over linearized structured knowledge, finding that hallucinations stem from systematic internal dynamics such as attention on shortcut cues and failures in semantic grounding in feed-forward layers, rather than random noise.