@dair_ai: NEW paper worth reading. GPT-5.4 nano plus a critic-comparator orchestration loop hits 76.4% on SWE-bench Verified, mat…
Summary
A new paper shows that using a weak model with k=8 proposals and a critic-comparator selection loop can match frontier model performance on SWE-bench Verified, reaching 76.4% accuracy. The key insight is that correct patches are often already present in a weak model's top-k candidates, and the challenge is effective selection using execution verification.
View Cached Full Text
Cached at: 05/18/26, 10:38 PM
NEW paper worth reading.
GPT-5.4 nano plus a critic-comparator orchestration loop hits 76.4% on SWE-bench Verified, matching standalone Gemini 3 Pro and Claude Opus 4.5 Thinking.
The trick is to select from k=8 weak-model proposals using execution and proof signals.
What does this mean?
Many of the patches you’d expect from a frontier model are already inside a weak model’s top-8 candidates.
When you have 8 candidate patches from a weak model, don’t ask the model which is best. Run them and verify them. That’s enough to match a frontier model’s accuracy.
The takeaway for AI devs: a weak model’s top-k often already contains the right answer. What limits you is the quality of your selector, not the capability of the model.
Paper: https://arxiv.org/abs/2605.14163
Learn to build effective AI agents in our academy: https://academy.dair.ai
Agentic Systems as Boosting Weak Reasoning Models
Source: https://arxiv.org/html/2605.14163 Varun Sunkaraneni Texas A&M University &Pierfrancesco Beneventano∗ MIT &Riccardo Neumarker MIT &Tomaso Poggio MIT &Tomer Galanti Texas A&M University Equal contribution. Random coin flip determined author order between the first two authors.Corresponding author: [email protected].
Abstract
Can a committee of weak reasoning-model calls reach the performance of much stronger models? We study verifier-backed committee search as inference-time boosting for reasoning language models. The mechanism is not simply that “more agents help”: samples expose latent correct solutions, while critics and comparators must recover them without access to the hidden verifier. We formalize this view by separating proposal coverage, local identifiability, progress, and diversity. We prove that coverage can be amplified by repeated sampling, but cannot by itself create useful critics or comparators; reliable amplification requires an additional local soundness signal, such as execution, proof checking, type checking, tests, or constraint solving. We give rank-based bounds showing when local selection errors compose into reliable trajectories, and characterize the proposer-side ceiling: oracle best-of-kkconverges only to the mass of task slices on which the proposal system assigns nonzero useful probability. Empirically, on SWE-bench Verified, a singleGPT-5.4 nanoproposal solves67.0%67.0\%of tasks. Using the same nano model, our critic–comparator orchestration reaches76.4%76.4\%withk=8k=8proposals, matching the standalone performance ofGemini 3 ProandClaude Opus 4.5Thinking and approaching the79.0%79.0\%oracle best-of-88upper bound. Thus, many correct patches are already present in weak-model proposal pools; the main challenge is selecting them. The remaining failures are mostly proposal-coverage failures, indicating shared blind spots that stronger selection alone cannot close.
1Introduction
Boosting turns weak predictors into strong predictors by repeatedly combining imperfect but useful signals[45,17,18]. Modern language-model systems use a related idea at inference time: they sample several candidates, check or compare them, search over partial states, and select a final output[52,34,3,6,25,60]. However, reasoning is not ordinary supervised boosting. In supervised prediction, each weak learner returns a label that can be evaluated against training examples. In reasoning, the system must instead generate an intermediate move, decide whether that move is useful, and avoid letting small local errors accumulate into a wrong final answer.
We study this mechanism for verifier-backed reasoning tasks such as code repair, theorem proving, and program synthesis. These domains provide tests, proof checkers, type checkers, execution, or constraint solvers that can supply local soundness signals[13,38,28,59,63,57]. We model agentic systems asinference-time boostingfor reasoning language models: repeated weak proposals increase the chance of producing a useful next move, critics or comparators help identify that move, and verifier-backed progress allows useful moves to be chained into a terminal solution.
The analysis separates four quantities: proposal coverage, local selection signal or identifiability, progress, and diversity. Coverage asks whether a good move appears; identifiability asks whether the system can recognize it; progress makes local choices compose; diversity determines whether more calls escape different failure modes.
Figure 1:A committee ofGPT-5.4 nanocalls reaches much stronger models.Increasing proposer diversity lifts nano orchestration far above the nano baseline and up toGemini 3 ProandClaude Opus 4.5Thinking. The oracle best-of-nncurve shows that correct solutions are often already in the proposal pool; the remaining gap is selection. Dashed lines denote single-model resolve rates.This separation is essential. Sampling more candidates can increase the chance that a useful move appears, but sampling alone does not explain how the system recognizes that move. Final-answer verification is also not enough: for a multi-step task, the system needs intermediate states where progress can be generated, checked, and safely composed. On the proposer side, more calls reduce ordinary sampling noise, but they cannot fix shared blind spots. If all proposers assign near-zero probability to the useful moves needed for a particular type of instance, then even an ideal critic cannot recover those moves from the sample pool. Thus best-of-kkwith an oracle critic measures an upper bound on what inference-time selection can recover from the proposed candidates, not the full capability of an ideal reasoner.
This viewpoint also changes how such systems should be evaluated. Pass@1 measures one-shot generation. Oracle best-of-kkmeasures whether a correct solution appears anywhere in the sampled candidate pool. Implemented system success measures how much of this oracle gap is recovered by a finite critic, comparator, verifier, or search harness. Budgeted success curves measure how quickly the harness approaches its best achievable performance as the number of calls, checks, or search steps increases.
Running example.Consider SWE-bench Verified[28]. The system receives a repository, an issue description, and visible tests, but success is measured by hidden tests. A single patch may choose the wrong design, miss a cross-file dependency, or overfit visible tests. A diverse nano pool may already contain a hidden-test-passing patch, but this latent capability matters only if the harness can recover it. Figure1shows this effect:GPT-5.4 nanostarts at67.0%67.0\%, while our reviewer-comparator harness reaches76.4%76.4\%, matchingGemini 3 ProandClaude Opus 4.5Thinking and exceedingGPT-5.4 mini. The oracle best-of-kkcurve reaches79.0%79.0\%, showing that correct patches are often already in the nano proposal pool. Thus, the harness-oracle gap diagnoses selection, while the oracle-stronger-model gap reflects remaining generation and shared-blind-spot limitations.
Contributions.
This paper makes five contributions.
- (i)Inference-time boosting for reasoning language models.We formalize verifier-backed sample–identify–advance systems over partial reasoning states as inference-time boosting of LLMs. We isolate four amplification quantities: proposal coverage, local identifiability, progress depth, and diversity.
- (ii)Coverage does not imply identifiability.We prove a black-box separation: nontrivial probability of generating progressing-sound moves does not by itself yield a useful critic or comparator. Reliable amplification requires an additional local identifiability signal, such as execution, proof checking, type checking, constraints, tests, or a learned reviewer.
- (iii)Local-to-global oracle–identifiability bounds.We decompose failure into an oracle miss term, asking whether any of thekkproposals contains a progressing-sound move, and an identifiability miss term, asking whether finite critics/comparators recover one. Along rank-bounded trajectories these errors add:Pr(failure)≤L(εorc(k)+εid(k,m,r))\Pr(\mathrm{failure})\leq L(\varepsilon_{\mathrm{orc}}(k)+\varepsilon_{\mathrm{id}}(k,m,r)), withεid(k,m,r)≲k2e−βm−2rσ2\varepsilon_{\mathrm{id}}(k,m,r)\lesssim k^{2}e^{-\beta m-2r\sigma^{2}}.
- (iv)Blind-spot ceilings and boostable capability.We characterize the oracle miss term under a latent subpopulation model asεorc(k)=B+o(k)\varepsilon_{\mathrm{orc}}(k)=B+o(k), whereBBis blind-spot mass ando(k)o(k)is finite-sampling residual. Hence oracle best-of-kkconverges to1−B1-B, giving a formal boostable-capability ceiling for a proposal system.
- (v)Weak-to-frontier empirical amplification.On SWE-bench Verified, critic–comparator orchestration liftsGPT-5.4 nanofrom weak one-shot performance to the level of substantially stronger standalone models. Ablations show the mechanism: diversity exposes latent correct patches, critics filter flawed candidates, comparators rank plausible alternatives, and remaining failures are mostly proposal-coverage failures.
2Related Work
Boosting and inference-time amplification.Classical boosting converts weak predictive edges into strong predictors under supervised feedback[45,17,18]. The analogy is useful but incomplete for verifier-backed reasoning: the system must generate a useful local move, identify it, and repeat this without losing progress. Prior work studies LLMs as weak learners, weak-to-strong generalization, and boosting-style uses of language models[41,5,10]. We instead study the black-box inference-time regime, where model weights are fixed and the question is when repeated calls plus local selection amplify reasoning.
Sampling and test-time scaling.Many inference-time methods improve performance by sampling more candidates. Self-consistency, rationale ensembles, universal self-consistency, multi-agent voting, and large-scale repeated sampling all exploit the fact that useful answers may appear away from the greedy path[52,53,9,34,3]. Recent work studies scaling laws for more LM calls, best-of-NNselection, and broader test-time scaling[6,25,62]. Our contribution is structural: sampling helps only when it exposes useful candidates, and its ceiling is set by shared blind spots.
Selection, verifiers, and judges.A second line studies how to choose among generated candidates. Learned verifiers, process reward models, pairwise rankers, multi-verifier systems, reward-model benchmarks, and tool-backed checking show that selection can be as important as generation[13,39,51,27,37,43,64,31,19,14]. These works motivate our coverage–identifiability separation. A correct candidate in the sample pool is useful only if the harness has a critic, comparator, verifier, or test signal strong enough to recover it.
Search, agents, and compound systems.Inference-time reasoning often proceeds over partial states rather than flat answers. Least-to-most prompting, Tree of Thoughts, RAP, LATS, ReAct, Reflexion, and Self-Refine use decomposition, search, feedback, planning, or tool interaction[66,60,22,65,61,48,40]. Multi-agent and compound-system frameworks such as CAMEL, AutoGen, MetaGPT, Mixture-of-Agents, Archon, and Smoothie explore larger orchestration spaces[33,56,23,50,44,21]. We isolate one mechanism inside this design space: verifier-backed committee search over bounded-depth trajectories.
Verifier-backed benchmarks and oversight.Code and formal-reasoning benchmarks make this mechanism concrete because candidates can often be checked by tests, execution, types, proof checkers, or other local signals. AlphaCode, Codex, SWE-bench, SWE-agent, AutoCodeRover, and Agentless all use some combination of sampling, localization, repair, validation, and tool-backed selection[35,7,28,59,63,57]. Debate, prover-verifier games, and scalable oversight also decompose hard judgments into simpler checks or comparisons, but usually study argument evaluation, supervision, or strategic interaction rather than verifier-backed local actions[26,12,16,36,4,1,30,29]. Our setting is narrower and more mechanistic: local moves, local signals, bounded progress, and measurable blind spots.
3Verifier-Backed Committee Search
We model verifier-backed agent systems as bounded-depth search over partial objects with local progress. Let𝒳\mathcal{X}denote a family of tasks, such as SWE-bench, and fix a task instancex∈𝒳x\in\mathcal{X}, for example a particular SWE-bench problem.
Our setting is inspired by reinforcement learning and consists of states and actions. We view an agentic workflow through the sequence of states it induces, and the system stops when it reaches a terminal state. A state represents a partial reasoning object, such as an intermediate proof state or a partially written program together with its current specification.
Let𝒮x\mathcal{S}_{x}be a countable state space. LetValidx⊆𝒮x\mathrm{Valid}_{x}\subseteq\mathcal{S}_{x}be the set of valid states, where validity means that some correct completion remains possible. Lets0(x)∈Validxs_{0}(x)\in\mathrm{Valid}_{x}be the initial state. LetRx:𝒮x→{0,1}R_{x}:\mathcal{S}_{x}\to\{0,1\}be a verifier over states, and letTermx⊆Validx\mathrm{Term}_{x}\subseteq\mathrm{Valid}_{x}be the set of terminal states. Terminal states are accepted by the verifier, meaning thatRx(s)=1R_{x}(s)=1for alls∈Termxs\in\mathrm{Term}_{x}.
Definition 1(Setting: valid state system with progress).
Avalid state systemis a tuple containing𝒮x,s0(x),Validx,Termx\mathcal{S}_{x},s_{0}(x),\mathrm{Valid}_{x},\mathrm{Term}_{x}, and action sets𝒜(s)\mathcal{A}(s)where an actiona∈𝒜(s)a\in\mathcal{A}(s)leads to the next state, and a rank functiondx:Validx→{0,1,…,Lx}d_{x}:\mathrm{Valid}_{x}\to\{0,1,\ldots,L_{x}\}describing the “distance from solution” such that
dx(s)=0fors∈Termx,anddx(s)≥1fors∈Validx∖Termx.d_{x}(s)=0\text{ for }s\in\mathrm{Term}_{x},\quad\text{and}\quad d_{x}(s)\geq 1\text{ for }s\in\mathrm{Valid}_{x}\setminus\mathrm{Term}_{x}.A state is reachable if it is obtained froms0(x)s_{0}(x)by finitely many actions.
The rank certifies progress: if every chosen action decreases the rank while preserving validity, the process terminates withinLxL_{x}steps.
Definition 2(Progressing-sound actions).
At a reachable valid nonterminal statess, an actiona∈𝒜(s)a\in\mathcal{A}(s)isprogressing-soundifsa∗s^{*}_{a}obtained by usingaafromsssatisfies
sa∗∈Validxanddx(sa∗)<dx(s).s^{*}_{a}\in\mathrm{Valid}_{x}\qquad\text{and}\qquad d_{x}(s^{*}_{a})<d_{x}(s).Write the set of sound actionsSoundx(s):={a∈𝒜(s):sa∗∈Validxanddx(sa∗)<dx(s)}.\mathrm{Sound}_{x}(s):=\{a\in\mathcal{A}(s):s^{*}_{a}\in\mathrm{Valid}_{x}\text{ and }d_{x}(s^{*}_{a})<d_{x}(s)\}.
For example, in SWE-bench, the state contains the current repository worktree, the issue, and the visible tests. A progressing-sound action is a code edit that preserves some hidden-test-passing patch while reducing the remaining work, such as by fixing failing visible tests. Visible tests, types, and linters reject many unsound edits but do not certify correctness, since hidden tests may still fail.
Committee rotocolΠk,m,r\Pi_{k,m,r}.At each reachable nonterminal statess, the protocol:
- (1)sampleskkcandidate actions from the proposer harness;
- (2)appliesmmindependent critic calls per candidate and discards candidates rejected at least once;
- (3)declares local failure if no candidate survives;
- (4)otherwise selects a Copeland winner among survivors usingrrcomparator votes per pair, applies it, and repeats.
propose (kk)critique (mm)compare (rr)sts_{t}a1a_{1}a2a_{2}a3a_{3}✓×\times✓a1a_{1}st+1s_{t+1}repeatLLtimesFigure 2:One step of the committee protocol.The protocol separates generation from identification. Proposers create breadth, critics remove locally refutable errors, and comparators select among surviving candidates. The theory below shows that this architecture amplifies weak local competence only when two distinct resources are present: proposal coverage and local identifiability.
Assumption 1(Per-state local coverage).
For each problem sizeNN, there exists a proposer portfolioPNP_{N}with|PN|=poly(N)|P_{N}|=\mathrm{poly}(N)such that for every reachable valid nonterminal statessand inputxxof sizeNN, some proposer policy or promptps∈PNp_{s}\in P_{N}satisfies
α(s,ps):=ℙ[LLM(ps)outputs an action inSoundx(s)]≥α0>0.\alpha(s,p_{s}):=\mathbb{P}[\textnormal{LLM}(p_{s})\text{ outputs an action in }\mathrm{Sound}_{x}(s)]\geq\alpha_{0}>0.
Assumption 2(Efficient local identifiability).
At every valid nonterminal statess, there exist polynomial-time randomized procedures𝖢𝗋𝗂𝗍s\mathsf{Crit}_{s}and𝖢𝗈𝗆𝗉s\mathsf{Comp}_{s}and constantsβ0,σ0>0\beta_{0},\sigma_{0}>0such that:
- (i)a∈Soundx(s)⇒𝖢𝗋𝗂𝗍s(a)a\in\mathrm{Sound}_{x}(s)\Rightarrow\mathsf{Crit}_{s}(a)never emits a verified rejection;
- (ii)a∉Soundx(s)⇒ℙ[𝖢𝗋𝗂𝗍s(a)=reject]≥β0a\notin\mathrm{Sound}_{x}(s)\Rightarrow\mathbb{P}[\mathsf{Crit}_{s}(a)=\textsc{reject}]\geq\beta_{0};
- (iii)ifa∈Soundx(s)a\in\mathrm{Sound}_{x}(s)andb∉Soundx(s)b\notin\mathrm{Sound}_{x}(s), thenℙ[𝖢𝗈𝗆𝗉s(a,b)=a]≥1/2+σ0\mathbb{P}[\mathsf{Comp}_{s}(a,b)=a]\geq 1/2+\sigma_{0}.
Assumption1says that good moves can be generated. Assumption2says that bad moves can be identified or ranked below good ones. These are different capabilities.
4From Coverage to Reliable Local Steps
Assumption1is a generation condition, not a verification condition. It says that some portfolio policy can sample a progressing-sound action, but not how to recognize one. This section shows that sampling can amplify coverage but cannot manufacture critics or comparators; local identifiability must come from an additional soundness signal.
Proposition 1(Coverage does not imply local identifiability).
For everyM≥2M\geq 2, there exists a one-step task family with action set𝒜={1,…,M}\mathcal{A}=\{1,\ldots,M\}and hidden world parameterθ∈{1,…,M}\theta\in\{1,\ldots,M\}such that the proposer distribution isUnif(𝒜)\mathrm{Unif}(\mathcal{A})in every world, the progressing-sound set isSoundθ=𝒜∖{θ}\mathrm{Sound}_{\theta}=\mathcal{A}\setminus\{\theta\}, and no procedure observing only candidate actions and polynomially many samples from the proposer distribution has a uniform critic or comparator edge over all worlds.
Thus Assumption1⇒(β,σ)\Rightarrow(\beta,\sigma)fails in general. Local identifiability needs an accessible signal—proof checking, execution, type checking, constraint solving, or another certificate mechanism—that can reject or rank bad moves.
Theorem 1(Bridge theorem: coverage plus identifiability).
Under Assumption1with proposer portfolioPNP_{N}and Assumption2with edges(β0,σ0)(\beta_{0},\sigma_{0}). At every reachable valid nonterminal statess, round-robin assignment over the proposer portfolio withk≥|PN|⌈ln(1/δprop)/α0⌉k\geq|P_{N}|\lceil\ln(1/\delta_{\mathrm{prop}})/\alpha_{0}\rceilproposer calls gives
αcommittee(s)≥1−(1−α0)⌊k/|PN|⌋≥1−δprop.\alpha_{\mathrm{committee}}(s)\geq 1-(1-\alpha_{0})^{\lfloor k/|P_{N}|\rfloor}\geq 1-\delta_{\mathrm{prop}}.The critic and comparator edges remainβ(s)≥β0\beta(s)\geq\beta_{0}andσ(s)≥σ0\sigma(s)\geq\sigma_{0}.
The theorem separates two resources: portfolio calls amplify the chance that a progressing-sound action appears, while critic and comparator edges come from Assumption2. The proof is in AppendixD.
Verifier-backed instantiation.If a one-sided local verifier can reject unsound moves, it supplies Assumption2. This is stronger than final-answer verification: the decomposition must expose useful local checks.
Assumption 3(One-sided local verifier).
At every reachable valid nonterminal statess, there exists a poly-time randomized verifierVx(s,a)∈{accept,reject}V_{x}(s,a)\in\{\textsc{accept},\textsc{reject}\}such thata∈Soundx(s)a\in\mathrm{Sound}_{x}(s)impliesVx(s,a)=acceptV_{x}(s,a)=\textsc{accept}almost surely, whilea∉Soundx(s)a\notin\mathrm{Sound}_{x}(s)impliesℙ[Vx(s,a)=reject]≥1−ν\mathbb{P}[V_{x}(s,a)=\textsc{reject}]\geq 1-\nu.
Corollary 1(Verifier-backed bridge).
Under Assumptions1,2, and3holds withβ0=1−ν\beta_{0}=1-\nuandσ0=(1−ν)/2\sigma_{0}=(1-\nu)/2.
The corollary uses the verifier as the critic. For comparison, we verify both candidates, choose the unrejected candidate when exactly one is rejected, and break ties uniformly. The full argument is in AppendixD.
5Amplification Along a Trajectory
The previous section bounds the local committee error. We now show how to reduce global failure to local committee error. For an inputxx, define
errx(k,m,r):=ℙ(Rx(Πk,m,r(x))=0),\mathrm{err}_{x}(k,m,r):=\mathbb{P}\!\left(R_{x}(\Pi_{k,m,r}(x))=0\right),the probability that the full protocol fails.
At a reachable valid nonterminal statess, let
εloc(s):=ℙ(At∉Soundx(s)∣St=s),\varepsilon_{\mathrm{loc}}(s):=\mathbb{P}\!\left(A_{t}\notin\mathrm{Sound}_{x}(s)\mid S_{t}=s\right),where local failure is counted as selecting an unsound action. This is the one-step error of the committee. It has two sources:
εloc(s)≤εprop(k;s)⏟no good proposal+k2e−βm−2rσ2⏟bad proposal survives and wins.\varepsilon_{\mathrm{loc}}(s)\leq\underbrace{\varepsilon_{\mathrm{prop}}(k;s)}_{\text{no good proposal}}+\underbrace{k^{2}e^{-\beta m-2r\sigma^{2}}}_{\text{bad proposal survives and wins}}.The first term is proposal failure; the second is identification failure. If each reachable local step fails with probability at mostε\varepsilonand the trajectory depth is at mostLxL_{x}, thenerrx(k,m,r)≤Lxε\mathrm{err}_{x}(k,m,r)\leq L_{x}\varepsilon.
Lemma 1(Adaptive cumulative-error bound).
Letxxhave rank boundLxL_{x}. Suppose that along the protocol trajectory,ℙ(At∉Soundx(St)∣ℱt)≤εt\mathbb{P}(A_{t}\notin\mathrm{Sound}_{x}(S_{t})\mid\mathcal{F}_{t})\leq\varepsilon_{t}wheneverSt∈Validx∖TermxS_{t}\in\mathrm{Valid}_{x}\setminus\mathrm{Term}_{x}, with local failure counted as an invalid action. Thenerrx(k,m,r)≤∑t=0Lx−1𝔼[εt]\mathrm{err}_{x}(k,m,r)\leq\sum_{t=0}^{L_{x}-1}\mathbb{E}[\varepsilon_{t}]. In particular, ifεt≤ε\varepsilon_{t}\leq\varepsilonfor alltt, thenerrx(k,m,r)≤Lxε\mathrm{err}_{x}(k,m,r)\leq L_{x}\varepsilon.
At a fixed reachable valid nonterminal statess, letεprop(k;s)\varepsilon_{\mathrm{prop}}(k;s)be the probability, conditional on reachingss, that none of thekkproposers outputs an action inSoundx(s)\mathrm{Sound}_{x}(s). Letεloc(s)\varepsilon_{\mathrm{loc}}(s)be the conditional probability that the local committee step either declares local failure or selects an action outsideSoundx(s)\mathrm{Sound}_{x}(s).
Theorem 2(Local error decomposition).
At any reachable valid nonterminal statess, under the local role model with critic edgeβ\betaand comparator edgeσ\sigma, and under the conditional independence assumptions stated in AppendixE,
εloc(s)≤εprop(k;s)+k2(1−β)me−2rσ2≤εprop(k;s)+k2e−βm−2rσ2.\varepsilon_{\mathrm{loc}}(s)\leq\varepsilon_{\mathrm{prop}}(k;s)+k^{2}(1-\beta)^{m}e^{-2r\sigma^{2}}\leq\varepsilon_{\mathrm{prop}}(k;s)+k^{2}e^{-\beta m-2r\sigma^{2}}.If the proposer calls are conditionally independent and each succeeds with probability at leastα\alpha, thenεprop(k;s)≤(1−α)k≤e−αk\varepsilon_{\mathrm{prop}}(k;s)\leq(1-\alpha)^{k}\leq e^{-\alpha k}.
Sizing rule for global success.Letεprop(k):=supsεprop(k;s)\varepsilon_{\mathrm{prop}}(k):=\sup_{s}\varepsilon_{\mathrm{prop}}(k;s), where the supremum is over reachable valid nonterminal states. Combining the cumulative and local bounds giveserrx(k,m,r)≤Lx(εprop(k)+k2e−βm−2rσ2).\mathrm{err}_{x}(k,m,r)\leq L_{x}\bigl(\varepsilon_{\mathrm{prop}}(k)+k^{2}e^{-\beta m-2r\sigma^{2}}\bigr).Thus, ifεprop(k)+k2e−βm−2rσ2≤δ/Lx\varepsilon_{\mathrm{prop}}(k)+k^{2}e^{-\beta m-2r\sigma^{2}}\leq\delta/L_{x}, then the full depth-LxL_{x}protocol fails with probability at mostδ\delta.
The bound also yields a standard polynomial-resource corollary when the rank, proposal coverage, critic/comparator edges, and per-call runtimes are polynomially bounded; we state this consequence in AppendixB.
6Oracle Error and Blind-Spot Limits
Section5reduced global failure to proposal failure plus identifiability failure. This section studies the proposal term. If no progressing-sound action appears among thekkproposals, then even a perfect local identifier cannot help. The main point is that this oracle miss probability splits into an irreducible blind-spot floor and a finite-sampling residual. For extensions and additional details, see AppendixF.
Lemma 2(Local oracle miss and blind-spot floor).
Fix a reachable valid nonterminal statess. Suppose that, conditional on a latent variableZZ, thekkproposal calls are independent and each lies inSoundx(s)\mathrm{Sound}_{x}(s)with probabilityqs(Z)q_{s}(Z). Then
εprop(k;s)=𝔼[(1−qs(Z))k].\varepsilon_{\mathrm{prop}}(k;s)=\mathbb{E}[(1-q_{s}(Z))^{k}].Equivalently,
εprop(k;s)=Bs+Rk(s),\varepsilon_{\mathrm{prop}}(k;s)=B_{s}+R_{k}(s),where
Bs:=ℙ(qs(Z)=0),Rk(s):=𝔼[(1−qs(Z))k𝟏{qs(Z)>0}].B_{s}:=\mathbb{P}(q_{s}(Z)=0),\qquad R_{k}(s):=\mathbb{E}[(1-q_{s}(Z))^{k}\mathbf{1}\{q_{s}(Z)>0\}].In particular,εprop(k;s)→Bs\varepsilon_{\mathrm{prop}}(k;s)\to B_{s}ask→∞k\to\infty.
The termBsB_{s}is the local blind-spot floor: on these latent subpopulations, the proposal system assigns zero probability to any progressing-sound action. The termRk(s)R_{k}(s)is finite-sampling error on non-blind subpopulations. Thus more proposals can reduceRk(s)R_{k}(s), but they cannot reduceBsB_{s}. ReducingBsB_{s}requires changing the proposal system itself: the model, prompts, tools, retrieval, decomposition, or proposer diversity. An average coverage condition can still hide blind spots: it is possible that𝔼[qs(Z)]>0\mathbb{E}[q_{s}(Z)]>0whileqs(Z)=0q_{s}(Z)=0on a nontrivial subpopulation. The floor vanishes only under a stronger conditional coverage condition, for exampleqs(Z)≥α0q_{s}(Z)\geq\alpha_{0}almost surely. Lemma2is useful precisely because it separates average proposal success from latent subpopulations with zero proposal mass.
Let
B:=supsBs,Rk:=supsRk(s),B:=\sup_{s}B_{s},\qquad R_{k}:=\sup_{s}R_{k}(s),where the suprema range over reachable valid nonterminal states. Combining Lemma2with Theorem2and the cumulative bound gives
errx(k,m,r)≤Lx[B⏟blind spots+Rk⏟finite proposal sampling+k2e−βm−2rσ2⏟identifiability error].\mathrm{err}_{x}(k,m,r)\leq L_{x}\Big[\underbrace{B}_{\textnormal{blind spots}}+\underbrace{R_{k}}_{\textnormal{finite proposal sampling}}+\underbrace{k^{2}e^{-\beta m-2r\sigma^{2}}}_{\textnormal{identifiability error}}\Big].This is the main error decomposition. The first term is irreducible for a fixed proposal system, the second decreases with proposal width, and the third decreases with critic and comparator resources. The appendix gives the heterogeneous-portfolio version of Lemma2, finite-kkconvergence rates forRkR_{k}, and the common-shock specialization.
The same calculation also gives the task-level oracle quantities used in evaluation. The theorem above is local:qs(Z)q_{s}(Z)is the probability of proposing a progressing-sound next action at statess. For complete outputs, define
p1(P):=ℙx,Y∼P(⋅∣x)(Rx(Y)=1),poracle(k;P):=ℙ(∃i≤k:Rx(Yi)=1).p_{1}(P):=\mathbb{P}_{x,Y\sim P(\cdot\mid x)}(R_{x}(Y)=1),\qquad p_{\mathrm{oracle}}(k;P):=\mathbb{P}(\exists i\leq k:\ R_{x}(Y_{i})=1).IfqP(Z):=ℙ(Rx(Y)=1∣Z)q_{P}(Z):=\mathbb{P}(R_{x}(Y)=1\mid Z), then
poracle(k;P)=1−𝔼[(1−qP(Z))k],limk→∞poracle(k;P)=1−ℙ(qP(Z)=0).p_{\mathrm{oracle}}(k;P)=1-\mathbb{E}[(1-q_{P}(Z))^{k}],\qquad\lim_{k\to\infty}p_{\mathrm{oracle}}(k;P)=1-\mathbb{P}(q_{P}(Z)=0).Precisely, pass@1 measures one-shot success, oracle best-of-kkmeasures terminal boostable capability, and the limiting oracle curve measures the boostable ceiling of the proposal system.
For an implemented system evaluated on the same candidate pool, letpsystem(k,m,r;P)p_{\mathrm{system}}(k,m,r;P)denote its success probability and define oracle-gap recovery by
Rec(k,m,r;P):=psystem(k,m,r;P)−p1(P)poracle(k;P)−p1(P),\mathrm{Rec}(k,m,r;P):=\frac{p_{\mathrm{system}}(k,m,r;P)-p_{1}(P)}{p_{\mathrm{oracle}}(k;P)-p_{1}(P)},when the denominator is positive. This measures how much of the oracle-exposed capability is recovered by the implemented selector, rather than lost to imperfect critics, verifiers, or comparators.
This gives a role-wise benchmark decomposition. Pass@1 measures theone-shot capabilityof the proposal system. Oracle best-of-kkmeasuresthe capability that is latent in the proposal distributionunder perfect identification. The gap between oracle best-of-kkand pass@1 measureshow much there is to recover by boosting. The gap between oracle best-of-kkand the implemented system measuresthe remaining identifiability bottleneck. Thus these quantities tell us whether failures come from weak proposal coverage, shared blind spots, or insufficient local identifiability.
7Experiments
We use SWE-bench Verified[28]as a diagnostic testbed for inference-time orchestration. Beyond testing whether additional calls improve solve rate, we ask where the gains come from. In the theory’s terminology, the experiment separates proposal coverage, critic filtering, comparator ranking, and residual shared blind spots.
The design estimates three role-wise quantities. First, the oracle best-of-kkcurve measures proposal coverage: whether a correct patch appears in the generated pool. Second, the gap between deployed orchestration and the oracle measures harness recovery: how much latent pool capability the selector recovers. Third, tasks unreachable by any generated proposal estimate the remaining coverage failure, or shared-blind-spot mass, of the current proposer family.
7.1Setup
SWE-bench Verified contains500500software-engineering tasks. Each task provides a repository, an issue description, and visible tests, while success is measured by held-out hidden tests. For each task, we generate a fixed pool ofk=8k=8candidate patches using independentGPT-5.4 nanoproposer runs. All selector ablations reuse this same cached proposal pool. Thus, differences in solve rate reflect differences in selection rather than differences in generation.
The harness uses two local selection signals. First, binary critics evaluate individual patches using local evidence such as the issue, the patch, visible tests, and execution traces. A patch survives the critic gate if at leastτ\tauout of five critic votes judge it plausible. Second, pairwise comparators rank candidate patches against each other. Comparator outcomes are aggregated with a tournament rule to select a final patch.
We evaluate these signals separately and jointly. Critics-only selection tests how far coarse patch filtering can go. Comparator-only selection tests whether pairwise preferences can recover correct patches without a critic gate. The full harness tests whether filtering and ranking are complementary. To reduce presentation-order bias, each pair is compared in both patch orders. A pairwise win is counted only if both orders select the same patch after mapping the swapped response back to the original indices; disagreements and ties are treated as ties.
We also report oracle-gap recovery. Letp1p_{1}denote the single-proposal solve rate,poracle(k)p_{\mathrm{oracle}}(k)the oracle best-of-kksolve rate, andpsystem(k)p_{\mathrm{system}}(k)the solve rate of the deployed harness using the samekk-proposal pool. We defineRec(k):=psystem(k)−p1poracle(k)−p1\mathrm{Rec}(k):=\frac{p_{\mathrm{system}}(k)-p_{1}}{p_{\mathrm{oracle}}(k)-p_{1}}, whenever the denominator is positive. This measures how much of the latent proposal-pool capability exposed by oracle best-of-kkis recovered by the actual selector.
We report hidden-test solve rate over all500500tasks. We also report an oracle best-of-kkupper bound, which counts a task as solved if any of thekkgenerated patches passes the hidden tests. This oracle is not deployable, since it uses hidden-test outcomes; it serves only as a diagnostic of latent proposal coverage. Diagnostics requiring complete critic or comparator vote logs are computed on the subset of tasks with complete traces, with the denominator reported where applicable.
Full implementation details, including candidate generation, critic prompting, comparator aggregation, tie handling, and evaluation protocol, appear in AppendixC.
7.2Results

Figure 3:Scaling with the number of proposals and orchestration components.Solve rate as the proposal budgetkkincreases. The oracle best-of-kkcurve succeeds whenever any generated proposal solves the task. Full orchestration reaches76.4%76.4\%atk=8k=8, close to the79.0%79.0\%oracle upper bound, while component-only variants remain below the full system.
Figure 4:Failure decomposition by benchmark category.For each category, we decompose tasks into those solved by orchestration, those that were oracle reachable but missed by selection, and those that were oracle unreachable under the proposal budget. The small oracle-reachable-but-missed segments indicate that most remaining failures are coverage failures rather than selection failures.Proposal diversity exposes latent correct patches, but does not by itself solve selection.Figure3shows the central scaling behavior. With a single nano proposal, solve rate is substantially below the oracle best-of-kkcurve. As the proposal budget increases, the oracle rises to79.0%79.0\%, showing that many tasks already contain a hidden-test-passing patch somewhere in the nano-generated pool. This is the proposal-coverage effect: additional calls increase the probability that a correct patch is present.
However, oracle best-of-kkis not a deployable system. The deployed harness must identify the correct patch without access to hidden tests. Full orchestration reaches76.4%76.4\%atk=8k=8, recovering most of the gap between one-shot nano performance and the oracle. Thus the main empirical effect is not “more samples” alone. The additional samples expose latent capability, while the critic–comparator harness converts a large fraction of that latent capability into realized solve rate.
Critics and comparators are complementary.The component ablations show that neither selection signal is sufficient on its own. Critics-only selection improves over the single-proposal baseline but plateaus well below the full system. This suggests that binary critics are useful for rejecting clearly flawed patches, but are too coarse to reliably choose among multiple plausible candidates. Comparator-only selection is stronger, indicating that direct pairwise ranking carries substantial information about patch quality. Still, it remains below full orchestration, showing that pairwise comparison also benefits from first removing implausible candidates.
This supports the role-wise picture from the theory. Critics provide a local filtering signal: they reduce the number of obviously bad actions that can win. Comparators provide a local ranking signal: they decide among candidates that survive this coarse filter. Full orchestration works best because it composes these two forms of identifiability.
Conservative comparator aggregation matters.The pairwise comparator is noisy and can be sensitive to presentation order. We therefore query each unordered pair in both orders, once withpip_{i}shown as Patch A and once withpjp_{j}shown as Patch A, and map the second judgment back to the original patch indices. A pairwise comparison is counted as a win only when both orders agree on the same patch. If the two orders disagree, or if either order returnsTIE\mathrm{TIE}, the pairwise outcome is treated as a tie.
This conservative rule is important conceptually. It treats comparator decisions as weak local evidence rather than as ground truth. A patch receives a pairwise win only when the preference is robust to a superficial change in presentation. Thus the comparator stage is not merely a majority vote over judge outputs; it is a debiased local-identification mechanism. The strong performance of the comparator and full-orchestration curves suggests that much of the oracle gap can be recovered by stable pairwise preferences, while the remaining gap reflects cases where either no correct proposal was generated or the available local signals were insufficient to identify it.
Remaining failures are mostly coverage failures, not selection failures.Figure4decomposes full-budget failures by benchmark category. For each category, we separate tasks solved by orchestration, tasks that were oracle reachable but missed by the selector, and tasks that were oracle unreachable because none of the generated proposals passed the hidden tests. The oracle-reachable-but-missed region is relatively small compared with the oracle-unreachable region in most categories. This indicates that, after critic–comparator orchestration recovers most of the best-of-kkgain, the dominant remaining bottleneck is proposal coverage.
This is exactly the blind-spot diagnosis predicted by the theory. If no proposal in the pool is correct, no selector can recover a correct patch. Further gains therefore require either more diverse proposers, stronger proposal mechanisms, or new ways of targeting slices on which the current nano proposer family has low correct-patch probability. Stronger selectors may still help on the oracle-reachable-but-missed tasks, but they cannot close the oracle-unreachable gap.
8Discussion and Limitations
The main lesson is that weak-model failure is often not a lack of information, but a failure to select it. On SWE-bench Verified, hidden-test-passing patches often appear in a pool ofGPT-5.4 nanoproposals even when a single nano call fails. Thus the central question is not only whether weak models can generate correct solutions, but whether an inference-time harness can identify them among several imperfect candidates.
Our results show that this selection problem is substantially solvable, but not for free. Critics remove candidates with clear local defects, while comparators rank the remaining plausible patches. Together, they recover most of the oracle best-of-kkgap, turning latent proposal-pool capability into actual solve rate. This explains how a committee of weak nano calls can approach much stronger standalone models: sampling exposes correct patches, and local selection makes them usable.
The same decomposition also identifies the ceiling. When a correct patch is in the pool but the harness chooses another one, the bottleneck is identifiability: better critics, comparators, tests, or aggregation can help. When no correct patch appears, the bottleneck is proposal coverage: no selector can recover an absent solution. Our remaining failures are mostly of this second kind, so further gains likely require more diverse proposers, stronger proposal mechanisms, or targeted methods for hard slices where the current proposer family has shared blind spots. More broadly, verifier-backed orchestration is most natural when tasks provide useful local evidence, and its gains must be weighed against added model calls, verification cost, latency, and system complexity.
References
- [1](2021)Learning to give checkable answers with prover-verifier games.External Links:2108.12099,Document,LinkCited by:§2.
- [2]L. Breiman(1996)Bagging predictors.Machine Learning24(2),pp. 123–140.External Links:Document,LinkCited by:§G.1.
- [3]B. Brown, J. Juravsky, R. Ehrlich, R. Clark, Q. V. Le, C. Ré, and A. Mirhoseini(2024)Large language monkeys: scaling inference compute with repeated sampling.External Links:2407.21787,Document,LinkCited by:§G.3,§1,§2.
- [4]J. Brown-Cohen, G. Irving, and G. Piliouras(2024)Scalable AI safety via doubly-efficient debate.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol.235,pp. 4585–4602.External Links:LinkCited by:§2.
- [5]C. Burns, P. Izmailov, J. H. Kirchner, B. Baker, L. Gao, L. Aschenbrenner, Y. Chen, A. Ecoffet, M. Joglekar, J. Leike, I. Sutskever, and J. Wu(2024)Weak-to-strong generalization: eliciting strong capabilities with weak supervision.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol.235,pp. 4971–5012.External Links:LinkCited by:§G.2,§2.
- [6]L. Chen, J. Q. Davis, B. Hanin, P. Bailis, I. Stoica, M. A. Zaharia, and J. Y. Zou(2024)Are more LM calls all you need? towards the scaling properties of compound AI systems.InAdvances in Neural Information Processing Systems,Vol.37,pp. 45767–45790.External Links:Document,LinkCited by:§G.4,§1,§2.
- [7]M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Herbert-Voss, W. H. Guss, A. Nichol, A. Paino, N. Tezak, J. Tang, I. Babuschkin, S. Balaji, S. Jain, W. Saunders, C. Hesse, A. N. Carr, J. Leike, J. Achiam, V. Misra, E. Morikawa, A. Radford, M. Knight, M. Brundage, M. Murati, K. Mayer, P. Welinder, B. McGrew, D. Amodei, S. McCandlish, I. Sutskever, and W. Zaremba(2021)Evaluating large language models trained on code.External Links:2107.03374,Document,LinkCited by:§G.9,§2.
- [8]W. Chen, Y. Su, J. Zuo, C. Yang, C. Yuan, C. Chan, H. Yu, Y. Lu, Y. Hung, C. Qian, Y. Qin, X. Cong, R. Xie, Z. Liu, M. Sun, and J. Zhou(2023)AgentVerse: facilitating multi-agent collaboration and exploring emergent behaviors.External Links:2308.10848,Document,LinkCited by:§G.8.
- [9]X. Chen, R. Aksitov, U. Alon, J. Ren, K. Xiao, P. Yin, S. Prakash, C. Sutton, X. Wang, and D. Zhou(2023)Universal self-consistency for large language model generation.External Links:2311.17311,Document,LinkCited by:§G.3,§2.
- [10]Z. Chen, T. Ai, Y. Li, G. Li, Y. Wei, W. Zhou, G. Li, B. Yu, Z. Chen, H. Sun, F. Zhuang, J. Li, D. Wang, and Y. Ban(2025)LLMBoost: make large language models stronger with boosting.Note:Preprint; also submitted to ICLR 2026 on OpenReviewExternal Links:2512.22309,Document,LinkCited by:§G.2,§2.
- [11]Z. Chen, X. Lu, J. Li, P. Chen, Z. Li, K. Sun, Y. Luo, Q. Mao, M. Li, L. Xiao, D. Yang, X. Huang, Y. Ban, H. Sun, and P. S. Yu(2025)Harnessing multiple large language models: a survey on LLM ensemble.External Links:2502.18036,Document,LinkCited by:§G.4.
- [12]P. Christiano, B. Shlegeris, and D. Amodei(2018)Supervising strong learners by amplifying weak experts.External Links:1810.08575,Document,LinkCited by:§G.2,§2.
- [13]K. Cobbe, V. Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman(2021)Training verifiers to solve math word problems.External Links:2110.14168,Document,LinkCited by:§G.5,§1,§2.
- [14]S. Dhuliawala, M. Komeili, J. Xu, R. Raileanu, X. Li, A. Celikyilmaz, and J. Weston(2024-08)Chain-of-verification reduces hallucination in large language models.InFindings of the Association for Computational Linguistics: ACL 2024,Bangkok, Thailand,pp. 3563–3578.External Links:Document,LinkCited by:§G.6,§2.
- [15]T. G. Dietterich(2000)Ensemble methods in machine learning.InMultiple Classifier Systems,Lecture Notes in Computer Science, Vol.1857,pp. 1–15.External Links:Document,LinkCited by:§G.1.
- [16]Y. Du, S. Li, A. Torralba, J. B. Tenenbaum, and I. Mordatch(2024)Improving factuality and reasoning in language models through multiagent debate.InProceedings of the 41st International Conference on Machine Learning,Proceedings of Machine Learning Research, Vol.235,pp. 11733–11763.External Links:LinkCited by:§2.
- [17]Y. Freund and R. E. Schapire(1995)A decision-theoretic generalization of on-line learning and an application to boosting.InComputational Learning Theory: Second European Conference, EuroCOLT 1995,Lecture Notes in Computer Science, Vol.904,pp. 23–37.External Links:Document,LinkCited by:§G.1,§1,§2.
- [18]Y. Freund and R. E. Schapire(1997)A decision-theoretic generalization of on-line learning and an application to boosting.Journal of Computer and System Sciences55(1),pp. 119–139.External Links:Document,LinkCited by:§G.1,§1,§2.
- [19]Z. Gou, Z. Shao, Y. Gong, Y. Shen, Y. Yang, N. Duan, and W. Chen(2024)CRITIC: large language models can self-correct with tool-interactive critiquing.InInternational Conference on Learning Representations,External Links:LinkCited by:§G.6,§2.
- [20]J. Gu, X. Jiang, Z. Shi, H. Tan, X. Zhai, C. Xu, W. Li, Y. Shen, S. Ma, H. Liu, S. Wang, K. Zhang, Y. Wang, W. Gao, L. M. Ni, and J. Guo(2024)A survey on LLM-as-a-judge.External Links:2411.15594,Document,LinkCited by:§G.10,§G.5.
- [21]N. Guha, M. F. Chen, T. Chow, I. S. Khare, and C. Ré(2024)Smoothie: label free language model routing.External Links:2412.04692,Document,LinkCited by:§G.8,§2.
- [22]S. Hao, Y. Gu, H. Ma, J. J. Hong, Z. Wang, D. Z. Wang, and Z. Hu(2023)Reasoning with language model is planning with world model.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing,Singapore,pp. 8154–8173.External Links:Document,LinkCited by:§G.7,§2.
- [23]S. Hong, M. Zhuge, J. Chen, X. Zheng, Y. Cheng, J. Wang, C. Zhang, Z. Wang, S. K. S. Yau, Z. Lin, L. Zhou, C. Ran, L. Xiao, C. Wu, and J. Schmidhuber(2024)MetaGPT: meta programming for a multi-agent collaborative framework.InInternational Conference on Learning Representations,External Links:LinkCited by:§G.8,§2.
- [24]A. Huang, A. Block, D. J. Foster, D. Rohatgi, C. Zhang, M. Simchowitz, and A. Krishnamurthy(2024)Self-improvement in language models: the sharpening mechanism.Note:ICLR 2025External Links:2412.01951,Document,LinkCited by:§G.2.
- [25]A. Huang, A. Block, D. Rohatgi, C. Zhang, M. Simchowitz, and A. Krishnamurthy(2025)Is best-of-N the best of them? coverage, scaling, and optimality in inference-time alignment.External Links:2503.21878,Document,LinkCited by:§G.10,§1,§2.
- [26]G. Irving, P. Christiano, and D. Amodei(2018)AI safety via debate.External Links:1805.00899,Document,LinkCited by:§2.
- [27]D. Jiang, X. Ren, and B. Y. Lin(2023-07)LLM-blender: ensembling large language models with pairwise ranking and generative fusion.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),Toronto, Canada,pp. 14165–14178.External Links:Document,LinkCited by:§G.5,§2.
- [28]C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. R. Narasimhan(2024)SWE-bench: can language models resolve real-world GitHub issues?.InInternational Conference on Learning Representations,External Links:LinkCited by:§C.1,§G.9,§1,§1,§2,§7.
- [29]Z. Kenton, N. Y. Siegel, J. Kramár, J. Brown-Cohen, S. Albanie, J. Bulian, R. Agarwal, D. Lindner, Y. Tang, N. D. Goodman, and R. Shah(2024)On scalable oversight with weak LLMs judging strong LLMs.InAdvances in Neural Information Processing Systems,Vol.37.External Links:Document,LinkCited by:§2.
- [30]J. H. Kirchner, Y. Chen, H. Edwards, J. Leike, N. McAleese, and Y. Burda(2024)Prover-verifier games improve legibility of LLM outputs.External Links:2407.13692,Document,LinkCited by:§2.
- [31]N. Lambert, V. Pyatkin, J. Morrison, L. J. V. Miranda, B. Y. Lin, K. Chandu, N. Dziri, S. Kumar, T. Zick, Y. Choi, N. A. Smith, and H. Hajishirzi(2024)RewardBench: evaluating reward models for language modeling.External Links:2403.13787,Document,LinkCited by:§G.10,§G.5,§2.
- [32]J. Lee, V. Ma, S. Zhao, Y. Nair, A. Spector, R. Cohen, and E. J. Candès(2026)FUSE: ensembling verifiers with zero labeled data.External Links:2604.18547,Document,LinkCited by:§G.10,§G.5.
- [33]G. Li, H. A. A. K. Hammoud, H. Itani, D. Khizbullin, and B. Ghanem(2023)CAMEL: communicative agents for “mind” exploration of large scale language model society.External Links:2303.17760,Document,LinkCited by:§G.8,§2.
- [34]J. Li, Q. Zhang, Y. Yu, Q. Fu, and D. Ye(2024)More agents is all you need.Transactions on Machine Learning Research.External Links:LinkCited by:§G.4,§1,§2.
- [35]Y. Li, D. Choi, J. Chung, N. Kushman, J. Schrittwieser, R. Leblond, T. Eccles, J. Keeling, F. Gimeno, A. D. Lago, T. Hubert, P. Choy, C. de Masson d’Autume, I. Babuschkin, X. Chen, P. Huang, J. Welbl, S. Gowal, A. Cherepanov, J. Molloy, D. Mankowitz, E. S. Robson, P. Kohli, N. de Freitas, K. Kavukcuoglu, and O. Vinyals(2022)Competition-level code generation with AlphaCode.Science378(6624),pp. 1092–1097.External Links:Document,LinkCited by:§G.9,§2.
- [36]T. Liang, Z. He, W. Jiao, X. Wang, Y. Wang, R. Wang, Y. Yang, S. Shi, and Z. Tu(2024-11)Encouraging divergent thinking in large language models through multi-agent debate.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,Miami, Florida, USA,pp. 17889–17904.External Links:Document,LinkCited by:§2.
- [37]S. Lifshitz, S. A. McIlraith, and Y. Du(2025)Multi-agent verification: scaling test-time compute with multiple verifiers.External Links:2502.20379,Document,LinkCited by:§G.5,§2.
- [38]H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe(2023)Let’s verify step by step.External Links:2305.20050,Document,LinkCited by:§1.
- [39]H. Lightman, V. Kosaraju, Y. Burda, H. Edwards, B. Baker, T. Lee, J. Leike, J. Schulman, I. Sutskever, and K. Cobbe(2024)Let’s verify step by step.InInternational Conference on Learning Representations,External Links:LinkCited by:§G.5,§2.
- [40]A. Madaan, N. Tandon, P. Gupta, S. Hallinan, L. Gao, S. Wiegreffe, U. Alon, N. Dziri, S. Prabhumoye, Y. Yang, S. Gupta, B. P. Majumder, K. Hermann, S. Welleck, A. Yazdanbakhsh, and P. Clark(2023)Self-refine: iterative refinement with self-feedback.InAdvances in Neural Information Processing Systems,Vol.36.External Links:LinkCited by:§G.6,§2.
- [41]H. Manikandan, Y. Jiang, and J. Z. Kolter(2023)Language models are weak learners.InAdvances in Neural Information Processing Systems,Vol.36.External Links:LinkCited by:§G.2,§2.
- [42]C. Qian, W. Liu, H. Liu, N. Chen, Y. Dang, J. Li, C. Yang, W. Chen, Y. Su, X. Cong, J. Xu, D. Li, Z. Liu, and M. Sun(2024-08)ChatDev: communicative agents for software development.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),Bangkok, Thailand,pp. 15174–15186.External Links:Document,Link,2307.07924Cited by:§G.8.
- [43]J. Saad-Falcon, E. K. Buchanan, M. F. Chen, T. Huang, B. McLaughlin, T. Bhathal, S. Zhu, B. Athiwaratkun, F. Sala, S. Linderman, A. Mirhoseini, and C. Ré(2025)Shrinking the generation-verification gap with weak verifiers.External Links:2506.18203,Document,LinkCited by:§G.10,§G.5,§2.
- [44]J. Saad-Falcon, A. G. Lafuente, S. Natarajan, N. Maru, H. Todorov, E. K. Guha, E. K. Buchanan, M. F. Chen, N. Guha, C. Ré, and A. Mirhoseini(2024)Archon: an architecture search framework for inference-time techniques.External Links:2409.15254,Document,LinkCited by:§G.8,§2.
- [45]R. E. Schapire(1990)The strength of weak learnability.Machine Learning5(2),pp. 197–227.External Links:Document,LinkCited by:§G.1,§1,§2.
- [46]T. Schick, J. Dwivedi-Yu, R. Dessì, R. Raileanu, M. Lomeli, E. Hambro, L. Zettlemoyer, N. Cancedda, and T. Scialom(2023)Toolformer: language models can teach themselves to use tools.InAdvances in Neural Information Processing Systems,Vol.36.External Links:Link,2302.04761Cited by:§G.6.
- [47]Y. Shen, K. Song, X. Tan, D. Li, W. Lu, and Y. Zhuang(2023)HuggingGPT: solving AI tasks with ChatGPT and its friends in Hugging Face.InAdvances in Neural Information Processing Systems,Vol.36.External Links:Link,2303.17580Cited by:§G.6.
- [48]N. Shinn, F. Cassano, E. Berman, A. Gopinath, K. Narasimhan, and S. Yao(2023)Reflexion: language agents with verbal reinforcement learning.InAdvances in Neural Information Processing Systems,Vol.36.External Links:LinkCited by:§G.6,§2.
- [49]V. Venktesh, M. Rathee, and A. Anand(2025)Trust but verify! a survey on verification design for test-time scaling.External Links:2508.16665,Document,LinkCited by:§G.4.
- [50]J. Wang, J. Wang, B. Athiwaratkun, C. Zhang, and J. Zou(2024)Mixture-of-agents enhances large language model capabilities.External Links:2406.04692,Document,LinkCited by:§G.8,§2.
- [51]P. Wang, L. Li, Z. Shao, R. Xu, D. Dai, Y. Li, D. Chen, Y. Wu, and Z. Sui(2024-08)Math-shepherd: verify and reinforce LLMs step-by-step without human annotations.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers),Bangkok, Thailand,pp. 9426–9439.External Links:Document,LinkCited by:§G.5,§2.
- [52]X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, S. Narang, A. Chowdhery, and D. Zhou(2023)Self-consistency improves chain of thought reasoning in language models.InInternational Conference on Learning Representations,External Links:LinkCited by:§G.3,§1,§2.
- [53]X. Wang, J. Wei, D. Schuurmans, Q. V. Le, E. H. Chi, and D. Zhou(2022)Rationale-augmented ensembles in language models.External Links:2207.00747,Document,LinkCited by:§G.3,§2.
- [54]J. Wei, X. Wang, D. Schuurmans, M. Bosma, B. Ichter, F. Xia, E. Chi, Q. V. Le, and D. Zhou(2022)Chain-of-thought prompting elicits reasoning in large language models.InAdvances in Neural Information Processing Systems,Vol.35,pp. 24824–24837.External Links:Link,2201.11903Cited by:§G.3.
- [55]D. H. Wolpert(1992)Stacked generalization.Neural Networks5(2),pp. 241–259.External Links:Document,LinkCited by:§G.1.
- [56]Q. Wu, G. Bansal, J. Zhang, Y. Wu, B. Li, E. Zhu, L. Jiang, X. Zhang, S. Zhang, J. Liu, A. H. Awadallah, R. W. White, D. Burger, and C. Wang(2023)AutoGen: enabling next-gen LLM applications via multi-agent conversation.External Links:2308.08155,Document,LinkCited by:§G.8,§2.
- [57]C. S. Xia, Y. Deng, S. Dunn, and L. Zhang(2025)Demystifying LLM-based software engineering agents.Proceedings of the ACM on Software Engineering2(FSE),pp. 801–824.External Links:Document,LinkCited by:§G.9,§1,§2.
- [58]Y. Xie, K. Kawaguchi, Y. Zhao, J. X. Zhao, M. Kan, J. He, and M. Xie(2023)Self-evaluation guided beam search for reasoning.InAdvances in Neural Information Processing Systems,Vol.36.External Links:Link,2305.00633Cited by:§G.7.
- [59]J. Yang, C. E. Jimenez, A. Wettig, K. Lieret, S. Yao, K. Narasimhan, and O. Press(2024)SWE-agent: agent-computer interfaces enable automated software engineering.InAdvances in Neural Information Processing Systems,Vol.37.External Links:Document,LinkCited by:§G.9,§1,§2.
- [60]S. Yao, D. Yu, J. Zhao, I. Shafran, T. L. Griffiths, Y. Cao, and K. Narasimhan(2023)Tree of thoughts: deliberate problem solving with large language models.InAdvances in Neural Information Processing Systems,Vol.36.External Links:LinkCited by:§G.7,§1,§2.
- [61]S. Yao, J. Zhao, D. Yu, N. Du, I. Shafran, K. Narasimhan, and Y. Cao(2023)ReAct: synergizing reasoning and acting in language models.InInternational Conference on Learning Representations,External Links:LinkCited by:§G.7,§2.
- [62]Q. Zhang, F. Lyu, Z. Sun, L. Wang, W. Zhang, W. Hua, H. Wu, Z. Guo, Y. Wang, N. Muennighoff, I. King, X. Liu, and C. Ma(2025)A survey on test-time scaling in large language models: what, how, where, and how well?.External Links:2503.24235,Document,LinkCited by:§G.4,§2.
- [63]Y. Zhang, H. Ruan, Z. Fan, and A. Roychoudhury(2024)AutoCodeRover: autonomous program improvement.Note:Published version appears in the ACM SIGSOFT/ISSTA proceedingsExternal Links:2404.05427,Document,LinkCited by:§G.9,§1,§2.
- [64]L. Zheng, W. Chiang, Y. Sheng, S. Zhuang, Z. Wu, Y. Zhuang, Z. Lin, Z. Li, D. Li, E. P. Xing, H. Zhang, J. E. Gonzalez, and I. Stoica(2023)Judging LLM-as-a-judge with MT-Bench and chatbot arena.InAdvances in Neural Information Processing Systems,Vol.36.External Links:LinkCited by:§G.10,§G.5,§2.
- [65]A. Zhou, K. Yan, M. Shlapentokh-Rothman, H. Wang, and Y. Wang(2023)Language agent tree search unifies reasoning, acting, and planning in language models.External Links:2310.04406,Document,LinkCited by:§G.7,§2.
- [66]D. Zhou, N. Schärli, L. Hou, J. Wei, N. Scales, X. Wang, D. Schuurmans, C. Cui, O. Bousquet, Q. V. Le, and E. H. Chi(2023)Least-to-most prompting enables complex reasoning in large language models.InInternational Conference on Learning Representations,External Links:LinkCited by:§G.7,§2.
Appendix ABroader Impacts
This work studies inference-time amplification for verifier-backed reasoning tasks, including software engineering, theorem proving, and program synthesis. A positive implication is that stronger performance may be obtained from weaker model calls by sampling multiple candidates and selecting among them with local critics, comparators, tests, proof checkers, or constraint solvers. This could make capable reasoning systems more accessible without requiring the training or deployment of substantially larger models.
The main risk is that the same selection mechanism can amplify harmful uses of reasoning models. If a weak model can occasionally generate a useful candidate, a committee system may make that capability more reliable, including in dual-use settings such as code generation, exploit repair, or automated search over formal constraints. Our analysis also cautions against overtrusting such systems: when all proposals share a blind spot, critics and comparators can only select among flawed candidates. Deployment should therefore use sandboxing, access controls, provenance logs, held-out verification, and human review when outputs affect security, privacy, or critical infrastructure.
Appendix BPolynomial-Resource Consequence of the Main Amplification Bound
Section5shows that the failure probability ofΠk,m,r\Pi_{k,m,r}is controlled by two local quantities: proposer failure and identification failure. In particular, the main text gives the bound
errx(k,m,r)≤Lx(εprop(k)+k2e−βm−2rσ2).\mathrm{err}_{x}(k,m,r)\leq L_{x}\bigl(\varepsilon_{\mathrm{prop}}(k)+k^{2}e^{-\beta m-2r\sigma^{2}}\bigr).This appendix records the corresponding polynomial-resource regime. The result is only a sufficient condition: it identifies when the committee protocol gives polynomial reliable amplification, but does not imply that all verifier-backed tasks satisfy these assumptions.
Definition 3(Committee-decomposable verifier family).
A family of search relationsℱ={Rx:x∈𝒳N,N∈ℕ}\mathcal{F}=\{R_{x}:x\in\mathcal{X}_{N},\ N\in\mathbb{N}\}iscommittee-decomposableif every instancex∈𝒳Nx\in\mathcal{X}_{N}admits the ingredients used in Sections3–5:
- (i)a valid state system with rank boundLx≤poly(N)L_{x}\leq\mathrm{poly}(N);
- (ii)a proposer portfolioPNP_{N}of sizepoly(N)\mathrm{poly}(N)satisfying Assumption1;
- (iii)efficient local identifiability satisfying Assumption2;
- (iv)proposer, critic, comparator, and verifier calls with runtime polynomial inNN.
Definition 4(Polynomial reliable amplification).
Fix a class𝔉\mathfrak{F}of search-relation families. A committee protocol achievespolynomial reliable amplificationover𝔉\mathfrak{F}if, for everyℱ∈𝔉\mathcal{F}\in\mathfrak{F}, every instancex∈𝒳Nx\in\mathcal{X}_{N}, and everyδ∈(0,1)\delta\in(0,1), it outputs a witnessyysatisfyingRx(y)=1R_{x}(y)=1with probability at least1−δ1-\delta, using at mostpoly(N,log(1/δ))\mathrm{poly}(N,\log(1/\delta))resources whenever such a witness exists.
Theorem 3(Polynomial amplification for committee-decomposable families).
Let𝔉\mathfrak{F}be a class of committee-decomposable verifier families, and letx∈𝒳Nx\in\mathcal{X}_{N}be an instance with rank boundLxL_{x}. Suppose the proposer harness has a uniform proposer-failure boundεprop(k)\varepsilon_{\mathrm{prop}}(k)over all reachable valid nonterminal states, and suppose local identifiability holds with edges(β0,σ0)(\beta_{0},\sigma_{0}). If
εprop(k)+k2exp(−β0m−2σ02r)≤δLx,\varepsilon_{\mathrm{prop}}(k)+k^{2}\exp(-\beta_{0}m-2\sigma_{0}^{2}r)\leq\frac{\delta}{L_{x}},(1)thenΠk,m,r\Pi_{k,m,r}outputs a witnessyysatisfyingRx(y)=1R_{x}(y)=1with probability at least1−δ1-\delta. The all-pairs implementation uses
O(Lx(k+mk+rk2))O\!\left(L_{x}(k+mk+rk^{2})\right)(2)role calls.
In particular, ifLxL_{x},|PN||P_{N}|, per-call runtime,1/α01/\alpha_{0},1/β01/\beta_{0}, and1/σ01/\sigma_{0}are polynomially bounded, and ifεprop(k)\varepsilon_{\mathrm{prop}}(k)can be driven below the target accuracy with polynomially many proposer calls, then the family is solvable withpoly(N,log(1/δ))\mathrm{poly}(N,\log(1/\delta))resources.
Proof.
Fix an instancex∈𝒳Nx\in\mathcal{X}_{N}with rank boundLxL_{x}, and suppose Eq. (1) holds:
εprop(k)+k2e−β0m−2σ02r≤δLx.\varepsilon_{\mathrm{prop}}(k)+k^{2}e^{-\beta_{0}m-2\sigma_{0}^{2}r}\leq\frac{\delta}{L_{x}}.By Theorem2, every reachable valid nonterminal statesshas local error
εloc(s)≤εprop(k)+k2e−β0m−2σ02r≤δLx.\varepsilon_{\mathrm{loc}}(s)\leq\varepsilon_{\mathrm{prop}}(k)+k^{2}e^{-\beta_{0}m-2\sigma_{0}^{2}r}\leq\frac{\delta}{L_{x}}.The bound is uniform over reachable states, so Lemma1gives
ℙ(Rx(Πk,m,r(x))=0)≤Lx⋅δLx=δ.\mathbb{P}(R_{x}(\Pi_{k,m,r}(x))=0)\leq L_{x}\cdot\frac{\delta}{L_{x}}=\delta.ThusΠk,m,r\Pi_{k,m,r}outputs a witnessyysatisfyingRx(y)=1R_{x}(y)=1with probability at least1−δ1-\delta.
For the role-call bound, each step useskkproposer calls,mkmkcritic calls, and at mostr(k2)=O(rk2)r\binom{k}{2}=O(rk^{2})comparator calls in the all-pairs implementation. Since every successful trajectory has length at mostLxL_{x}, the total number of role calls is
O(Lx(k+mk+rk2)).O\!\left(L_{x}(k+mk+rk^{2})\right).∎
Proof.
Fix an instancex∈𝒳Nx\in\mathcal{X}_{N}with rank boundLxL_{x}, and suppose Eq. (1) holds:
εprop(k)+k2e−β0m−2σ02r≤δLx.\varepsilon_{\mathrm{prop}}(k)+k^{2}e^{-\beta_{0}m-2\sigma_{0}^{2}r}\leq\frac{\delta}{L_{x}}.By Theorem2, every reachable valid nonterminal statesshas local error
εloc(s)≤εprop(k)+k2e−β0m−2σ02r≤δLx.\varepsilon_{\mathrm{loc}}(s)\leq\varepsilon_{\mathrm{prop}}(k)+k^{2}e^{-\beta_{0}m-2\sigma_{0}^{2}r}\leq\frac{\delta}{L_{x}}.The bound is uniform over reachable states, so Lemma1gives
ℙ(Rx(Πk,m,r(x))=0)≤Lx⋅δLx=δ.\mathbb{P}(R_{x}(\Pi_{k,m,r}(x))=0)\leq L_{x}\cdot\frac{\delta}{L_{x}}=\delta.ThusΠk,m,r\Pi_{k,m,r}outputs a witnessyysatisfyingRx(y)=1R_{x}(y)=1with probability at least1−δ1-\delta.
For the role-call bound, each step useskkproposer calls,mkmkcritic calls, and at mostr(k2)=O(rk2)r\binom{k}{2}=O(rk^{2})comparator calls in the all-pairs implementation. Running the bounded-depth protocol for at mostLxL_{x}steps therefore uses
O(Lx(k+mk+rk2))O\!\left(L_{x}(k+mk+rk^{2})\right)role calls. ∎
Explicit parameter choices for the polynomial-resource claim.Under the round-robin portfolio assignment from Theorem1, the good promptps∈PNp_{s}\in P_{N}is queried at least⌊k/|PN|⌋\lfloor k/|P_{N}|\rfloortimes at every reachable state. Hence
εprop(k)≤(1−α0)⌊k/|PN|⌋≤exp(−α0⌊k|PN|⌋).\varepsilon_{\mathrm{prop}}(k)\leq(1-\alpha_{0})^{\lfloor k/|P_{N}|\rfloor}\leq\exp\!\left(-\alpha_{0}\left\lfloor\frac{k}{|P_{N}|}\right\rfloor\right).Thus it is enough to choose
k≥|PN|⌈log(2Lx/δ)α0⌉.k\geq|P_{N}|\left\lceil\frac{\log(2L_{x}/\delta)}{\alpha_{0}}\right\rceil.For the identification term, it suffices that
k2e−β0m−2σ02r≤δ2Lx,k^{2}e^{-\beta_{0}m-2\sigma_{0}^{2}r}\leq\frac{\delta}{2L_{x}},equivalently,
β0m+2σ02r≥log2k2Lxδ.\beta_{0}m+2\sigma_{0}^{2}r\geq\log\frac{2k^{2}L_{x}}{\delta}.One separated choice is
m≥⌈12β0log2k2Lxδ⌉,r≥⌈14σ02log2k2Lxδ⌉.m\geq\left\lceil\frac{1}{2\beta_{0}}\log\frac{2k^{2}L_{x}}{\delta}\right\rceil,\qquad r\geq\left\lceil\frac{1}{4\sigma_{0}^{2}}\log\frac{2k^{2}L_{x}}{\delta}\right\rceil.Therefore, ifLxL_{x},|PN||P_{N}|, per-call runtime,1/α01/\alpha_{0},1/β01/\beta_{0}, and1/σ01/\sigma_{0}are polynomially bounded, the required parameters and total runtime are polynomial inNNandlog(1/δ)\log(1/\delta).
The theorem makes explicit the efficient regime implicit in the main amplification bound. Inference-time boosting can fail to be polynomial for several different reasons: exponentially small proposal mass, weak critic or comparator edges, long progress depth, or expensive verification. Thus the result applies to the locally identifiable core of a task, not automatically to every task with a final verifier.
Appendix CImplementation Details
This appendix describes the implementation used for the SWE-bench Verified experiments in Section7. The experiments are designed to separate proposal coverage from selection quality. We therefore first generate a fixed pool of candidate patches for each task and then reuse this same pool across all selector ablations.
C.1Benchmark and Evaluation
We evaluate on SWE-bench Verified[28], which contains500500real software-engineering tasks. Each task provides a repository state and an issue description. A submitted patch is counted as correct only if it passes the benchmark’s held-out hidden tests. All headline solve rates are computed over the full set of500500tasks.
Candidate patches are evaluated with the SWE-bench evaluation harness. Hidden test outcomes are used only for scoring and for oracle diagnostics. They are never exposed to the deployed selector.
C.2Proposal Generation
For each taskxx, we generate a pool ofk=8k=8candidate patches
𝒫(x)={p1(x),…,pk(x)}\mathcal{P}(x)=\{p_{1}(x),\ldots,p_{k}(x)\}using independentGPT-5.4 nanoproposer calls. Each proposer is implemented as a mini-swe-agent trajectory over the repository. The agent receives the issue and interacts with the repository environment to produce a patch. The proposer runs are independent, and their outputs are cached before any selector ablation is performed.
This fixed-pool design is important. All selector variants, including binary-judge-only selection, comparator-only selection, hybrid tie-breaking, and full orchestration, select from the same candidate patches. Therefore differences in solve rate come from the selection rule rather than from changes in proposal generation.
C.3Oracle Best-of-kk
The oracle best-of-kkscore measures latent proposal coverage. For each task, we evaluate every cached proposal against the hidden tests and count the task as oracle solved if at least one proposal succeeds:
𝟏oracle(x)=𝟏{∃j≤k:pj(x)passes the hidden tests}.\mathbf{1}_{\mathrm{oracle}}(x)=\mathbf{1}\{\exists j\leq k:\ p_{j}(x)\text{ passes the hidden tests}\}.This oracle is not deployable because it uses hidden-test outcomes. It is used only as a diagnostic: it measures whether the proposal pool contains a correct patch somewhere, independently of whether the selector can identify that patch.
C.4Binary Patch Judges
The single-patch selection signal is implemented as a binary patch judge. For each candidate patch, the judge receives the issue text and the proposed patch and answers whether the patch resolves the issue without breaking previously working behavior. Each judgment returns a binary decision, a short rationale, and a confidence score. The prompt asks the judge to trace the failure mode from the issue to the code path and decide whether the patch changes the relevant behavior.
For each task and patch, multiple independent binary judgments are cached. A binary-judge-only selector ranks patches by the number of positive judgments, with ties broken by the lowest proposal index. We also evaluate a confidence-weighted variant, where positive judgments are weighted by the reported confidence. These cached judgments are reused for voter-count and proposal-budget ablations.
C.5Pairwise Comparators
The pairwise comparator evaluates two candidate patches for the same issue. Given the issue and two patches, the comparator first states a hypothesis for the failure mode, lists the relevant changes made by each patch, and then decides whether each patch is consistent with the hypothesized failure and whether it introduces collateral behavioral changes. It then outputs a winner in{A,B,TIE}\{A,B,\mathrm{TIE}\}together with a confidence score.
Pairwise comparator outcomes are aggregated into a tournament. A Copeland-style rule scores each candidate by its pairwise wins and selects the highest-scoring candidate. Ties in the final Copeland score are broken deterministically by the lowest proposal index in the offline ablations. This gives a comparator-only selector when applied to the full proposal pool.
C.6Hybrid and Full Orchestration Selectors
The hybrid selector combines the binary patch judge with pairwise comparators. First, the binary judge scores each patch. Patches that pass a fixed yes-rate threshold are retained as candidates for pairwise comparison. If no patch passes the gate, the selector falls back to the highest-scoring patch under the binary judge. Pairwise comparator aggregation is then applied only to the retained candidate set.
The full orchestration system uses the same cached proposal pool and combines the binary-judge signal with pairwise comparator aggregation. Thus, the comparison between binary-judge-only, comparator-only, hybrid tie-breaking, and full orchestration isolates the value of different selection signals while holding proposal generation fixed.
C.7Proposal-Budget Ablations
To study the effect of proposal diversity, we compute solve rate as a function of the proposal budgetkk. For eachkk, the selector is restricted to a subset ofkkproposals from the cachedk=8k=8pool. The oracle best-of-kkcurve is computed using hidden-test outcomes for thosekkproposals. The implemented selector curves are computed using only cached binary-judge and comparator signals.
For subset-based ablations, results are averaged over proposal subsets of the same size when applicable. This avoids attributing performance changes to an arbitrary ordering of proposal indices.
C.8Vote-Count Ablations
The binary-judge and comparator vote caches also allow offline vote-count ablations. For binary judges, we recompute the selector using subsets of the available judge votes and report the resulting solve rate. For comparators, we similarly recompute the Copeland tournament using subsets of the available pairwise votes. These ablations test how much selection accuracy depends on the number of local judgments rather than on additional proposal generation.
C.9Failure Decomposition
For the category-level decomposition in Figure4, each task is assigned to one of three mutually exclusive groups:
- •Solved by orchestration:the patch selected by the deployed orchestration system passes the hidden tests.
- •Oracle reachable, missed:at least one generated proposal passes the hidden tests, but the deployed selector does not select a passing patch.
- •Oracle unreachable:none of the generated proposals passes the hidden tests.
This decomposition separates selection failures from proposal-coverage failures. An oracle-reachable-but-missed task is a selection failure: the correct patch was present in the proposal pool, but the selector failed to recover it. An oracle-unreachable task is a coverage failure: no selector could solve the task using the available proposal pool.
C.10Reported Quantities
The main metric is hidden-test solve rate. For a deployed selectorAA, the solve rate is
1500∑x𝟏{A(x)passes the hidden tests}.\frac{1}{500}\sum_{x}\mathbf{1}\{A(x)\text{ passes the hidden tests}\}.We also report the oracle best-of-kksolve rate and the oracle-gap recovery ratio
psystem(k)−p1poracle(k)−p1,\frac{p_{\mathrm{system}}(k)-p_{1}}{p_{\mathrm{oracle}}(k)-p_{1}},wherep1p_{1}is the single-proposal baseline,poracle(k)p_{\mathrm{oracle}}(k)is the oracle best-of-kkrate, andpsystem(k)p_{\mathrm{system}}(k)is the implemented orchestration rate. This ratio measures how much of the latent best-of-kkimprovement is recovered by the deployed selector.
C.11Prompts and Parsing
This subsection gives the selector prompts and parsing rules used in the offline SWE-bench Verified selector experiments. The proposer patches are generated once and cached; the prompts below are used only for selecting among the cached proposals.
Binary patch-judge prompt.The binary patch judge evaluates one patch at a time. The prompt asks whether the patch resolves the issue and preserves previously passing behavior.
Binary patch\-judge prompt The issue text is truncated to 20002000 characters and the proposed patch to 80008000 characters\. Each patch receives multiple independent binary judgments\. Malformed responses, refusals, or parse failures are treated as abstentions and do not contribute positive votes\. Binary\-judge aggregation\. For each patch, we compute the number of positive votes, syes\(p\)=\#\{v:v\.resolves=true\}\.s\_\{\\mathrm\{yes\}\}\(p\)=\\\#\\\{v:\\ v\.\\mathrm\{resolves\}=\\mathrm\{true\}\\\}\. The majority binary selector chooses the patch with the largest syess\_\{\\mathrm\{yes\}\}\. Ties are broken by the lowest proposal index\. We also compute a confidence\-weighted score sconf\(p\)=∑v:v\.resolves=truev\.confidence,s\_\{\\mathrm\{conf\}\}\(p\)=\\sum\_\{v:\\ v\.\\mathrm\{resolves\}=\\mathrm\{true\}\}v\.\\mathrm\{confidence\}, again breaking ties by the lowest proposal index\. Unless otherwise stated, the main binary selector uses the yes\-count rule\. Pairwise comparator prompt\. The comparator evaluates two patches directly\. It is instructed to prefer the patch that minimally resolves the issue, rather than the patch that is longer, more ambitious, or more visually complex\. Pairwise comparator prompt As in the binary patch\-judge prompt, the issue text is truncated to 20002000 characters and each patch to 80008000 characters\. Responses are parsed as JSON\. If the response is malformed after the allowed retries, the comparator judgment is treated as missing and handled as a tie in the offline aggregation\.
C\.12 Selection, Aggregation, and Tie\-Breaking Position\-swap debiasing\. For each unordered pair of patches \(pi,pj\)\(p\_\{i\},p\_\{j\}\), the comparator is queried in both orders: first with pip\_\{i\} as Patch A and pjp\_\{j\} as Patch B, and then with the positions swapped\. The swapped response is mapped back to the original patch indices before aggregation\. This position\-swap protocol reduces lead\-position bias and prevents the selector from favoring a patch because it was shown first rather than because it better addresses the issue\. In the conservative offline rule, a pairwise comparison is counted as a win for pip\_\{i\} only if both position orders select pip\_\{i\}, and as a win for pjp\_\{j\} only if both position orders select pjp\_\{j\}\. If the two orders disagree, or if either response is missing or tied, the aggregated pairwise outcome is treated as a tie\. Comparator aggregation\. Pairwise comparator outcomes are aggregated with a Copeland\-style tournament\. For each unordered pair \(pi,pj\)\(p\_\{i\},p\_\{j\}\), the aggregated pairwise outcome is one of pip\_\{i\} wins, pjp\_\{j\} wins, or tie\. A win contributes one point to the winning patch; a tie contributes no win to either patch in the offline aggregation\. The selected patch is the one with the largest Copeland score\. Ties in the final Copeland score are broken deterministically by the lowest proposal index\. Hybrid gated comparator\. The hybrid selector first applies the binary patch judge and retains patches whose yes\-rate exceeds a threshold τ\\tau\. If no patch passes the gate, the selector falls back to the highest\-scoring patch under the binary judge\. Pairwise comparator aggregation is then applied only to the retained patches\. Unless otherwise specified, the main gated\-comparator ablations use τ=0\.5\\tau=0\.5\. Threshold\-sweep experiments are treated as offline diagnostics\. Parse failures and abstentions\. All selector prompts require strict JSON outputs\. If a response cannot be parsed after the allowed retries, the corresponding judgment is treated as an abstention\. For the binary patch judge, abstentions are not counted as positive votes\. For pairwise comparison, missing or unparseable comparator outputs are treated as ties in the offline aggregation\. This prevents malformed outputs from creating artificial wins\. Oracle diagnostics\. The oracle best\-of\-kk diagnostic is computed only after hidden\-test evaluation\. It counts a task as solved if any cached proposal passes the hidden tests\. The deployed selector never observes these hidden\-test outcomes\. Therefore, at a fixed proposal budget kk, the gap between oracle best\-of\-kk and implemented orchestration measures missed selection opportunities: cases where a correct patch was present but not selected\. The remaining gap between oracle best\-of\-kk and perfect performance reflects finite\-budget proposal\-coverage failures, including possible shared blind spots of the proposer pool\. C\.13 Selector Ablations Table 1 gives additional selector ablations for the SWE\-bench Verified experiment\. These ablations use the same cached k=8k=8 GPT\-5\.4 nano proposal pool as the main experiment, so changes in solve rate reflect the selector rather than proposal generation\. The first block varies the comparator aggregation rule using GPT\-5\.4 nano comparators with five comparator calls per pairwise matchup\. Copeland round\-robin and strict dominance both reach 75\.8%75\.8\\%, while the single\-elimination bracket drops to 74\.6%74\.6\\%\. This supports the main\-text claim that all\-pairs aggregation preserves useful pairwise evidence better than cheaper tournament structures\. The second block varies the critic\-gate threshold\. With no critic gate, pure Copeland over the eight proposals reaches 75\.8%75\.8\\%\. Dropping only patches with zero critic support raises performance to 76\.4%76\.4\\%, matching the best full\-harness result\. Increasing the threshold beyond τ≥2\\tau\\geq 2 slightly reduces performance\. Thus the critic stage is most useful as a permissive coarse filter, not as a strict correctness certificate\. Table 1: Selector ablations on SWE\-bench Verified\. All rows use the same k=8k=8 GPT\-5\.4 nano proposal pool\. Tournament rows vary the GPT\-5\.4 nano comparator aggregation rule with 55 comparator calls per pairwise matchup\. Threshold rows vary the five\-critic gate with 55 critics and 55 comparators\. Oracle\-gap recovery is computed using the metric defined in Section 7\.1\. Appendix D Proofs for Section 4 This section proves the bridge results\. The central point is that proposer coverage and local identifiability are logically distinct\. Coverage says that good actions can be sampled with nontrivial probability\. Identifiability says that bad actions can be rejected or ranked below good ones\. The latter cannot be obtained from black\-box sampling alone\. D\.1 Formal information model for Proposition 1 A local identification procedure is allowed to know the family construction, the action set 𝒜\\mathcal\{A\}, and the proposer distribution μ\\mu\. It may observe any finite list of candidate actions and polynomially many i\.i\.d\. samples from μ\\mu\. It does not observe the hidden world parameter θ\\theta\. A critic edge is required to hold uniformly over worlds: for every θ\\theta, the critic must never reject actions in Soundθ\\mathrm\{Sound\}\_\{\\theta\} and must reject actions outside Soundθ\\mathrm\{Sound\}\_\{\\theta\} with probability bounded away from zero\. A comparator edge is also required to hold uniformly: for every θ\\theta, whenever one candidate is in Soundθ\\mathrm\{Sound\}\_\{\\theta\} and the other is not, the comparator must prefer the sound candidate with probability at least 1/2\+σ1/2\+\\sigma for some σ\>0\\sigma\>0\. D\.2 Proof of Proposition 1 See 1 Proof\. Fix M≥2M\\geq 2\. Let the action set be 𝒜=\{1,…,M\}\\mathcal\{A\}=\\\{1,\\ldots,M\\\} and let the hidden world parameter be θ∈\{1,…,M\}\\theta\\in\\\{1,\\ldots,M\\\}\. In world θ\\theta, define Soundθ=𝒜∖\{θ\}\.\\mathrm\{Sound\}\_\{\\theta\}=\\mathcal\{A\}\\setminus\\\{\\theta\\\}\. Let the proposer distribution be μ=Unif\(𝒜\)\\mu=\\mathrm\{Unif\}\(\\mathcal\{A\}\) in every world\. Then, for every θ\\theta, ℙa∼μ\(a∈Soundθ\)=M−1M=1−1M\.\\mathbb\{P\}\_\{a\\sim\\mu\}\(a\\in\\mathrm\{Sound\}\_\{\\theta\}\)=\\frac\{M\-1\}\{M\}=1\-\\frac\{1\}\{M\}\. Thus the proposer succeeds with probability α0=1−1/M\\alpha\_\{0\}=1\-1/M in every world\. The observable transcript of any procedure that sees only candidate actions and samples from μ\\mu is independent of θ\\theta, because the proposer distribution is identical in every world\. We use this to rule out uniform critic and comparator edges\. First consider critics\. Fix an action i∈𝒜i\\in\\mathcal\{A\}\. In world θ=i\\theta=i, the action ii is unsound\. In any world θ=j≠i\\theta=j\\neq i, the same action ii is sound\. Since the transcript distribution is identical in these two worlds, the critic has the same probability of rejecting ii in both worlds\. Uniform one\-sided soundness requires that this rejection probability be zero in every world where ii is sound\. Hence the rejection probability for ii must be zero in every world θ≠i\\theta\\neq i\. But the transcript distribution is the same in world θ=i\\theta=i, so the rejection probability is also zero in the world where ii is unsound\. Therefore no critic can reject the unsound action with probability at least β\>0\\beta\>0 uniformly over worlds\. Now consider comparators\. Fix two distinct actions i,j∈𝒜i,j\\in\\mathcal\{A\}\. In world θ=i\\theta=i, action ii is unsound and action jj is sound, so a comparator with edge σ\>0\\sigma\>0 must prefer jj to ii with probability at least 1/2\+σ1/2\+\\sigma\. In world θ=j\\theta=j, action jj is unsound and action ii is sound, so the comparator must prefer ii to jj with probability at least 1/2\+σ1/2\+\\sigma\. But the observable transcript distribution is identical in these two worlds\. Therefore the probability of preferring ii over jj must be the same in both worlds\. It cannot be both at least 1/2\+σ1/2\+\\sigma and at most 1/2−σ1/2\-\\sigma\. Hence no uniform comparator edge exists\. Thus proposer coverage can be arbitrarily strong while local identifiability is impossible in this black\-box information model\. ∎ D\.3 Proof of Theorem 1 See 1 Proof\. Fix a reachable valid nonterminal state ss\. By Assumption 1, there exists a policy or prompt ps∈PNp\_\{s\}\\in P\_\{N\} such that ℙ\(LLM\(ps\)∈Soundx\(s\)\)≥α0\\mathbb\{P\}\(\\mathrm\{LLM\}\(p\_\{s\}\)\\in\\mathrm\{Sound\}\_\{x\}\(s\)\)\\geq\\alpha\_\{0\}\. Under round\-robin assignment, if the committee makes kk proposer calls, then psp\_\{s\} is used at least q=⌊k/\|PN\|⌋q=\\lfloor k/\|P\_\{N\}\|\\rfloor times\. Since these calls use fresh independent randomness, the probability that none of them returns an action in Soundx\(s\)\\mathrm\{Sound\}\_\{x\}\(s\) is at most \(1−α0\)q\(1\-\\alpha\_\{0\}\)^\{q\}\. Hence αcommittee\(s\)≥1−\(1−α0\)⌊k/\|PN\|⌋\.\\alpha\_\{\\mathrm\{committee\}\}\(s\)\\geq 1\-\(1\-\\alpha\_\{0\}\)^\{\\lfloor k/\|P\_\{N\}\|\\rfloor\}\. If k≥\|PN\|⌈α0−1ln\(1/δprop\)⌉k\\geq\|P\_\{N\}\|\\left\\lceil\\alpha\_\{0\}^\{\-1\}\\ln\(1/\\delta\_\{\\mathrm\{prop\}\}\)\\right\\rceil, then ⌊k/\|PN\|⌋≥α0−1ln\(1/δprop\)\\lfloor k/\|P\_\{N\}\|\\rfloor\\geq\\alpha\_\{0\}^\{\-1\}\\ln\(1/\\delta\_\{\\mathrm\{prop\}\}\)\. Using 1−u≤e−u1\-u\\leq e^\{\-u\} for u∈\[0,1\]u\\in\[0,1\], \(1−α0\)⌊k/\|PN\|⌋≤exp\(−α0⌊k/\|PN\|⌋\)≤δprop\.\(1\-\\alpha\_\{0\}\)^\{\\lfloor k/\|P\_\{N\}\|\\rfloor\}\\leq\\exp\\\!\\left\(\-\\alpha\_\{0\}\\left\\lfloor k/\|P\_\{N\}\|\\right\\rfloor\\right\)\\leq\\delta\_\{\\mathrm\{prop\}\}\. Therefore αcommittee\(s\)≥1−δprop\\alpha\_\{\\mathrm\{committee\}\}\(s\)\\geq 1\-\\delta\_\{\\mathrm\{prop\}\}\. The critic and comparator guarantees are not derived from Assumption 1; they are exactly the local\-identifiability guarantees supplied by Assumption 2\. Thus β\(s\)≥β0\\beta\(s\)\\geq\\beta\_\{0\} and σ\(s\)≥σ0\\sigma\(s\)\\geq\\sigma\_\{0\}\. This proves the theorem\. ∎ D\.4 Proof of Corollary 1 See 1 Proof\. Use the verifier Vx\(s,a\)V\_\{x\}\(s,a\) as the critic\. If a∈Soundx\(s\)a\\in\\mathrm\{Sound\}\_\{x\}\(s\), Assumption 3 gives Vx\(s,a\)=acceptV\_\{x\}\(s,a\)=\\textsc\{accept\} almost surely, so the critic never rejects a progressing\-sound action\. If a∉Soundx\(s\)a\\notin\\mathrm\{Sound\}\_\{x\}\(s\), Assumption 3 gives ℙ\(Vx\(s,a\)=reject\)≥1−ν\\mathbb\{P\}\(V\_\{x\}\(s,a\)=\\textsc\{reject\}\)\\geq 1\-\\nu\. Hence the critic edge is β0=1−ν\\beta\_\{0\}=1\-\\nu\. For comparison, suppose a∈Soundx\(s\)a\\in\\mathrm\{Sound\}\_\{x\}\(s\) and b∉Soundx\(s\)b\\notin\\mathrm\{Sound\}\_\{x\}\(s\)\. Verify both candidates\. The sound candidate aa is never rejected, while bb is rejected with probability at least 1−ν1\-\\nu\. If bb is rejected, choose aa; if both candidates are accepted, break the tie uniformly at random\. Therefore the probability of choosing the sound candidate is at least \(1−ν\)\+ν2=12\+1−ν2\.\(1\-\\nu\)\+\\frac\{\\nu\}\{2\}=\\frac\{1\}\{2\}\+\\frac\{1\-\\nu\}\{2\}\. Thus the comparator edge is σ0=\(1−ν\)/2\\sigma\_\{0\}=\(1\-\\nu\)/2\. This proves the corollary\. ∎ D\.5 Why disagreement\-based identification is insufficient One might try to identify bad actions by comparing them to a sampled reference action\. This is not valid without a much stronger canonical\-witness assumption\. In many verifier\-backed domains there are many distinct progressing\-sound actions at the same state\. Two proof steps may differ syntactically while both preserve solvability and make progress\. Two program continuations may use different implementations while both satisfy the local specification\. Therefore token agreement, syntactic proximity, or agreement with one sampled reference is not a reliable proxy for soundness\. A theorem based on sampled\-reference agreement would require an additional assumption, such as the existence of a canonical progressing\-sound witness at each state and a known representation in which soundness is equivalent, or nearly equivalent, to agreement with that witness\. That is much stronger than Assumption 1\. Without such an assumption, Proposition 1 rules out deriving local identification from coverage alone\. Appendix E Proofs for Section 5 This section proves the cumulative and local error bounds\. We explicitly include the local\-failure/no\-survivor case in the bad event\. E\.1 Proof of Lemma 1 See 1 Proof\. For each t<Lxt<L\_\{x\}, let BtB\_\{t\} be the event that, when St∈Validx∖TermxS\_\{t\}\\in\\mathrm\{Valid\}\_\{x\}\\setminus\\mathrm\{Term\}\_\{x\}, the protocol either declares local failure or selects an action At∉Soundx\(St\)A\_\{t\}\\notin\\mathrm\{Sound\}\_\{x\}\(S\_\{t\}\)\. If St∉Validx∖TermxS\_\{t\}\\notin\\mathrm\{Valid\}\_\{x\}\\setminus\\mathrm\{Term\}\_\{x\}, set Bt=∅B\_\{t\}=\\emptyset\. Equivalently, local failure may be represented as selecting a formal invalid absorbing action\. By assumption, ℙ\(Bt∣ℱt\)≤εt\\mathbb\{P\}\(B\_\{t\}\\mid\\mathcal\{F\}\_\{t\}\)\\leq\\varepsilon\_\{t\}\. Let G=⋂t=0Lx−1BtcG=\\bigcap\_\{t=0\}^\{L\_\{x\}\-1\}B\_\{t\}^\{c\}\. On GG, whenever the protocol is at a valid nonterminal state, it selects an action in Soundx\(St\)\\mathrm\{Sound\}\_\{x\}\(S\_\{t\}\)\. By Definition 2, the next state remains valid and the rank strictly decreases: St\+1=Φx\(St,At\)∈Validx,dx\(St\+1\)<dx\(St\)\.S\_\{t\+1\}=\\Phi\_\{x\}\(S\_\{t\},A\_\{t\}\)\\in\\mathrm\{Valid\}\_\{x\},\\qquad d\_\{x\}\(S\_\{t\+1\}\)<d\_\{x\}\(S\_\{t\}\)\. Since dxd\_\{x\} is a nonnegative integer with rank bound LxL\_\{x\}, the protocol reaches a terminal state within at most LxL\_\{x\} progressing steps\. By Definition 1, every terminal state s∈Termxs\\in\\mathrm\{Term\}\_\{x\} satisfies Rx\(Outx\(s\)\)=1R\_\{x\}\(\\mathrm\{Out\}\_\{x\}\(s\)\)=1\. Therefore, \{Rx\(Πk,m,r\(x\)\)=0\}⊆⋃t=0Lx−1Bt\.\\\{R\_\{x\}\(\\Pi\_\{k,m,r\}\(x\)\)=0\\\}\\subseteq\\bigcup\_\{t=0\}^\{L\_\{x\}\-1\}B\_\{t\}\. The union bound and tower property give ℙ\(Rx\(Πk,m,r\(x\)\)=0\)≤∑t=0Lx−1ℙ\(Bt\)=∑t=0Lx−1𝔼\[ℙ\(Bt∣ℱt\)\]≤𝔼\[∑t=0Lx−1εt\]\.\\mathbb\{P\}\(R\_\{x\}\(\\Pi\_\{k,m,r\}\(x\)\)=0\)\\leq\\sum\_\{t=0\}^\{L\_\{x\}\-1\}\\mathbb\{P\}\(B\_\{t\}\)=\\sum\_\{t=0\}^\{L\_\{x\}\-1\}\\mathbb\{E\}\[\\mathbb\{P\}\(B\_\{t\}\\mid\\mathcal\{F\}\_\{t\}\)\]\\leq\\mathbb\{E\}\\\!\\left\[\\sum\_\{t=0\}^\{L\_\{x\}\-1\}\\varepsilon\_\{t\}\\right\]\. If εt≤ε\\varepsilon\_\{t\}\\leq\\varepsilon almost surely for every tt, this becomes ℙ\(Rx\(Πk,m,r\(x\)\)=0\)≤Lxε\.\\mathbb\{P\}\(R\_\{x\}\(\\Pi\_\{k,m,r\}\(x\)\)=0\)\\leq L\_\{x\}\\varepsilon\. This proves the lemma\. ∎ E\.2 Proof of Theorem 2 See 2 Proof\. Fix a reachable valid nonterminal state ss, and let C1,…,CkC\_\{1\},\\ldots,C\_\{k\} be the proposed candidates\. Define Eprop:=\{Ci∉Soundx\(s\) for all i=1,…,k\}\.E\_\{\\mathrm\{prop\}\}:=\\\{C\_\{i\}\\notin\\mathrm\{Sound\}\_\{x\}\(s\)\\text\{ for all \}i=1,\\ldots,k\\\}\. By definition, ℙ\(Eprop\)=εprop\(k;s\)\\mathbb\{P\}\(E\_\{\\mathrm\{prop\}\}\)=\\varepsilon\_\{\\mathrm\{prop\}\}\(k;s\)\. If EpropE\_\{\\mathrm\{prop\}\} occurs, then no progressing\-sound candidate is available, so the protocol may either select an unsound action or declare local failure; both are catastrophic local errors and are charged to εprop\(k;s\)\\varepsilon\_\{\\mathrm\{prop\}\}\(k;s\)\. Now condition on EpropcE\_\{\\mathrm\{prop\}\}^\{c\}\. At least one progressing\-sound candidate has been proposed\. Since critics are one\-sided, no progressing\-sound candidate is ever rejected, so at least one progressing\-sound candidate survives criticism and the no\-survivor case cannot occur\. Fix an ordered pair \(a,b\)\(a,b\) of proposed candidates with a∈Soundx\(s\)a\\in\\mathrm\{Sound\}\_\{x\}\(s\) and b∉Soundx\(s\)b\\notin\\mathrm\{Sound\}\_\{x\}\(s\)\. The mm critics assigned to bb reject it independently with probability at least β\\beta, so ℙ\(b survives all m critics\)≤\(1−β\)m\.\\mathbb\{P\}\(b\\text\{ survives all \}m\\text\{ critics\}\)\\leq\(1\-\\beta\)^\{m\}\. Conditioned on bb surviving, each of the rr fresh comparator votes prefers aa to bb with probability at least 1/2\+σ1/2\+\\sigma, independently across votes\. Let Yj∈\{0,1\}Y\_\{j\}\\in\\\{0,1\\\} indicate whether vote jj prefers aa to bb\. Since 𝔼\[Yj\]≥1/2\+σ\\mathbb\{E\}\[Y\_\{j\}\]\\geq 1/2\+\\sigma, the unsound candidate bb wins the majority comparison only if 1r∑j=1rYj≤1/2\\frac\{1\}\{r\}\\sum\_\{j=1\}^\{r\}Y\_\{j\}\\leq 1/2, where ties may be broken adversarially against aa\. By Hoeffding’s inequality, ℙ\(1r∑j=1rYj≤12\)≤e−2rσ2\.\\mathbb\{P\}\\\!\\left\(\\frac\{1\}\{r\}\\sum\_\{j=1\}^\{r\}Y\_\{j\}\\leq\\frac\{1\}\{2\}\\right\)\\leq e^\{\-2r\\sigma^\{2\}\}\. Since the comparison randomness is fresh relative to the critic randomness, for this fixed ordered pair, ℙ\(b survives criticism and defeats a\)≤\(1−β\)me−2rσ2\.\\mathbb\{P\}\(b\\text\{ survives criticism and defeats \}a\)\\leq\(1\-\\beta\)^\{m\}e^\{\-2r\\sigma^\{2\}\}\. There are at most k2k^\{2\} ordered sound/unsound pairs, so by a union bound the probability that some unsound candidate survives criticism and defeats some sound candidate is at most k2\(1−β\)me−2rσ2\.k^\{2\}\(1\-\\beta\)^\{m\}e^\{\-2r\\sigma^\{2\}\}\. Call this event EidE\_\{\\mathrm\{id\}\}\. If neither EpropE\_\{\\mathrm\{prop\}\} nor EidE\_\{\\mathrm\{id\}\} occurs, then at least one sound candidate survives, and every surviving unsound candidate loses its pairwise comparison to every surviving sound candidate\. We claim that every Copeland winner is then sound\. Let S≥1S\\geq 1 be the number of surviving sound candidates and U≥0U\\geq 0 the number of surviving unsound candidates\. Every sound candidate beats all UU unsound candidates, so its Copeland score is at least UU\. Every unsound candidate loses to all SS sound candidates, so it can only score wins against other unsound candidates and has Copeland score at most U−1U\-1\. Thus no unsound candidate can be a Copeland winner\. Therefore catastrophic local error can occur only if EpropE\_\{\\mathrm\{prop\}\} or EidE\_\{\\mathrm\{id\}\} occurs\. Hence εloc\(s\)≤εprop\(k;s\)\+k2\(1−β\)me−2rσ2\.\\varepsilon\_\{\\mathrm\{loc\}\}\(s\)\\leq\\varepsilon\_\{\\mathrm\{prop\}\}\(k;s\)\+k^\{2\}\(1\-\\beta\)^\{m\}e^\{\-2r\\sigma^\{2\}\}\. Using 1−β≤e−β1\-\\beta\\leq e^\{\-\\beta\}, εloc\(s\)≤εprop\(k;s\)\+k2e−βm−2rσ2\.\\varepsilon\_\{\\mathrm\{loc\}\}\(s\)\\leq\\varepsilon\_\{\\mathrm\{prop\}\}\(k;s\)\+k^\{2\}e^\{\-\\beta m\-2r\\sigma^\{2\}\}\. Finally, if the kk proposer calls are conditionally independent and each succeeds with probability at least α\\alpha, then εprop\(k;s\)≤\(1−α\)k≤e−αk\.\\varepsilon\_\{\\mathrm\{prop\}\}\(k;s\)\\leq\(1\-\\alpha\)^\{k\}\\leq e^\{\-\\alpha k\}\. This proves the theorem\. ∎ Appendix F Proofs and Extensions for Section 6 This appendix contains the derivations behind the blind\-spot discussion in Section 6\. The main text states the exchangeable local version because it is the clearest way to see the message: proposal failure equals a blind\-spot floor plus a finite\-sampling residual\. We record the proof, the heterogeneous\-portfolio extension, finite\-kk rates, and the terminal\-output analogue used by oracle best\-of\-kk benchmarks\. F\.1 Proof of Lemma 2 See 2 Proof\. Fix a reachable valid nonterminal state ss\. Conditional on ZZ, the kk proposal\-success indicators are independent Bernoulli variables with common success probability qs\(Z\)q\_\{s\}\(Z\)\. Therefore the conditional probability that all kk proposals fail is ℙ\(all proposals fail∣Z\)=\(1−qs\(Z\)\)k\.\\mathbb\{P\}\(\\text\{all proposals fail\}\\mid Z\)=\(1\-q\_\{s\}\(Z\)\)^\{k\}\. Taking expectation over ZZ gives εprop\(k;s\)=𝔼\[\(1−qs\(Z\)\)k\]\.\\varepsilon\_\{\\mathrm\{prop\}\}\(k;s\)=\\mathbb\{E\}\[\(1\-q\_\{s\}\(Z\)\)^\{k\}\]\. Split the expectation according to whether qs\(Z\)=0q\_\{s\}\(Z\)=0: 𝔼\[\(1−qs\(Z\)\)k\]=ℙ\(qs\(Z\)=0\)\+𝔼\[\(1−qs\(Z\)\)k𝟏\{qs\(Z\)\>0\}\]\.\\mathbb\{E\}\[\(1\-q\_\{s\}\(Z\)\)^\{k\}\]=\\mathbb\{P\}\(q\_\{s\}\(Z\)=0\)\+\\mathbb\{E\}\[\(1\-q\_\{s\}\(Z\)\)^\{k\}\\mathbf\{1\}\\\{q\_\{s\}\(Z\)\>0\\\}\]\. This is exactly Bs\+Rk\(s\)B\_\{s\}\+R\_\{k\}\(s\)\. Since \(1−qs\(Z\)\)k𝟏\{qs\(Z\)\>0\}→0\(1\-q\_\{s\}\(Z\)\)^\{k\}\\mathbf\{1\}\\\{q\_\{s\}\(Z\)\>0\\\}\\to 0 pointwise and is bounded by 11, dominated convergence gives Rk\(s\)→0R\_\{k\}\(s\)\\to 0\. Hence εprop\(k;s\)→Bs\\varepsilon\_\{\\mathrm\{prop\}\}\(k;s\)\\to B\_\{s\}\. ∎ F\.2 Heterogeneous proposer portfolios The exchangeable formula in the main text can be replaced by a heterogeneous portfolio calculation\. This version is useful when prompts, tools, retrieval contexts, decoding policies, or base models define different proposer families\. Proposition 2 \(Heterogeneous portfolio oracle miss\)\. Fix a reachable valid nonterminal state ss\. Suppose the proposal system contains families g=1,…,Gg=1,\\ldots,G, with ngn\_\{g\} calls from family gg\. Conditional on a latent variable ZZ, calls are independent and each call from family gg proposes an action in Soundx\(s\)\\mathrm\{Sound\}\_\{x\}\(s\) with probability qg\(Z\)q\_\{g\}\(Z\)\. Then εprop\(\{ng\};s\)=𝔼\[∏g=1G\(1−qg\(Z\)\)ng\]\.\\varepsilon\_\{\\mathrm\{prop\}\}\(\\\{n\_\{g\}\\\};s\)=\\mathbb\{E\}\\left\[\\prod\_\{g=1\}^\{G\}\(1\-q\_\{g\}\(Z\)\)^\{n\_\{g\}\}\\right\]\. If Q\(Z\):=∑g=1Gngqg\(Z\)Q\(Z\):=\\sum\_\{g=1\}^\{G\}n\_\{g\}q\_\{g\}\(Z\), then εprop\(\{ng\};s\)≤𝔼\[e−Q\(Z\)\]\.\\varepsilon\_\{\\mathrm\{prop\}\}\(\\\{n\_\{g\}\\\};s\)\\leq\\mathbb\{E\}\[e^\{\-Q\(Z\)\}\]\. Consequently, if Q\(Z\)≥λQ\(Z\)\\geq\\lambda except on an event of probability at most η\\eta, then εprop\(\{ng\};s\)≤η\+e−λ\.\\varepsilon\_\{\\mathrm\{prop\}\}\(\\\{n\_\{g\}\\\};s\)\\leq\\eta\+e^\{\-\\lambda\}\. If ng→∞n\_\{g\}\\to\\infty for every family gg, then εprop\(\{ng\};s\)→ℙ\(∀g,qg\(Z\)=0\)\.\\varepsilon\_\{\\mathrm\{prop\}\}\(\\\{n\_\{g\}\\\};s\)\\to\\mathbb\{P\}\(\\forall g,\\ q\_\{g\}\(Z\)=0\)\. Proof\. Conditional on ZZ, all calls are independent\. The probability that all calls from family gg fail is \(1−qg\(Z\)\)ng\(1\-q\_\{g\}\(Z\)\)^\{n\_\{g\}\}, and multiplying over families gives the product formula\. The upper bound follows from 1−u≤e−u1\-u\\leq e^\{\-u\}: ∏g=1G\(1−qg\(Z\)\)ng≤exp\(−∑g=1Gngqg\(Z\)\)=e−Q\(Z\)\.\\prod\_\{g=1\}^\{G\}\(1\-q\_\{g\}\(Z\)\)^\{n\_\{g\}\}\\leq\\exp\\\!\\left\(\-\\sum\_\{g=1\}^\{G\}n\_\{g\}q\_\{g\}\(Z\)\\right\)=e^\{\-Q\(Z\)\}\. If Q\(Z\)≥λQ\(Z\)\\geq\\lambda on an event AA with ℙ\(A\)≥1−η\\mathbb\{P\}\(A\)\\geq 1\-\\eta, then the integrand is at most e−λe^\{\-\\lambda\} on AA and at most 11 on AcA^\{c\}, giving η\+e−λ\\eta\+e^\{\-\\lambda\}\. Finally, for fixed ZZ, the product converges to 11 exactly when all qg\(Z\)=0q\_\{g\}\(Z\)=0, and to 0 otherwise\. Dominated convergence proves the limiting floor\. ∎ This proposition is the formal version of the diversity claim in the main text\. Adding more calls from the same family only helps where that family has positive conditional proposal mass\. Diversifying the portfolio can reduce the blind\-spot floor only when different families cover different latent subpopulations\. F\.3 Finite\-sampling residual and lower\-tail rates The main text writes the local oracle miss as εprop\(k;s\)=Bs\+Rk\(s\),Rk\(s\)=𝔼\[\(1−qs\(Z\)\)k𝟏\{qs\(Z\)\>0\}\]\.\\varepsilon\_\{\\mathrm\{prop\}\}\(k;s\)=B\_\{s\}\+R\_\{k\}\(s\),\\qquad R\_\{k\}\(s\)=\\mathbb\{E\}\[\(1\-q\_\{s\}\(Z\)\)^\{k\}\\mathbf\{1\}\\\{q\_\{s\}\(Z\)\>0\\\}\]\. The rate at which Rk\(s\)R\_\{k\}\(s\) decays is determined by the lower tail of qs\(Z\)q\_\{s\}\(Z\) near zero\. Proposition 3 \(Lower\-tail bound\)\. Suppose that for some a,C\>0a,C\>0 and all sufficiently small t\>0t\>0, ℙ\(0<qs\(Z\)≤t\)≤Cta\.\\mathbb\{P\}\(0<q\_\{s\}\(Z\)\\leq t\)\\leq Ct^\{a\}\. Then for all sufficiently large kk, Rk\(s\)≤C\(alogkk\)a\+k−a\.R\_\{k\}\(s\)\\leq C\\left\(\\frac\{a\\log k\}\{k\}\\right\)^\{a\}\+k^\{\-a\}\. In particular, if qs\(Z\)≥τ\>0q\_\{s\}\(Z\)\\geq\\tau\>0 whenever qs\(Z\)\>0q\_\{s\}\(Z\)\>0, then Rk\(s\)≤e−τk\.R\_\{k\}\(s\)\\leq e^\{\-\\tau k\}\. Proof\. For the uniform lower bound, simply use \(1−qs\(Z\)\)k≤e−τk\(1\-q\_\{s\}\(Z\)\)^\{k\}\\leq e^\{\-\\tau k\} on \{qs\(Z\)\>0\}\\\{q\_\{s\}\(Z\)\>0\\\}\. For the lower\-tail bound, set tk=alog\(k\)/kt\_\{k\}=a\\log\(k\)/k\. Split Rk\(s\)=𝔼\[\(1−qs\(Z\)\)k𝟏\{0<qs\(Z\)≤tk\}\]\+𝔼\[\(1−qs\(Z\)\)k𝟏\{qs\(Z\)\>tk\}\]\.R\_\{k\}\(s\)=\\mathbb\{E\}\[\(1\-q\_\{s\}\(Z\)\)^\{k\}\\mathbf\{1\}\\\{0<q\_\{s\}\(Z\)\\leq t\_\{k\}\\\}\]\+\\mathbb\{E\}\[\(1\-q\_\{s\}\(Z\)\)^\{k\}\\mathbf\{1\}\\\{q\_\{s\}\(Z\)\>t\_\{k\}\\\}\]\. The first term is at most ℙ\(0<qs\(Z\)≤tk\)≤Ctka\\mathbb\{P\}\(0<q\_\{s\}\(Z\)\\leq t\_\{k\}\)\\leq Ct\_\{k\}^\{a\}\. For the second term, \(1−q\)k≤e−kq≤e−ktk=k−a\(1\-q\)^\{k\}\\leq e^\{\-kq\}\\leq e^\{\-kt\_\{k\}\}=k^\{\-a\}\. Substituting tk=alog\(k\)/kt\_\{k\}=a\\log\(k\)/k proves the claim\. ∎ Thus oracle boosting is complete up to blind spots, but the finite\-kk cost depends on the lower tail of proposal mass\. Uniform positive mass gives logarithmic sample complexity in 1/ε1/\\varepsilon, polynomial lower tails give polynomial convergence, and exponentially small proposal mass gives exponentially expensive brute\-force boosting\. F\.4 Task\-level oracle curves The theorem in the main text is local: qs\(Z\)q\_\{s\}\(Z\) is the probability of proposing a progressing\-sound next action at state ss\. Benchmarking often uses the terminal\-output analogue\. For a proposer harness PP that samples complete outputs Y∼P\(⋅∣x\)Y\\sim P\(\\cdot\\mid x\), define poracle\(k;P\):=ℙ\(∃i≤k:Rx\(Yi\)=1\)\.p\_\{\\mathrm\{oracle\}\}\(k;P\):=\\mathbb\{P\}\(\\exists i\\leq k:\\ R\_\{x\}\(Y\_\{i\}\)=1\)\. If qP\(Z\):=ℙ\(Rx\(Y\)=1∣Z\)q\_\{P\}\(Z\):=\\mathbb\{P\}\(R\_\{x\}\(Y\)=1\\mid Z\), the same calculation gives poracle\(k;P\)=1−𝔼\[\(1−qP\(Z\)\)k\],limk→∞poracle\(k;P\)=1−ℙ\(qP\(Z\)=0\)\.p\_\{\\mathrm\{oracle\}\}\(k;P\)=1\-\\mathbb\{E\}\[\(1\-q\_\{P\}\(Z\)\)^\{k\}\],\\qquad\\lim\_\{k\\to\\infty\}p\_\{\\mathrm\{oracle\}\}\(k;P\)=1\-\\mathbb\{P\}\(q\_\{P\}\(Z\)=0\)\. This is the terminal\-output version of the blind\-spot floor\. It is the quantity estimated by oracle best\-of\-kk curves in verifier\-backed benchmarks\. If an implemented system is evaluated on the same candidate pool, the oracle\-gap recovery ratio compares its success to this oracle curve\. F\.5 Common\-shock special case A useful closed\-form dependence model is the following\. With probability ρ\\rho, all proposers share a common fate\. In this branch, the whole committee fails with probability 1−α1\-\\alpha\. With probability 1−ρ1\-\\rho, the proposers act independently with marginal success probability α\\alpha, so the whole committee fails only if all kk proposers fail, with probability \(1−α\)k\(1\-\\alpha\)^\{k\}\. Therefore εprop\(k\)=ρ\(1−α\)\+\(1−ρ\)\(1−α\)k\.\\varepsilon\_\{\\mathrm\{prop\}\}\(k\)=\\rho\(1\-\\alpha\)\+\(1\-\\rho\)\(1\-\\alpha\)^\{k\}\. Equivalently, the proposer success probability is αcommittee=1−ρ\(1−α\)−\(1−ρ\)\(1−α\)k\.\\alpha\_\{\\mathrm\{committee\}\}=1\-\\rho\(1\-\\alpha\)\-\(1\-\\rho\)\(1\-\\alpha\)^\{k\}\. As k→∞k\\to\\infty, the failure probability converges to ρ\(1−α\)\\rho\(1\-\\alpha\), the common\-shock floor\. Appendix G Extended Literature Review This appendix expands the main related\-work discussion\. We organize prior work by mechanism rather than by chronology\. The unifying question is how existing methods instantiate, implicitly or explicitly, the four ingredients in our framework: coverage,identifiability,progress,diversity\.\\text\{coverage\},\\qquad\\text\{identifiability\},\\qquad\\text\{progress\},\\qquad\\text\{diversity\}\. The literature already contains many empirical demonstrations that extra inference\-time compute can improve LLM performance: sampling more candidates, voting, ranking, searching, debating, using tools, or adding agents\. Our goal is not to claim novelty for any of these practices\. Rather, the framework in this paper gives a role\-wise account of one mechanism behind the gains and its limits: proposer coverage creates an oracle gap, local identifiability recovers part of that gap, progress composes local decisions into trajectories, and diversity controls the ceiling through shared blind spots\. G\.1 Classical boosting, weak learning, and ensemble methods Classical boosting is the historical source of the weak\-to\-strong amplification analogy\. The PAC weak\-learning theorem of \[45\] shows that weak and strong learnability are equivalent under the classical PAC formalism\. \[17\] and \[18\] developed algorithms for combining weak hypotheses, with AdaBoost becoming the canonical practical method\. Bagging, stacking, and ensemble methods more broadly provide related mechanisms for variance reduction and model combination \[2, 55, 15\]\. The analogy to LLM agent systems is useful but partial\. Classical boosting assumes labeled examples and a weak learner that returns hypotheses in an evaluable label space\. In our setting, there may be no label for an intermediate state unless the task exposes a local verifier\. Moreover, the system must repeatedly generate and select local moves along a trajectory\. Thus the classical weak edge is split into several quantities: local proposal coverage, local identification edge, progress depth, and dependence/diversity\. This is the first way in which verifier\-backed committee search differs from ordinary supervised boosting\. G\.2 LLMs as weak learners, weak supervision, and boosting\-style LLM methods Several papers explicitly use weak\-learning, weak\-supervision, or boosting language for language models\. \[41\] show that prompt\-based language models can act as weak learners inside a classical boosting pipeline for tabular classification\. In their setting, the LLM produces textual summaries or templates that serve as weak classifiers, and boosting is applied in a supervised learning loop\. This is close in terminology but different in object: our system performs black\-box inference\-time search rather than training\-time tabular boosting\. LLMBoost \[10\] is another close terminological neighbor\. It proposes an ensemble fine\-tuning framework that uses cross\-model attention, chain training, error\-suppression objectives, and near\-parallel inference\. This is a representation\-sharing and training/fine\-tuning approach\. By contrast, our framework treats the generator, critic, and comparator as black\-box inference\-time components\. We therefore do not require internal hidden states, cross\-model attention, or parameter updates\. Weak\-to\-strong generalization is also conceptually adjacent\. \[5\] study whether weak model supervision can elicit strong capabilities from a stronger model\. Iterated amplification \[12\] similarly studies how weak experts can supervise stronger learners through recursive decomposition\. These works share our interest in using weak signals to elicit stronger behavior, but the object is different: they study supervision and training protocols, whereas we study fixed\-weight inference\-time committees operating over verifier\-backed search trajectories\. Recent self\-improvement and sharpening\-style works are also related\. The sharpening mechanism studies how a model can use verifier\-like feedback to concentrate probability mass on high\-quality outputs and amortize expensive inference\-time computation \[24\]\. These works are aligned with our interest in eliciting latent capability, but our focus is a theorem for committee search and its ceilings: coverage, local identifiability, progress, and shared blind spots\. G\.3 Multi\-sample inference: self\-consistency, best\-of\-NN, and repeated sampling A large line of work improves LLM reasoning by sampling multiple outputs\. Chain\-of\-thought prompting elicits intermediate reasoning traces \[54\]; self\-consistency then samples multiple reasoning paths and marginalizes over answers rather than using greedy decoding \[52\]\. This is one of the clearest empirical antecedents of our coverage story\. The method works when at least some sampled traces reach the correct answer and the majority/consistency heuristic can identify the right answer\. Rationale\-augmented ensembles sample rationales and aggregate predictions, showing that rationale diversity can improve few\-shot robustness \[53\]\. Universal Self\-Consistency extends the aggregation idea to settings where answers are not trivially comparable by exact match, using an LLM to select the most consistent answer among free\-form candidates \[9\]\. Best\-of\-NN, pass@kk, and oracle\-selection evaluations in code and reasoning instantiate the same principle: repeated sampling reveals latent correct mass\. Large Language Monkeys \[3\] is especially close to our benchmarking section\. It studies repeated sampling as inference\-time compute scaling and measures coverage, the fraction of problems solved by at least one sample\. In verifier\-backed domains such as coding and formal proofs, this coverage can directly translate into performance if correct samples can be identified\. The paper also finds that in domains without automatic verifiers, majority voting and reward models may plateau, leaving a generation\-verification gap\. Our theory formalizes this distinction: oracle best\-of\-kk estimates latent boostable capability, while practical selectors recover only part of the oracle gap\. G\.4 More agents, compound inference systems, and voting laws More Agents Is All You Need \[34\] studies a simple Agent Forest method: instantiate many LLM agents, sample outputs, and aggregate by voting\. The paper reports that performance scales with the number of agents and that the degree of improvement correlates with task difficulty\. This is one of the clearest empirical motivations for our theory\. In our language, adding agents helps when it increases coverage or diversity; it saturates when additional agents are shared\-blind\-spot clones\. \[6\] study scaling laws for compound inference systems, focusing on Vote and Filter\-Vote architectures\. They show that increasing the number of calls can yield non\-monotone performance because easy and hard instances respond differently to majority voting\. This complements our analysis: their theory focuses on flat voting systems and majority aggregation, while ours studies sequential verifier\-backed search with explicit identifiability and progress conditions\. Recent surveys on test\-time scaling, LLM ensembles, and verifier design organize this rapidly expanding landscape \[62, 11, 49\]\. These surveys emphasize that additional inference compute can be allocated in many ways: parallel sampling, sequential refinement, search, verification, aggregation, routing, fusion, or distillation\. Our paper contributes a mechanism\-level theorem for one important slice of that landscape\. G\.5 Verifiers, reward models, process supervision, and judges Verifier\-based selection is a central precedent for our identifiability assumption\. \[13\] train verifiers for math word problems and show that generating many candidate solutions and selecting with a verifier improves GSM8K performance\. This is the flat\-output analogue of our proposer\-plus\-critic decomposition\. Process supervision moves verification from final answers to intermediate steps\. \[39\] compare outcome and process supervision and release PRM800K, demonstrating that process\-supervised reward models can improve mathematical reasoning\. \[51\] construct automatic process supervision for math reasoning and use Math\-Shepherd both for reranking and reinforcement\. These works are strongly aligned with our local\-identifiability story: step\-level verifiers instantiate local signals that can reject or score partial reasoning moves\. LLM\-as\-a\-judge and pairwise ranking methods provide another route to identifiability\. LLM\-Blender \[27\] uses a PairRanker to compare candidate outputs and a GenFuser to synthesize improved outputs\. Multi\-Agent Verification \[37\] scales test\-time compute by adding multiple aspect verifiers and combining them with best\-of\-nn sampling\. Weaver \[43\] studies the generation\-verification gap and combines many weak verifiers using weak supervision, showing that verifier aggregation can recover much of the gap between pass@1 and oracle selection\. Very recent verifier\-ensemble work such as FUSE \[32\] pushes this idea further by ensembling verifiers without ground\-truth labels\. Reward\-model and judge benchmarks show that selection models have their own biases and failure modes\. MT\-Bench and Chatbot Arena use LLM\-as\-a\-judge as a scalable approximation to human preferences \[64\]; RewardBench evaluates reward models for preference modeling and RLHF\-style selection \[31\]; recent surveys map the broader LLM\-as\-a\-judge landscape \[20\]\. These works reinforce a central point of our paper: selection is not a cosmetic add\-on\. It is a separate resource\. Our bridge theorem and black\-box impossibility result formalize this separation: local proposal coverage does not imply a usable critic or comparator edge\. G\.6 Tool\-backed checking, self\-correction, and verification loops Several works improve LLM outputs by adding external feedback, tools, or iterative verification\. CRITIC \[19\] uses tool\-interactive critiquing to validate and revise outputs\. Chain\-of\-Verification \[14\] asks models to draft answers, generate verification questions, answer them independently, and then revise\. Self\-Refine \[40\] iteratively generates feedback and refines outputs without additional training data\. Reflexion \[48\] stores verbal feedback in memory to improve subsequent trials\. Tool\-using systems also change the effective verification and proposal process\. Toolformer \[46\] trains models to decide when and how to call external APIs; HuggingGPT \[47\] uses an LLM as a controller over a collection of external models; ReAct\-style systems interleave reasoning and environment actions\. In our terminology, tools can improve local identifiability, increase proposal coverage, reduce blind spots, or change the decomposition so that progress becomes easier to certify\. These systems instantiate repeated local improvement, but their reliability depends on whether feedback exposes true errors and whether revisions make progress\. Our framework makes those requirements explicit\. A feedback loop is useful only insofar as it increases local identifiability, improves proposal coverage on later attempts, changes the state decomposition, or reduces the cost of reaching the same boostable ceiling\. G\.7 Search over partial states and planning\-style inference Search\-based prompting and agent methods are natural neighbors of our valid\-state framework\. Least\-to\-most prompting decomposes hard problems into easier subproblems solved sequentially \[66\]\. Tree of Thoughts generalizes chain\-of\-thought into a search over intermediate thoughts, using evaluation and backtracking \[60\]\. Self\-evaluation\-guided beam search uses stepwise self\-evaluation to guide stochastic beam search through reasoning chains \[58\]\. RAP uses the LLM as both world model and reasoning agent, and applies Monte Carlo tree search \[22\]\. LATS integrates reasoning, acting, planning, self\-reflection, and MCTS\-style search \[65\]\. ReAct interleaves reasoning traces with external actions \[61\]\. These works already embody the idea that reasoning is not merely one\-shot generation\. Our paper contributes a set of sufficient and necessary conditions under which such search\-like systems can amplify weak local reasoning\. In particular, decomposition alone is not enough\. The decomposition must expose progressing\-sound actions that the proposer can sample and that local signals can identify\. G\.8 General multi\-agent orchestration, routing, and model aggregation A broad multi\-agent literature studies how LLM agents should communicate, specialize, coordinate, route tasks, and use tools\. CAMEL introduces role\-playing communicative agents \[33\]; AutoGen provides a programmable multi\-agent conversation framework \[56\]; MetaGPT encodes standardized operating procedures for collaborative software development \[23\]; ChatDev studies communicative agents for software development workflows \[42\]; AgentVerse studies multi\-agent collaboration and emergent behavior \[8\]\. Mixture\-of\-Agents constructs layered multi\-model aggregation where each layer conditions on outputs from the previous layer \[50\]\. Archon searches over inference\-time architectures combining generation, ranking, fusion, verification, and other inference\-time techniques under compute budgets \[44\]\. Routing methods such as Smoothie study how to choose among LLMs without labeled data or with weak supervision \[21\]\. These works explore design spaces rather than proving a universal theory\. Our framework offers a diagnostic lens for them\. A role or subagent is valuable if it improves one of the load\-bearing quantities: coverage, identifiability, progress, or diversity per unit cost\. Additional agents are not inherently useful; they help if they reduce blind spots, provide independent evidence, improve selection, expose new decompositions, or approach the same boostable ceiling with lower budget\. G\.9 Software\-engineering agents and verifier\-backed code benchmarks Software engineering is one of the best testbeds for our theory because candidate artifacts can often be checked by execution, tests, type checkers, or repository\-specific validators\. Codex \[7\] and AlphaCode \[35\] already showed the importance of sampling, filtering, and program behavior for code generation; AlphaCode in particular generated many candidate programs, filtered by behavior, and clustered candidates before submission\. SWE\-bench \[28\] introduced realistic GitHub issue resolution tasks that require repository navigation, code editing, and execution\-based validation\. SWE\-agent \[59\] shows that agent\-computer interface design can substantially improve software\-engineering agents\. AutoCodeRover \[63\] combines LLMs with code search and program analysis\. Agentless \[57\] argues that simpler localization\-repair\-validation pipelines can achieve strong SWE\-bench Lite performance with lower cost than more complex autonomous agents\. These systems support one of our practical implications: better agent systems are not necessarily more elaborate systems\. They are systems that move the coverage\-identifiability\-progress\-diversity frontier efficiently\. A simple pipeline with strong localization and validation can beat a more ornate agent if it recovers the oracle gap with less cost\. G\.10 Inference\-time alignment, reward\-model limitations, and verifier aggregation Inference\-time alignment studies how to use additional compute and imperfect reward models to improve responses without updating the base model\. \[25\] analyze best\-of\-NN alignment and show that coverage over high\-quality responses is crucial; they also show that naively increasing NN can suffer from reward hacking under imperfect reward models\. This is conceptually close to our separation between oracle best\-of\-kk and implemented selection\. In our terminology, a large oracle gap indicates latent correct mass, but a weak selector may fail to recover it or may select reward\-hacked candidates\. Verifier aggregation methods such as Weaver and FUSE \[43, 32\] are especially close to our identification story\. They treat imperfect verifiers as weak signals and combine them to approach stronger verification behavior\. Our theory abstracts this into critic/comparator edges and highlights the missing robustness question: repeated verifiers can also share blind spots, just as repeated proposers can\. Reward\-model and judge benchmarks such as RewardBench and LLM\-as\-a\-judge evaluations further show that selection models have their own biases and failure modes \[31, 64, 20\]\. This reinforces the need to treat identifiability as a separate assumption rather than deriving it from generation\. G\.11 Benchmarking implications The literature increasingly evaluates models and systems using pass@1, best\-of\-NN, pass@kk, oracle selection, verifier selection, and budgeted success\. Our framework suggests a unified reporting scheme for verifier\-backed benchmarks: p1\(P\),poracle\(K;P\),psystem\(K,m,r;P\),Rec\(K,m,r;P\),p\_\{1\}\(P\),\\qquad p\_\{\\mathrm\{oracle\}\}\(K;P\),\\qquad p\_\{\\mathrm\{system\}\}\(K,m,r;P\),\\qquad\\mathrm\{Rec\}\(K,m,r;P\), together with token/call/latency budgets\. This separates one\-shot capability, boostable capability, selector quality, and efficiency\. Prior work often reports one or two of these quantities; our theory explains why all four matter\. In particular, oracle best\-of\-kk measures latent proposal mass, residual oracle failure estimates blind\-spot mass, implemented system performance measures harness quality, and budgeted curves measure how efficiently the system approaches the boostable ceiling\.
Similar Articles
Introducing BenchBench (5 minute read)
Introduces BenchBench, a benchmark that tests AI models' ability to create effective benchmarks for other models, with GPT 5.2 being the only successful winner so far while frontier models like GPT 5.5 and Opus 4.6 struggled.
@berryxia: Small model, big wisdom? It's now real! A 7B small model now acts as the boss of top large models like GPT-5, Claude Sonnet 4, Gemini 2.5 Pro. A new paper shows an RL-trained 7B model learned to write natural language subtasks, assign them to different models, precisely...
A new paper proposes training a 7B small model via reinforcement learning as a task scheduler, automatically decomposing subtasks and assigning them to top models like GPT-5 and Claude. It surpasses individual frontier models on several hard benchmarks, demonstrating that end-to-end reward learning can effectively replace manual prompt engineering and multi-agent pipeline design.
@omarsar0: The efficiency frontier! Where do you think GPT-5.6 will land?
Discussion of recent benchmark results for Claude Opus 4.8 and GPT-5.5 on DeepSWE Bench, with speculation about future GPT-5.6 performance and efficiency trends.
Introducing GPT-5 for developers
OpenAI releases GPT-5 in their API platform, a state-of-the-art model achieving 74.9% on SWE-bench Verified and excelling at coding, agentic tasks, and long-context reasoning. The release includes three model sizes (gpt-5, gpt-5-mini, gpt-5-nano) and new API features like verbosity control, minimal reasoning mode, and custom tools.
Introducing GPT-5.1 for developers
OpenAI releases GPT-5.1, a new model in the GPT-5 series that dynamically adapts thinking time based on task complexity, offering 2-3x faster performance than GPT-5 while maintaining frontier intelligence. The release includes extended prompt caching (24-hour retention), new coding tools (apply_patch and shell), and a 'no reasoning' mode for latency-sensitive applications.