CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming

arXiv cs.CL Papers

Summary

CP-Agent presents a calibrated risk-controlled approach for feedback-driven competitive programming using large language models, achieving significant improvements on benchmarks without parameter updates.

arXiv:2605.24693v1 Announce Type: new Abstract: Large language models still struggle with contest-level programming, while many agentic remedies rely on massive inference-time sampling or expensive multi-stage post-training. We study when execution feedback reliably helps an LLM CP solver and which mechanisms govern the gains. We model feedback-driven solving as a calibrated stopped process and identify three quantities: false-admission risk, program-level evidence against bad programs, and the active-state success hazard. Under held-out trace calibration and selection from a pre-declared finite controller manifest, the resulting structural certificate lower-bounds the clean success probability before false admission. We instantiate mechanisms targeting these quantities as Dual-Granularity Verification, Test Augmentation, and Experience-Driven Self-Evolving, yielding CP-Agent. Without updating any parameters, CP-Agent raises Pass@1 from 25.8\% to 48.5\% on LiveCodeBench Pro and improves Refine@5 by 11.0\% on ICPC-Eval. Across three LLM backbones, CP-Agent lies on the cost--accuracy efficiency frontier, and ablations show that each component primarily affects its corresponding certificate quantity.
Original Article
View Cached Full Text

Cached at: 05/26/26, 09:05 AM

# CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming
Source: [https://arxiv.org/html/2605.24693](https://arxiv.org/html/2605.24693)
,Bowen LiuThe Hong Kong University of Science and Technology \(Guangzhou\)GuangzhouChina,Zehua LiThe Hong Kong University of Science and Technology \(Guangzhou\)GuangzhouChina,Yuyao WangThe Hong Kong University of Science and Technology \(Guangzhou\)GuangzhouChina,Zhiwei MaThe Hong Kong University of Science and Technology \(Guangzhou\)GuangzhouChina,Yuhan LiThe Hong Kong University of Science and Technology \(Guangzhou\)GuangzhouChinaandJia LiThe Hong Kong University of Science and Technology \(Guangzhou\)GuangzhouChina

\(2026\)

###### Abstract\.

Large language models still struggle with contest\-level programming, while many agentic remedies rely on massive inference\-time sampling or expensive multi\-stage post\-training\. We study when execution feedback reliably helps an LLM CP solver and which mechanisms govern the gains\. We model feedback\-driven solving as a calibrated stopped process and identify three quantities: false\-admission risk, program\-level evidence against bad programs, and the active\-state success hazard\. Under held\-out trace calibration and selection from a pre\-declared finite controller manifest, the resulting structural certificate lower\-bounds the clean success probability before false admission\. We instantiate mechanisms targeting these quantities as Dual\-Granularity Verification, Test Augmentation, and Experience\-Driven Self\-Evolving, yieldingCP\-Agent\. Without updating any parameters,CP\-Agentraises Pass@1 from 25\.8% to 48\.5% on LiveCodeBench Pro and improves Refine@5 by 11\.0% on ICPC\-Eval\. Across three LLM backbones,CP\-Agentlies on the cost–accuracy efficiency frontier, and ablations show that each component primarily affects its corresponding certificate quantity\.

competitive programming, large language models, agentic reasoning, execution feedback, code generation

††copyright:none††journalyear:2026††doi:XXXXXXX\.XXXXXXX††conference:Make sure to enter the correct conference title from your rights confirmation email; June 03–05, 2018; Woodstock, NY††isbn:978\-1\-4503\-XXXX\-X/2018/06††ccs:Computing methodologies Natural language processing††ccs:Mathematics of computing Stochastic processes![Refer to caption](https://arxiv.org/html/2605.24693v1/x1.png)Figure 1\.Multi\-backbone Pass@1 versus cost on LiveCodeBench Pro for DeepSeek\-V3\.2\-Chat, GPT\-4o, and Qwen3\-235B\-A22B\-Instruct, with the VRS/BCC \(DEA\) efficiency frontier shown as the upper envelope of each backbone\. Cost is the average API cost per problem on a log scale\.CP\-Agentlies on the efficiency frontier for all three backbones\.A scatter plot with average API cost per problem on a logarithmic x\-axis ranging from about 0\.001 to 0\.05 US dollars, and Pass@1 accuracy on the y\-axis ranging from 0 to about 55 percent\. Points are grouped into three backbones using different colors and marker shapes: DeepSeek\-V3\.2\-Chat, GPT\-4o, and Qwen3\-235B\-A22B\-Instruct\. For each backbone, a piecewise\-linear upper\-envelope curve connects its most cost\-efficient methods\. Within every backbone the CP\-Agent point sits on this envelope in the upper\-right region; baselines such as Reflexion lie on the envelope only for DeepSeek\-V3\.2\-Chat and fall strictly below it for the other backbones, indicating that CP\-Agent attains the highest Pass@1 at comparable per\-problem cost across all three backbones\.## 1\.Introduction

Large language models \(LLMs\) now perform well across a broad range of downstream tasks, including code generation and everyday software\-engineering assistance\. Contest\-level competitive programming \(CP\), however, remains a sharp failure case: on ICPC/IOI\-style benchmarks\(Jainet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib23); Xuet al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib2); Zhenget al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib3)\), even strong LLMs reach below 30% Pass@1 \(Section[4](https://arxiv.org/html/2605.24693#S4)\)\. CP is judged by hidden correctness and asymptotic feasibility under strict time and memory limits, so plausible implementations earn no credit when the algorithm or an edge case is wrong, and software\-engineering virtues such as maintainability or modular design do not compensate\.

Recent work has explored two directions: \(i\) tackling competitive programming problems through massive code generation combined with heuristic\-based inference and execution\-based pruning, exemplified by AlphaCode\(Liet al\.,[2022](https://arxiv.org/html/2605.24693#bib.bib25)\), AlphaCode2\(Leblondet al\.,[2023](https://arxiv.org/html/2605.24693#bib.bib39)\), and o1\-ioi\(El\-Kishkyet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib12)\); \(ii\) enhancing the intrinsic reasoning ability of LLMs, including large reasoning models\(Guoet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib16); El\-Kishkyet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib12); OpenAI,[2025b](https://arxiv.org/html/2605.24693#bib.bib38); Liuet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib26)\)that are often trained with reinforcement learning\. While these directions improve performance, they often come with substantial computational cost\-large\-scale sampling during inference, or expensive rollouts during training\.

CP offers rich, executable feedback: compilation errors, test failures, and counterexamples directly expose concrete failure modes\(Shinnet al\.,[2023](https://arxiv.org/html/2605.24693#bib.bib30)\)\. This makes agentic solving a natural fit\. Two gaps, however, remain for agentic CP\. First, the literature disagrees on which mechanisms reliably help\. MapCoder\(Islamet al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib21)\)reports that additional test\-case generation is unhelpful while AlphaCodium\(Ridniket al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib29)\)reports gains; MapCoder and Oi\-Assistant\(Yuhuaet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib37)\)find retrieval of similar problems useful, while AlgoSimBench\(Li and Mooney,[2025](https://arxiv.org/html/2605.24693#bib.bib24)\)finds it unstable\. Second, recent strong agentic CP systems increasingly invest in multi\-stage agentic post\-training\(Teamet al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib32)\)\. In contrast, for training\-free LLM agents on CP, it remains unclear whether tool orchestration alone improves performance and which mechanisms drive its gains\.

These gaps converge to one essential question: which controllable mechanisms turn execution feedback into measurable gains on contest\-level CP? We answer it through a calibrated risk\-control view of feedback\-driven solving\. We analyze feedback\-driven CP solving as a calibrated, risk\-controlled stopped process\. Its state reduces to three certificate quantities: the residual bad\-candidate envelopeρt\\rho\_\{t\}, the accumulated program\-level evidenceItI\_\{t\}, and the active\-state success hazardh¯t\\underline\{h\}\_\{t\}\. Each is moved by exactly one abstract action class:RiskProbeonρt\\rho\_\{t\},EvidenceAcquireonItI\_\{t\}, andContextAcquireonh¯t\\underline\{h\}\_\{t\}\. The calibrated false\-admission UCBq¯t≈ρ¯t​exp⁡\(−ℐt\)\\bar\{q\}\_\{t\}\\approx\\bar\{\\rho\}\_\{t\}\\exp\(\-\\mathcal\{I\}\_\{t\}\)follows as a derived quantity\. This one\-to\-one reduction yields a unified design\-and\-selection framework: for any pre\-declared finite manifest of controllers, it supplies a post\-selection\-valid lower bound on clean\-before\-false success\. Guided by this framework, we designCP\-Agentwith one controller per action class: Dual\-Granularity Verification contractsρt\\rho\_\{t\}, Test Augmentation adds toItI\_\{t\}, and Experience\-Driven Self\-Evolving raisesh¯t\\underline\{h\}\_\{t\}\. A frozen admission gate and a refinement loop complete the agent\. Extensive empirical results across three backbones on two contest\-level benchmarks, together with theory\-grounded mechanism ablations on the five\-controller calibration class, jointly validateCP\-Agent’s effectiveness and its calibrated certificate\.

Our contributions are threefold:

- •A calibrated risk\-control theory for feedback\-driven CP\.We prove an admission\-risk theorem and a simultaneously valid finite\-manifest selection rule over five abstract action classes \(Section[2](https://arxiv.org/html/2605.24693#S2)\)\.
- •A theory\-induced agent\.We instantiate this controller asCP\-Agent, a tool\-using framework with no parameter updates \(Section[3](https://arxiv.org/html/2605.24693#S3)\)\.111Our implementation is open\-sourced at[https://github\.com/NineAbyss/CP\-Agent](https://github.com/NineAbyss/CP-Agent)\.
- •Trace\-calibrated mechanism evidence\.On LiveCodeBench Pro and ICPC\-Eval across three backbones,CP\-Agentraises Pass@1 from 25\.8% to 48\.5% and Refine@5 by\+11\.0\+11\.0%; calibration attributes the gains to TA \(evidence\), HP \(bad\-plan survival\), and Experience \(success hazard\) \(Section[4](https://arxiv.org/html/2605.24693#S4)\)\.

![Refer to caption](https://arxiv.org/html/2605.24693v1/x2.png)Figure 2\.Overview ofCP\-Agent\. The risk\-control theory defines a certificate state and abstract action classes; the concreteCP\-Agentcontroller is declared before target testing and audited on held\-out traces through the calibrated lower\-bound objective\.CP\-Agentimplements these handles with Dual\-Granularity Verification, Test Augmentation, and Experience\-Driven Self\-Evolving\.A wide block diagram with three horizontal layers connected by arrows\. The top layer is the theoretical certificate state, listing the three quantities rho\_t \(residual bad\-candidate envelope\), I\_t \(program\-level evidence in nats\), and underline h\_t \(clean\-success hazard\), each linked to one abstract action class: RiskProbe acts on rho\_t, EvidenceAcquire acts on I\_t, and ContextAcquire acts on underline h\_t\. The middle layer is the CP\-Agent controller, with three modules drawn as boxes: a Hypothesis Prober that compiles and runs short C\+\+ snippets on the left, a Solution Validator and Test Augmentation pair in the middle that runs candidates against public tests and consensus\-filtered augmented tests, and an Experience\-Driven Self\-Evolving memory bank on the right that retrieves frozen Fail\-to\-Pass triplets\. Arrows from each middle\-layer module point up to its matching certificate quantity\. The bottom layer shows the temporal data\-split pipeline as three boxes in a row: D\_hist\_mem for memory construction, D\_cal for held\-out calibration and controller selection, and D\_test for target evaluation, with horizontal arrows indicating that the controller manifest and the memory snapshot M\-star are frozen before D\_cal and D\_test\.
## 2\.A Calibrated Risk\-Controlled Theory of Feedback Control

A feedback\-driven CP agent repeatedly faces the same decision: admit the current candidate, refine it, or acquire more information\. We specify certificate quantities that audit this decision, abstract action classes that act on them, and a calibration rule that turns them into a frozen controller; the takeaway box summarizes the role of the analysis and Appendix[B](https://arxiv.org/html/2605.24693#A2)consolidates the Section[2](https://arxiv.org/html/2605.24693#S2)notation\.

###### Definition 2\.1 \(Controlled stopped process and certificate state\)\.

Fix a horizonTT, false\-admission budgetα\\alpha, and a frozen policyπ\\pi\. Throughout, “steptt” denotes a*refinement step*—one response–revision attempt \(oneRefineadvance, equivalently one outer\-loop iteration of Algorithm[1](https://arxiv.org/html/2605.24693#alg1)\), matched to Refine@KKand Pass@11\. Within\-trajectory interaction events such as individual LLM outputs or tool calls are termed*interaction steps*and appear only in diagnostic figures \(e\.g\., Figure[3\(a\)](https://arxiv.org/html/2605.24693#S4.F3.sf1)\)\. Before refinement steptt, the trajectory is active whenAt=\{τS\>t−1,τF\>t−1\}A\_\{t\}=\\\{\\tau\_\{S\}\>t\-1,\\tau\_\{F\}\>t\-1\\\}\. On active trajectories, the implementation state isXtX\_\{t\}\(problem text, candidate code, feedback, retrieved memory, and tool outputs\)\. The certificate exposes a lower\-dimensional summary state

Zt:=ϕ​\(Xt\)=\(ρt,It,h¯t,t\),Z\_\{t\}:=\\phi\(X\_\{t\}\)=\(\\rho\_\{t\},I\_\{t\},\\underline\{h\}\_\{t\},t\),whereρt\\rho\_\{t\}is an upper envelope on residual bad\-candidate mass,ItI\_\{t\}is accumulated program\-level evidence, andh¯t\\underline\{h\}\_\{t\}is a lower envelope on clean\-success hazard\. The residual admission\-risk score is

\(1\)rt​\(Zt\):=ρt​exp⁡\(−It\)\.r\_\{t\}\(Z\_\{t\}\):=\\rho\_\{t\}\\exp\(\-I\_\{t\}\)\.Hereρt\\rho\_\{t\}upper\-bounds the bad\-candidate probability after upstream screening andexp⁡\(−It\)\\exp\(\-I\_\{t\}\)upper\-bounds the probability that such a candidate survives the evidence gate, so their product upper\-bounds false admission;ItI\_\{t\}is in nats so evidence accumulates additively asIt=∑s≤tΔ​IsI\_\{t\}=\\sum\_\{s\\leq t\}\\Delta I\_\{s\}\. LetΓt∈\{0,1\}\\Gamma\_\{t\}\\in\\\{0,1\\\}denote the controller’s observable admission gate at steptt, the binary admit/reject signal logged on every active trace\. If the controller admits a candidate through gateΓt\\Gamma\_\{t\}, hidden adjudication defines

Ft:=𝟏\{At,Γt=1,Btϵ\},St:=𝟏\{At,Γt=1,¬Btϵ\},F\_\{t\}:=\\mathbf\{1\}\\\{A\_\{t\},\\Gamma\_\{t\}=1,B\_\{t\}^\{\\epsilon\}\\\},\\qquad S\_\{t\}:=\\mathbf\{1\}\\\{A\_\{t\},\\Gamma\_\{t\}=1,\\neg B\_\{t\}^\{\\epsilon\}\\\},whereBtϵB\_\{t\}^\{\\epsilon\}denotes the event that the admitted candidate fails the deterministic hidden\-judge on at least one hidden test, which we treat as theϵ\\epsilon\-significant error event under the benchmark’s hidden\-test distribution\. The stopping times areτF:=inf\{t:Ft=1\}\\tau\_\{F\}:=\\inf\\\{t:F\_\{t\}=1\\\}andτS:=inf\{t:St=1\}\\tau\_\{S\}:=\\inf\\\{t:S\_\{t\}=1\\\}; the certified event is\{τS≤T,τS<τF\}\\\{\\tau\_\{S\}\\leq T,\\tau\_\{S\}<\\tau\_\{F\}\\\}\.

###### Definition 2\.2 \(Abstract action classes\)\.

We choose action classes to span the certificate: each certificate quantity inZtZ\_\{t\}admits one action that can move it in the direction of a larger lower bound—RiskProbeonρt\\rho\_\{t\},EvidenceAcquireonItI\_\{t\}, andContextAcquireonh¯t\\underline\{h\}\_\{t\}—together withStopandRefinefor the commit/advance decision\. The controller chooses actions from

\(2\)𝒜\(Zt\)=\{\\displaystyle\\mathcal\{A\}\(Z\_\{t\}\)=\\\{Stop,Refine,RiskProbe,\\displaystyle\\textsf\{Stop\},\\textsf\{Refine\},\\textsf\{RiskProbe\},EvidenceAcquire\(m\),ContextAcquire\(k\)\}\.\\displaystyle\\textsf\{EvidenceAcquire\}\(m\),\\textsf\{ContextAcquire\}\(k\)\\\}\.Each action acts on the certificate as follows:

\(3\)Stop:\\displaystyle\\textsf\{Stop\}:\\Γtα=𝟏​\{rt​\(Zt\)≤θ^t,α\},\\displaystyle\\Gamma\_\{t\}^\{\\alpha\}=\\mathbf\{1\}\\\{r\_\{t\}\(Z\_\{t\}\)\\leq\\hat\{\\theta\}\_\{t,\\alpha\}\\\},\(4\)RiskProbe:\\displaystyle\\textsf\{RiskProbe\}:\\ρt\+≤λt​\(Zt\)​ρt,λt​\(Zt\)∈\[0,1\],\\displaystyle\\rho\_\{t\+\}\\leq\\lambda\_\{t\}\(Z\_\{t\}\)\\,\\rho\_\{t\},\\quad\\lambda\_\{t\}\(Z\_\{t\}\)\\in\[0,1\],\(5\)EvidenceAcquire​\(m\):\\displaystyle\\textsf\{EvidenceAcquire\}\(m\):\\It\+=It\+Δ​It​\(m,Zt\),\\displaystyle I\_\{t\+\}=I\_\{t\}\+\\Delta I\_\{t\}\(m,Z\_\{t\}\),\(6\)ContextAcquire​\(k\):\\displaystyle\\textsf\{ContextAcquire\}\(k\):\\h¯t\+≥h¯t\+Δ​ht​\(k,Zt\),\\displaystyle\\underline\{h\}\_\{t\+\}\\geq\\underline\{h\}\_\{t\}\+\\Delta h\_\{t\}\(k,Z\_\{t\}\),whereρt\+,It\+,h¯t\+\\rho\_\{t\+\},I\_\{t\+\},\\underline\{h\}\_\{t\+\}denote the post\-action certificate components, andRefineadvances to the next candidate distribution without admitting the current candidate\. The integersmmandkkare action intensities, e\.g\., evidence budget and context\-retrieval budget\.

###### Assumption 2\.3 \(Held\-out exchangeability\)\.

Calibration and evaluation problem trajectories are independent draws from the same benchmark distribution\. Independence is required across problem trajectories at a fixed steptt; it is*not*required across refinement steps within a trajectory\.

###### Definition 2\.4 \(Controller manifest\)\.

The controller manifest is

π:=\(Γ1:Tobs,𝒰probe,m,γevid,kgen,ℓalg,Rroute,Ddedup,\\displaystyle\\pi=\\bigl\(\\Gamma^\{\\rm obs\}\_\{1:T\},\\,\\mathcal\{U\}\_\{\\rm probe\},\\,m,\\,\\gamma\_\{\\rm evid\},\\,k\_\{\\rm gen\},\\,\\ell\_\{\\rm alg\},\\,R\_\{\\rm route\},\\,D\_\{\\rm dedup\},pprompt,ddecode,T,M⋆,ωwrite\)\\displaystyle p\_\{\\rm prompt\},\\,d\_\{\\rm decode\},\\,T,\\,M^\{\\star\},\\,\\omega\_\{\\rm write\}\\bigr\)containing: the observable admission gateΓ1:Tobs\\Gamma^\{\\rm obs\}\_\{1:T\}; theRiskProbetool interface𝒰probe\\mathcal\{U\}\_\{\\rm probe\}and its model\-controlled invocation policy; theEvidenceAcquireintensitymmand aggregation ruleγevid\\gamma\_\{\\rm evid\}; the twoContextAcquireretrieval channels—general\-bank injection budgetkgenk\_\{\\rm gen\}and algorithmic\-bucket retrieval budgetℓalg\\ell\_\{\\rm alg\}\(so the action intensitykkin Eq\. \([6](https://arxiv.org/html/2605.24693#S2.E6)\) is realized as the pair\(kgen,ℓalg\)\(k\_\{\\rm gen\},\\ell\_\{\\rm alg\}\)\); the retrieval routingRrouteR\_\{\\rm route\}; the deduplication ruleDdedupD\_\{\\rm dedup\}; the prompt templateppromptp\_\{\\rm prompt\}; the decoding parametersddecoded\_\{\\rm decode\}; the horizonTT; the memory snapshotM⋆M^\{\\star\}; and the write modeωwrite\\omega\_\{\\rm write\}\. For every certificate\-bearing result,ωwrite=audit\-only\\omega\_\{\\rm write\}=\\textsf\{audit\-only\}, i\.e\.,M⋆M^\{\\star\}is read\-only and evaluation\-time experiences are written only to the audit logM~\\widetilde\{M\}\. Each entry ofπ\\piis either pre\-declared or selected on held\-out traces, and all entries are frozen before final evaluation under the splitDhistmem→Dcal→DtestD\_\{\\rm hist\}^\{\\rm mem\}\\to D\_\{\\rm cal\}\\to D\_\{\\rm test\}withM⋆M^\{\\star\}built fromDhistmemD\_\{\\rm hist\}^\{\\rm mem\}\.

##### Calibration estimators\.

For each fixedπ\\piand steptt, letntπn\_\{t\}^\{\\pi\}active calibration traces containftπf\_\{t\}^\{\\pi\}false admissions andstπs\_\{t\}^\{\\pi\}clean successes\. Under Assumption[2\.3](https://arxiv.org/html/2605.24693#S2.Thmtheorem3)the counts are binomial conditional onntπn\_\{t\}^\{\\pi\}, so we use the exact one\-sided Clopper–Pearson intervals\(Clopper and Pearson,[1934](https://arxiv.org/html/2605.24693#bib.bib11)\)

\(7\)q¯t,rawπ:=UCBBinom​\(ftπ,ntπ,δ2​T\),h¯tπ:=LCBBinom​\(stπ,ntπ,δ2​T\)\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}:=\\mathrm\{UCB\}\_\{\\rm Binom\}\\\!\\left\(f\_\{t\}^\{\\pi\},n\_\{t\}^\{\\pi\},\\tfrac\{\\delta\}\{2T\}\\right\),\\quad\\underline\{h\}\_\{t\}^\{\\pi\}:=\\mathrm\{LCB\}\_\{\\rm Binom\}\\\!\\left\(s\_\{t\}^\{\\pi\},n\_\{t\}^\{\\pi\},\\tfrac\{\\delta\}\{2T\}\\right\)on the deployment active\-state hazardsqtπ:=Prπ⁡\(Ft=1∣At\)q\_\{t\}^\{\\pi\}:=\\Pr\_\{\\pi\}\(F\_\{t\}\{=\}1\\mid A\_\{t\}\)andhtπ:=Prπ⁡\(St=1∣At\)h\_\{t\}^\{\\pi\}:=\\Pr\_\{\\pi\}\(S\_\{t\}\{=\}1\\mid A\_\{t\}\), jointly valid fort≤Tt\\leq Tby a union bound \(proof of Theorem[2\.6](https://arxiv.org/html/2605.24693#S2.Thmtheorem6)\); operationally, both bounds share the same active\-row denominatorntπn\_\{t\}^\{\\pi\}onDcalD\_\{\\rm cal\}\.

###### Theorem 2\.5\(Calibrated admission gate\)\.

LetΘt\\Theta\_\{t\}be a finite grid of admission thresholds for the risk score in Eq\. \([1](https://arxiv.org/html/2605.24693#S2.E1)\)\. On held\-out calibration traces, letnt​\(θ\)n\_\{t\}\(\\theta\)be the number of active candidate admissions withrt​\(Zt\)≤θr\_\{t\}\(Z\_\{t\}\)\\leq\\theta, and letft​\(θ\)f\_\{t\}\(\\theta\)be the number of such admissions that hidden evaluation rejects\. Define

R¯t​\(θ\)=UCBBinom​\(ft​\(θ\),nt​\(θ\),δgateT​\|Θt\|\)\\bar\{R\}\_\{t\}\(\\theta\)=\\mathrm\{UCB\}\_\{\\rm Binom\}\\\!\\left\(f\_\{t\}\(\\theta\),n\_\{t\}\(\\theta\),\\frac\{\\delta\_\{\\mathrm\{gate\}\}\}\{T\|\\Theta\_\{t\}\|\}\\right\)and select

\(8\)θ^t,α=max⁡\{θ∈Θt:R¯t​\(θ\)≤αt\},\\hat\{\\theta\}\_\{t,\\alpha\}=\\max\\\{\\theta\\in\\Theta\_\{t\}:\\bar\{R\}\_\{t\}\(\\theta\)\\leq\\alpha\_\{t\}\\\},withθ^t,α:=−∞\\hat\{\\theta\}\_\{t,\\alpha\}\{:=\}\-\\infty,Γtα≡0\\Gamma\_\{t\}^\{\\alpha\}\{\\equiv\}0if the set is empty, and the conservative conventionsUCBBinom​\(0,0,η\):=1\\mathrm\{UCB\}\_\{\\rm Binom\}\(0,0,\\eta\)\{:=\}1and zero\-probability conditioning evaluated as0\. Then, with probability at least1−δgate1\-\\delta\_\{\\mathrm\{gate\}\}over calibration, the gateΓtα=𝟏​\{rt​\(Zt\)≤θ^t,α\}\\Gamma\_\{t\}^\{\\alpha\}=\\mathbf\{1\}\\\{r\_\{t\}\(Z\_\{t\}\)\\leq\\hat\{\\theta\}\_\{t,\\alpha\}\\\}satisfies

Pr⁡\(Ft=1∣At,Γtα=1\)≤αt\\Pr\(F\_\{t\}=1\\mid A\_\{t\},\\Gamma\_\{t\}^\{\\alpha\}=1\)\\leq\\alpha\_\{t\}simultaneously for allt≤Tt\\leq Tand all data\-dependentθ^t,α∈Θt\\hat\{\\theta\}\_\{t,\\alpha\}\\in\\Theta\_\{t\}\.

##### Controller certificate\.

Theorem[2\.5](https://arxiv.org/html/2605.24693#S2.Thmtheorem5)controls risk\-threshold stopping when a score threshold is part of the manifest, and audits the frozen observable gate otherwise\. To select information\-acquisition actions, we compare two separately calibrated UCBs on the active false\-admission hazard — the*raw*bound from admitted\-failure counts and the*mechanism\-factorized*bound that multiplies probe\-survival and evidence\-gate\-survival rates — and take their minimum \(the mechanism factor wins when the probe and evidence channel together reject most of the bad mass; the raw bound wins when calibration data is sparse, since unfit mechanism factors fall back to11\):

q¯t,mechπ\\displaystyle\\bar\{q\}\_\{t,\\mathrm\{mech\}\}^\{\\pi\}=ρ¯tprobe,π​exp⁡\[−ℐtevid,π\],\\displaystyle=\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}\\exp\\\!\\bigl\[\-\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}\\bigr\],\(9\)q¯t,ctrlπ\\displaystyle\\bar\{q\}\_\{t,\\mathrm\{ctrl\}\}^\{\\pi\}=min⁡\{q¯t,rawπ,q¯t,mechπ\}\.\\displaystyle=\\min\\\!\\left\\\{\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\},\\,\\bar\{q\}\_\{t,\\mathrm\{mech\}\}^\{\\pi\}\\right\\\}\.Hereq¯t,rawπ\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}is the Clopper–Pearson UCB onqtπq\_\{t\}^\{\\pi\}from active calibration trajectories\. For the mechanism factor, letWtprobe,Wtevid∈\{0,1\}W\_\{t\}^\{\\rm probe\},W\_\{t\}^\{\\rm evid\}\\in\\\{0,1\\\}denote survival through the upstream risk\-probe stage and the evidence gate, respectively\. Since\{Ft=1\}⊆\{Btϵ,\\\{F\_\{t\}\{=\}1\\\}\\subseteq\\\{B\_\{t\}^\{\\epsilon\},Wtprobe=1,W\_\{t\}^\{\\rm probe\}\{=\}1,Wtevid=1\}W\_\{t\}^\{\\rm evid\}\{=\}1\\\}onAtA\_\{t\}, the chain rule of conditional probabilities \(not independence\) gives

Prπ⁡\(Ft=1∣At\)≤Prπ⁡\(Btϵ,Wtprobe=1∣At\)⏟≤ρ¯tprobe,π​\(Prop\.[2\.7](https://arxiv.org/html/2605.24693#S2.Thmtheorem7)\)⋅Prπ⁡\(Wtevid=1∣Btϵ,Wtprobe=1,At\)⏟≤exp⁡\[−ℐtevid,π\]​\(Prop\.[2\.8](https://arxiv.org/html/2605.24693#S2.Thmtheorem8)\)\.\\Pr\_\{\\pi\}\(F\_\{t\}\{=\}1\\mid A\_\{t\}\)\\leq\\underbrace\{\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\},W\_\{t\}^\{\\rm probe\}\{=\}1\\mid A\_\{t\}\)\}\_\{\\leq\\,\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}\\ \\text\{\(Prop\.~\\ref\{prop:probe\-odds\}\)\}\}\\cdot\\underbrace\{\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\rm evid\}\{=\}1\\mid B\_\{t\}^\{\\epsilon\},W\_\{t\}^\{\\rm probe\}\{=\}1,A\_\{t\}\)\}\_\{\\leq\\,\\exp\[\-\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}\]\\ \\text\{\(Prop\.~\\ref\{prop:evidence\-acquire\}\)\}\}\.We splitδ=δraw\+δgate\+δprobe\+δevid\\delta=\\delta\_\{\\mathrm\{raw\}\}\+\\delta\_\{\\mathrm\{gate\}\}\+\\delta\_\{\\mathrm\{probe\}\}\+\\delta\_\{\\mathrm\{evid\}\}, allocatingδraw/\(2​T\)\\delta\_\{\\mathrm\{raw\}\}/\(2T\)to each ofq¯t,rawπ\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}andh¯tπ\\underline\{h\}\_\{t\}^\{\\pi\}\(Eqs\. \([7](https://arxiv.org/html/2605.24693#S2.E7)\), \([10](https://arxiv.org/html/2605.24693#S2.E10)\)\),δgate\\delta\_\{\\mathrm\{gate\}\}to the threshold grid \(Theorem[2\.5](https://arxiv.org/html/2605.24693#S2.Thmtheorem5)\),δprobe/T\\delta\_\{\\mathrm\{probe\}\}/Ttoρ¯tprobe,π\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}, andδevid/T\\delta\_\{\\mathrm\{evid\}\}/Ttob¯tπ​\(m\)\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)\. A single union bound makes all UCBs/LCBs jointly valid with calibration probability≥1−δ\\geq 1\-\\delta, soq¯t,ctrlπ\\bar\{q\}\_\{t,\\mathrm\{ctrl\}\}^\{\\pi\}is a valid UCB onqtπq\_\{t\}^\{\\pi\}\. If a mechanism factor is not separately calibrated, the corresponding term is set to11and Eq\. \([9](https://arxiv.org/html/2605.24693#S2.E9)\) falls back toq¯t,rawπ\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}\. The context\-conditioned success term is

\(10\)h¯tπ​\(M\)=LCBBinom​\(stM,π,ntM,π,δraw2​T\),\\underline\{h\}\_\{t\}^\{\\pi\}\(M\)=\\mathrm\{LCB\}\_\{\\rm Binom\}\\\!\\left\(s\_\{t\}^\{M,\\pi\},n\_\{t\}^\{M,\\pi\},\\tfrac\{\\delta\_\{\\rm raw\}\}\{2T\}\\right\),and the symbolα\\alphain Definition[2\.1](https://arxiv.org/html/2605.24693#S2.Thmtheorem1)denotes the false\-admission*risk budget*, distinct from the calibration confidenceδ\\delta\. The controller\-level certificate is

\(11\)CTctrl​\(π\)=\[∏t=1T\(1−q¯t,ctrlπ\)−∏t=1T\(1−h¯tπ​\(M\)\)\]\+\.C\_\{T\}^\{\\rm ctrl\}\(\\pi\)=\\bigl\[\\prod\_\{t=1\}^\{T\}\(1\-\\bar\{q\}\_\{t,\\mathrm\{ctrl\}\}^\{\\pi\}\)\-\\prod\_\{t=1\}^\{T\}\(1\-\\underline\{h\}\_\{t\}^\{\\pi\}\(M\)\)\\bigr\]\_\{\+\}\.
###### Theorem 2\.6\(Simultaneously valid finite\-manifest calibration\)\.

LetΠα\\Pi\_\{\\alpha\}be a finite, pre\-declared class of frozen controllers generated from the abstract action classes in Eq\. \([2](https://arxiv.org/html/2605.24693#S2.E2)\); each controller maps certificate states to stop/refine decisions, model\-controlled risk\-probe calls, evidence intensitymm, context intensitykk, prompts, routing/deduplication rules, decoding parameters, a retrievable memory snapshotM⋆M^\{\\star\}, and write modeωwrite\\omega\_\{\\rm write\}\. Allocate the calibration failure probabilityδ\\deltaacross the joint family of UCB/LCB statements as

δ=δraw\+δgate\+δprobe\+δevid,\\delta=\\delta\_\{\\mathrm\{raw\}\}\+\\delta\_\{\\mathrm\{gate\}\}\+\\delta\_\{\\mathrm\{probe\}\}\+\\delta\_\{\\mathrm\{evid\}\},with the per\-component per\-step splitting of Section[2](https://arxiv.org/html/2605.24693#S2)applied for eachπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}separately when estimating Eq\. \([7](https://arxiv.org/html/2605.24693#S2.E7)\), Eq\. \([9](https://arxiv.org/html/2605.24693#S2.E9)\), and Eq\. \([10](https://arxiv.org/html/2605.24693#S2.E10)\)\. The calibrated selection rule is

\(12\)π⋆∈arg⁡maxπ∈Πα⁡CTctrl​\(π\)\.\\pi^\{\\star\}\\in\\arg\\max\_\{\\pi\\in\\Pi\_\{\\alpha\}\}C\_\{T\}^\{\\rm ctrl\}\(\\pi\)\.Then, by a single union bound overΠα\\Pi\_\{\\alpha\}, threshold grids, mechanism factors, and steps, with probability≥1−δ\\geq 1\-\\deltaover calibration the UCBs/LCBs in Eq\. \([11](https://arxiv.org/html/2605.24693#S2.E11)\) are simultaneously valid for everyπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}\. Consequently, anyπ⋆\\pi^\{\\star\}selected by Eq\. \([12](https://arxiv.org/html/2605.24693#S2.E12)\)—and any pre\-declared frozenπ\\pi—inherits a post\-selection\-valid lower boundCTctrl​\(π⋆\)C\_\{T\}^\{\\rm ctrl\}\(\\pi^\{\\star\}\); if its manifest includes a score\-threshold admission rule, the threshold also satisfies Theorem[2\.5](https://arxiv.org/html/2605.24693#S2.Thmtheorem5)\.

##### Numerical instantiation\.

BeforeDcalD\_\{\\rm cal\}opens we fixδ=0\.10\\delta\{=\}0\.10split equally asδraw=δgate=δprobe=δevid=0\.025\\delta\_\{\\mathrm\{raw\}\}\{=\}\\delta\_\{\\mathrm\{gate\}\}\{=\}\\delta\_\{\\mathrm\{probe\}\}\{=\}\\delta\_\{\\mathrm\{evid\}\}\{=\}0\.025,\|Θt\|=1\|\\Theta\_\{t\}\|\{=\}1\(two\-stage realization in §[3](https://arxiv.org/html/2605.24693#S3)\),\|Πα\|=5\|\\Pi\_\{\\alpha\}\|\{=\}5\(Table[4](https://arxiv.org/html/2605.24693#S4.T4)\),T=5T\{=\}5for ICPC Refine@5 \(orT=1T\{=\}1for LCB\-Pro Pass@1\), and per\-stepαt=0\.10\\alpha\_\{t\}\{=\}0\.10\. The per\-step per\-controller Clopper–Pearson\(Clopper and Pearson,[1934](https://arxiv.org/html/2605.24693#bib.bib11)\)allocations for Eqs\. \([7](https://arxiv.org/html/2605.24693#S2.E7)\)–\([10](https://arxiv.org/html/2605.24693#S2.E10)\), Theorem[2\.5](https://arxiv.org/html/2605.24693#S2.Thmtheorem5), and Propositions[2\.7](https://arxiv.org/html/2605.24693#S2.Thmtheorem7)–[2\.8](https://arxiv.org/html/2605.24693#S2.Thmtheorem8)are, respectively,

δraw2​T​\|Πα\|,δgateT​\|Θt\|​\|Πα\|,δprobeT​\|Πα\|,δevidT​\|Πα\|,\\tfrac\{\\delta\_\{\\mathrm\{raw\}\}\}\{2T\|\\Pi\_\{\\alpha\}\|\},\\ \\tfrac\{\\delta\_\{\\mathrm\{gate\}\}\}\{T\|\\Theta\_\{t\}\|\|\\Pi\_\{\\alpha\}\|\},\\ \\tfrac\{\\delta\_\{\\mathrm\{probe\}\}\}\{T\|\\Pi\_\{\\alpha\}\|\},\\ \\tfrac\{\\delta\_\{\\mathrm\{evid\}\}\}\{T\|\\Pi\_\{\\alpha\}\|\},withUCBBinom​\(0,0,η\)=1\\mathrm\{UCB\}\_\{\\rm Binom\}\(0,0,\\eta\)\{=\}1andLCBBinom​\(0,0,η\)=0\\mathrm\{LCB\}\_\{\\rm Binom\}\(0,0,\\eta\)\{=\}0for empty cells\. Concretely, the raw per\-row confidence on the Full controller at any singlettis0\.025/\(2⋅5⋅5\)=5×10−40\.025/\(2\{\\cdot\}5\{\\cdot\}5\)\{=\}5\{\\times\}10^\{\-4\}\.

###### Proposition 2\.7\(RiskProbe bad\-and\-survive UCB\)\.

LetWtprobeW\_\{t\}^\{\\mathrm\{probe\}\}be a\{0,1\}\\\{0,1\\\}\-valued probe\-survival indicator at steptt, set to11when the probe is not invoked\. On held\-out calibration traces, letntπn\_\{t\}^\{\\pi\}be the number of trajectories active at stepttunder controllerπ\\pi\(soAtA\_\{t\}holds\), and letftprobe,πf\_\{t\}^\{\\mathrm\{probe\},\\pi\}count those for which the hidden\-judge labelBtϵB\_\{t\}^\{\\epsilon\}holds and the recordedWtprobe=1W\_\{t\}^\{\\mathrm\{probe\}\}=1\. Define

ρ¯tprobe,π:=UCBBinom​\(ftprobe,π,ntπ,δprobe/T\)\.\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}:=\\mathrm\{UCB\}\_\{\\rm Binom\}\\\!\\left\(f\_\{t\}^\{\\mathrm\{probe\},\\pi\},\\,n\_\{t\}^\{\\pi\},\\,\\delta\_\{\\mathrm\{probe\}\}/T\\right\)\.Thenρ¯tprobe,π\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}is a valid one\-sided UCB onPrπ⁡\(Btϵ,Wtprobe=1∣At\)\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\},W\_\{t\}^\{\\mathrm\{probe\}\}\{=\}1\\mid A\_\{t\}\)fort≤Tt\\leq Twith confidence1−δprobe1\-\\delta\_\{\\mathrm\{probe\}\}, factoring asσtbad,π​ρtπ\\sigma\_\{t\}^\{\\mathrm\{bad\},\\pi\}\\rho\_\{t\}^\{\\pi\}withσtbad,π:=Prπ⁡\(Wtprobe=1∣Btϵ,At\)\\sigma\_\{t\}^\{\\mathrm\{bad\},\\pi\}:=\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\mathrm\{probe\}\}\{=\}1\\mid B\_\{t\}^\{\\epsilon\},A\_\{t\}\)andρtπ:=Prπ⁡\(Btϵ∣At\)\\rho\_\{t\}^\{\\pi\}:=\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\}\\mid A\_\{t\}\), so the probe contracts bad\-and\-survive mass whenever bad candidates survive at a strictly lower rate than the prior\.

###### Proposition 2\.8\(Candidate\-level evidence from EvidenceAcquire\)\.

LetWtevid∈\{0,1\}W\_\{t\}^\{\\mathrm\{evid\}\}\\in\\\{0,1\\\}denote the program\-level evidence\-gate survival at steptt\(recorded as11when the gate is not invoked\)\. On held\-out calibration traces calibrated under the chain\-rule conditional, letb¯tπ​\(m\)\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)be a one\-sided UCB onPrπ⁡\(Wtevid=1∣Btϵ,Wtprobe=1,At\)\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\mathrm\{evid\}\}=1\\mid B\_\{t\}^\{\\epsilon\},W\_\{t\}^\{\\mathrm\{probe\}\}=1,A\_\{t\}\)at levelδevid/T\\delta\_\{\\mathrm\{evid\}\}/Tfrom active rows that are simultaneously bad and probe\-surviving\. Then settingΔ​It​\(m\):=−log⁡b¯tπ​\(m\)\\Delta I\_\{t\}\(m\):=\-\\log\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)andℐtevid,π:=∑s≤tΔ​Is​\(ms\)\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}:=\\sum\_\{s\\leq t\}\\Delta I\_\{s\}\(m\_\{s\}\)yields, simultaneously fort≤Tt\\leq Twith confidence1−δevid1\-\\delta\_\{\\mathrm\{evid\}\},

Prπ⁡\(Wtevid=1\|Btϵ,Wtprobe=1,At\)≤exp⁡\[−Δ​It​\(m\)\]\.\\Pr\_\{\\pi\}\\bigl\(W\_\{t\}^\{\\mathrm\{evid\}\}=1\\,\\big\|\\,B\_\{t\}^\{\\epsilon\},W\_\{t\}^\{\\mathrm\{probe\}\}=1,A\_\{t\}\\bigr\)\\leq\\exp\\\!\\bigl\[\-\\Delta I\_\{t\}\(m\)\\bigr\]\.The bound is suite\-level:b¯tπ​\(m\)\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)is a Bernoulli rate over the consensus\-filtered suite, with no independence required across themmtests\.

###### Proposition 2\.9\(ContextAcquire as calibrated hazard envelope\)\.

Under the splitDhistmem→Dcal→DtestD\_\{\\rm hist\}^\{\\rm mem\}\\to D\_\{\\rm cal\}\\to D\_\{\\rm test\}, aContextAcquire​\(k\)\\textsf\{ContextAcquire\}\(k\)action with frozenM⋆M^\{\\star\}realizes Eq\. \([6](https://arxiv.org/html/2605.24693#S2.E6)\) by assigningh¯t\+←h¯tπ​\(M⋆\)\\underline\{h\}\_\{t\+\}\\leftarrow\\underline\{h\}\_\{t\}^\{\\pi\}\(M^\{\\star\}\), withh¯tπ​\(M⋆\)\\underline\{h\}\_\{t\}^\{\\pi\}\(M^\{\\star\}\)the LCB in Eq\. \([10](https://arxiv.org/html/2605.24693#S2.E10)\) estimated on traces usingM⋆M^\{\\star\}\. The incrementΔ​ht​\(k,Zt\):=h¯tπ​\(M⋆\)−h¯tπ,∅\\Delta h\_\{t\}\(k,Z\_\{t\}\):=\\underline\{h\}\_\{t\}^\{\\pi\}\(M^\{\\star\}\)\-\\underline\{h\}\_\{t\}^\{\\pi,\\varnothing\}is the snapshot\-on/off LCB gap, reported as a deployment diagnostic\. Both the controller\-level certificate Eq\. \([11](https://arxiv.org/html/2605.24693#S2.E11)\) and the generic certificate Eq\. \([13](https://arxiv.org/html/2605.24693#S2.E13)\) use onlyh¯tπ​\(M⋆\)\\underline\{h\}\_\{t\}^\{\\pi\}\(M^\{\\star\}\)as the LCB factor, read off afterM⋆M^\{\\star\}is frozen\.

###### Corollary 2\.10\(Stopped clean\-before\-false certificate\)\.

Under Assumption[2\.3](https://arxiv.org/html/2605.24693#S2.Thmtheorem3), suppose a frozen controllerπ\\pihas simultaneous active\-step boundsqtπ≤q¯tπq\_\{t\}^\{\\pi\}\\leq\\bar\{q\}\_\{t\}^\{\\pi\}andhtπ≥h¯tπh\_\{t\}^\{\\pi\}\\geq\\underline\{h\}\_\{t\}^\{\\pi\}for alltt\. Then, with probability at least1−δ1\-\\deltaover calibration, a fresh evaluation problem satisfies

\(13\)Prπ⁡\(τS≤T,τS<τF\)≥CT​\(π\):=\[∏t=1T\(1−q¯tπ\)−∏t=1T\(1−h¯tπ\)\]\+\.\\Pr\_\{\\pi\}\(\\tau\_\{S\}\\leq T,\\tau\_\{S\}<\\tau\_\{F\}\)\\geq C\_\{T\}\(\\pi\):=\\bigl\[\\prod\_\{t=1\}^\{T\}\(1\-\\bar\{q\}\_\{t\}^\{\\pi\}\)\-\\prod\_\{t=1\}^\{T\}\(1\-\\underline\{h\}\_\{t\}^\{\\pi\}\)\\bigr\]\_\{\+\}\.

###### Proof\.

The product form follows from the chain rule over stopped events, not from step independence\. We assume the trajectory begins active,Pr⁡\(A1\)=1\\Pr\(A\_\{1\}\)=1, and adopt the convention that any conditional probability with a zero\-probability conditioning event evaluates to0\(these conditioning events contribute0to the bound\)\. We argue in three steps\.

Step 1: no\-false survival\.The events\{τF\>t\}\\\{\\tau\_\{F\}\>t\\\}form a decreasing filtration with\{τF\>t\}=\{τF\>t−1\}∩\{Ft=0\}\\\{\\tau\_\{F\}\>t\\\}=\\\{\\tau\_\{F\}\>t\-1\\\}\\cap\\\{F\_\{t\}=0\\\}\. SinceFt=1F\_\{t\}=1implies the trajectory is active attt, we have the chain\{Ft=1\}⊆At⊆\{τF\>t−1\}\\\{F\_\{t\}=1\\\}\\subseteq A\_\{t\}\\subseteq\\\{\\tau\_\{F\}\>t\-1\\\}, and for eventsE⊆B⊆CE\\subseteq B\\subseteq CwithPr⁡\(B\)\>0\\Pr\(B\)\>0the identitiesPr⁡\(E∣C\)=Pr⁡\(E\)/Pr⁡\(C\)≤Pr⁡\(E\)/Pr⁡\(B\)=Pr⁡\(E∣B\)\\Pr\(E\\mid C\)=\\Pr\(E\)/\\Pr\(C\)\\leq\\Pr\(E\)/\\Pr\(B\)=\\Pr\(E\\mid B\)give

Prπ⁡\(Ft=1​∣τF\>​t−1\)≤Prπ⁡\(Ft=1∣At\)=qtπ≤q¯tπ\.\\Pr\_\{\\pi\}\(F\_\{t\}=1\\mid\\tau\_\{F\}\>t\-1\)\\leq\\Pr\_\{\\pi\}\(F\_\{t\}=1\\mid A\_\{t\}\)=q\_\{t\}^\{\\pi\}\\leq\\bar\{q\}\_\{t\}^\{\\pi\}\.Iterating the chain rule onPr⁡\(τF\>T\)=∏t=1TPr⁡\(Ft=0​∣τF\>​t−1\)\\Pr\(\\tau\_\{F\}\>T\)=\\prod\_\{t=1\}^\{T\}\\Pr\(F\_\{t\}=0\\mid\\tau\_\{F\}\>t\-1\)yields

\(14\)Prπ⁡\(τF\>T\)≥∏t=1T\(1−q¯tπ\)\.\\Pr\_\{\\pi\}\(\\tau\_\{F\}\>T\)\\geq\\prod\_\{t=1\}^\{T\}\(1\-\\bar\{q\}\_\{t\}^\{\\pi\}\)\.
Step 2: still\-active upper bound\.The active event satisfiesAt\+1=At∩\{St=0,Ft=0\}⊆At∩\{St=0\}A\_\{t\+1\}=A\_\{t\}\\cap\\\{S\_\{t\}=0,F\_\{t\}=0\\\}\\subseteq A\_\{t\}\\cap\\\{S\_\{t\}=0\\\}, hence

Prπ⁡\(At\+1∣At\)≤Prπ⁡\(St=0∣At\)=1−htπ≤1−h¯tπ\.\\Pr\_\{\\pi\}\(A\_\{t\+1\}\\mid A\_\{t\}\)\\leq\\Pr\_\{\\pi\}\(S\_\{t\}=0\\mid A\_\{t\}\)=1\-h\_\{t\}^\{\\pi\}\\leq 1\-\\underline\{h\}\_\{t\}^\{\\pi\}\.Iterating, usingAT\+1⊆AT⊆⋯⊆A1A\_\{T\+1\}\\subseteq A\_\{T\}\\subseteq\\cdots\\subseteq A\_\{1\}andPr⁡\(A1\)=1\\Pr\(A\_\{1\}\)=1,

\(15\)Prπ⁡\(AT\+1\)=∏t=1TPrπ⁡\(At\+1∣At\)≤∏t=1T\(1−h¯tπ\)\.\\Pr\_\{\\pi\}\(A\_\{T\+1\}\)=\\prod\_\{t=1\}^\{T\}\\Pr\_\{\\pi\}\(A\_\{t\+1\}\\mid A\_\{t\}\)\\leq\\prod\_\{t=1\}^\{T\}\(1\-\\underline\{h\}\_\{t\}^\{\\pi\}\)\.
Step 3: clean\-before\-false event\.The certified event\{τS≤T,τS<τF\}\\\{\\tau\_\{S\}\\leq T,\\tau\_\{S\}<\\tau\_\{F\}\\\}is exactly the complement, inside\{τF\>T\}\\\{\\tau\_\{F\}\>T\\\}, of the still\-active event\{AT\+1\}\\\{A\_\{T\+1\}\\\}: trajectories with no false admission byTTeither succeed byTT\(clean stopping at someτS≤T<τF\\tau\_\{S\}\\leq T<\\tau\_\{F\}\) or remain active atT\+1T\+1\. Therefore

Prπ⁡\(τS≤T,τS<τF\)=Prπ⁡\(τF\>T\)−Prπ⁡\(AT\+1\)\.\\Pr\_\{\\pi\}\(\\tau\_\{S\}\\leq T,\\tau\_\{S\}<\\tau\_\{F\}\)=\\Pr\_\{\\pi\}\(\\tau\_\{F\}\>T\)\-\\Pr\_\{\\pi\}\(A\_\{T\+1\}\)\.Combining \([14](https://arxiv.org/html/2605.24693#S2.E14)\) and \([15](https://arxiv.org/html/2605.24693#S2.E15)\), and taking the positive part since probabilities are non\-negative, yields Eq\. \([13](https://arxiv.org/html/2605.24693#S2.E13)\)\. The decomposition includes the degenerate caseτS=τF=∞\\tau\_\{S\}=\\tau\_\{F\}=\\infty, which hasτF\>T\\tau\_\{F\}\>TandτS\>T\\tau\_\{S\}\>Tand therefore lies inAT\+1A\_\{T\+1\}\. The validity of the per\-step boundsq¯tπ\\bar\{q\}\_\{t\}^\{\\pi\}andh¯tπ\\underline\{h\}\_\{t\}^\{\\pi\}follows from Clopper–Pearson on independent active calibration rows at fixedtt\(Assumption[2\.3](https://arxiv.org/html/2605.24693#S2.Thmtheorem3)\); the union bound overttis absorbed by theδraw/\(2​T\)\\delta\_\{\\mathrm\{raw\}\}/\(2T\)allocation in Eq\. \([7](https://arxiv.org/html/2605.24693#S2.E7)\)\. ∎

Action classCertificate effectCP\-Agentrealization \(control variable\)Stopenforce calibrated risk gate / frozen observable gatefinal candidate gate \(admission threshold\)Refinemove to next active candidatesolution generation / repair \(horizonTT, retry\)RiskProbeρt↦λt​ρt\\rho\_\{t\}\\\!\\mapsto\\\!\\lambda\_\{t\}\\rho\_\{t\}Hypothesis Prober \(model\-decided\)EvidenceAcquire​\(m\)\\textsf\{EvidenceAcquire\}\(m\)It↦It\+Δ​It​\(m\)I\_\{t\}\\\!\\mapsto\\\!I\_\{t\}\+\\Delta I\_\{t\}\(m\)Validator \+ Test Augmentation \(mm,γevid\\gamma\_\{\\rm evid\}\)ContextAcquire​\(k\)\\textsf\{ContextAcquire\}\(k\)h¯t↦h¯t\+Δ​ht​\(k\)\\underline\{h\}\_\{t\}\\\!\\mapsto\\\!\\underline\{h\}\_\{t\}\+\\Delta h\_\{t\}\(k\)Experience memory \(budgetkk\)Table 1\.Theory\-to\-system map\. The theory derives abstract action classes from certificate quantities;CP\-Agentinstantiates each class with concrete tools and calibrated knobs\.\\MakeFramed\\FrameRestore

\(T1\) Stopped\-process state\.Feedback\-driven CP solving compresses into three certificate quantities\(ρt,It,h¯t\)\(\\rho\_\{t\},I\_\{t\},\\underline\{h\}\_\{t\}\), with calibrated realizations\(ρ¯tprobe,π,ℐtevid,π,h¯tπ​\(M\)\)\(\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\},\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\},\\underline\{h\}\_\{t\}^\{\\pi\}\(M\)\)entering the certificate\.\(T2\) One action per quantity\.Each quantity is moved by exactly one action class—RiskProbe onρt\\rho\_\{t\}, EvidenceAcquire onItI\_\{t\}, ContextAcquire onh¯t\\underline\{h\}\_\{t\}—which specifies what a useful tool must change\.\(T3\) Audit target\.The calibration rule and stopped\-process certificate select and audit frozen manifests on held\-out traces; Section[3](https://arxiv.org/html/2605.24693#S3)instantiates the three classes as Hypothesis Prober, Solution Validator / Test Augmentation, and frozen Experience retrieval\.\\endMakeFramed

The detailed proofs of all theorems are provided in Appendix[C](https://arxiv.org/html/2605.24693#A3)\.

## 3\.Method

##### Motivation & Overview\.

CP\-Agentis the instantiation of the selected controllerπ⋆\\pi^\{\\star\}: each abstract action class maps to a concrete tool \(Table[1](https://arxiv.org/html/2605.24693#S2.T1)\)\. All manifest entries—prompts, the observable admission gate, the HP and TA interfaces, the retrieval budget, the historical memory snapshotM⋆M^\{\\star\}, the write mode, and the maximum horizon—are frozen before benchmark evaluation, either pre\-declared or audited on held\-out tracesDcalD\_\{\\rm cal\}via Theorem[2\.6](https://arxiv.org/html/2605.24693#S2.Thmtheorem6)\.

### 3\.1\.Dual\-Granularity Verification

RiskProbeandEvidenceAcquire​\(m\)\\textsf\{EvidenceAcquire\}\(m\)act at different granularities: the former screens a reasoning state before it is admitted, the latter gates a complete candidate before it is trusted\.CP\-Agentrealizes them as a Hypothesis Prober for the logged upstream gateGtHPG\_\{t\}^\{\\mathrm\{HP\}\}and a Solution Validator for the program\-level admission screen insideΓt\\Gamma\_\{t\}\. Separating the two roles prevents the upstream screen from being double\-counted as calibrated program\-level evidence\.

##### Hypothesis Prober\.

During early\-stage reasoning, the agent often needs to verify local claims that are hard to establish purely in natural language—e\.g\., whether a recurrence holds on random small instances, whether a greedy rule fails on a constructed counterexample, or whether an invariant is preserved by a transition\.CP\-Agentexposes a Hypothesis Prober that compiles and runs short C\+\+ snippets for this purpose\. Under the frozen prompt and decoding policy the backbone itself decides when to call HP from its observable reasoning state—no hand\-written trigger grid—and every call/pass/fail outcome is logged for calibration\. A surviving HP call contracts the posterior odds of an upstream reasoning error \(Proposition[2\.7](https://arxiv.org/html/2605.24693#S2.Thmtheorem7)\) and tightens the residual envelopeρ¯tprobe,π\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}entering Eq\. \([9](https://arxiv.org/html/2605.24693#S2.E9)\); HP thus movesρt\\rho\_\{t\}\(and the aggregateq¯t\\bar\{q\}\_\{t\}\) alone, not the program\-level evidence channel\.

##### Solution Validator\.

Once the agent produces a complete candidate implementation, the Solution Validator compiles and runs it against the local test suite: any public/local failure rejects the candidate and returns counterexample\-level feedback\. By default SV is a deterministic hard screen and contributes no calibrated statistical evidence; only when the suite uses fresh sampled tests or an externally calibrated set is an all\-pass event additionally folded intoℐt\\mathcal\{I\}\_\{t\}as the binary survival channel of Proposition[2\.8](https://arxiv.org/html/2605.24693#S2.Thmtheorem8)\.

### 3\.2\.Test Augmentation

The action classEvidenceAcquire​\(m\)\\textsf\{EvidenceAcquire\}\(m\)makes the admission gate more informative against public\-test\-passing errors and exposes the evidence\-acquisition intensity as a frozen controller entry\. Public test suites𝒯basic\\mathcal\{T\}\_\{\\text\{basic\}\}are fixed and shallow, so passing them is a weak certificate: candidates routinely clear𝒯basic\\mathcal\{T\}\_\{\\text\{basic\}\}yet fail hidden tests in corner regimes\. Test Augmentation \(TA\) instantiatesEvidenceAcquireby appending consensus\-filtered checks beyond𝒯basic\\mathcal\{T\}\_\{\\text\{basic\}\}; an all\-pass on the augmented suite entersℐt\\mathcal\{I\}\_\{t\}as the binary survival channel of Proposition[2\.8](https://arxiv.org/html/2605.24693#S2.Thmtheorem8)\.

##### Augmentation pipeline\.

CP\-Agentimplements test augmentation via three stages\.Input Synthesis\.Given the statement, the agent synthesizes a C\+\+ input generator based ontestlib\.h222[https://github\.com/MikeMirzayanov/testlib](https://github.com/MikeMirzayanov/testlib)and command suites that sweep corner regimes such as min/max values, degenerate structures, adversarial patterns, and random sizes\.Oracle Labeling\.Ignoring time and memory limits often makes it easier to write a correct brute\-force solver than an efficient one\(Ridniket al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib29); Heet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib17)\);CP\-Agenttherefore executes independently generated brute\-force solvers on synthesized inputs to obtain candidate reference outputs\.Consensus Filtering\.Because brute\-force programs can still be imperfect,CP\-Agentruns multiple solvers in parallel and majority\-votes their outputs\. Candidates are evaluated on augmented tests only after passing public tests, so augmentation serves as additional diagnostic evidence after the basic admission screen\.Budget and retries\.The frozen TA manifest caps synthesis at 32 candidate augmented inputs per problem\. This is an upper bound rather than a fixed suite size: the generator does not keep sampling until a nonempty suite is obtained, and there is no separate retention cap after filtering\. Instead,CP\-Agentretains all valid inputs that pass synthesis, compilation/execution, and consensus labeling within the cap; generator synthesis/compilation/execution may be retried up to five times\. Consequently, GenRate measures the fraction of problems for which this capped protocol yields a nonempty suite, while the candidate\-level evidence calibration treats the resulting suite as one binary survival channel\.

### 3\.3\.Experience\-Driven Self\-Evolving

The action classContextAcquire​\(k\)\\textsf\{ContextAcquire\}\(k\)calls for a compact context priorMMthat raises the active\-state success hazardh¯t​\(M\)\\underline\{h\}\_\{t\}\(M\)in Corollary[2\.10](https://arxiv.org/html/2605.24693#S2.Thmtheorem10)\. A naive instantiation—continuously updatingMMacross the entire problem stream—would let target\-split outcomes leak intoMMand invalidate the calibration premise of Theorem[2\.6](https://arxiv.org/html/2605.24693#S2.Thmtheorem6): the hazard lift would no longer be data\-independent of the evaluation split\.CP\-Agentresolves this with a temporal frozen\-memory protocol,

Dhistmem→Dcal→Dtest\.D\_\{\\rm hist\}^\{\\rm mem\}\\rightarrow D\_\{\\rm cal\}\\rightarrow D\_\{\\rm test\}\.In*memory\-building mode*, validated fail\-to\-pass traces fromDhistmemD\_\{\\rm hist\}^\{\\rm mem\}are written into a retrievable memory bank\. In*certificate\-evaluation mode*, the resulting snapshotM⋆M^\{\\star\}is frozen beforeDcalD\_\{\\rm cal\}andDtestD\_\{\\rm test\}; both readM⋆M^\{\\star\}read\-only, and target\-split experiences are written only to the audit log, never retrieved by subsequent target problems\. Dataset instantiation ofDhistmem/Dcal/DtestD\_\{\\rm hist\}^\{\\rm mem\}/D\_\{\\rm cal\}/D\_\{\\rm test\}is reported in §[4](https://arxiv.org/html/2605.24693#S4)\.

##### Memory\-building stage\.

*Notation\.*OnDhistmem=\(q1,…,qn\)D\_\{\\rm hist\}^\{\\rm mem\}=\(q\_\{1\},\\ldots,q\_\{n\}\)we index problems byrrand within\-problem attempts bytt; the target\-split indexi∈Dcal∪Dtesti\\in D\_\{\\rm cal\}\\cup D\_\{\\rm test\}is disjoint fromrr\.

The design question is what evidence each experience should carry so that the resulting prior can move the hazard but not the calibration premise\. We require each retrievable item to be a self\-contained*Fail→\\toPass*triplet—a failed attempt, its observable failure evidence, and a later observable passing correction on the*same*problem\. The triplet anchors every learnt pattern to an observable counterexample, so retrieval at evaluation time substitutes for an additional probe rather than supplying labels; and because it is built only onDhistmemD\_\{\\rm hist\}^\{\\rm mem\}, it leaves the active\-row denominatorntπn\_\{t\}^\{\\pi\}onDcalD\_\{\\rm cal\}untouched\.

Concretely, on problemqrq\_\{r\}the agent maintains a build\-time memory stateMrbuildM^\{\\rm build\}\_\{r\}distilled from validated fixes\. Candidate solutions are generated as

\(16\)sr,t=Gen​\(qr,Mrbuild,hr,t−1\),s\_\{r,t\}=\\mathrm\{Gen\}\(q\_\{r\},M^\{\\rm build\}\_\{r\},h\_\{r,t\-1\}\),wherehr,t−1h\_\{r,t\-1\}summarizes within\-problem history\. The Solution Validator returns a pass indicator and public/local feedback,

\(17\)\(vr,t,fr,t\)=Φ​\(qr,sr,t\),vr,t∈\{0,1\},\(v\_\{r,t\},\\,f\_\{r,t\}\)=\\Phi\(q\_\{r\},s\_\{r,t\}\),\\qquad v\_\{r,t\}\\in\\\{0,1\\\},withfr,tf\_\{r,t\}the actionable failure feedback \(unused whenvr,t=1v\_\{r,t\}\{=\}1\), matching the\(v,f\)←Φ​\(⋅\)\(v,f\)\\leftarrow\\Phi\(\\cdot\)call in Algorithm[1](https://arxiv.org/html/2605.24693#alg1)\. For any failing attempttt, define the first later observable pass

\(18\)t\+​\(r,t\):=min⁡\{u\>t∣vr,u=1\},t^\{\+\}\(r,t\):=\\min\\\{u\>t\\mid v\_\{r,u\}=1\\\},when it exists\. The extracted experience is

\(19\)er,t=Exp​\(qr,sr,t,sr,t\+​\(r,t\),fr,t\),e\_\{r,t\}=\\mathrm\{Exp\}\(q\_\{r\},s\_\{r,t\},s\_\{r,t^\{\+\}\(r,t\)\},f\_\{r,t\}\),realising the Fail→\\toPass triplet above as a reusable memory item\. The build\-time memory update is

\(20\)Mr\+1build=Update​\(Mrbuild,er,t\)\.M^\{\\rm build\}\_\{r\+1\}=\\mathrm\{Update\}\(M^\{\\rm build\}\_\{r\},e\_\{r,t\}\)\.Because bothsr,t\+​\(r,t\)s\_\{r,t^\{\+\}\(r,t\)\}andfr,tf\_\{r,t\}are observable onDhistmemD\_\{\\rm hist\}^\{\\rm mem\}, no hidden label fromDcal∪DtestD\_\{\\rm cal\}\\cup D\_\{\\rm test\}entersExp​\(⋅\)\\mathrm\{Exp\}\(\\cdot\),Update​\(⋅\)\\mathrm\{Update\}\(\\cdot\), prompts, or routing\.

##### Frozen certificate\-evaluation stage\.

After historical memory construction, calibration chooses a snapshotM⋆M^\{\\star\}and controller manifestπ⋆\\pi^\{\\star\}usingDcalD\_\{\\rm cal\}with memory fixed\. For every target calibration and test problem,

\(21\)Mitarget≡M⋆,i∈Dcal∪Dtest\.M\_\{i\}^\{\\rm target\}\\equiv M^\{\\star\},\\qquad i\\in D\_\{\\rm cal\}\\cup D\_\{\\rm test\}\.An extractable target\-split experience is recorded only in an audit logM~\\widetilde\{M\}for optional post\-benchmark deployment:

\(22\)M~i\+1=M~i∪\{ei\},Mi\+1target=M⋆\.\\widetilde\{M\}\_\{i\+1\}=\\widetilde\{M\}\_\{i\}\\cup\\\{e\_\{i\}\\\},\\qquad M\_\{i\+1\}^\{\\rm target\}=M^\{\\star\}\.Thus the calibrated certificate’s memory prior is data\-independent of the target evaluation outcomes—exactly the i\.i\.d\.\-given\-controller premise that Theorem[2\.6](https://arxiv.org/html/2605.24693#S2.Thmtheorem6)requires of the held\-out rows\.

##### Experience Curation\.

The hazard lift attributable toMMis bounded by the description length of the retrieved evidence \(Proposition[2\.9](https://arxiv.org/html/2605.24693#S2.Thmtheorem9)\), so curation is the part of the manifest that directly determines how muchh¯t\\underline\{h\}\_\{t\}can move\. We design three controllers, one per MDL component, and freeze all three into the manifest \(Definition[2\.4](https://arxiv.org/html/2605.24693#S2.Thmtheorem4)\):

*\(i\) Routing*restricts the active hypothesis bucket by splitting the snapshot into two repositories,M⋆:=\(ℳgen,ℳalg\)M^\{\\star\}:=\(\\mathcal\{M\}^\{\\mathrm\{gen\}\},\\mathcal\{M\}^\{\\mathrm\{alg\}\}\), whereℳgen\\mathcal\{M\}^\{\\mathrm\{gen\}\}collects task\-agnostic debugging patterns andℳτ1,τ2alg\\mathcal\{M\}^\{\\mathrm\{alg\}\}\_\{\\tau\_\{1\},\\tau\_\{2\}\}is indexed by a two\-level taxonomy\(τ1,τ2\)\(\\tau\_\{1\},\\tau\_\{2\}\)with 8 primary categories and 68 secondary tags\.

*\(ii\) LLM\-mediated deduplication*merges redundant traces: on each insertion the curation prompt compares the candidate against existing summaries in the target bank and rejects entries judged duplicate or covered by an existing generalization, so retrieved memory carries non\-redundant evidence rather than near\-copies of the same pattern\.

*\(iii\) Capacity*controls the description length of retrieved memory by capping each bank,\|ℳgen\|≤20\|\\mathcal\{M\}^\{\\mathrm\{gen\}\}\|\\leq 20and\|ℳτ1,τ2alg\|≤K=10\|\\mathcal\{M\}^\{\\mathrm\{alg\}\}\_\{\\tau\_\{1\},\\tau\_\{2\}\}\|\\leq K\{=\}10; once a bank is at capacity the same semantic rule keeps at most the allowed number of representative experiences,

ℬ←Dedup​\(ℬ;K\)\.\\mathcal\{B\}\\leftarrow\\mathrm\{Dedup\}\(\\mathcal\{B\};K\)\.

##### Experience Utilization\.

At evaluation time the two repositories play distinct roles in the hazard\. The general bank supplies a*task\-agnostic prior shift*applied to every problem: a compact summary is appended to the system prompt,

SysPrompti=\\displaystyle\\textsf\{SysPrompt\}\_\{i\}=\{\}BaseSysPrompt∥Format\(ℳgen,kgen\),\\displaystyle\\\|\\ \\mathrm\{Format\}\(\\mathcal\{M\}^\{\\mathrm\{gen\}\},k\_\{\\text\{gen\}\}\),raising the baseline component ofh¯t​\(M⋆\)\\underline\{h\}\_\{t\}\(M^\{\\star\}\)uniformly acrossqiq\_\{i\}\. The algorithmic bank supplies a*task\-conditional lift*that fires only whenqiq\_\{i\}is routed to a populated bucket:CP\-Agentpredicts a taxonomy label and retrieves the corresponding frozen bucket,

\(23a\)\(τ^1,τ^2\)\\displaystyle\(\\hat\{\\tau\}\_\{1\},\\hat\{\\tau\}\_\{2\}\)=ClassifyAlg​\(qi\),\\displaystyle=\\mathrm\{ClassifyAlg\}\(q\_\{i\}\),\(23b\)Eialg\\displaystyle E\_\{i\}^\{\\mathrm\{alg\}\}=RetrieveAlgExp​\(τ^1,τ^2;ℓ,M⋆\),\\displaystyle=\\mathrm\{RetrieveAlgExp\}\(\\hat\{\\tau\}\_\{1\},\\hat\{\\tau\}\_\{2\};\\ell,M^\{\\star\}\),contributing the residual, problem\-specific portion of the lift\. Both pathways readM⋆M^\{\\star\}read\-only, so the realized hazardh¯tπ​\(M⋆\)\\underline\{h\}\_\{t\}^\{\\pi\}\(M^\{\\star\}\)remains the calibrated quantity entering Corollary[2\.10](https://arxiv.org/html/2605.24693#S2.Thmtheorem10), and the refinement\-step\-11\(t=1t\{=\}1\) acceptance gain in Table[5](https://arxiv.org/html/2605.24693#S4.T5)is its empirical instantiation\.

### 3\.4\.Tool Orchestration

CP\-Agentruns a stage\-wise refinement loop coordinating the three components\. For each problemqiq\_\{i\}, the system prompt is initialized with the general bank ofM⋆M^\{\\star\}and, when helpful, retrieved algorithmic experiencesEialgE\_\{i\}^\{\\mathrm\{alg\}\}\. The frozen backbone then drafts a solution and may call the Hypothesis Prober to falsify local hypotheses before committing; the Solution Validator compiles and runs each draft on public tests, whose feedback drives subsequent edits; once all public tests pass, Test Augmentation generates consensus\-filtered cases\. In deployment mode, attempts may be appended to retrievable memory via Eq\. \([20](https://arxiv.org/html/2605.24693#S3.E20)\); in certificate\-evaluation mode they are written only to the audit logM~\\widetilde\{M\}via Eq\. \([22](https://arxiv.org/html/2605.24693#S3.E22)\)\. The mapping to the abstract controller of Section[2](https://arxiv.org/html/2605.24693#S2)is direct: HP→RiskProbe\\to\\textsf\{RiskProbe\}, SV/TA→EvidenceAcquire​\(m\)\\to\\textsf\{EvidenceAcquire\}\(m\), Experience→ContextAcquire​\(k\)\\to\\textsf\{ContextAcquire\}\(k\)\. Algorithm[1](https://arxiv.org/html/2605.24693#alg1)gives the full loop, with an outer loop over theTTrefinement steps of Def\.[2\.1](https://arxiv.org/html/2605.24693#S2.Thmtheorem1)and an inner loop over within\-step HP probes \(the interaction steps of Fig\.[3\(a\)](https://arxiv.org/html/2605.24693#S4.F3.sf1)\), so a single refinement step—includingT=1T\{=\}1on LCB\-Pro—can host many HP calls before the candidate is committed\.

ModelLCB\-ProICPC\-EvalPass@1Refine@5\#TFrontier LLMsDeepSeek\-R139\.514\.42\.1DeepSeek\-V3\.2\-Reasoner58\.133\.91\.6Gemini\-3\-pro71\.948\.31\.4o4\-mini\-high59\.342\.41\.2GPT\-4o BasedGPT\-4o5\.43\.41\.8GPT\-4o Reflexion9\.67\.72\.4GPT\-4o CodeSim8\.45\.11\.5GPT\-4o AlphaCodium5\.45\.92\.7GPT\-4oCP\-Agent\(ours\)10\.2 \(\+4\.8\)7\.7 \(\+4\.3\)1\.6Qwen3\-235B\-A22B\-Instruct BasedQwen3\-235B\-A22B\-Instruct26\.921\.21\.4Qwen3\-235B\-A22B Reflexion35\.327\.11\.2Qwen3\-235B\-A22B CodeSim45\.525\.42\.1Qwen3\-235B\-A22B AlphaCodium25\.825\.41\.6Qwen3\-235B\-A22BCP\-Agent\(ours\)46\.7 \(\+19\.8\)26\.3 \(\+5\.1\)1\.6DeepSeek\-V3\.2\-Chat BasedDeepSeek\-V3\.2\-Chat25\.822\.91\.2DeepSeek\-V3\.2\-Chat Reflexion37\.132\.51\.2DeepSeek\-V3\.2\-Chat CodeSim40\.129\.71\.2DeepSeek\-V3\.2\-Chat AlphaCodium24\.529\.71\.7DeepSeek\-V3\.2\-ChatCP\-Agent\(ours\)48\.5 \(\+22\.7\)33\.9 \(\+11\.0\)1\.3

Table 2\.Performance ofCP\-Agentand strong baselines on competitive\-programming benchmarks\. LiveCodeBench Pro reports Pass@1 on the 167\-problem 2025Q2 target split; ICPC\-Eval reports Refine@5 and the average number of refinement turns \#T\. Highlighted rows correspond toCP\-Agentvariants built upon the same backbone\.

## 4\.Experiment

### 4\.1\.Experiment Setup

##### Backbone LLMs

To evaluate the general applicability of our method across different LLMs, we conducted experiments using three distinct backbone LLMs: GPT\-4o\(Hurstet al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib20)\), Qwen3\-235B\-A22B\-Instruct\(Yanget al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib33)\), and DeepSeek\-V3\.2\-Chat\(Liuet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib26)\)for all multi\-agent workflows andCP\-Agentvariants\.

##### Baselines

We compare against two baseline groups:\(i\) Frontier LLMs:DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib16)\), DeepSeek\-V3\.2\-Reasoner\(Liuet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib26)\), Gemini\-3\-pro\(Pichaiet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib14)\), and o4\-mini\-high\(OpenAI,[2025b](https://arxiv.org/html/2605.24693#bib.bib38)\)\.\(ii\) Agent frameworks/workflows:For a fair comparison, we evaluate three agent frameworks: Reflexion\(Shinnet al\.,[2023](https://arxiv.org/html/2605.24693#bib.bib30)\), CodeSim\(Islamet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib22)\), and AlphaCodium\(Ridniket al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib29)\), each with the same three backbones, all targeting C\+\+\.

##### Evaluation datasets\.

We evaluate all baselines andCP\-Agentvariants on two competition\-style, contamination\-controlled benchmarks: LiveCodeBench Pro\(Zhenget al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib3)\)and ICPC\-Eval\(Xuet al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib2)\)\. The target split for LCB\-Pro is the held\-out 2025Q2 subset\. The temporal partition that Experience uses on LCB\-Pro and the controller transfer to ICPC\-Eval are described in §[4\.2](https://arxiv.org/html/2605.24693#S4.SS2)\(summarized in Table[3](https://arxiv.org/html/2605.24693#S4.T3)\)\.

##### Metrics

For ICPC\-Eval, we use Refine@K\(Xuet al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib2)\)to measure iterative refinement under execution feedback, i\.e\., whether the model passes the hidden judge withinKKresponse–revision attempts; we report Refine@5 \(K=5K\{=\}5\) to reflect the interactive nature of competitive programming\. For LiveCodeBench Pro, we report Pass@1 to match its standard single\-shot setting and prior work, and use per\-refinement\-step analyses \(t=1,2,3t\{=\}1,2,3\) when studying the role of experience\.

##### Implementations\.

The full frozen controller manifest—refinement horizonTT, retrieval budgets\(kgen,ℓalg\)\(k\_\{\\rm gen\},\\ell\_\{\\rm alg\}\), TA intensitymmand consensus ruleγevid\\gamma\_\{\\rm evid\}, decoding parameters, and prompt template—is given in Table[3](https://arxiv.org/html/2605.24693#S4.T3)\. Baselines \(Reflexion, CodeSim, AlphaCodium\) use their official prompts and control flow with the same three backbones\. HP, TA, and Experience are added only insideCP\-Agentvariants and never touch baselines\.

### 4\.2\.Calibration Protocol and Auditability

The controller is calibrated before final benchmark reporting and is never recomputed after observing evaluation outcomes\. We use a temporal three\-stage protocol,

Dhistmem→Dcal→Dtest\.D\_\{\\rm hist\}^\{\\rm mem\}\\rightarrow D\_\{\\rm cal\}\\rightarrow D\_\{\\rm test\}\.First, the 387 historical pre\-test problems inbiannual\_2024\_7\_12construct candidate memory snapshots through the memory\-building rule in Eq\. \([20](https://arxiv.org/html/2605.24693#S3.E20)\)\. Second, for each candidate policyπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}, we run the frozen agent with a fixed snapshotM⋆M^\{\\star\}on the disjoint 166\-problem calibration splitquater\_2025\_1\_3, record every active trace row, compute the one\-sided bounds in Eq\. \([7](https://arxiv.org/html/2605.24693#S2.E7)\), audit the observable admission gate using Eq\. \([8](https://arxiv.org/html/2605.24693#S2.E8)\), and select the best declared manifest fromΠα\\Pi\_\{\\alpha\}by Eq\. \([12](https://arxiv.org/html/2605.24693#S2.E12)\)\. Third, final benchmark runs onDtestD\_\{\\rm test\}use the selected controller in certificate\-evaluation mode: the policy, memory snapshot, TA consensus threshold, HP tool interface, prompts, decoding parameters, maximum horizon, and write mode are frozen, and target\-test experiences are written only to the audit log but not retrievable\. For ICPC\-Eval we transfer the LCB\-Pro–selected controller and the same frozen snapshotM⋆M^\{\\star\}directly to the ICPC target split, under the identical certificate\-evaluation mode\.

Hidden\-judge outcomes are used in calibration only as labels forStS\_\{t\}andFtF\_\{t\}after a candidate has passed the observable gateΓt\\Gamma\_\{t\}; they are not exposed to the agent during refinement and are never used for target\-split memory writes, retrieval, prompt updates, routing, or controller selection\. This separation is important: public/local execution can guide repair, whereas hidden outcomes define the calibration labels for the stopped\-process certificate\. A trace contributes to the denominatorntn\_\{t\}only if it is active at steptt; a candidate that fails SV or TA contributes neither tosts\_\{t\}nor toftf\_\{t\}because it was not admitted\. These conventions align the empirical hazards with Definition[2\.1](https://arxiv.org/html/2605.24693#S2.Thmtheorem1)and the admission risk with Theorem[2\.5](https://arxiv.org/html/2605.24693#S2.Thmtheorem5)\.

Controller entrySymbolDeclared / fixed value beforeDcalD\_\{\\rm cal\}Data splits \(Assumption[2\.3](https://arxiv.org/html/2605.24693#S2.Thmtheorem3)\)Temporal splits—LCB\-Pro:\|Dhistmem\|=387\|D\_\{\\rm hist\}^\{\\rm mem\}\|\{=\}387,\|Dcal\|=166\|D\_\{\\rm cal\}\|\{=\}166,\|Dtest\|=167\|D\_\{\\rm test\}\|\{=\}167\(97/57/13 Easy/Med/Hard\)\.Manifest entries \(Definition[2\.4](https://arxiv.org/html/2605.24693#S2.Thmtheorem4)\)Admission gateΓ1:Tobs\\Gamma^\{\\rm obs\}\_\{1:T\}SV/TA hard screen; risk\-score grid\|Θt\|=1\|\\Theta\_\{t\}\|\{=\}1\.Probe interface𝒰probe\\mathcal\{U\}\_\{\\rm probe\}C\+\+ snippet HP tool; backbone\-decided invocation\.TA intensitymm≤32\\leq 32synthesized inputs/problem;≤5\\leq 5generation retries\.TA aggregationγevid\\gamma\_\{\\rm evid\}16 brute\-force samples at sampling temperature1\.01\.0; accept on≥9/16\\geq 9/16agreement\.General retrievalkgenk\_\{\\rm gen\}Injectkgen=10k\_\{\\rm gen\}\{=\}10general\-bank items; cap2020\.Algorithmic retrievalℓalg\\ell\_\{\\rm alg\}Retrieveℓalg=5\\ell\_\{\\rm alg\}\{=\}5items per bucket; per\-bucket capK=10K\{=\}10\.RoutingRrouteR\_\{\\rm route\}Two\-level taxonomy \(8 primary, 68 secondary tags\)\.DeduplicationDdedupD\_\{\\rm dedup\}LLM\-mediated insertion / merge decision\.Prompt templateppromptp\_\{\\rm prompt\}Frozen system \+ user prompts \(Appendix\)\.Decodingddecoded\_\{\\rm decode\}Backbone\-default temperature; greedy where supported\.HorizonTTT=5T\{=\}5\(ICPC Refine@5\);T=1T\{=\}1\(LCB\-Pro Pass@1\)\.Memory snapshotM⋆M^\{\\star\}Built fromDhistmemD\_\{\\rm hist\}^\{\\rm mem\}; read\-only onDcal∪DtestD\_\{\\rm cal\}\\cup D\_\{\\rm test\}\.Write modeωwrite\\omega\_\{\\rm write\}audit\-onlyon certificate\-bearing runs\.Manifest classΠα\\Pi\_\{\\alpha\}\(5 controller variants, Theorem[2\.6](https://arxiv.org/html/2605.24693#S2.Thmtheorem6)\)Base: backbone only \(no SV/HP/TA, noM⋆M^\{\\star\}\); admission is the trivial all\-pass gate\. \+SV: Base \+ Solution Validator \(admission gated by SV\)\. \+SV\+HP: above \+ Hypothesis Prober\. \+SV\+HP\+TA: above \+ Test Augmentation\. Full \(=π⋆\\pi^\{\\star\}\): above \+ frozen Experience retrieval \(M⋆,kgen,ℓalgM^\{\\star\},k\_\{\\rm gen\},\\ell\_\{\\rm alg\}\)\. All five sharepprompt,ddecode,T,ωwritep\_\{\\rm prompt\},d\_\{\\rm decode\},T,\\omega\_\{\\rm write\}and differ only by which manifest entries are switched on; mechanism factors of switched\-off components fall back to11in Eq\. \([9](https://arxiv.org/html/2605.24693#S2.E9)\) and to the trivial gate inΓ1:Tobs\\Gamma^\{\\rm obs\}\_\{1:T\}\.Table 3\.Frozen controller manifest for certificate\-bearing evaluation\. Each row maps a manifest entry of Definition[2\.4](https://arxiv.org/html/2605.24693#S2.Thmtheorem4)to its declared value; the bottom block defines the five controller variants inΠα\\Pi\_\{\\alpha\}\. No entry is tuned onDtestD\_\{\\rm test\}\.
### 4\.3\.Main Results

From Table[2](https://arxiv.org/html/2605.24693#S3.T2): \(1\)CP\-Agentimproves consistently across backbones, with best gains on DeepSeek\-V3\.2\-Chat of\+11\.0\+11\.0\(ICPC\-Eval Refine@5\) and\+22\.7\+22\.7\(LCB\-Pro Pass@1,81/16781/167vs\.43/16743/167, Wilson 95% CIs non\-overlapping\); \(2\) it matches or beats same\-backbone baselines and equals DeepSeek\-V3\.2\-Reasoner on ICPC\-Eval at33\.933\.9Refine@5 under frozen memory; \(3\) gains scale with backbone strength\. Section[4\.4](https://arxiv.org/html/2605.24693#S4.SS4)audits Corollary[2\.10](https://arxiv.org/html/2605.24693#S2.Thmtheorem10)’s clean component\.

### 4\.4\.Trace\-Calibrated Certificate and Mechanism Evidence

We instantiate the certificate on a held\-out calibration split using the logging schema in Definition[2\.1](https://arxiv.org/html/2605.24693#S2.Thmtheorem1)\. The pre\-declared manifest class isΠα=\{Base,\+SV,\+SV\+HP,\+SV\+HP\+TA,Full\}\\Pi\_\{\\alpha\}=\\\{\\text\{Base\},\\,\\text\{\+SV\},\\,\\text\{\+SV\+HP\},\\,\\text\{\+SV\+HP\+TA\},\\,\\text\{Full\}\\\}\(\|Πα\|=5\|\\Pi\_\{\\alpha\}\|=5\); each controller is evaluated onDcalD\_\{\\rm cal\}with the per\-controller\-split Clopper–Pearson allocations of the Numerical instantiation paragraph, and Eq\. \([12](https://arxiv.org/html/2605.24693#S2.E12)\) selects the one with the largestCTctrl​\(π\)C\_\{T\}^\{\\rm ctrl\}\(\\pi\)for final evaluation\. Table[4](https://arxiv.org/html/2605.24693#S4.T4)reports the theorem\-level products and certificate values; compact display rates are derived from these products and are not homogeneous\-step assumptions\.

Config\(SV,HP,TA,Exp\)Calibration certificate \(DcalD\_\{\\rm cal\},T=5T\{=\}5\)Empirical \(target split\)∏\(1−q¯\)\\prod\(1\{\-\}\\bar\{q\}\)∏\(1−h¯\)\\prod\(1\{\-\}\\underline\{h\}\)C5ctrlC\_\{5\}^\{\\rm ctrl\}LCB\-ProPass@1ICPC\-EvalR@5 \(\#T\)Cost✗✗✗✗63\.9%60\.3%3\.6%25\.822\.9 \(1\.2\)$0\.002✓✗✗✗69\.5%63\.4%6\.1%43\.731\.4 \(1\.3\)$0\.023✓✓✗✗73\.4%61\.9%11\.5%44\.332\.2 \(1\.3\)$0\.027✓✓✓✗76\.5%60\.3%16\.2%46\.133\.1 \(1\.3\)$0\.043✓✓✓✓79\.5%54\.1%25\.4%48\.533\.9 \(1\.3\)$0\.035

Table 4\.Trace\-calibrated stopping\-time certificate and component ablation onΠα\\Pi\_\{\\alpha\}\(DeepSeek\-V3\.2\-Chat\)\.*Calibration certificate*: theorem\-level products andC5ctrlC\_\{5\}^\{\\rm ctrl\}on the 166\-problem held\-outDcalD\_\{\\rm cal\}; a missing mechanism factor falls back to the admitted\-failure UCB\.*Empirical*: LCB\-Pro Pass@1 \(T=1T\{=\}1, 167\-problem 2025Q2 split\), transferred ICPC\-Eval Refine@5 with mean \#refinements \(transfer per §[4\.2](https://arxiv.org/html/2605.24693#S4.SS2)\), and per\-problem API cost\.C5ctrlC\_\{5\}^\{\\rm ctrl\}lower\-boundsPr⁡\(τS≤5,τS<τF\)\\Pr\(\\tau\_\{S\}\\leq 5,\\tau\_\{S\}<\\tau\_\{F\}\); Pass@1 is cross\-reference, not an empirical bound onC5ctrlC\_\{5\}^\{\\rm ctrl\}\.Table[4](https://arxiv.org/html/2605.24693#S4.T4)shows that the certificate ranking matches the ablation trend: the Full temporal frozen\-memory controller has the largestC5ctrl​\(π\)C^\{\\rm ctrl\}\_\{5\}\(\\pi\), andC5ctrl​\(π\)C^\{\\rm ctrl\}\_\{5\}\(\\pi\)lies below the matched\-horizon ICPC\-Eval Refine@5 in every row, as required\. Mechanism calibration on the 166\-problem calibration split gives: HP odds contractionσ¯tbad/σ¯tgood=0\.46/0\.71=0\.65\\bar\{\\sigma\}\_\{t\}^\{\\rm bad\}/\\underline\{\\sigma\}\_\{t\}^\{\\rm good\}=0\.46/0\.71=0\.65, withσ¯tbad\\bar\{\\sigma\}\_\{t\}^\{\\rm bad\}the Clopper–Pearson UCB onσtbad\\sigma\_\{t\}^\{\\rm bad\}\(Proposition[2\.7](https://arxiv.org/html/2605.24693#S2.Thmtheorem7)\) andσ¯tgood\\underline\{\\sigma\}\_\{t\}^\{\\rm good\}the LCB on the analogous good\-plan ratePr⁡\(Wtprobe=1∣¬Btϵ,At\)\\Pr\(W\_\{t\}^\{\\rm probe\}\{=\}1\\mid\\neg B\_\{t\}^\{\\epsilon\},A\_\{t\}\); TA evidence incrementΔ​It​\(m=32\)=−log⁡b¯t​\(m=32\)=1\.00\\Delta I\_\{t\}\(m\{=\}32\)=\-\\log\\bar\{b\}\_\{t\}\(m\{=\}32\)=1\.00nats with post\-survival bug rate11/58=19\.0%11/58=19\.0\\%on the 117\-candidate diagnostic pool\. For Experience, Table[5](https://arxiv.org/html/2605.24693#S4.T5)reports cumulative judge\-acceptance with and without the frozen snapshotM⋆M^\{\\star\}over the first three refinement steps on both benchmarks\. The refinement stepttis the response–revision attempt of Definition[2\.1](https://arxiv.org/html/2605.24693#S2.Thmtheorem1), and we operationalize it per benchmark: on LCB\-Pro,ttindexes thett\-th Solution Validator admission within a trajectory—the per\-step counterpart of the single\-shot Pass@1 protocol \(T=1T\{=\}1in Table[3](https://arxiv.org/html/2605.24693#S4.T3)\)—while on ICPC\-Eval,ttindexes thett\-th official judge submission under the Refine@5 protocol\. Each cell is the fraction ofDcalD\_\{\\rm cal\}problems whose first judge\-accepted submission arrives by steptt; both*w/ Exp*rows read the same pre\-builtM⋆M^\{\\star\}, so the comparison isolates Experience and is order\-independent\. The lift concentrates att=1t\{=\}1\(\+9\.6\+9\.6pp on LCB\-Pro and\+3\.4\+3\.4pp on ICPC\-Eval\) and shrinks at later steps, consistent withContextAcquireacting on the active\-state hazardh¯t​\(M\)\\underline\{h\}\_\{t\}\(M\)in Proposition[2\.9](https://arxiv.org/html/2605.24693#S2.Thmtheorem9)rather than onρ¯tprobe\\bar\{\\rho\}\_\{t\}^\{\\rm probe\}orℐt\\mathcal\{I\}\_\{t\}\.

Cumulative judge\-acceptance \(%\)LCB\-Pro:tt\-th SV admission; ICPC\-Eval:tt\-th official submissiont=1t\{=\}1t=2t\{=\}2t=3t\{=\}3LCB\-Pro w/o Exp50\.661\.468\.7LCB\-Pro w/Exp60\.267\.569\.9Δ\\Delta\(Exp−\-no Exp\)\+9\.6\+6\.1\+1\.2ICPC\-Eval w/o Exp18\.622\.027\.1ICPC\-Eval w/Exp22\.025\.428\.8Δ\\Delta\(Exp−\-no Exp\)\+3\.4\+3\.4\+1\.7Table 5\.Cumulative judge\-acceptance by refinement steptt\(response–revision attempt of Def\.[2\.1](https://arxiv.org/html/2605.24693#S2.Thmtheorem1); not the interaction step of Fig\.[3\(a\)](https://arxiv.org/html/2605.24693#S4.F3.sf1)\) onDcalD\_\{\\rm cal\}, with vs\. without the frozen Experience snapshotM⋆M^\{\\star\}\(DeepSeek\-V3\.2\-Chat\)\. Per\-benchmark operationalization ofttis in the text\. The lift concentrates att=1t\{=\}1, matchingContextAcquireraising the active\-state hazardh¯t​\(M\)\\underline\{h\}\_\{t\}\(M\)\(Prop\.[2\.9](https://arxiv.org/html/2605.24693#S2.Thmtheorem9)\) rather thanρ¯tprobe\\bar\{\\rho\}\_\{t\}^\{\\rm probe\}orℐt\\mathcal\{I\}\_\{t\}\.
### 4\.5\.Efficiency Analysis

We analyze cost–performance trade\-offs with the variable\-returns\-to\-scale Banker–Charnes–Cooper DEA model\(Charneset al\.,[1978](https://arxiv.org/html/2605.24693#bib.bib7); Bankeret al\.,[1984](https://arxiv.org/html/2605.24693#bib.bib5)\)\(input==Cost, output==Pass@1;θ=1\\theta\{=\}1marks the BCC\-efficient frontier\)\. Figure[1](https://arxiv.org/html/2605.24693#S0.F1)shows thatCP\-Agentlies on the frontier for all three backbones \(DeepSeek\-V3\.2\-Chat, GPT\-4o, Qwen3\-235B\), whereas baselines such as Reflexion are on the frontier only on some backbones \(DeepSeek\-V3\.2\-Chat\) and dominated on others \(Qwen3\-235B\)\. On DeepSeek\-V3\.2\-Chat, agentic baselines andCP\-Agentall sit in the10−210^\{\-2\}USD per\-problem range \(CP\-Agent: $0\.0350\.035, Table[4](https://arxiv.org/html/2605.24693#S4.T4)\); the frontier is therefore driven byCP\-Agent’s higherPass@1at comparable cost rather than by an order\-of\-magnitude cost gap\. Tracked components account for92\.4%92\.4\\%of total cost \(HP69\.7%69\.7\\%, SV17\.6%17\.6\\%, Experience3\.4%3\.4\\%, TA1\.7%1\.7\\%\)\.

### 4\.6\.Tool\-Use Analysis

We analyze howCP\-Agent’s tool use affects the three quantities identified by the theory: false\-admission risk, calibrated admission evidence, and experience\-conditioned success hazard\.

##### Dual\-Granularity Verification\.

Figure[3\(a\)](https://arxiv.org/html/2605.24693#S4.F3.sf1)\(LCB\-Pro, DeepSeek\-V3\.2\-Chat\) shows that, indexed by*interaction step*\(theii\-th in\-trajectory LLM/tool event, distinct from refinement stepttin Definition[2\.1](https://arxiv.org/html/2605.24693#S2.Thmtheorem1)\), HP peaks at the first few steps and decays as the approach stabilizes, while SV stays roughly uniform—early hypothesis checks, then solution\-level validation on executable candidates\.

##### Test Augmentation\.

LetGenRateaug\\mathrm\{GenRate\}\_\{\\rm aug\}be the fraction of problems with a non\-empty TA suite, and call problemii*rescued*if a public\-passing candidate that initially failed augmented tests later passes them and is accepted\. Figure[3\(b\)](https://arxiv.org/html/2605.24693#S4.F3.sf2)reports these on the 167\-problem target split: GPT\-4o reachesGenRateaug=39\.5%\\mathrm\{GenRate\}\_\{\\rm aug\}\{=\}39\.5\\%vs\. Qwen3\-235B\-A22B68\.3%68\.3\\%and DeepSeek\-V3\.2\-Chat70\.1%70\.1\\%, mainly reflecting brute\-force quality;RescueRateaug\\mathrm\{RescueRate\}\_\{\\rm aug\}is substantial across backbones \(9\.39\.3–23\.9%23\.9\\%\), with DeepSeek\-V3\.2\-Chat highest, indicating that both test strength and the backbone’s ability to act on feedback matter\. Both diagnostics map to the evidence variable of Proposition[2\.8](https://arxiv.org/html/2605.24693#S2.Thmtheorem8):GenRateaug\\mathrm\{GenRate\}\_\{\\rm aug\}is the availability of the evidence channel \(𝒯aug≠∅\\mathcal\{T\}\_\{\\rm aug\}\\neq\\emptyset, required forb¯tπ​\(m\)\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)to be informative\), andRescueRateaug\\mathrm\{RescueRate\}\_\{\\rm aug\}tracks the “counterexample\-found” rate that tightensb¯tπ​\(m\)\\bar\{b\}\_\{t\}^\{\\pi\}\(m\); §[4\.4](https://arxiv.org/html/2605.24693#S4.SS4)converts this toΔ​It​\(m=32\)≈1\.00\\Delta I\_\{t\}\(m\{=\}32\)\{\\approx\}1\.00nats in Table[4](https://arxiv.org/html/2605.24693#S4.T4)\.

![Refer to caption](https://arxiv.org/html/2605.24693v1/x3.png)\(a\)Verification tool usage across*interaction steps*\.A line plot for DeepSeek\-V3\.2\-Chat on LiveCodeBench Pro\. The x\-axis is the within\-trajectory interaction\-step index from 1 to about 29, and the y\-axis is the call frequency of each verification tool\. Two curves are shown: the Hypothesis Prober curve starts high at the first few interaction steps and decays monotonically toward zero as the trajectory progresses, while the Solution Validator curve stays roughly flat across all interaction steps\. The pattern indicates that hypothesis\-level probes concentrate early in reasoning and solution\-level validation is invoked roughly uniformly once executable candidates are available\.
![Refer to caption](https://arxiv.org/html/2605.24693v1/x4.png)\(b\)Test Augmentation statistics across backbones \(y\-axis: %\)\.A grouped bar chart over the 167\-problem LiveCodeBench Pro target split with three backbone groups along the x\-axis: GPT\-4o, Qwen3\-235B\-A22B\-Instruct, and DeepSeek\-V3\.2\-Chat\. The y\-axis is percentage from 0 to about 80\. Each backbone has two bars: GenRate\_aug, the fraction of problems for which Test Augmentation produces a non\-empty consensus\-filtered suite, and RescueRate\_aug, the fraction of public\-passing candidates whose hidden\-wrong label is caught by the augmented suite\. GenRate\_aug is 39\.5 percent for GPT\-4o, 68\.3 percent for Qwen3\-235B\-A22B\-Instruct, and 70\.1 percent for DeepSeek\-V3\.2\-Chat\. RescueRate\_aug ranges from about 9\.3 to 23\.9 percent across backbones, with DeepSeek\-V3\.2\-Chat the highest\.

Figure 3\.Tool\-Use Analysis on LCB\-Pro\. \(a\) HP/SV frequency per*interaction step*on DeepSeek\-V3\.2\-Chat\. Panel \(a\)’s x\-axis indexes interaction steps \(1​…​291\\ldots\\\!29per trajectory\),*not*refinement stept∈\{1,…,T\}t\\in\\\{1,\\ldots,T\\\}of Definition[2\.1](https://arxiv.org/html/2605.24693#S2.Thmtheorem1)and Table[5](https://arxiv.org/html/2605.24693#S4.T5)\. \(b\)GenRateaug\\mathrm\{GenRate\}\_\{\\rm aug\}= fraction of problems with non\-empty TA suite;RescueRateaug\\mathrm\{RescueRate\}\_\{\\rm aug\}= fraction of public\-passing candidates whose hidden\-wrong label is caught by TA, the operational counterpart ofb¯tπ​\(m\)\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)in Prop\.[2\.8](https://arxiv.org/html/2605.24693#S2.Thmtheorem8)\.A two\-panel composite figure for tool\-use analysis on LiveCodeBench Pro with DeepSeek\-V3\.2\-Chat\. Panel \(a\) is a line plot of Hypothesis Prober and Solution Validator call frequencies as a function of within\-trajectory interaction step from 1 to about 29: HP decays from a high early\-step peak toward zero while SV stays approximately uniform\. Panel \(b\) is a grouped bar chart of two Test\-Augmentation diagnostics, GenRate\_aug and RescueRate\_aug, across the three backbones GPT\-4o, Qwen3\-235B\-A22B\-Instruct, and DeepSeek\-V3\.2\-Chat: GenRate\_aug grows from 39\.5 to 70\.1 percent across backbones and RescueRate\_aug ranges roughly between 9 and 24 percent, peaking on DeepSeek\-V3\.2\-Chat\.
##### Experience\-Driven Self\-Evolving

Certificate\-bearing benchmark results use the temporal frozen snapshot protocol in Section[4\.2](https://arxiv.org/html/2605.24693#S4.SS2): historical problems buildM⋆M^\{\\star\}, calibration audits the controller withM⋆M^\{\\star\}fixed, and final testing reads the same snapshot\. Deployment\-mode growth diagnostics in the full study show that general experiences saturate quickly while algorithmic memories keep expanding as new problem types appear; the curriculum effect follows our write\-in criterion, because experiences are stored only after a failure is repaired\. These diagnostics explain snapshot construction onDhistmemD\_\{\\rm hist\}^\{\\rm mem\}, while all reported certificates freeze the resulting snapshot beforeDcalD\_\{\\rm cal\}andDtestD\_\{\\rm test\}\.

### 4\.7\.Ablation Study

Table[4](https://arxiv.org/html/2605.24693#S4.T4)also serves as the component ablation on DeepSeek\-V3\.2\-Chat: SV and HP constitute Dual\-Granularity Verification, TA generates extra tests beyond public cases, and Exp retrieves reusable repair knowledge from past successful trajectories\. Vanilla LLM yields22\.922\.9Refine@5 on ICPC\-Eval and25\.825\.8Pass@1 on LCB\-Pro at $0\.002 per problem\. Adding SV markedly improves both \(31\.4/43\.731\.4/43\.7at $0\.023\); SV\+HP further improves to32\.2/44\.332\.2/44\.3, supporting their complementarity\. Adding TA on top of Dual\-Granularity Verification brings additional Pass@1 gains at higher cost \(33\.1/46\.133\.1/46\.1at $0\.043\)\. Finally, historical frozen Experience achieves the best performance \(33\.9/48\.533\.9/48\.5\) and reduces per\-problem cost back to $0\.035, indicating that retrieved repair experiences reduce ineffective trial\-and\-error without using calibration\- or test\-split write\-ins\. The certificate columnC5ctrlC\_\{5\}^\{\\rm ctrl\}rises monotonically with the same component additions, so the ablation ranking reproduces the certificate ranking onDcalD\_\{\\rm cal\}\.

##### Additional diagnostics\.

TA reliability on 117 public\-pass candidates: TA flags 40/51 hidden\-wrong programs while letting 47/66 hidden\-correct survive, leaving the11/58=19\.0%11/58\{=\}19\.0\\%post\-survival bug rate used by Prop\.[2\.8](https://arxiv.org/html/2605.24693#S2.Thmtheorem8); samplingkktests per suite raises detection to74\.5%74\.5\\%atk=15k\{=\}15\.

## 5\.Related Work

##### Competitive programming agents\.

Among tool\-augmented LLM agents that ground reasoning on deterministic execution feedback\(Yaoet al\.,[2022](https://arxiv.org/html/2605.24693#bib.bib36); Madaanet al\.,[2023](https://arxiv.org/html/2605.24693#bib.bib27); Gaoet al\.,[2023](https://arxiv.org/html/2605.24693#bib.bib13); Chenet al\.,[2022b](https://arxiv.org/html/2605.24693#bib.bib9); Gouet al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib15); Wuet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib35); Daset al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib1)\), work on CP benchmarks\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.24693#bib.bib18); Chenet al\.,[2021](https://arxiv.org/html/2605.24693#bib.bib4); Jainet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib23); Zhenget al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib3); Xuet al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib2)\)mostly raises accuracy by scale: dedicated training or massive sampling\(Liet al\.,[2022](https://arxiv.org/html/2605.24693#bib.bib25); Leblondet al\.,[2023](https://arxiv.org/html/2605.24693#bib.bib39); El\-Kishkyet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib12); Guoet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib16); Liuet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib26)\)\. Inference\-time pipelines such as MapCoder, AlphaCodium, and CodeSim avoid retraining by iterating generation and verification, but are session\-local and discard repair experience after each problem\(Islamet al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib21); Ridniket al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib29); Islamet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib22); Yuhuaet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib37); Li and Mooney,[2025](https://arxiv.org/html/2605.24693#bib.bib24)\)\. Reflexion\-style self\-refinement\(Shinnet al\.,[2023](https://arxiv.org/html/2605.24693#bib.bib30)\)keeps verbal memory across problems but offers no admission rule for when retrieved memory should enter the loop\.CP\-Agentadds neither training nor sampling: a calibrated controller ties generation, verification, and a frozen memory snapshot to a single held\-out audit, closing the gap to a stronger reasoner\(Liuet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib26)\)on ICPC\-Eval Refine@5\.

##### Testing, risk control, and frozen memory\.

Randomized testing and generated checks have long driven program repair and synthesis\(Claessen and Hughes,[2000](https://arxiv.org/html/2605.24693#bib.bib10); Solar\-Lezamaet al\.,[2006](https://arxiv.org/html/2605.24693#bib.bib31); Chenet al\.,[2022a](https://arxiv.org/html/2605.24693#bib.bib8); Heet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib17)\), but are usually consumed as raw signals\. We treat SV and TA as calibrated evidence channels via binomial calibration, conformal\-style risk control, and anytime\-valid evidence\(Clopper and Pearson,[1934](https://arxiv.org/html/2605.24693#bib.bib11); Angelopouloset al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib6); Howardet al\.,[2021](https://arxiv.org/html/2605.24693#bib.bib19); Vovk and Wang,[2021](https://arxiv.org/html/2605.24693#bib.bib34)\), so coverage and post\-survival risk are read off held\-out traces\. Experience enters as a temporal frozen memory prior with PAC\-Bayes/MDL complexity control\(McAllester,[1999](https://arxiv.org/html/2605.24693#bib.bib28),[1999](https://arxiv.org/html/2605.24693#bib.bib28)\): tools and memory are fixed onDhistmemD\_\{\\rm hist\}^\{\\rm mem\}beforeDcalD\_\{\\rm cal\}andDtestD\_\{\\rm test\}\(Section[4\.2](https://arxiv.org/html/2605.24693#S4.SS2)\) so the audit remains valid; memory may still grow in deployment\.

## 6\.Conclusion

A calibrated risk\-controlled controller substantially improves contest\-level programming without additional training\. The theory derives certificate quantities and abstract action classes that are audited on held\-out traces and frozen before evaluation;CP\-Agentinstantiates them as Dual\-Granularity Verification, Test Augmentation, and Experience\-Driven Self\-Evolving, matches DeepSeek\-V3\.2\-Reasoner on ICPC\-Eval Refine@5, and provides trace\-calibrated mechanism evidence for each action class\.

## References

- A\. Angelopoulos, S\. Bates, A\. Fisch, L\. Lei, and T\. Schuster \(2024\)Conformal risk control\.InInternational conference on learning representations,Vol\.2024,pp\. 55198–55218\.Cited by:[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px2.p1.3)\.
- J\. Baek, S\. K\. Jauhar, S\. Cucerzan, and S\. J\. Hwang \(2025\)Researchagent: iterative research idea generation over scientific literature with large language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 6709–6738\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p1.1)\.
- M\. Balog, A\. L\. Gaunt, M\. Brockschmidt, S\. Nowozin, and D\. Tarlow \(2016\)Deepcoder: learning to write programs\.arXiv preprint arXiv:1611\.01989\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p1.1)\.
- R\. D\. Banker, A\. Charnes, and W\. W\. Cooper \(1984\)Some models for estimating technical and scale inefficiencies in data envelopment analysis\.Management science30\(9\),pp\. 1078–1092\.Cited by:[§D\.4](https://arxiv.org/html/2605.24693#A4.SS4.p1.1),[§4\.5](https://arxiv.org/html/2605.24693#S4.SS5.p1.10)\.
- A\. Charnes, W\. W\. Cooper, and E\. Rhodes \(1978\)Measuring the efficiency of decision making units\.European journal of operational research2\(6\),pp\. 429–444\.Cited by:[§D\.4](https://arxiv.org/html/2605.24693#A4.SS4.p1.1),[§4\.5](https://arxiv.org/html/2605.24693#S4.SS5.p1.10)\.
- B\. Chen, F\. Zhang, A\. Nguyen, D\. Zan, Z\. Lin, J\. Lou, and W\. Chen \(2022a\)Codet: code generation with generated tests\.arXiv preprint arXiv:2207\.10397\.Cited by:[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px2.p1.3)\.
- M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. D\. O\. Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman,et al\.\(2021\)Evaluating large language models trained on code\.arXiv preprint arXiv:2107\.03374\.Cited by:[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- W\. Chen, X\. Ma, X\. Wang, and W\. W\. Cohen \(2022b\)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks\.arXiv preprint arXiv:2211\.12588\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p2.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- K\. Claessen and J\. Hughes \(2000\)QuickCheck: a lightweight tool for random testing of haskell programs\.InProceedings of the fifth ACM SIGPLAN international conference on Functional programming,pp\. 268–279\.Cited by:[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px2.p1.3)\.
- C\. J\. Clopper and E\. S\. Pearson \(1934\)The use of confidence or fiducial limits illustrated in the case of the binomial\.Biometrika26\(4\),pp\. 404–413\.Cited by:[§2](https://arxiv.org/html/2605.24693#S2.SS0.SSS0.Px1.p1.6),[§2](https://arxiv.org/html/2605.24693#S2.SS0.SSS0.Px3.p1.8),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px2.p1.3)\.
- D\. Das, D\. Banerjee, S\. Aditya, and A\. Kulkarni \(2024\)MATHSENSEI: a tool\-augmented large language model for mathematical reasoning\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 942–966\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p2.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- A\. El\-Kishky, A\. Wei, A\. Saraiva, B\. Minaiev, D\. Selsam, D\. Dohan, F\. Song, H\. Lightman, I\. Clavera, J\. Pachocki,et al\.\(2025\)Competitive programming with large reasoning models\.arXiv preprint arXiv:2502\.06807\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.24693#S1.p2.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)Pal: program\-aided language models\.InInternational conference on machine learning,pp\. 10764–10799\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p2.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- P\. Gao, Z\. Tian, X\. Meng, X\. Wang, R\. Hu, Y\. Xiao, Y\. Liu, Z\. Zhang, J\. Chen, C\. Gao,et al\.\(2025\)Trae agent: an llm\-based agent for software engineering with test\-time scaling\.arXiv preprint arXiv:2507\.23370\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p1.1)\.
- Z\. Gou, Z\. Shao, Y\. Gong, Y\. Yang, M\. Huang, N\. Duan, W\. Chen,et al\.\(2024\)Tora: a tool\-integrated reasoning agent for mathematical problem solving\.InInternational Conference on Learning Representations,Vol\.2024,pp\. 48362–48395\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p2.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi,et al\.\(2025\)Deepseek\-r1: incentivizing reasoning capability in llms via reinforcement learning\.arXiv preprint arXiv:2501\.12948\.Cited by:[§1](https://arxiv.org/html/2605.24693#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- H\. He, W\. Yao, K\. Ma, W\. Yu, Y\. Dai, H\. Zhang, Z\. Lan, and D\. Yu \(2024\)Webvoyager: building an end\-to\-end web agent with large multimodal models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 6864–6890\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p1.1)\.
- Z\. He, Y\. M\. Choi, K\. Zhang, J\. Ji, J\. Zhou, D\. Xu, I\. Bercovich, A\. Zhang, and L\. Li \(2025\)Hardtests: synthesizing high\-quality test cases for llm coding\.arXiv preprint arXiv:2505\.24098\.Cited by:[§3\.2](https://arxiv.org/html/2605.24693#S3.SS2.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px2.p1.3)\.
- D\. Hendrycks, S\. Basart, S\. Kadavath, M\. Mazeika, A\. Arora, E\. Guo, C\. Burns, S\. Puranik, H\. He, D\. Song,et al\.\(2021\)Measuring coding challenge competence with apps\.arXiv preprint arXiv:2105\.09938\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- S\. R\. Howard, A\. Ramdas, J\. McAuliffe, and J\. Sekhon \(2021\)Time\-uniform, nonparametric, nonasymptotic confidence sequences\.The Annals of Statistics49\(2\),pp\. 1055–1080\.Cited by:[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px2.p1.3)\.
- A\. Hurst, A\. Lerer, A\. P\. Goucher, A\. Perelman, A\. Ramesh, A\. Clark, A\. Ostrow, A\. Welihinda, A\. Hayes, A\. Radford,et al\.\(2024\)Gpt\-4o system card\.arXiv preprint arXiv:2410\.21276\.Cited by:[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px1.p1.1)\.
- M\. A\. Islam, M\. E\. Ali, and M\. R\. Parvez \(2024\)Mapcoder: multi\-agent code generation for competitive problem solving\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 4912–4944\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2605.24693#S1.p3.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- M\. A\. Islam, M\. E\. Ali, and M\. R\. Parvez \(2025\)Codesim: multi\-agent code generation and problem solving through simulation\-driven planning and debugging\.InFindings of the Association for Computational Linguistics: NAACL 2025,pp\. 5113–5139\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p2.1),[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- N\. Jain, A\. Gu, W\. Li, F\. Yan, T\. Zhang, S\. Wang, A\. Solar\-Lezama, K\. Sen, and I\. Stoica \(2025\)Livecodebench: holistic and contamination free evaluation of large language models for code\.InInternational Conference on Learning Representations,Vol\.2025,pp\. 58791–58831\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.24693#S1.p1.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Jiang, H\. Xie, S\. Shen, Y\. Shen, Z\. Zhang, M\. Lei, Y\. Zheng, Y\. Li, C\. Li, D\. Huang,et al\.\(2024\)Siriusbi: a comprehensive llm\-powered solution for data analytics in business intelligence\.arXiv preprint arXiv:2411\.06102\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p1.1)\.
- H\. Lai, X\. Liu, I\. L\. Iong, S\. Yao, Y\. Chen, P\. Shen, H\. Yu, H\. Zhang, X\. Zhang, Y\. Dong,et al\.\(2024\)Autowebglm: a large language model\-based web navigating agent\.InProceedings of the 30th ACM SIGKDD Conference on Knowledge Discovery and Data Mining,pp\. 5295–5306\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p1.1)\.
- R\. Leblond, F\. Gimeno, F\. Altché, A\. Saade, A\. Ruddock, C\. Tallec, G\. Powell, J\. Grill, M\. Mikuła, M\. Lochbrunner,et al\.\(2023\)AlphaCode 2 technical report\.Technical reportGoogle DeepMind\.Note:Accessed: 2025\-01\-14External Links:[Link](https://storage.googleapis.com/deepmind-media/AlphaCode2/AlphaCode2_Tech_Report.pdf)Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.24693#S1.p2.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- J\. Li and R\. Mooney \(2025\)AlgoSimBench: identifying algorithmically similar problems for competitive programming\.arXiv preprint arXiv:2507\.15378\.Cited by:[§1](https://arxiv.org/html/2605.24693#S1.p3.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Li, D\. Choi, J\. Chung, N\. Kushman, J\. Schrittwieser, R\. Leblond, T\. Eccles, J\. Keeling, F\. Gimeno, A\. Dal Lago,et al\.\(2022\)Competition\-level code generation with alphacode\.Science378\(6624\),pp\. 1092–1097\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.24693#S1.p2.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Liu, A\. Mei, B\. Lin, B\. Xue, B\. Wang, B\. Xu, B\. Wu, B\. Zhang, C\. Lin, C\. Dong,et al\.\(2025\)Deepseek\-v3\. 2: pushing the frontier of open large language models\.arXiv preprint arXiv:2512\.02556\.Cited by:[§1](https://arxiv.org/html/2605.24693#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- C\. Lu, C\. Lu, R\. T\. Lange, J\. Foerster, J\. Clune, and D\. Ha \(2024\)The ai scientist: towards fully automated open\-ended scientific discovery\.arXiv preprint arXiv:2408\.06292\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p1.1)\.
- Q\. Luo, Y\. Ye, S\. Liang, Z\. Zhang, Y\. Qin, Y\. Lu, Y\. Wu, X\. Cong, Y\. Lin, Y\. Zhang,et al\.\(2024\)Repoagent: an llm\-powered open\-source framework for repository\-level code documentation generation\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations,pp\. 436–464\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p1.1)\.
- A\. Madaan, N\. Tandon, P\. Gupta, S\. Hallinan, L\. Gao, S\. Wiegreffe, U\. Alon, N\. Dziri, S\. Prabhumoye, Y\. Yang,et al\.\(2023\)Self\-refine: iterative refinement with self\-feedback\.Advances in neural information processing systems36,pp\. 46534–46594\.Cited by:[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- Z\. Manna and R\. Waldinger \(1980\)A deductive approach to program synthesis\.ACM Transactions on Programming Languages and Systems \(TOPLAS\)2\(1\),pp\. 90–121\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p1.1)\.
- D\. A\. McAllester \(1999\)PAC\-bayesian model averaging\.InProceedings of the twelfth annual conference on Computational learning theory,pp\. 164–170\.Cited by:[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px2.p1.3)\.
- L\. Ning, Z\. Liang, Z\. Jiang, H\. Qu, Y\. Ding, W\. Fan, X\. Wei, S\. Lin, H\. Liu, P\. S\. Yu,et al\.\(2025\)A survey of webagents: towards next\-generation ai agents for web automation with large foundation models\.InProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\. 2,pp\. 6140–6150\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p1.1)\.
- OpenAI \(2025a\)OpenAI o3 system card\.Technical Report\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p1.1)\.
- OpenAI \(2025b\)System card: openai o4‑mini \(including the o4‑mini‑high variant\)\.Note:[https://openai\.com/index/o3\-o4\-mini\-system\-card/](https://openai.com/index/o3-o4-mini-system-card/)Cited by:[§1](https://arxiv.org/html/2605.24693#S1.p2.1),[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px2.p1.1)\.
- S\. Pichai, D\. Hassabis, and K\. Kavukcuoglu \(2025\)A new era of intelligence with gemini 3\.Google\. URL: https://blog\. google/products\-and\-platforms/products/gemini/gemini3\.Cited by:[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px2.p1.1)\.
- T\. Ridnik, D\. Kredo, and I\. Friedman \(2024\)Code generation with alphacodium: from prompt engineering to flow engineering\.arXiv preprint arXiv:2401\.08500\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2605.24693#S1.p3.1),[§3\.2](https://arxiv.org/html/2605.24693#S3.SS2.SSS0.Px1.p1.1),[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- N\. Shang, Y\. Liu, Y\. Zhu, L\. L\. Zhang, W\. Xu, X\. Guan, B\. Zhang, B\. Dong, X\. Zhou, B\. Zhang,et al\.\(2025\)Rstar2\-agent: agentic reasoning technical report\.arXiv preprint arXiv:2508\.20722\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p2.1)\.
- N\. Shinn, F\. Cassano, A\. Gopinath, K\. Narasimhan, and S\. Yao \(2023\)Reflexion: language agents with verbal reinforcement learning\.Advances in neural information processing systems36,pp\. 8634–8652\.Cited by:[§1](https://arxiv.org/html/2605.24693#S1.p3.1),[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px2.p1.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Solar\-Lezama, L\. Tancau, R\. Bodik, S\. Seshia, and V\. Saraswat \(2006\)Combinatorial sketching for finite programs\.InProceedings of the 12th international conference on Architectural support for programming languages and operating systems,pp\. 404–415\.Cited by:[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px2.p1.3)\.
- L\. Spector and A\. Robinson \(2002\)Genetic programming and autoconstructive evolution with the push programming language\.Genetic Programming and Evolvable Machines3\(1\),pp\. 7–40\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p1.1)\.
- H\. Sun, K\. Yu, Y\. Wang, B\. Liu, X\. Li, R\. Li, N\. Chen, and J\. Li \(2026\)AlgBench: to what extent do large reasoning models understand algorithms?\.arXiv preprint arXiv:2601\.04996\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p1.1)\.
- D\. Team, X\. Li, X\. Sun, G\. Wang, S\. Su, C\. Shum, and J\. Li \(2026\)GrandCode: achieving grandmaster level in competitive programming via agentic reinforcement learning\.arXiv preprint arXiv:2604\.02721\.Cited by:[§1](https://arxiv.org/html/2605.24693#S1.p3.1)\.
- E\. Toledo, K\. Hambardzumyan, M\. Josifoski, R\. Hazra, N\. Baldwin, A\. Audran\-Reiss, M\. Kuchnik, D\. Magka, M\. Jiang, A\. Lupidi,et al\.\(2026\)Ai research agents for machine learning: search, exploration, and generalization in mle\-bench\.Advances in Neural Information Processing Systems38,pp\. 35309–35348\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p1.1)\.
- V\. Vovk and R\. Wang \(2021\)E\-values: calibration, combination and applications\.The Annals of Statistics49\(3\),pp\. 1736–1754\.Cited by:[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px2.p1.3)\.
- J\. Wu, J\. Zhu, Y\. Liu, M\. Xu, and Y\. Jin \(2025\)Agentic reasoning: a streamlined framework for enhancing llm reasoning with agentic tools\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 28489–28503\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p2.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Xu, H\. Yiwen, Y\. Min, Z\. Chen, X\. Zhao, and J\. Wen \(2026\)ICPC\-eval: probing the frontiers of llm reasoning with competitive programming contests\.Advances in Neural Information Processing Systems38\.Cited by:[Appendix E](https://arxiv.org/html/2605.24693#A5.SS0.SSS0.Px1),[Appendix F](https://arxiv.org/html/2605.24693#A6.SS0.SSS0.Px1),[Appendix F](https://arxiv.org/html/2605.24693#A6.SS0.SSS0.Px2),[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.24693#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px4.p1.3),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px1.p1.1)\.
- Z\. Yang, Z\. Zhou, S\. Wang, X\. Cong, X\. Han, Y\. Yan, Z\. Liu, Z\. Tan, P\. Liu, D\. Yu,et al\.\(2024\)Matplotagent: method and evaluation for llm\-based agentic scientific data visualization\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 11789–11804\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, N\. Du, I\. Shafran, K\. Narasimhan, and Y\. Cao \(2022\)React: synergizing reasoning and acting in language models\.arXiv preprint arXiv:2210\.03629\.Cited by:[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- S\. Yuhua, P\. NIe, and X\. MeNG \(2025\)OI\-assistant: a retrieval augmented system for similar problem discovery and interactive learning in competitive programming\.Cited by:[§1](https://arxiv.org/html/2605.24693#S1.p3.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhang, Z\. Ma, Y\. Ma, Z\. Han, Y\. Wu, and V\. Tresp \(2025\)Webpilot: a versatile and autonomous multi\-agent system for web task execution with strategic exploration\.InProceedings of the AAAI Conference on Artificial Intelligence,Vol\.39,pp\. 23378–23386\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p1.1)\.
- Z\. Zheng, Z\. Cheng, Z\. Shen, S\. Zhou, K\. Liu, H\. He, D\. Li, S\. Wei, H\. Hao, J\. Yao,et al\.\(2026\)Livecodebench pro: how do olympiad medalists judge llms in competitive programming?\.Advances in Neural Information Processing Systems38\.Cited by:[Appendix E](https://arxiv.org/html/2605.24693#A5.SS0.SSS0.Px2),[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.24693#S1.p1.1),[§4\.1](https://arxiv.org/html/2605.24693#S4.SS1.SSS0.Px3.p1.1),[§5](https://arxiv.org/html/2605.24693#S5.SS0.SSS0.Px1.p1.1)\.
- Y\. Zhu, L\. Wang, C\. Yang, X\. Lin, B\. Li, W\. Zhou, X\. Liu, Z\. Peng, T\. Luo, Y\. Li,et al\.\(2025\)A survey of data agents: emerging paradigm or overstated hype?\.arXiv preprint arXiv:2510\.23587\.Cited by:[Appendix I](https://arxiv.org/html/2605.24693#A9.SS0.SSS0.Px1.p1.1)\.

## Appendix ATool\-Orchestrated Pseudocode

Algorithm[1](https://arxiv.org/html/2605.24693#alg1)formalizes §[3\.4](https://arxiv.org/html/2605.24693#S3.SS4)with two nested loops: the outer loop indexes refinement stepst=1,…,Tt\{=\}1,\\ldots,T\(one admission each, matched to Refine@KK/Pass@1\), and the inner loop indexes interaction steps withintt—HP probes consume interaction steps but not refinement steps, so a singlett\(evenT=1T\{=\}1on LCB\-Pro\) can absorb many HP calls beforeCommit\. Memory writes follow Def\.[2\.4](https://arxiv.org/html/2605.24693#S2.Thmtheorem4)and uncalled channels record their gate as neutral \(Def\.[2\.1](https://arxiv.org/html/2605.24693#S2.Thmtheorem1)\)\.

Algorithm 1CP\-Agent: tool\-orchestrated refinement on a problemqiq\_\{i\}\. The outerforloop indexes refinement stepst=1,…,Tt\{=\}1,\\ldots,T\(Def\.[2\.1](https://arxiv.org/html/2605.24693#S2.Thmtheorem1), matched to Refine@KK/Pass@1\); the innerrepeat…\\ldotsuntilloop indexes interaction steps—HP probes within steptt, which do*not*consumett\(cf\. Fig\.[3\(a\)](https://arxiv.org/html/2605.24693#S4.F3.sf1)\)\. The deterministic SV→\\tooptional TA→\\toadmission cascade closes each step\.0:

qiq\_\{i\},

M=\(ℳgen,ℳalg\)M\{=\}\(\\mathcal\{M\}^\{\\mathrm\{gen\}\},\\mathcal\{M\}^\{\\mathrm\{alg\}\}\),

TT,

\(kgen,ℓalg\)\(k\_\{\\rm gen\},\\ell\_\{\\rm alg\}\),

ωwrite\\omega\_\{\\rm write\}
1:

SysPrompti←BaseSysPrompt∥Format​\(ℳgen,kgen\)\\textsf\{SysPrompt\}\_\{i\}\\leftarrow\\textsf\{BaseSysPrompt\}\\,\\\|\\,\\mathrm\{Format\}\(\\mathcal\{M\}^\{\\mathrm\{gen\}\},k\_\{\\rm gen\}\)\{ContextAcquire\}

2:

\(τ^1,τ^2\)←ClassifyAlg​\(qi\)\(\\hat\{\\tau\}\_\{1\},\\hat\{\\tau\}\_\{2\}\)\\leftarrow\\mathrm\{ClassifyAlg\}\(q\_\{i\}\);

Eialg←RetrieveAlgExp​\(τ^1,τ^2;ℓalg,M\)E\_\{i\}^\{\\mathrm\{alg\}\}\\leftarrow\\mathrm\{RetrieveAlgExp\}\(\\hat\{\\tau\}\_\{1\},\\hat\{\\tau\}\_\{2\};\\ell\_\{\\rm alg\},M\)
3:

𝒯←public tests\\mathcal\{T\}\\leftarrow\\text\{public tests\};

𝒯aug←∅\\mathcal\{T\}\_\{\\rm aug\}\\leftarrow\\emptyset;

hi,0←Eialgh\_\{i,0\}\\leftarrow E\_\{i\}^\{\\mathrm\{alg\}\}
4:for

t=1,…,Tt=1,\\ldots,Tdo

5:

Xi,t←hi,t−1X\_\{i,t\}\\leftarrow h\_\{i,t\-1\};

Gi,tHP←G^\{\\mathrm\{HP\}\}\_\{i,t\}\\leftarrowneutral

6:repeat

7:

a←NextAction​\(qi,Xi,t,M\),a∈\{Probe,Commit\}a\\leftarrow\\mathrm\{NextAction\}\(q\_\{i\},X\_\{i,t\},M\),\\ a\\in\\\{\\textsf\{Probe\},\\textsf\{Commit\}\\\}
8:if

a=Probea=\\textsf\{Probe\}then

9:run HP on

Xi,tX\_\{i,t\}; fold pass/fail into

Xi,tX\_\{i,t\}; AND outcome into

Gi,tHPG^\{\\mathrm\{HP\}\}\_\{i,t\}\{Prop\.[2\.7](https://arxiv.org/html/2605.24693#S2.Thmtheorem7)\}

10:endif

11:until

a=Commita=\\textsf\{Commit\}
12:

si,t←Gen​\(qi,M,Xi,t\)s\_\{i,t\}\\leftarrow\\mathrm\{Gen\}\(q\_\{i\},M,X\_\{i,t\}\)\{Eq\. \([16](https://arxiv.org/html/2605.24693#S3.E16)\)\}

13:

\(vi,t,fi,t\)←Φ​\(qi,si,t;𝒯\)\(v\_\{i,t\},f\_\{i,t\}\)\\leftarrow\\Phi\(q\_\{i\},s\_\{i,t\};\\mathcal\{T\}\);

Gi,tSV←vi,tG^\{\\mathrm\{SV\}\}\_\{i,t\}\\leftarrow v\_\{i,t\}\{public\-test screen\}

14:if

vi,t=1v\_\{i,t\}=1then

15:if

𝒯aug=∅\\mathcal\{T\}\_\{\\rm aug\}=\\emptysetthen

16:

𝒯aug←Augment​\(qi\)\\mathcal\{T\}\_\{\\rm aug\}\\leftarrow\\mathrm\{Augment\}\(q\_\{i\}\)\{TA once per

qiq\_\{i\}\(Tab\.[3](https://arxiv.org/html/2605.24693#S4.T3)\)\}

17:endif

18:

\(vi,t,fi,t\)←Φ​\(qi,si,t;𝒯∪𝒯aug\)\(v\_\{i,t\},f\_\{i,t\}\)\\leftarrow\\Phi\(q\_\{i\},s\_\{i,t\};\\mathcal\{T\}\\cup\\mathcal\{T\}\_\{\\rm aug\}\);

Gi,tTA←vi,tG^\{\\mathrm\{TA\}\}\_\{i,t\}\\leftarrow v\_\{i,t\}\{Prop\.[2\.8](https://arxiv.org/html/2605.24693#S2.Thmtheorem8)\}

19:else

20:

Gi,tTA←G^\{\\mathrm\{TA\}\}\_\{i,t\}\\leftarrowneutral

21:endif

22:

Γi,t←Gi,tHP∧Gi,tSV∧Gi,tTA\\Gamma\_\{i,t\}\\leftarrow G^\{\\mathrm\{HP\}\}\_\{i,t\}\\wedge G^\{\\mathrm\{SV\}\}\_\{i,t\}\\wedge G^\{\\mathrm\{TA\}\}\_\{i,t\}; log

\(Zi,t,Ai,t,Γi,t,Si,t,Fi,t\)\(Z\_\{i,t\},A\_\{i,t\},\\Gamma\_\{i,t\},S\_\{i,t\},F\_\{i,t\}\)
23:if

Γi,t=1\\Gamma\_\{i,t\}=1then

24:return

si,ts\_\{i,t\}\{Stopadmits\}

25:endif

26:

hi,t←Update​\(Xi,t,fi,t\)h\_\{i,t\}\\leftarrow\\mathrm\{Update\}\(X\_\{i,t\},f\_\{i,t\}\)\{Refineto

t\+1t\{\+\}1\}

27:endfor

28:if

ωwrite=audit\-only\\omega\_\{\\rm write\}=\\textsf\{audit\-only\}then

29:append experience to

M~\\widetilde\{M\}\{Eq\. \([22](https://arxiv.org/html/2605.24693#S3.E22)\)\}

30:else

31:update memory via Eq\. \([20](https://arxiv.org/html/2605.24693#S3.E20)\)

32:endif

## Appendix BNotation Glossary

Table[6](https://arxiv.org/html/2605.24693#A2.T6)consolidates every symbol used in Section[2](https://arxiv.org/html/2605.24693#S2)\.

SymbolMeaningTrajectory events and stopping timesTTRefinement horizon\.AtA\_\{t\}Active\-trajectory event\{τS\>t−1,τF\>t−1\}\\\{\\tau\_\{S\}\>t\{\-\}1,\\tau\_\{F\}\>t\{\-\}1\\\}\.Γt,Γtα\\Gamma\_\{t\},\\,\\Gamma\_\{t\}^\{\\alpha\}Observable admission gate / risk\-score\-thresholded gate\.BtϵB\_\{t\}^\{\\epsilon\}Hidden\-judgeϵ\\epsilon\-significant failure event for the admitted candidate\.Ft,StF\_\{t\},\\,S\_\{t\}Indicators of a false admission / clean success at steptt\.τF,τS\\tau\_\{F\},\\,\\tau\_\{S\}Stopping times of false admission and clean success\.Certificate state, risk score, and abstract actionsXt,ϕX\_\{t\},\\,\\phiImplementation state and projectionZt=ϕ​\(Xt\)Z\_\{t\}=\\phi\(X\_\{t\}\)\.ZtZ\_\{t\}Certificate summary state\(ρt,It,h¯t,t\)\(\\rho\_\{t\},I\_\{t\},\\underline\{h\}\_\{t\},t\)\.ρt\\rho\_\{t\}Upper envelope on residual bad\-candidate mass \(calibrated asρ¯tprobe,π\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}\)\.ItI\_\{t\}Accumulated program\-level evidence in nats \(calibrated asℐtevid,π\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}\)\.h¯t\\underline\{h\}\_\{t\}Lower envelope on clean\-success hazard \(calibrated ash¯tπ​\(M\)\\underline\{h\}\_\{t\}^\{\\pi\}\(M\)\)\.ρt\+,It\+,h¯t\+\\rho\_\{t\+\},\\,I\_\{t\+\},\\,\\underline\{h\}\_\{t\+\}Post\-action certificate components after a single action\.rt​\(Zt\)r\_\{t\}\(Z\_\{t\}\)Residual admission\-risk scoreρt​exp⁡\(−It\)\\rho\_\{t\}\\exp\(\-I\_\{t\}\)\.θ^t,α,Θt\\hat\{\\theta\}\_\{t,\\alpha\},\\,\\Theta\_\{t\}Calibrated admission threshold forrtr\_\{t\}and its finite grid\.StopCommit current candidate through gateΓtα\\Gamma\_\{t\}^\{\\alpha\}\.RefineAdvance to next candidate distribution\.RiskProbeAction contractingρt\\rho\_\{t\}viaλt​\(Zt\)∈\[0,1\]\\lambda\_\{t\}\(Z\_\{t\}\)\\in\[0,1\]\.λt​\(Zt\)\\lambda\_\{t\}\(Z\_\{t\}\)Probe\-induced contraction factor onρt\\rho\_\{t\}\.EvidenceAcquire​\(m\)\\textsf\{EvidenceAcquire\}\(m\)Action adding incrementΔ​It​\(m,Zt\)\\Delta I\_\{t\}\(m,Z\_\{t\}\)toItI\_\{t\}\.ContextAcquire​\(k\)\\textsf\{ContextAcquire\}\(k\)Action addingΔ​ht​\(k,Zt\)\\Delta h\_\{t\}\(k,Z\_\{t\}\)toh¯t\\underline\{h\}\_\{t\}via memory snapshot\.Manifest entries \(Definition[2\.4](https://arxiv.org/html/2605.24693#S2.Thmtheorem4)\)π,Πα,π⋆\\pi,\\,\\Pi\_\{\\alpha\},\\,\\pi^\{\\star\}Frozen controller; finite pre\-declared class; selected controller\.Γ1:Tobs\\Gamma^\{\\rm obs\}\_\{1:T\}Observable admission gate trajectory\.𝒰probe\\mathcal\{U\}\_\{\\rm probe\}RiskProbetool interface \(HP\)\.m,γevidm,\\,\\gamma\_\{\\rm evid\}TA generation intensity and consensus aggregation rule\.kgen,ℓalgk\_\{\\rm gen\},\\,\\ell\_\{\\rm alg\}General\-bank injection budget and algorithmic\-bucket retrieval budget \(joint realization ofkk\)\.Rroute,DdedupR\_\{\\rm route\},\\,D\_\{\\rm dedup\}Retrieval routing rule and deduplication rule\.pprompt,ddecodep\_\{\\rm prompt\},\\,d\_\{\\rm decode\}Prompt template and decoding parameters\.M⋆,M~M^\{\\star\},\\,\\widetilde\{M\}Frozen retrievable memory snapshot and audit\-only log\.ωwrite\\omega\_\{\\rm write\}Memory write mode \(audit\-onlyon certificate\-bearing runs\)\.Calibration counts and boundsntπ,ftπ,stπn\_\{t\}^\{\\pi\},\\,f\_\{t\}^\{\\pi\},\\,s\_\{t\}^\{\\pi\}Active calibration traces, false admissions, clean successes at stepttunderπ\\pi\.nt​\(θ\),ft​\(θ\),R¯t​\(θ\)n\_\{t\}\(\\theta\),\\,f\_\{t\}\(\\theta\),\\,\\bar\{R\}\_\{t\}\(\\theta\)Active admissions, hidden\-rejected admissions, and Clopper–Pearson UCB at thresholdθ\\theta\.ftprobe,πf\_\{t\}^\{\\mathrm\{probe\},\\pi\}Calibration count of bad\-and\-probe\-surviving traces\.qtπ,htπq\_\{t\}^\{\\pi\},\\,h\_\{t\}^\{\\pi\}Deployment active\-state false / clean hazards\.q¯t,rawπ\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}Clopper–Pearson UCB onqtπq\_\{t\}^\{\\pi\}from admitted failures\.h¯tπ​\(M\)\\underline\{h\}\_\{t\}^\{\\pi\}\(M\)Clopper–Pearson LCB onhtπh\_\{t\}^\{\\pi\}under memoryMM;h¯tπ:=h¯tπ​\(∅\)\\underline\{h\}\_\{t\}^\{\\pi\}:=\\underline\{h\}\_\{t\}^\{\\pi\}\(\\varnothing\)\.Wtprobe,WtevidW\_\{t\}^\{\\rm probe\},\\,W\_\{t\}^\{\\rm evid\}Survival indicators at probe / evidence stages\.ρtπ\\rho\_\{t\}^\{\\pi\}Deployment bad\-priorPrπ⁡\(Btϵ∣At\)\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\}\\mid A\_\{t\}\)\.σtbad,π,σtgood,π\\sigma\_\{t\}^\{\\rm bad,\\pi\},\\,\\sigma\_\{t\}^\{\\rm good,\\pi\}Probe\-survival rates conditional onBtϵB\_\{t\}^\{\\epsilon\}/¬Btϵ\\neg B\_\{t\}^\{\\epsilon\}\.ρ¯tprobe,π\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}UCB on joint bad\-and\-probe\-survive mass \(Prop\.[2\.7](https://arxiv.org/html/2605.24693#S2.Thmtheorem7)\)\.b¯tπ​\(m\)\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)UCB on evidence\-gate survival given bad and probe\-survive\.Δ​It​\(m\),ℐtevid,π\\Delta I\_\{t\}\(m\),\\,\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}Per\-step evidence−log⁡b¯tπ​\(m\)\-\\log\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)and cumulative evidence\.q¯t,mechπ\\bar\{q\}\_\{t,\\mathrm\{mech\}\}^\{\\pi\}Mechanism\-factorized UCBρ¯tprobe,π​exp⁡\[−ℐtevid,π\]\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}\\exp\[\-\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}\]\.q¯t,ctrlπ\\bar\{q\}\_\{t,\\mathrm\{ctrl\}\}^\{\\pi\}min⁡\{q¯t,rawπ,q¯t,mechπ\}\\min\\\{\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\},\\bar\{q\}\_\{t,\\mathrm\{mech\}\}^\{\\pi\}\\\}\.CT​\(π\),CTctrl​\(π\)C\_\{T\}\(\\pi\),\\,C\_\{T\}^\{\\rm ctrl\}\(\\pi\)Clean\-before\-false lower bound \(raw / mechanism\-tightened\)\.Confidence allocations and data splitsα,αt\\alpha,\\,\\alpha\_\{t\}False\-admission risk budget \(overall / per\-step\)\.δ\\deltaCalibration failure probability\.δraw,δgate,δprobe,δevid\\delta\_\{\\mathrm\{raw\}\},\\,\\delta\_\{\\mathrm\{gate\}\},\\,\\delta\_\{\\mathrm\{probe\}\},\\,\\delta\_\{\\mathrm\{evid\}\}Per\-channel allocations ofδ\\delta\(each=0\.025=0\.025\)\.DhistmemD\_\{\\rm hist\}^\{\\rm mem\}Historical split used only for memory construction\.DcalD\_\{\\rm cal\}Held\-out split used for calibration and selection\.DtestD\_\{\\rm test\}Target evaluation split \(read\-onlyM⋆M^\{\\star\}\)\.Table 6\.Consolidated notation for Section[2](https://arxiv.org/html/2605.24693#S2)\. Each symbol used in Definitions[2\.1](https://arxiv.org/html/2605.24693#S2.Thmtheorem1)–[2\.4](https://arxiv.org/html/2605.24693#S2.Thmtheorem4), Theorems[2\.5](https://arxiv.org/html/2605.24693#S2.Thmtheorem5)and[2\.6](https://arxiv.org/html/2605.24693#S2.Thmtheorem6), Propositions[2\.7](https://arxiv.org/html/2605.24693#S2.Thmtheorem7)–[2\.9](https://arxiv.org/html/2605.24693#S2.Thmtheorem9), and Corollary[2\.10](https://arxiv.org/html/2605.24693#S2.Thmtheorem10)appears here\.
## Appendix CSection 2 Proofs

This appendix contains the proofs for Section 2 \(”A Calibrated Risk\-Controlled Theory of Feedback Control”\) of the paper, covering the calibration estimators after Definition 2\.4, Theorem 2\.5, Theorem 2\.6, Propositions 2\.7–2\.9, Corollary 2\.10, and the validity of the mechanism factorization in Eq\. \(9\)\.

### 0\. Setting and Conventions

#### 0\.1 Probability space and stopped events

Fix a frozen controllerπ\\pi\. A trajectory is generated by sampling one problem from the benchmark distribution and runningπ\\pifor at mostTTrefinement steps\. All probabilitiesPrπ⁡\(⋅\)\\Pr\_\{\\pi\}\(\\cdot\)are taken with respect to a fresh evaluation trajectory; calibration probabilities are taken with respect to held\-out calibration trajectories\.

The stopping times are

τF:=inf\{t:Ft=1\},τS:=inf\{t:St=1\},\\tau\_\{F\}:=\\inf\\\{t:F\_\{t\}=1\\\},\\qquad\\tau\_\{S\}:=\\inf\\\{t:S\_\{t\}=1\\\},
and the active event before stepttis

At:=\{τS\>t−1,τF\>t−1\}\.A\_\{t\}:=\\\{\\tau\_\{S\}\>t\-1,\\;\\tau\_\{F\}\>t\-1\\\}\.
By Definition 2\.1,

Ft:=𝟏\{At,Γt=1,Btϵ\},St:=𝟏\{At,Γt=1,¬Btϵ\},F\_\{t\}:=\\mathbf\{1\}\\\{A\_\{t\},\\,\\Gamma\_\{t\}=1,\\,B\_\{t\}^\{\\epsilon\}\\\},\\qquad S\_\{t\}:=\\mathbf\{1\}\\\{A\_\{t\},\\,\\Gamma\_\{t\}=1,\\,\\neg B\_\{t\}^\{\\epsilon\}\\\},
soFtF\_\{t\}andStS\_\{t\}are mutually exclusive at every fixed stepttviaBtϵB\_\{t\}^\{\\epsilon\}\. HenceτS\\tau\_\{S\}andτF\\tau\_\{F\}cannot both be finite and equal \(they may, however, both equal\+∞\+\\inftywhen neither stop occurs byTT, with the conventioninf∅:=\+∞\\inf\\varnothing:=\+\\infty\)\. The certified event is

\{τS≤T,τS<τF\}\.\\\{\\tau\_\{S\}\\leq T,\\;\\tau\_\{S\}<\\tau\_\{F\}\\\}\.
We adopt the stopped\-process convention: once a clean success or false admission occurs at some stept⋆t^\{\\star\},StS\_\{t\}andFtF\_\{t\}are set to0at all subsequent stepst\>t⋆t\>t^\{\\star\}and the trajectory is inactive\. As a consequence, ifτS<∞\\tau\_\{S\}<\\inftythenFt=0F\_\{t\}=0for allt\>τSt\>\\tau\_\{S\}, soτF=\+∞\\tau\_\{F\}=\+\\inftyon\{τS<τF\}\\\{\\tau\_\{S\}<\\tau\_\{F\}\\\}; symmetrically,τS=\+∞\\tau\_\{S\}=\+\\inftyon\{τF<τS\}\\\{\\tau\_\{F\}<\\tau\_\{S\}\\\}\. We further assumePrπ⁡\(A1\)=1\\Pr\_\{\\pi\}\(A\_\{1\}\)=1, i\.e\., the trajectory begins active at step11\.

#### 0\.2 Admission\-gate semantics required by the mechanism factorization

Eq\. \(9\) and Section 5 below use the containment

\{Ft=1\}⊆\{Btϵ=1,Wtprobe=1,Wtevid=1\}​on​At\.\\\{F\_\{t\}=1\\\}\\subseteq\\\{B\_\{t\}^\{\\epsilon\}=1,\\;W\_\{t\}^\{\\mathrm\{probe\}\}=1,\\;W\_\{t\}^\{\\mathrm\{evid\}\}=1\\\}\\quad\\text\{on \}A\_\{t\}\.
This is equivalent to the requirement that the observable admission gate satisfies

Γt=1⟹Wtprobe=1​and​Wtevid=1​on​At,\\Gamma\_\{t\}=1\\;\\Longrightarrow\\;W\_\{t\}^\{\\mathrm\{probe\}\}=1\\;\\text\{and\}\\;W\_\{t\}^\{\\mathrm\{evid\}\}=1\\quad\\text\{on \}A\_\{t\},
i\.e\., any candidate released byΓt\\Gamma\_\{t\}has first passed the risk\-probe and evidence stages \(with the conventionWprobe=Wevid=1W^\{\\mathrm\{probe\}\}=W^\{\\mathrm\{evid\}\}=1when the corresponding stage is not invoked\)\. We treat this implication as part of the manifest contract onπ\\pi; ifΓ1:Tobs\\Gamma\_\{1:T\}^\{\\mathrm\{obs\}\}does not chain the upstream screens, the mechanism UCB is invalid and only the raw UCBq¯t,rawπ\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}applies\.

#### 0\.3 Held\-out exchangeability

Under Assumption 2\.3, calibration and evaluation trajectories are independent draws from the same benchmark distribution\. Independence is required across problem trajectories at a fixed steptt, not across refinement steps within a trajectory\. For fixedttandπ\\pi, the active calibration rows yield conditional Bernoulli samples; a random active denominator does not invalidate the bounds, because we condition on the denominator first and then mix\.

#### 0\.4 Zero\-probability conditioning

For any conditional probabilityPr⁡\(H∣E\)\\Pr\(H\\mid E\)withPr⁡\(E\)=0\\Pr\(E\)=0, we setPr⁡\(H∣E\):=0\\Pr\(H\\mid E\):=0\. This convention never inflates an upper bound; in chain\-rule decompositions, the corresponding event has probability0and contributes nothing\.

#### 0\.5 Clopper–Pearson UCB / LCB

LetY1,…,Yn​∼i\.i\.d\.​Bernoulli​\(p\)Y\_\{1\},\\dots,Y\_\{n\}\\overset\{\\text\{i\.i\.d\.\}\}\{\\sim\}\\mathrm\{Bernoulli\}\(p\)andK=∑iYiK=\\sum\_\{i\}Y\_\{i\}\. Write

U​\(K,n,η\)=UCBBinom​\(K,n,η\),L​\(K,n,η\)=LCBBinom​\(K,n,η\)U\(K,n,\\eta\)=\\mathrm\{UCB\}\_\{\\mathrm\{Binom\}\}\(K,n,\\eta\),\\qquad L\(K,n,\\eta\)=\\mathrm\{LCB\}\_\{\\mathrm\{Binom\}\}\(K,n,\\eta\)
for the exact one\-sided Clopper–Pearson bounds, so that

Prp⁡\{p≤U​\(K,n,η\)\}≥1−η,Prp⁡\{p≥L​\(K,n,η\)\}≥1−η\.\\Pr\_\{p\}\\\{p\\leq U\(K,n,\\eta\)\\\}\\geq 1\-\\eta,\\qquad\\Pr\_\{p\}\\\{p\\geq L\(K,n,\\eta\)\\\}\\geq 1\-\\eta\.
If the denominatorNNis random withK∣N=n∼Binomial​\(n,p\)K\\mid N=n\\sim\\mathrm\{Binomial\}\(n,p\), then

Pr⁡\{p\>U​\(K,N,η\)\}=∑nPr⁡\(N=n\)​Pr⁡\{p\>U​\(K,n,η\)∣N=n\}≤η,\\Pr\\\{p\>U\(K,N,\\eta\)\\\}=\\sum\_\{n\}\\Pr\(N=n\)\\,\\Pr\\\{p\>U\(K,n,\\eta\)\\mid N=n\\\}\\leq\\eta,
and the analogous bound holds for the LCB\. We use the conservative conventions

UCBBinom​\(0,0,η\)=1,LCBBinom​\(0,0,η\)=0\.\\mathrm\{UCB\}\_\{\\mathrm\{Binom\}\}\(0,0,\\eta\)=1,\\qquad\\mathrm\{LCB\}\_\{\\mathrm\{Binom\}\}\(0,0,\\eta\)=0\.

#### 0\.6 Pre\-declaration of score\-defining objects

The proof of Theorem 2\.5 requires that the score functionrt​\(Zt\)=ρt​exp⁡\(−It\)r\_\{t\}\(Z\_\{t\}\)=\\rho\_\{t\}\\exp\(\-I\_\{t\}\), the finite threshold gridΘt\\Theta\_\{t\}\(with1≤\|Θt\|<∞1\\leq\|\\Theta\_\{t\}\|<\\infty\), and the gate rule𝟏​\{rt≤θ\}\\mathbf\{1\}\\\{r\_\{t\}\\leq\\theta\\\}be determined byπ\\pialone, using only the observable trajectory state\. Concretely:

1. \(1\)The measurable mapϕπ:Xt↦Zt=\(ρt,It,h¯t,t\)\\phi\_\{\\pi\}:X\_\{t\}\\mapsto Z\_\{t\}=\(\\rho\_\{t\},I\_\{t\},\\underline\{h\}\_\{t\},t\)depends only on observable mechanism outputs \(probe survival, evidence\-gate survival, retrieval slots, step index\), not on hidden labelsBtϵB\_\{t\}^\{\\epsilon\}\.
2. \(2\)The finite gridΘt\\Theta\_\{t\}is determined byπ\\piwith1≤\|Θt\|<∞1\\leq\|\\Theta\_\{t\}\|<\\infty\.
3. \(3\)The gate rule𝟏​\{rt≤θ\}\\mathbf\{1\}\\\{r\_\{t\}\\leq\\theta\\\}is determined byπ\\pi\.

Ifϕπ\\phi\_\{\\pi\},Θt\\Theta\_\{t\}, the aggregation ruleγevid\\gamma\_\{\\mathrm\{evid\}\}, the memory snapshotM⋆M^\{\\star\}, or any calibrated scalar inrtr\_\{t\}\(such asρ¯tprobe,π\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}orb¯tπ\\bar\{b\}\_\{t\}^\{\\pi\}\) is tuned onDcalD\_\{\\mathrm\{cal\}\}, every candidate value must be enumerated in the finite, pre\-declared manifest classΠα\\Pi\_\{\\alpha\}\(with1≤\|Πα\|<∞1\\leq\|\\Pi\_\{\\alpha\}\|<\\infty\) in advance, so that the union bound of Theorem 2\.6 absorbs both the score selection and the gate threshold selection\. Throughout Theorem 2\.5 and Theorem 2\.6 we treat this pre\-declaration as a standing hypothesis\.

### 1\. Validity of the Calibration Estimators in Eq\. \(7\)

For fixedπ\\piand steptt, letntπn\_\{t\}^\{\\pi\}be the number of active calibration rows,ftπf\_\{t\}^\{\\pi\}the count of false admissions, andstπs\_\{t\}^\{\\pi\}the count of clean successes\. Define

q¯t,rawπ:=UCBBinom​\(ftπ,ntπ,δ2​T\),h¯tπ:=LCBBinom​\(stπ,ntπ,δ2​T\)\.\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}:=\\mathrm\{UCB\}\_\{\\mathrm\{Binom\}\}\\\!\\left\(f\_\{t\}^\{\\pi\},n\_\{t\}^\{\\pi\},\\tfrac\{\\delta\}\{2T\}\\right\),\\qquad\\underline\{h\}\_\{t\}^\{\\pi\}:=\\mathrm\{LCB\}\_\{\\mathrm\{Binom\}\}\\\!\\left\(s\_\{t\}^\{\\pi\},n\_\{t\}^\{\\pi\},\\tfrac\{\\delta\}\{2T\}\\right\)\.
The deployment hazards areqtπ:=Prπ⁡\(Ft=1∣At\)q\_\{t\}^\{\\pi\}:=\\Pr\_\{\\pi\}\(F\_\{t\}=1\\mid A\_\{t\}\)andhtπ:=Prπ⁡\(St=1∣At\)h\_\{t\}^\{\\pi\}:=\\Pr\_\{\\pi\}\(S\_\{t\}=1\\mid A\_\{t\}\)\.

Proof\.Fixttandπ\\pi\. LetAi,tA\_\{i,t\}be the active indicator for calibration trajectoryii\. On\{Ai,t=1\}\\\{A\_\{i,t\}=1\\\},Fi,tF\_\{i,t\}is Bernoulli with parameterqtπq\_\{t\}^\{\\pi\}\. By Assumption 2\.3, different trajectories are independent at fixedtt, so the active rows yield i\.i\.d\.Bernoulli​\(qtπ\)\\mathrm\{Bernoulli\}\(q\_\{t\}^\{\\pi\}\)samples; the random denominatorntπn\_\{t\}^\{\\pi\}is handled by Section 0\.5\. The Clopper–Pearson UCB gives

Pr⁡\{qtπ≤q¯t,rawπ\}≥1−δ2​T\.\\Pr\\\{q\_\{t\}^\{\\pi\}\\leq\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}\\\}\\geq 1\-\\tfrac\{\\delta\}\{2T\}\.
The same argument applied toSi,tS\_\{i,t\}givesPr⁡\{htπ≥h¯tπ\}≥1−δ2​T\\Pr\\\{h\_\{t\}^\{\\pi\}\\geq\\underline\{h\}\_\{t\}^\{\\pi\}\\\}\\geq 1\-\\tfrac\{\\delta\}\{2T\}\. A union bound over the2​T2Tone\-sided eventst=1,…,Tt=1,\\dots,Tyields

Pr\{∀t≤T:qtπ≤q¯t,rawπandhtπ≥h¯tπ\}≥1−δ\.□\\Pr\\\!\\left\\\{\\forall t\\leq T:\\;q\_\{t\}^\{\\pi\}\\leq\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}\\;\\text\{and\}\\;h\_\{t\}^\{\\pi\}\\geq\\underline\{h\}\_\{t\}^\{\\pi\}\\right\\\}\\geq 1\-\\delta\.\\qquad\\square
The argument requires only across\-trajectory independence at fixedtt, not within\-trajectory step independence\.

### 2\. Theorem 2\.5: Calibrated Admission Gate

Statement\.Assume the pre\-declaration of Section 0\.6:ϕπ\\phi\_\{\\pi\}, the score functionrt​\(Zt\)=ρt​exp⁡\(−It\)r\_\{t\}\(Z\_\{t\}\)=\\rho\_\{t\}\\exp\(\-I\_\{t\}\), the finite non\-empty gridΘt\\Theta\_\{t\}\(with1≤\|Θt\|<∞1\\leq\|\\Theta\_\{t\}\|<\\infty\), and the gate rule are fixed byπ\\pibefore label counting; any calibrated scalar enteringrtr\_\{t\}is enumerated in the finite, non\-empty manifestΠα\\Pi\_\{\\alpha\}\(with1≤\|Πα\|<∞1\\leq\|\\Pi\_\{\\alpha\}\|<\\infty\)\. On held\-out calibration traces, letnt​\(θ\)n\_\{t\}\(\\theta\)be the number of active candidate admissions withrt​\(Zt\)≤θr\_\{t\}\(Z\_\{t\}\)\\leq\\thetaandft​\(θ\)f\_\{t\}\(\\theta\)the count of such admissions rejected by hidden evaluation\. Define

R¯t​\(θ\):=UCBBinom​\(ft​\(θ\),nt​\(θ\),δgateT​\|Θt\|\)\\bar\{R\}\_\{t\}\(\\theta\):=\\mathrm\{UCB\}\_\{\\mathrm\{Binom\}\}\\\!\\left\(f\_\{t\}\(\\theta\),\\,n\_\{t\}\(\\theta\),\\,\\tfrac\{\\delta\_\{\\mathrm\{gate\}\}\}\{T\|\\Theta\_\{t\}\|\}\\right\)
and select

θ^t,α=max⁡\{θ∈Θt:R¯t​\(θ\)≤αt\},\\hat\{\\theta\}\_\{t,\\alpha\}=\\max\\\{\\theta\\in\\Theta\_\{t\}:\\bar\{R\}\_\{t\}\(\\theta\)\\leq\\alpha\_\{t\}\\\},
withθ^t,α:=−∞\\hat\{\\theta\}\_\{t,\\alpha\}:=\-\\inftyandΓtα≡0\\Gamma\_\{t\}^\{\\alpha\}\\equiv 0when the set is empty\. Then, with calibration probability at least1−δgate1\-\\delta\_\{\\mathrm\{gate\}\}, the gateΓtα=𝟏​\{rt​\(Zt\)≤θ^t,α\}\\Gamma\_\{t\}^\{\\alpha\}=\\mathbf\{1\}\\\{r\_\{t\}\(Z\_\{t\}\)\\leq\\hat\{\\theta\}\_\{t,\\alpha\}\\\}satisfies

Pr⁡\(Ft=1∣At,Γtα=1\)≤αt\\Pr\(F\_\{t\}=1\\mid A\_\{t\},\\,\\Gamma\_\{t\}^\{\\alpha\}=1\)\\leq\\alpha\_\{t\}
simultaneously for allt≤Tt\\leq Tand all data\-dependentθ^t,α∈Θt\\hat\{\\theta\}\_\{t,\\alpha\}\\in\\Theta\_\{t\}\.

Proof\.Section 0\.6 asserts thatϕπ\\phi\_\{\\pi\},Θt\\Theta\_\{t\}, and the gate rule are pre\-declared \(or absorbed intoΠα\\Pi\_\{\\alpha\}by Theorem 2\.6\)\. Fixttandθ∈Θt\\theta\\in\\Theta\_\{t\}, and define the deployment conditional rejection rate

pt,θ:=Prπ⁡\(Ft=1∣At,rt​\(Zt\)≤θ\)\.p\_\{t,\\theta\}:=\\Pr\_\{\\pi\}\(F\_\{t\}=1\\mid A\_\{t\},\\,r\_\{t\}\(Z\_\{t\}\)\\leq\\theta\)\.
The conditioning event\{At,rt​\(Zt\)≤θ\}\\\{A\_\{t\},\\,r\_\{t\}\(Z\_\{t\}\)\\leq\\theta\\\}is determined by observables\. By Assumption 2\.3, the calibration rows satisfying this event yield i\.i\.d\.Bernoulli​\(pt,θ\)\\mathrm\{Bernoulli\}\(p\_\{t,\\theta\}\)samples for hidden rejection, with a random denominatornt​\(θ\)n\_\{t\}\(\\theta\)handled by Section 0\.5\. The Clopper–Pearson UCB gives

Pr⁡\{pt,θ≤R¯t​\(θ\)\}≥1−δgateT​\|Θt\|\.\\Pr\\\{p\_\{t,\\theta\}\\leq\\bar\{R\}\_\{t\}\(\\theta\)\\\}\\geq 1\-\\tfrac\{\\delta\_\{\\mathrm\{gate\}\}\}\{T\|\\Theta\_\{t\}\|\}\.
Union\-bounding overt≤Tt\\leq Tandθ∈Θt\\theta\\in\\Theta\_\{t\}defines the good event

Ggate:=\{∀t≤T,∀θ∈Θt:pt,θ≤R¯t​\(θ\)\},Pr⁡\(Ggate\)≥1−δgate\.G\_\{\\mathrm\{gate\}\}:=\\\{\\forall t\\leq T,\\,\\forall\\theta\\in\\Theta\_\{t\}:\\;p\_\{t,\\theta\}\\leq\\bar\{R\}\_\{t\}\(\\theta\)\\\},\\quad\\Pr\(G\_\{\\mathrm\{gate\}\}\)\\geq 1\-\\delta\_\{\\mathrm\{gate\}\}\.
OnGgateG\_\{\\mathrm\{gate\}\}, consider the data\-dependent choiceθ^t,α\\hat\{\\theta\}\_\{t,\\alpha\}\.

*Case 1: the selection set is non\-empty\.*Thenθ^t,α∈Θt\\hat\{\\theta\}\_\{t,\\alpha\}\\in\\Theta\_\{t\}andR¯t​\(θ^t,α\)≤αt\\bar\{R\}\_\{t\}\(\\hat\{\\theta\}\_\{t,\\alpha\}\)\\leq\\alpha\_\{t\}\. SinceGgateG\_\{\\mathrm\{gate\}\}holds at every grid point,

pt,θ^t,α≤R¯t​\(θ^t,α\)≤αt,p\_\{t,\\hat\{\\theta\}\_\{t,\\alpha\}\}\\leq\\bar\{R\}\_\{t\}\(\\hat\{\\theta\}\_\{t,\\alpha\}\)\\leq\\alpha\_\{t\},
andΓtα=1\\Gamma\_\{t\}^\{\\alpha\}=1is equivalent tort​\(Zt\)≤θ^t,αr\_\{t\}\(Z\_\{t\}\)\\leq\\hat\{\\theta\}\_\{t,\\alpha\}, so

Pr⁡\(Ft=1∣At,Γtα=1\)=pt,θ^t,α≤αt\.\\Pr\(F\_\{t\}=1\\mid A\_\{t\},\\,\\Gamma\_\{t\}^\{\\alpha\}=1\)=p\_\{t,\\hat\{\\theta\}\_\{t,\\alpha\}\}\\leq\\alpha\_\{t\}\.
*Case 2: the selection set is empty\.*ThenΓtα≡0\\Gamma\_\{t\}^\{\\alpha\}\\equiv 0, andPr⁡\(At,Γtα=1\)=0\\Pr\(A\_\{t\},\\,\\Gamma\_\{t\}^\{\\alpha\}=1\)=0; by Section 0\.4 the conditional probability is0≤αt0\\leq\\alpha\_\{t\}\.□\\qquad\\square

The post\-selection validity follows from establishing the UCB simultaneously over the finite grid before any data\-dependent choice is made\.

### 3\. Proposition 2\.7: RiskProbe Bad\-and\-Survive UCB

Statement\.LetWtprobe∈\{0,1\}W\_\{t\}^\{\\mathrm\{probe\}\}\\in\\\{0,1\\\}be the probe\-survival indicator at steptt\(set to11when the probe is not invoked\)\. On held\-out calibration traces, letntπn\_\{t\}^\{\\pi\}be the number of active rows at stepttunderπ\\pi, and letftprobe,πf\_\{t\}^\{\\mathrm\{probe\},\\pi\}count those rows where bothBtϵ=1B\_\{t\}^\{\\epsilon\}=1andWtprobe=1W\_\{t\}^\{\\mathrm\{probe\}\}=1\. Define

ρ¯tprobe,π:=UCBBinom​\(ftprobe,π,ntπ,δprobeT\)\.\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}:=\\mathrm\{UCB\}\_\{\\mathrm\{Binom\}\}\\\!\\left\(f\_\{t\}^\{\\mathrm\{probe\},\\pi\},\\,n\_\{t\}^\{\\pi\},\\,\\tfrac\{\\delta\_\{\\mathrm\{probe\}\}\}\{T\}\\right\)\.
Then, with calibration probability at least1−δprobe1\-\\delta\_\{\\mathrm\{probe\}\}, simultaneously fort≤Tt\\leq T,

ρ¯tprobe,π≥Prπ⁡\(Btϵ,Wtprobe=1∣At\),\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}\\geq\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1\\mid A\_\{t\}\),
and

Prπ⁡\(Btϵ,Wtprobe=1∣At\)=σtbad,π​ρtπ,\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1\\mid A\_\{t\}\)=\\sigma\_\{t\}^\{\\mathrm\{bad\},\\pi\}\\,\\rho\_\{t\}^\{\\pi\},
whereσtbad,π:=Prπ⁡\(Wtprobe=1∣Btϵ,At\)\\sigma\_\{t\}^\{\\mathrm\{bad\},\\pi\}:=\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\mathrm\{probe\}\}=1\\mid B\_\{t\}^\{\\epsilon\},\\,A\_\{t\}\)andρtπ:=Prπ⁡\(Btϵ∣At\)\\rho\_\{t\}^\{\\pi\}:=\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\}\\mid A\_\{t\}\)\.

Proof\.Fixtt\. For calibration trajectoryii, let

Yi,tprobe:=𝟏​\{Ai,t=1,Bi,tϵ=1,Wi,tprobe=1\}Y\_\{i,t\}^\{\\mathrm\{probe\}\}:=\\mathbf\{1\}\\\{A\_\{i,t\}=1,\\,B\_\{i,t\}^\{\\epsilon\}=1,\\,W\_\{i,t\}^\{\\mathrm\{probe\}\}=1\\\}
and consider it only on active rows\. On\{Ai,t=1\}\\\{A\_\{i,t\}=1\\\},

𝟏​\{Bi,tϵ=1,Wi,tprobe=1\}\\mathbf\{1\}\\\{B\_\{i,t\}^\{\\epsilon\}=1,\\,W\_\{i,t\}^\{\\mathrm\{probe\}\}=1\\\}
is Bernoulli with parameterptprobe,π:=Prπ⁡\(Btϵ,Wtprobe=1∣At\)p\_\{t\}^\{\\mathrm\{probe\},\\pi\}:=\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1\\mid A\_\{t\}\)\. By Assumption 2\.3, the active rows yield i\.i\.d\. samples; Section 0\.5 handles the random denominator\. The Clopper–Pearson UCB gives

Pr⁡\{ptprobe,π≤ρ¯tprobe,π\}≥1−δprobeT,\\Pr\\\{p\_\{t\}^\{\\mathrm\{probe\},\\pi\}\\leq\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}\\\}\\geq 1\-\\tfrac\{\\delta\_\{\\mathrm\{probe\}\}\}\{T\},
and a union bound overt=1,…,Tt=1,\\dots,Tyields the simultaneous statement with confidence1−δprobe1\-\\delta\_\{\\mathrm\{probe\}\}\.

The factorization follows from the chain rule of conditional probability \(no independence is required\):

Prπ⁡\(Btϵ,Wtprobe=1∣At\)\\displaystyle\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1\\mid A\_\{t\}\)=Prπ⁡\(Wtprobe=1∣Btϵ,At\)​Prπ⁡\(Btϵ∣At\)\\displaystyle=\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\mathrm\{probe\}\}=1\\mid B\_\{t\}^\{\\epsilon\},\\,A\_\{t\}\)\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\}\\mid A\_\{t\}\)=σtbad,πρtπ\.□\\displaystyle=\\sigma\_\{t\}^\{\\mathrm\{bad\},\\pi\}\\,\\rho\_\{t\}^\{\\pi\}\.\\hskip 18\.49988pt\\square
Ifσtbad,π<1\\sigma\_\{t\}^\{\\mathrm\{bad\},\\pi\}<1, the probe contracts the bad\-and\-survive mass strictly below the bad prior; if additionallyσtbad,π<σtgood,π\\sigma\_\{t\}^\{\\mathrm\{bad\},\\pi\}<\\sigma\_\{t\}^\{\\mathrm\{good\},\\pi\}, the probe also shrinks the bad\-versus\-good odds\.

### 4\. Proposition 2\.8: Candidate\-Level Evidence from EvidenceAcquire

Statement\.LetWtevid∈\{0,1\}W\_\{t\}^\{\\mathrm\{evid\}\}\\in\\\{0,1\\\}denote the program\-level evidence\-gate survival at steptt\(set to11when the gate is not invoked\)\. On calibration rows that are simultaneously active, bad, and probe\-surviving, letb¯tπ​\(m\)\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)be the one\-sided Clopper–Pearson UCB on

Prπ⁡\(Wtevid=1∣Btϵ,Wtprobe=1,At\)\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\mathrm\{evid\}\}=1\\mid B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1,\\,A\_\{t\}\)
at levelδevid/T\\delta\_\{\\mathrm\{evid\}\}/T\. SetΔ​It​\(m\):=−log⁡b¯tπ​\(m\)\\Delta I\_\{t\}\(m\):=\-\\log\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)\. Then, with calibration probability at least1−δevid1\-\\delta\_\{\\mathrm\{evid\}\}, simultaneously fort≤Tt\\leq T,

Prπ⁡\(Wtevid=1∣Btϵ,Wtprobe=1,At\)≤exp⁡\[−Δ​It​\(m\)\]\.\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\mathrm\{evid\}\}=1\\mid B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1,\\,A\_\{t\}\)\\leq\\exp\[\-\\Delta I\_\{t\}\(m\)\]\.
Proof\.Fixttand restrict the calibration set to rows withAt=1A\_\{t\}=1,Btϵ=1B\_\{t\}^\{\\epsilon\}=1, andWtprobe=1W\_\{t\}^\{\\mathrm\{probe\}\}=1\. Letntevid,πn\_\{t\}^\{\\mathrm\{evid\},\\pi\}be the number of such rows andktevid,πk\_\{t\}^\{\\mathrm\{evid\},\\pi\}the number withWtevid=1W\_\{t\}^\{\\mathrm\{evid\}\}=1\. Conditional on this restriction,WtevidW\_\{t\}^\{\\mathrm\{evid\}\}is Bernoulli with parameter

btπ​\(m\):=Prπ⁡\(Wtevid=1∣Btϵ,Wtprobe=1,At\)\.b\_\{t\}^\{\\pi\}\(m\):=\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\mathrm\{evid\}\}=1\\mid B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1,\\,A\_\{t\}\)\.
By Assumption 2\.3 and Section 0\.5, the Clopper–Pearson UCB gives

Pr⁡\{btπ​\(m\)≤b¯tπ​\(m\)\}≥1−δevidT,\\Pr\\\{b\_\{t\}^\{\\pi\}\(m\)\\leq\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)\\\}\\geq 1\-\\tfrac\{\\delta\_\{\\mathrm\{evid\}\}\}\{T\},
and a union bound overt=1,…,Tt=1,\\dots,Tproduces the good event

Gevid:=\{∀t≤T:btπ​\(m\)≤b¯tπ​\(m\)\},Pr⁡\(Gevid\)≥1−δevid\.G\_\{\\mathrm\{evid\}\}:=\\\{\\forall t\\leq T:\\;b\_\{t\}^\{\\pi\}\(m\)\\leq\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)\\\},\\quad\\Pr\(G\_\{\\mathrm\{evid\}\}\)\\geq 1\-\\delta\_\{\\mathrm\{evid\}\}\.
OnGevidG\_\{\\mathrm\{evid\}\},exp⁡\[−Δ​It​\(m\)\]=b¯tπ​\(m\)\\exp\[\-\\Delta I\_\{t\}\(m\)\]=\\bar\{b\}\_\{t\}^\{\\pi\}\(m\), so

Prπ⁡\(Wtevid=1∣Btϵ,Wtprobe=1,At\)=btπ​\(m\)≤b¯tπ​\(m\)=exp⁡\[−Δ​It​\(m\)\]\.\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\mathrm\{evid\}\}=1\\mid B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1,\\,A\_\{t\}\)=b\_\{t\}^\{\\pi\}\(m\)\\leq\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)=\\exp\[\-\\Delta I\_\{t\}\(m\)\]\.
If the conditioning subset is empty,b¯tπ​\(m\)=1\\bar\{b\}\_\{t\}^\{\\pi\}\(m\)=1by Section 0\.5 and the bound is trivial\.□\\qquad\\square

#### 4\.1 Suite\-level interpretation

The bound treats the all\-pass event over the consensus\-filtered suite as a single Bernoulli channel,

Wtevid=1⇔\\displaystyle W\_\{t\}^\{\\mathrm\{evid\}\}=1\\iff\{\}a bad\-and\-probe\-surviving candidatepasses the entire suite\.\\displaystyle\\text\{passes the entire suite\}\.
We calibrate the suite\-level ratebtπ​\(m\)b\_\{t\}^\{\\pi\}\(m\)directly; themmgenerated tests need not be independent, andmmis not treated as a sample size\.

#### 4\.2 Channel\-level accumulation

If multiple evidence channelsj=1,…,Jtj=1,\\dots,J\_\{t\}act on the same admission decision at steptt, with survival eventsEt,jE\_\{t,j\}calibrated under sequential conditional survival

Prπ⁡\(Et,j=1∣Btϵ,Wtprobe=1,At,Et,1=1,…,Et,j−1=1\)≤bt,j,\\Pr\_\{\\pi\}\(E\_\{t,j\}=1\\mid B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1,\\,A\_\{t\},\\,E\_\{t,1\}=1,\\dots,E\_\{t,j\-1\}=1\)\\leq b\_\{t,j\},
where eachbt,jb\_\{t,j\}is a Clopper–Pearson UCB at levelδevid/\(T⋅Jmax⋅\|Πα\|\)\\delta\_\{\\mathrm\{evid\}\}/\(T\\cdot J\_\{\\max\}\\cdot\|\\Pi\_\{\\alpha\}\|\)on its conditional subset \(withJmax=maxt⁡JtJ\_\{\\max\}=\\max\_\{t\}J\_\{t\}\), then the chain rule gives

Prπ⁡\(Wtevid=1∣Btϵ,Wtprobe=1,At\)≤∏j=1Jtbt,j\.\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\mathrm\{evid\}\}=1\\mid B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1,\\,A\_\{t\}\)\\leq\\prod\_\{j=1\}^\{J\_\{t\}\}b\_\{t,j\}\.
In this setting we define the cumulative evidence by channel index,

ℐtevid,π:=∑j=1JtΔ​It,j,Δ​It,j:=−log⁡bt,j,\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}:=\\sum\_\{j=1\}^\{J\_\{t\}\}\\Delta I\_\{t,j\},\\qquad\\Delta I\_\{t,j\}:=\-\\log b\_\{t,j\},
yielding

Prπ⁡\(Wtevid=1∣Btϵ,Wtprobe=1,At\)≤exp⁡\[−ℐtevid,π\]\.\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\mathrm\{evid\}\}=1\\mid B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1,\\,A\_\{t\}\)\\leq\\exp\[\-\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}\]\.
The accumulation is over channels acting on the current step’s admission decision, not over refinement stepss≤ts\\leq t\. Across refinement steps, the candidates differ, no sequential conditional survival is calibrated, and the chain\-rule decomposition does not apply\.

For CP\-Agent, the implementation uses a single TA channel per admission decision \(Jt≡1J\_\{t\}\\equiv 1\), andℐtevid,π=Δ​It​\(m\)\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}=\\Delta I\_\{t\}\(m\)\.

### 5\. Validity of the Mechanism Factorization in Eq\. \(9\)

Eq\. \(9\) defines

q¯t,mechπ=ρ¯tprobe,π​exp⁡\[−ℐtevid,π\],q¯t,ctrlπ=min⁡\{q¯t,rawπ,q¯t,mechπ\}\.\\bar\{q\}\_\{t,\\mathrm\{mech\}\}^\{\\pi\}=\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}\\,\\exp\[\-\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}\],\\qquad\\bar\{q\}\_\{t,\\mathrm\{ctrl\}\}^\{\\pi\}=\\min\\\{\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\},\\,\\bar\{q\}\_\{t,\\mathrm\{mech\}\}^\{\\pi\}\\\}\.
We show that on the good events of Propositions 2\.7 and 2\.8,q¯t,ctrlπ\\bar\{q\}\_\{t,\\mathrm\{ctrl\}\}^\{\\pi\}is a valid UCB onqtπ:=Prπ⁡\(Ft=1∣At\)q\_\{t\}^\{\\pi\}:=\\Pr\_\{\\pi\}\(F\_\{t\}=1\\mid A\_\{t\}\)\.

Proof\.Under the gate\-semantics requirement of Section 0\.2, onAtA\_\{t\},

\{Ft=1\}⊆\{Btϵ=1,Wtprobe=1,Wtevid=1\}\.\\\{F\_\{t\}=1\\\}\\subseteq\\\{B\_\{t\}^\{\\epsilon\}=1,\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1,\\,W\_\{t\}^\{\\mathrm\{evid\}\}=1\\\}\.
The chain rule of conditional probability gives

qtπ\\displaystyle q\_\{t\}^\{\\pi\}≤Prπ⁡\(Btϵ=1,Wtprobe=1,Wtevid=1∣At\)\\displaystyle\\leq\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\}=1,\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1,\\,W\_\{t\}^\{\\mathrm\{evid\}\}=1\\mid A\_\{t\}\)=Prπ⁡\(Btϵ=1,Wtprobe=1∣At\)\\displaystyle=\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\}=1,\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1\\mid A\_\{t\}\)×Prπ⁡\(Wtevid=1∣Btϵ=1,Wtprobe=1,At\)\.\\displaystyle\\qquad\\times\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\mathrm\{evid\}\}=1\\mid B\_\{t\}^\{\\epsilon\}=1,\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1,\\,A\_\{t\}\)\.
On the good event of Proposition 2\.7,

Prπ⁡\(Btϵ=1,Wtprobe=1∣At\)≤ρ¯tprobe,π\.\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\}=1,\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1\\mid A\_\{t\}\)\\leq\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}\.
On the good event of Proposition 2\.8 \(or Section 4\.2 when multiple channels are present\),

Prπ⁡\(Wtevid=1∣Btϵ=1,Wtprobe=1,At\)≤exp⁡\[−ℐtevid,π\],\\Pr\_\{\\pi\}\(W\_\{t\}^\{\\mathrm\{evid\}\}=1\\mid B\_\{t\}^\{\\epsilon\}=1,\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1,\\,A\_\{t\}\)\\leq\\exp\[\-\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}\],
whereℐtevid,π\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}accumulates over channels acting on the step\-ttadmission decision \(Section 4\.2\)\. Combining,

qtπ≤ρ¯tprobe,π​exp⁡\[−ℐtevid,π\]=q¯t,mechπ\.q\_\{t\}^\{\\pi\}\\leq\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}\\exp\[\-\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}\]=\\bar\{q\}\_\{t,\\mathrm\{mech\}\}^\{\\pi\}\.
Section 1 givesqtπ≤q¯t,rawπq\_\{t\}^\{\\pi\}\\leq\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}on its own good event\. On the joint good event,

qtπ≤min\{q¯t,rawπ,q¯t,mechπ\}=q¯t,ctrlπ\.□q\_\{t\}^\{\\pi\}\\leq\\min\\\{\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\},\\,\\bar\{q\}\_\{t,\\mathrm\{mech\}\}^\{\\pi\}\\\}=\\bar\{q\}\_\{t,\\mathrm\{ctrl\}\}^\{\\pi\}\.\\qquad\\square
If a mechanism factor is not separately calibrated, we set it to11; this can only relaxq¯t,mechπ\\bar\{q\}\_\{t,\\mathrm\{mech\}\}^\{\\pi\}, so Eq\. \(9\) falls back toq¯t,rawπ\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}without losing validity\.

### 6\. Proposition 2\.9: ContextAcquire as a Calibrated Hazard Envelope

Statement\.Under the splitDhistmem→Dcal→DtestD\_\{\\mathrm\{hist\}\}^\{\\mathrm\{mem\}\}\\to D\_\{\\mathrm\{cal\}\}\\to D\_\{\\mathrm\{test\}\}and a frozen memory snapshotM⋆M^\{\\star\}, ContextAcquire\(k\)\(k\)assignsh¯t\+←h¯tπ​\(M⋆\)\\underline\{h\}\_\{t\+\}\\leftarrow\\underline\{h\}\_\{t\}^\{\\pi\}\(M^\{\\star\}\), where

h¯tπ​\(M⋆\)=LCBBinom​\(stM⋆,π,ntM⋆,π,ηh\)\\underline\{h\}\_\{t\}^\{\\pi\}\(M^\{\\star\}\)=\\mathrm\{LCB\}\_\{\\mathrm\{Binom\}\}\\\!\\left\(s\_\{t\}^\{M^\{\\star\},\\pi\},\\,n\_\{t\}^\{M^\{\\star\},\\pi\},\\,\\eta\_\{h\}\\right\)
is a Clopper–Pearson LCB estimated on calibration traces usingM⋆M^\{\\star\}\. The incrementΔ​ht​\(k,Zt\):=h¯tπ​\(M⋆\)−h¯tπ,∅\\Delta h\_\{t\}\(k,Z\_\{t\}\):=\\underline\{h\}\_\{t\}^\{\\pi\}\(M^\{\\star\}\)\-\\underline\{h\}\_\{t\}^\{\\pi,\\varnothing\}is the snapshot\-on/off LCB gap, used only as a deployment diagnostic; the certificate usesh¯tπ​\(M⋆\)\\underline\{h\}\_\{t\}^\{\\pi\}\(M^\{\\star\}\)directly\.

Proof\.Fixπ\\pi,tt, andM⋆M^\{\\star\}\. The split ensures thatM⋆M^\{\\star\}is constructed fromDhistmemD\_\{\\mathrm\{hist\}\}^\{\\mathrm\{mem\}\}and used read\-only onDcalD\_\{\\mathrm\{cal\}\}andDtestD\_\{\\mathrm\{test\}\}, so the evaluation rows do not leak into memory construction\. IfM⋆M^\{\\star\}or any other manifest entry is selected from a finite set onDcalD\_\{\\mathrm\{cal\}\}, all candidates are absorbed into the union bound of Theorem 2\.6; the present statement is for a fixedM⋆M^\{\\star\}\.

LetntM⋆,πn\_\{t\}^\{M^\{\\star\},\\pi\}be the number of active calibration rows usingM⋆M^\{\\star\}andstM⋆,πs\_\{t\}^\{M^\{\\star\},\\pi\}the count of clean successes\. The deployment hazard is

htπ​\(M⋆\):=Prπ⁡\(St=1∣At,M⋆\)\.h\_\{t\}^\{\\pi\}\(M^\{\\star\}\):=\\Pr\_\{\\pi\}\(S\_\{t\}=1\\mid A\_\{t\},\\,M^\{\\star\}\)\.
By Assumption 2\.3 and Section 0\.5, on\{At=1\}\\\{A\_\{t\}=1\\\}the indicatorStS\_\{t\}is i\.i\.d\. Bernoulli with parameterhtπ​\(M⋆\)h\_\{t\}^\{\\pi\}\(M^\{\\star\}\), and the Clopper–Pearson LCB satisfies

Pr⁡\{htπ​\(M⋆\)≥h¯tπ​\(M⋆\)\}≥1−ηh\.\\Pr\\\{h\_\{t\}^\{\\pi\}\(M^\{\\star\}\)\\geq\\underline\{h\}\_\{t\}^\{\\pi\}\(M^\{\\star\}\)\\\}\\geq 1\-\\eta\_\{h\}\.
Union\-bounding over steps and any manifest\-level memory/controller choices yields the simultaneous LCB\. Assigningh¯t\+←h¯tπ​\(M⋆\)\\underline\{h\}\_\{t\+\}\\leftarrow\\underline\{h\}\_\{t\}^\{\\pi\}\(M^\{\\star\}\)givesht\+≥h¯tπ​\(M⋆\)h\_\{t\}^\{\+\}\\geq\\underline\{h\}\_\{t\}^\{\\pi\}\(M^\{\\star\}\)\. The certificate \(Corollary 2\.10, Theorem 2\.6\) uses this LCB directly, not the gapΔ​ht​\(k,Zt\)\\Delta h\_\{t\}\(k,Z\_\{t\}\)\.□\\qquad\\square

### 7\. Corollary 2\.10: Stopped Clean\-Before\-False Certificate

Statement\.Fix a frozen controllerπ\\pi\. Assume the stopped\-process convention of Section 0\.1 and the initial\-activity conditionPrπ⁡\(A1\)=1\\Pr\_\{\\pi\}\(A\_\{1\}\)=1\. On a calibration good event giving simultaneous active\-step boundsqtπ≤q¯tπq\_\{t\}^\{\\pi\}\\leq\\bar\{q\}\_\{t\}^\{\\pi\}andhtπ≥h¯tπh\_\{t\}^\{\\pi\}\\geq\\underline\{h\}\_\{t\}^\{\\pi\}for allt≤Tt\\leq T, a fresh evaluation trajectory satisfies

Prπ⁡\(τS≤T,τS<τF\)≥CT​\(π\):=\[∏t=1T\(1−q¯tπ\)−∏t=1T\(1−h¯tπ\)\]\+\.\\Pr\_\{\\pi\}\(\\tau\_\{S\}\\leq T,\\,\\tau\_\{S\}<\\tau\_\{F\}\)\\geq C\_\{T\}\(\\pi\):=\\left\[\\prod\_\{t=1\}^\{T\}\(1\-\\bar\{q\}\_\{t\}^\{\\pi\}\)\-\\prod\_\{t=1\}^\{T\}\(1\-\\underline\{h\}\_\{t\}^\{\\pi\}\)\\right\]\_\{\+\}\.
Proof\.The argument uses only the chain rule on stopped events; no across\-step independence is required\.

*Step 1 \(no\-false survival\)\.*The event\{Ft=1\}\\\{F\_\{t\}=1\\\}implies the trajectory is active attt, hence\{Ft=1\}⊆At⊆\{τF\>t−1\}\\\{F\_\{t\}=1\\\}\\subseteq A\_\{t\}\\subseteq\\\{\\tau\_\{F\}\>t\-1\\\}\. For eventsE⊆B⊆CE\\subseteq B\\subseteq CwithPr⁡\(B\)\>0\\Pr\(B\)\>0,

Pr⁡\(E∣C\)=Pr⁡\(E\)Pr⁡\(C\)≤Pr⁡\(E\)Pr⁡\(B\)=Pr⁡\(E∣B\),\\Pr\(E\\mid C\)=\\frac\{\\Pr\(E\)\}\{\\Pr\(C\)\}\\leq\\frac\{\\Pr\(E\)\}\{\\Pr\(B\)\}=\\Pr\(E\\mid B\),
so

Prπ⁡\(Ft=1​∣τF\>​t−1\)≤Prπ⁡\(Ft=1∣At\)=qtπ≤q¯tπ\.\\Pr\_\{\\pi\}\(F\_\{t\}=1\\mid\\tau\_\{F\}\>t\-1\)\\leq\\Pr\_\{\\pi\}\(F\_\{t\}=1\\mid A\_\{t\}\)=q\_\{t\}^\{\\pi\}\\leq\\bar\{q\}\_\{t\}^\{\\pi\}\.
IfPr⁡\(At\)=0\\Pr\(A\_\{t\}\)=0, Section 0\.4 sets the conditional to0and the inequality remains valid\. Iterating the chain rule on\{τF\>t\}=\{τF\>t−1\}∩\{Ft=0\}\\\{\\tau\_\{F\}\>t\\\}=\\\{\\tau\_\{F\}\>t\-1\\\}\\cap\\\{F\_\{t\}=0\\\},

Prπ⁡\(τF\>T\)=∏t=1TPrπ⁡\(Ft=0​∣τF\>​t−1\)≥∏t=1T\(1−q¯tπ\)\.\\Pr\_\{\\pi\}\(\\tau\_\{F\}\>T\)=\\prod\_\{t=1\}^\{T\}\\Pr\_\{\\pi\}\(F\_\{t\}=0\\mid\\tau\_\{F\}\>t\-1\)\\geq\\prod\_\{t=1\}^\{T\}\(1\-\\bar\{q\}\_\{t\}^\{\\pi\}\)\.
*Step 2 \(still\-active upper bound\)\.*SinceAt\+1=At∩\{St=0,Ft=0\}⊆At∩\{St=0\}A\_\{t\+1\}=A\_\{t\}\\cap\\\{S\_\{t\}=0,\\,F\_\{t\}=0\\\}\\subseteq A\_\{t\}\\cap\\\{S\_\{t\}=0\\\},

Prπ⁡\(At\+1∣At\)≤Prπ⁡\(St=0∣At\)=1−htπ≤1−h¯tπ\.\\Pr\_\{\\pi\}\(A\_\{t\+1\}\\mid A\_\{t\}\)\\leq\\Pr\_\{\\pi\}\(S\_\{t\}=0\\mid A\_\{t\}\)=1\-h\_\{t\}^\{\\pi\}\\leq 1\-\\underline\{h\}\_\{t\}^\{\\pi\}\.
Iterating, and usingPrπ⁡\(A1\)=1\\Pr\_\{\\pi\}\(A\_\{1\}\)=1from the Section 0\.1 initial\-activity assumption together with the inclusionAT\+1⊆AT⊆⋯⊆A1A\_\{T\+1\}\\subseteq A\_\{T\}\\subseteq\\cdots\\subseteq A\_\{1\},

Prπ⁡\(AT\+1\)=∏t=1TPrπ⁡\(At\+1∣At\)≤∏t=1T\(1−h¯tπ\)\.\\Pr\_\{\\pi\}\(A\_\{T\+1\}\)=\\prod\_\{t=1\}^\{T\}\\Pr\_\{\\pi\}\(A\_\{t\+1\}\\mid A\_\{t\}\)\\leq\\prod\_\{t=1\}^\{T\}\(1\-\\underline\{h\}\_\{t\}^\{\\pi\}\)\.
*Step 3 \(decomposition\)\.*We claim the disjoint identity

\{τF\>T\}=\{τS≤T,τS<τF\}​∪˙​AT\+1\.\\\{\\tau\_\{F\}\>T\\\}=\\\{\\tau\_\{S\}\\leq T,\\;\\tau\_\{S\}<\\tau\_\{F\}\\\}\\;\\dot\{\\cup\}\\;A\_\{T\+1\}\.
To verify it, fix a trajectory in\{τF\>T\}\\\{\\tau\_\{F\}\>T\\\}and split on whetherτS≤T\\tau\_\{S\}\\leq TorτS\>T\\tau\_\{S\}\>T:

\(i\) IfτS≤T\\tau\_\{S\}\\leq TthenτS≤T<τF\\tau\_\{S\}\\leq T<\\tau\_\{F\}, so the trajectory lies in\{τS≤T,τS<τF\}\\\{\\tau\_\{S\}\\leq T,\\,\\tau\_\{S\}<\\tau\_\{F\}\\\}\.

\(ii\) IfτS\>T\\tau\_\{S\}\>TthenτS\>T\\tau\_\{S\}\>TandτF\>T\\tau\_\{F\}\>T, hence the trajectory lies inAT\+1=\{τS\>T,τF\>T\}A\_\{T\+1\}=\\\{\\tau\_\{S\}\>T,\\,\\tau\_\{F\}\>T\\\}\.

For the reverse containment, note that the two right\-hand events are disjoint: inAT\+1A\_\{T\+1\}we haveτS\>T\\tau\_\{S\}\>T, contradictingτS≤T\\tau\_\{S\}\\leq T\. To see that every trajectory in\{τS≤T,τS<τF\}\\\{\\tau\_\{S\}\\leq T,\\,\\tau\_\{S\}<\\tau\_\{F\}\\\}also lies in\{τF\>T\}\\\{\\tau\_\{F\}\>T\\\}, we invoke the stopped\-process convention of Section 0\.1: on\{τS<τF\}\\\{\\tau\_\{S\}<\\tau\_\{F\}\\\},τS\\tau\_\{S\}is the first stop, soFt=0F\_\{t\}=0at every active stept<τSt<\\tau\_\{S\}\(by minimality ofτS\\tau\_\{S\}\) andFt=0F\_\{t\}=0at every stept\>τSt\>\\tau\_\{S\}\(by the stopped\-process zeroing\); the within\-step exclusivity ofFτSF\_\{\\tau\_\{S\}\}andSτSS\_\{\\tau\_\{S\}\}viaBtϵB\_\{t\}^\{\\epsilon\}\(Section 0\.1\) givesFτS=0F\_\{\\tau\_\{S\}\}=0\. HenceτF=inf\{t:Ft=1\}=\+∞\>T\\tau\_\{F\}=\\inf\\\{t:F\_\{t\}=1\\\}=\+\\infty\>T, and the trajectory is in\{τF\>T\}\\\{\\tau\_\{F\}\>T\\\}\. Without this convention the identity fails \(e\.g\.,τS=2,τF=3,T=5\\tau\_\{S\}=2,\\,\\tau\_\{F\}=3,\\,T=5would satisfyτS≤T,τS<τF\\tau\_\{S\}\\leq T,\\,\\tau\_\{S\}<\\tau\_\{F\}butτF≤T\\tau\_\{F\}\\leq T\)\.

Therefore

Prπ⁡\(τS≤T,τS<τF\)=Prπ⁡\(τF\>T\)−Prπ⁡\(AT\+1\)\.\\Pr\_\{\\pi\}\(\\tau\_\{S\}\\leq T,\\,\\tau\_\{S\}<\\tau\_\{F\}\)=\\Pr\_\{\\pi\}\(\\tau\_\{F\}\>T\)\-\\Pr\_\{\\pi\}\(A\_\{T\+1\}\)\.
Combining \(14\) and \(15\) and taking the positive part,

Prπ\(τS≤T,τS<τF\)≥\[∏t=1T\(1−q¯tπ\)−∏t=1T\(1−h¯tπ\)\]\+\.□\\Pr\_\{\\pi\}\(\\tau\_\{S\}\\leq T,\\,\\tau\_\{S\}<\\tau\_\{F\}\)\\geq\\left\[\\prod\_\{t=1\}^\{T\}\(1\-\\bar\{q\}\_\{t\}^\{\\pi\}\)\-\\prod\_\{t=1\}^\{T\}\(1\-\\underline\{h\}\_\{t\}^\{\\pi\}\)\\right\]\_\{\+\}\.\\qquad\\square
*Specialization toT=1T=1\.*C1​\(π\)=\[h¯1π−q¯1π\]\+C\_\{1\}\(\\pi\)=\[\\underline\{h\}\_\{1\}^\{\\pi\}\-\\bar\{q\}\_\{1\}^\{\\pi\}\]\_\{\+\}, recovering the Pass@1 form\.

### 8\. Theorem 2\.6: Simultaneously Valid Finite\-Manifest Calibration

Statement\.LetΠα\\Pi\_\{\\alpha\}be a finite, pre\-declared, non\-empty class of frozen controllers with1≤\|Πα\|<∞1\\leq\|\\Pi\_\{\\alpha\}\|<\\infty, declared beforeDcalD\_\{\\mathrm\{cal\}\}is opened\. Eachπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}specifies the stop/refine rule, risk\-probe invocation policy, evidence intensitymm, context intensitykk, prompts, routing and deduplication rules, decoding parameters, retrievable memory snapshotM⋆M^\{\\star\}, write modeωwrite\\omega\_\{\\mathrm\{write\}\}, and \(when a threshold gate is used\) a finite non\-empty gridΘt\\Theta\_\{t\}with1≤\|Θt\|<∞1\\leq\|\\Theta\_\{t\}\|<\\inftyper step\. With probability at least1−δ1\-\\deltaover calibration, the UCB/LCB quantities entering

CTctrl​\(π\):=\[∏t=1T\(1−q¯t,ctrlπ\)−∏t=1T\(1−h¯tπ​\(M\)\)\]\+C\_\{T\}^\{\\mathrm\{ctrl\}\}\(\\pi\):=\\left\[\\prod\_\{t=1\}^\{T\}\(1\-\\bar\{q\}\_\{t,\\mathrm\{ctrl\}\}^\{\\pi\}\)\-\\prod\_\{t=1\}^\{T\}\(1\-\\underline\{h\}\_\{t\}^\{\\pi\}\(M\)\)\\right\]\_\{\+\}
are simultaneously valid for everyπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}, so the data\-dependent selection

π⋆∈arg⁡maxπ∈Πα⁡CTctrl​\(π\)\\pi^\{\\star\}\\in\\arg\\max\_\{\\pi\\in\\Pi\_\{\\alpha\}\}C\_\{T\}^\{\\mathrm\{ctrl\}\}\(\\pi\)
inherits a post\-selection\-valid lower boundCTctrl​\(π⋆\)C\_\{T\}^\{\\mathrm\{ctrl\}\}\(\\pi^\{\\star\}\)onPrπ⋆⁡\(τS≤T,τS<τF\)\\Pr\_\{\\pi^\{\\star\}\}\(\\tau\_\{S\}\\leq T,\\,\\tau\_\{S\}<\\tau\_\{F\}\)\.

#### 8\.1 Confidence budget

Choose non\-negativeδraw,δgate,δprobe,δevid\\delta\_\{\\mathrm\{raw\}\},\\delta\_\{\\mathrm\{gate\}\},\\delta\_\{\\mathrm\{probe\}\},\\delta\_\{\\mathrm\{evid\}\}with

δraw\+δgate\+δprobe\+δevid≤δ\.\\delta\_\{\\mathrm\{raw\}\}\+\\delta\_\{\\mathrm\{gate\}\}\+\\delta\_\{\\mathrm\{probe\}\}\+\\delta\_\{\\mathrm\{evid\}\}\\leq\\delta\.
For eachπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}, allocate per\-row Clopper–Pearson confidences as

δraw2​T​\|Πα\|,δgateT​\|Θt\|​\|Πα\|,δprobeT​\|Πα\|,δevidT​\|Πα\|⋅Jmax\\frac\{\\delta\_\{\\mathrm\{raw\}\}\}\{2T\|\\Pi\_\{\\alpha\}\|\},\\quad\\frac\{\\delta\_\{\\mathrm\{gate\}\}\}\{T\|\\Theta\_\{t\}\|\|\\Pi\_\{\\alpha\}\|\},\\quad\\frac\{\\delta\_\{\\mathrm\{probe\}\}\}\{T\|\\Pi\_\{\\alpha\}\|\},\\quad\\frac\{\\delta\_\{\\mathrm\{evid\}\}\}\{T\|\\Pi\_\{\\alpha\}\|\\cdot J\_\{\\max\}\}
for the raw, gate, probe, and evidence bounds, respectively, withJmax=maxt⁡JtJ\_\{\\max\}=\\max\_\{t\}J\_\{t\}\(which equals11in the single\-channel implementation\)\. Substitutingδ=0\.10\\delta=0\.10,δ∙=0\.025\\delta\_\{\\bullet\}=0\.025,\|Πα\|=5\|\\Pi\_\{\\alpha\}\|=5,T=5T=5,\|Θt\|=1\|\\Theta\_\{t\}\|=1,Jmax=1J\_\{\\max\}=1gives the per\-row raw confidence0\.025/\(2⋅5⋅5\)=5×10−40\.025/\(2\\cdot 5\\cdot 5\)=5\\times 10^\{\-4\}used in the main text\.

#### 8\.2 Proof

For eachπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}define the good events

Graw​\(π\)\\displaystyle G\_\{\\mathrm\{raw\}\}\(\\pi\):∀t,qtπ≤q¯t,rawπ​and​htπ​\(M\)≥h¯tπ​\(M\),\\displaystyle:\\forall t,\\;q\_\{t\}^\{\\pi\}\\leq\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}\\text\{ and \}h\_\{t\}^\{\\pi\}\(M\)\\geq\\underline\{h\}\_\{t\}^\{\\pi\}\(M\),Ggate​\(π\)\\displaystyle G\_\{\\mathrm\{gate\}\}\(\\pi\):∀t,∀θ∈Θt,pt,θπ≤R¯tπ\(θ\),\\displaystyle:\\forall t,\\,\\forall\\theta\\in\\Theta\_\{t\},\\;p\_\{t,\\theta\}^\{\\pi\}\\leq\\bar\{R\}\_\{t\}^\{\\pi\}\(\\theta\),Gprobe​\(π\)\\displaystyle G\_\{\\mathrm\{probe\}\}\(\\pi\):∀t,Prπ⁡\(Btϵ,Wtprobe=1∣At\)≤ρ¯tprobe,π,\\displaystyle:\\forall t,\\;\\Pr\_\{\\pi\}\(B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1\\mid A\_\{t\}\)\\leq\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\},Gevid​\(π\)\\displaystyle G\_\{\\mathrm\{evid\}\}\(\\pi\):∀t,∀j≤Jt,Prπ\(Et,j=1∣⋯\)≤bt,jπ\.\\displaystyle:\\forall t,\\,\\forall j\\leq J\_\{t\},\\;\\Pr\_\{\\pi\}\(E\_\{t,j\}=1\\mid\\cdots\)\\leq b\_\{t,j\}^\{\\pi\}\.
The per\-row allocations of Section 8\.1 give

Pr⁡\(Graw​\(π\)c\)\\displaystyle\\Pr\(G\_\{\\mathrm\{raw\}\}\(\\pi\)^\{c\}\)≤δraw\|Πα\|,\\displaystyle\\leq\\tfrac\{\\delta\_\{\\mathrm\{raw\}\}\}\{\|\\Pi\_\{\\alpha\}\|\},Pr⁡\(Ggate​\(π\)c\)\\displaystyle\\Pr\(G\_\{\\mathrm\{gate\}\}\(\\pi\)^\{c\}\)≤δgate\|Πα\|,\\displaystyle\\leq\\tfrac\{\\delta\_\{\\mathrm\{gate\}\}\}\{\|\\Pi\_\{\\alpha\}\|\},Pr⁡\(Gprobe​\(π\)c\)\\displaystyle\\Pr\(G\_\{\\mathrm\{probe\}\}\(\\pi\)^\{c\}\)≤δprobe\|Πα\|,\\displaystyle\\leq\\tfrac\{\\delta\_\{\\mathrm\{probe\}\}\}\{\|\\Pi\_\{\\alpha\}\|\},Pr⁡\(Gevid​\(π\)c\)\\displaystyle\\Pr\(G\_\{\\mathrm\{evid\}\}\(\\pi\)^\{c\}\)≤δevid\|Πα\|\.\\displaystyle\\leq\\tfrac\{\\delta\_\{\\mathrm\{evid\}\}\}\{\|\\Pi\_\{\\alpha\}\|\}\.
Let

G:=⋂π∈Πα\(Graw​\(π\)∩Ggate​\(π\)∩Gprobe​\(π\)∩Gevid​\(π\)\)\.G:=\\bigcap\_\{\\pi\\in\\Pi\_\{\\alpha\}\}\\\!\\bigl\(G\_\{\\mathrm\{raw\}\}\(\\pi\)\\cap G\_\{\\mathrm\{gate\}\}\(\\pi\)\\cap G\_\{\\mathrm\{probe\}\}\(\\pi\)\\cap G\_\{\\mathrm\{evid\}\}\(\\pi\)\\bigr\)\.
A union bound overΠα\\Pi\_\{\\alpha\}, steps, threshold grids, mechanism factors, and channels gives

Pr⁡\(G\)≥1−\(δraw\+δgate\+δprobe\+δevid\)≥1−δ\.\\Pr\(G\)\\geq 1\-\(\\delta\_\{\\mathrm\{raw\}\}\+\\delta\_\{\\mathrm\{gate\}\}\+\\delta\_\{\\mathrm\{probe\}\}\+\\delta\_\{\\mathrm\{evid\}\}\)\\geq 1\-\\delta\.
OnGG, for anyπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}and anyt≤Tt\\leq T:

1. \(1\)qtπ≤q¯t,rawπq\_\{t\}^\{\\pi\}\\leq\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\}fromGraw​\(π\)G\_\{\\mathrm\{raw\}\}\(\\pi\)\.
2. \(2\)qtπ≤ρ¯tprobe,π​exp⁡\[−ℐtevid,π\]=q¯t,mechπq\_\{t\}^\{\\pi\}\\leq\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\},\\pi\}\\exp\[\-\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}\]=\\bar\{q\}\_\{t,\\mathrm\{mech\}\}^\{\\pi\}from Section 5 usingGprobe​\(π\)∩Gevid​\(π\)G\_\{\\mathrm\{probe\}\}\(\\pi\)\\cap G\_\{\\mathrm\{evid\}\}\(\\pi\)\.
3. \(3\)Thereforeqtπ≤q¯t,ctrlπ:=min⁡\{q¯t,rawπ,q¯t,mechπ\}q\_\{t\}^\{\\pi\}\\leq\\bar\{q\}\_\{t,\\mathrm\{ctrl\}\}^\{\\pi\}:=\\min\\\{\\bar\{q\}\_\{t,\\mathrm\{raw\}\}^\{\\pi\},\\,\\bar\{q\}\_\{t,\\mathrm\{mech\}\}^\{\\pi\}\\\}\.
4. \(4\)htπ​\(M\)≥h¯tπ​\(M\)h\_\{t\}^\{\\pi\}\(M\)\\geq\\underline\{h\}\_\{t\}^\{\\pi\}\(M\)fromGraw​\(π\)G\_\{\\mathrm\{raw\}\}\(\\pi\)\.

Corollary 2\.10 then givesPrπ⁡\(τS≤T,τS<τF\)≥CTctrl​\(π\)\\Pr\_\{\\pi\}\(\\tau\_\{S\}\\leq T,\\,\\tau\_\{S\}<\\tau\_\{F\}\)\\geq C\_\{T\}^\{\\mathrm\{ctrl\}\}\(\\pi\)for everyπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}\. SinceGGholds simultaneously for allπ∈Πα\\pi\\in\\Pi\_\{\\alpha\}, any data\-dependentπ⋆\\pi^\{\\star\}in this finite class also satisfies

Prπ⋆⁡\(τS≤T,τS<τF\)≥CTctrl​\(π⋆\)\.\\Pr\_\{\\pi^\{\\star\}\}\(\\tau\_\{S\}\\leq T,\\,\\tau\_\{S\}<\\tau\_\{F\}\)\\geq C\_\{T\}^\{\\mathrm\{ctrl\}\}\(\\pi^\{\\star\}\)\.
Ifπ⋆\\pi^\{\\star\}’s manifest includes a score\-threshold admission rule,Ggate​\(π⋆\)G\_\{\\mathrm\{gate\}\}\(\\pi^\{\\star\}\)provides the Theorem 2\.5 guarantee at the corresponding threshold\.□\\qquad\\square

### 9\. Monotonicity of the Certificate

In the unclipped regime, define

CT=∏s=1T\(1−q¯s\)−∏s=1T\(1−h¯s\),q¯t=ρ¯tprobe​exp⁡\(−ℐtevid\)\.C\_\{T\}=\\prod\_\{s=1\}^\{T\}\(1\-\\bar\{q\}\_\{s\}\)\-\\prod\_\{s=1\}^\{T\}\(1\-\\underline\{h\}\_\{s\}\),\\qquad\\bar\{q\}\_\{t\}=\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\}\}\\exp\(\-\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\}\}\)\.
We have

∂CT∂q¯t=−∏s≠t\(1−q¯s\)≤0,\\frac\{\\partial C\_\{T\}\}\{\\partial\\bar\{q\}\_\{t\}\}=\-\\prod\_\{s\\neq t\}\(1\-\\bar\{q\}\_\{s\}\)\\leq 0,
so any reduction inq¯t\\bar\{q\}\_\{t\}does not decreaseCTC\_\{T\}\. Since

∂q¯t∂ρ¯tprobe=exp⁡\(−ℐtevid\)≥0,∂q¯t∂ℐtevid=−q¯t≤0,\\frac\{\\partial\\bar\{q\}\_\{t\}\}\{\\partial\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\}\}\}=\\exp\(\-\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\}\}\)\\geq 0,\\qquad\\frac\{\\partial\\bar\{q\}\_\{t\}\}\{\\partial\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\}\}\}=\-\\bar\{q\}\_\{t\}\\leq 0,
it follows that

∂CT∂ρ¯tprobe≤0,∂CT∂ℐtevid≥0,∂CT∂h¯t=∏s≠t\(1−h¯s\)≥0\.\\frac\{\\partial C\_\{T\}\}\{\\partial\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\}\}\}\\leq 0,\\qquad\\frac\{\\partial C\_\{T\}\}\{\\partial\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\}\}\}\\geq 0,\\qquad\\frac\{\\partial C\_\{T\}\}\{\\partial\\underline\{h\}\_\{t\}\}=\\prod\_\{s\\neq t\}\(1\-\\underline\{h\}\_\{s\}\)\\geq 0\.
RiskProbe contractsρ¯tprobe\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\}\}, EvidenceAcquire increasesℐtevid\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\}\}, and ContextAcquire raisesh¯t\\underline\{h\}\_\{t\}, each weakly increasing the certificate\. The same conclusions hold in the clipped regime by subgradient monotonicity of\[⋅\]\+\[\\cdot\]\_\{\+\}\.

### 10\. Summary of Structural Conditions

1. \(1\)Across\-trajectory independence at fixedttis required; within\-trajectory step independence is not \(Section 0\.3\)\.
2. \(2\)The suite\-level Bernoulli treatment does not require independence across themmgenerated tests \(Section 4\.1\); test\-level independence is needed only for sequential channel accumulation \(Section 4\.2\)\.
3. \(3\)The Hypothesis Prober contributes only toρ¯tprobe\\bar\{\\rho\}\_\{t\}^\{\\mathrm\{probe\}\}; Solution Validator and Test Augmentation contribute only toℐtevid\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\}\}\(Sections 3, 4, 5\)\.
4. \(4\)Finite grids and finite manifests give post\-selection validity \(Theorems 2\.5, 2\.6\)\.
5. \(5\)Empty cells use the conservative conventionsUCB​\(0,0,η\)=1\\mathrm\{UCB\}\(0,0,\\eta\)=1andLCB​\(0,0,η\)=0\\mathrm\{LCB\}\(0,0,\\eta\)=0\(Section 0\.5\)\.
6. \(6\)Memory must be frozen, or all candidate snapshots, routes, and retrieval budgets must be absorbed intoΠα\\Pi\_\{\\alpha\}\(Section 6\)\.
7. \(7\)Corollary 2\.10 lower\-boundsPr⁡\(τS≤T,τS<τF\)\\Pr\(\\tau\_\{S\}\\leq T,\\,\\tau\_\{S\}<\\tau\_\{F\}\), which is strictly stricter than aggregate Refine@TTor Pass@1\.
8. \(8\)FtF\_\{t\}andStS\_\{t\}are mutually exclusive by construction;τS=τF\\tau\_\{S\}=\\tau\_\{F\}does not occur \(Section 0\.1\)\.
9. \(9\)The mechanism containment requiresΓt=1⇒Wtprobe=Wtevid=1\\Gamma\_\{t\}=1\\Rightarrow W\_\{t\}^\{\\mathrm\{probe\}\}=W\_\{t\}^\{\\mathrm\{evid\}\}=1onAtA\_\{t\}; without it, only the raw UCB applies \(Section 0\.2\)\.
10. \(10\)The score mapϕπ\\phi\_\{\\pi\}, threshold gridΘt\\Theta\_\{t\}, aggregation ruleγevid\\gamma\_\{\\mathrm\{evid\}\}, memory snapshotM⋆M^\{\\star\}, and any calibrated scalar inrtr\_\{t\}must be pre\-declared or absorbed intoΠα\\Pi\_\{\\alpha\}before label counting \(Section 0\.6\)\.
11. \(11\)Evidence accumulates over channels acting on the same admission decision, not over refinement steps\. The CP\-Agent implementation uses a single channel per step, soℐtevid,π=Δ​It​\(m\)\\mathcal\{I\}\_\{t\}^\{\\mathrm\{evid\},\\pi\}=\\Delta I\_\{t\}\(m\)\(Section 4\.2\)\.
12. \(12\)The confidence budget satisfiesδraw\+δgate\+δprobe\+δevid≤δ\\delta\_\{\\mathrm\{raw\}\}\+\\delta\_\{\\mathrm\{gate\}\}\+\\delta\_\{\\mathrm\{probe\}\}\+\\delta\_\{\\mathrm\{evid\}\}\\leq\\delta\(Section 8\.1\)\.

### 11\. Compact Proof Skeleton

1. \(1\)*Lemma \(Clopper–Pearson with random active denominator\)\.*Conditioning on the active\-row count, the CP UCB/LCB is valid; a union bound covers steps, grids, and manifests\.
2. \(2\)*Theorem 2\.5\.*For each\(t,θ\)\(t,\\theta\), calibratept,θ=Pr⁡\(Ft=1∣At,rt≤θ\)p\_\{t,\\theta\}=\\Pr\(F\_\{t\}=1\\mid A\_\{t\},\\,r\_\{t\}\\leq\\theta\); the finite\-grid simultaneous UCB yields a post\-selection\-valid threshold\.
3. \(3\)*Proposition 2\.7\.*On active rows,𝟏​\{Btϵ,Wtprobe=1\}\\mathbf\{1\}\\\{B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\}=1\\\}is Bernoulli; the chain rule gives the factorizationσtbad​ρt\\sigma\_\{t\}^\{\\mathrm\{bad\}\}\\rho\_\{t\}\.
4. \(4\)*Proposition 2\.8\.*On bad\-and\-probe\-surviving rows, calibrate the suite\-level Bernoulli survival;Δ​I=−log⁡b¯\\Delta I=\-\\log\\bar\{b\}requires no independence across themmtests; channel\-level accumulation requires sequential conditional calibration\.
5. \(5\)*Eq\. \(9\)\.*The chain rule on\{Ft=1\}⊆\{Btϵ,Wtprobe,Wtevid\}\\\{F\_\{t\}=1\\\}\\subseteq\\\{B\_\{t\}^\{\\epsilon\},\\,W\_\{t\}^\{\\mathrm\{probe\}\},\\,W\_\{t\}^\{\\mathrm\{evid\}\}\\\}onAtA\_\{t\}givesqtπ≤ρ¯​exp⁡\(−ℐ\)q\_\{t\}^\{\\pi\}\\leq\\bar\{\\rho\}\\exp\(\-\\mathcal\{I\}\); the minimum with the raw UCB is again a UCB\.
6. \(6\)*Proposition 2\.9\.*With frozenM⋆M^\{\\star\}, the active\-success hazard admits the CP LCB; the snapshot\-on/off gap is a diagnostic, not a certificate component\.
7. \(7\)*Corollary 2\.10\.*No\-false survival lower bound, still\-active upper bound, and disjoint decomposition\{τF\>T\}=\{τS≤T,τS<τF\}​∪˙​AT\+1\\\{\\tau\_\{F\}\>T\\\}=\\\{\\tau\_\{S\}\\leq T,\\,\\tau\_\{S\}<\\tau\_\{F\}\\\}\\dot\{\\cup\}A\_\{T\+1\}\.
8. \(8\)*Theorem 2\.6\.*All UCB/LCB quantities are simultaneously valid over the finite manifest; the argmax inherits the bound\.

## Appendix DAdditional Empirical Analyses

This appendix probes the main results along seven diagnostic axes: whether Test Augmentation produces a useful signal at all \(§[D\.1](https://arxiv.org/html/2605.24693#A4.SS1)\); whether Dual\-Granularity Verification absorbs the TA noise we measure \(§[D\.2](https://arxiv.org/html/2605.24693#A4.SS2)\); whether the gains are an artifact of weak backbones \(§[D\.3](https://arxiv.org/html/2605.24693#A4.SS3)\); whether the cost–accuracy position is genuinely on the efficiency frontier \(§[D\.4](https://arxiv.org/html/2605.24693#A4.SS4)\); where the tokens actually go \(§[D\.5](https://arxiv.org/html/2605.24693#A4.SS5)\); how the tool budget shifts with problem difficulty \(§[D\.6](https://arxiv.org/html/2605.24693#A4.SS6)\); and whether Experience’s hazard lift translates into observable per\-problem savings in refinement steps and cost \(§[D\.7](https://arxiv.org/html/2605.24693#A4.SS7)\)\. All seven analyses reuse the frozen manifestπ⋆\\pi^\{\\star\}from Section[4](https://arxiv.org/html/2605.24693#S4), leavingΠα\\Pi\_\{\\alpha\}andCTctrl​\(π⋆\)C\_\{T\}^\{\\rm ctrl\}\(\\pi^\{\\star\}\)unchanged; the detailed proofs of all theorems, propositions, and corollaries are provided in Appendix[C](https://arxiv.org/html/2605.24693#A3)\.

### D\.1\.Test\-Augmentation Diagnostics

This analysis evaluates whether Test Augmentation \(TA\) provides useful candidate\-level evidence beyond public tests\. To avoid circular validation, the augmented tests were generated byCP\-Agentwith the DeepSeek\-V3\.2\-Chat backbone, while the evaluated candidate solutions were produced independently by DeepSeek\-V3\.2\-Reasoner on the same LCB\-Pro problems\.

##### TA label accuracy\.

The diagnostic pool contains 117 problems where TA successfully generated augmented tests and the candidate solution had already passed all public tests\. Among them,5151candidates are actually incorrect and6666are actually correct according to the online judge\. A positive prediction means the augmented tests flag the candidate as buggy\.

Pred\. BuggyPred\. CorrectTotalActually IncorrectTP=40=40FN=11=1151Actually CorrectFP=19=19TN=47=4766Total5958117
MetricValueAccuracy87/117=74\.4%87/117=74\.4\\%Precision40/59=67\.8%40/59=67\.8\\%Recall40/51=78\.4%40/51=78\.4\\%F172\.7%72\.7\\%Specificity47/66=71\.2%47/66=71\.2\\%FPR19/66=28\.8%19/66=28\.8\\%FNR11/51=21\.6%11/51=21\.6\\%

Table 7\.Confusion matrix and derived metrics for the quality of TA\-generated augmented test cases\. Augmented tests are produced byCP\-Agentwith the DeepSeek\-V3\.2\-Chat backbone; the evaluated candidates are produced independently by DeepSeek\-V3\.2\-Reasoner on the same LCB\-Pro problems\.Although TA is an imperfect verifier, it improves the quality of the accepted set in this diagnostic pool\. Before TA filtering, the hidden\-bug rate is51/117=43\.6%51/117=43\.6\\%\. Among candidates that survive TA, the hidden\-bug rate drops to11/58=19\.0%11/58=19\.0\\%, so the post\-survival correctness rate is47/58=81\.0%47/58=81\.0\\%\.

##### Test\-case diversity\.

We estimate diversity by randomly samplingkkgenerated test cases from each TA suite and measuring whether at least one sampled case exposes the incorrect candidate\. The evaluation is conducted on the5151actually\-incorrect candidates from the diagnostic pool\. Results are mean±\\pmstandard deviation over100100random draws\.

kk\(sampled cases\)Bug detection rate137\.6%±\\pm5\.8%251\.4%±\\pm5\.1%358\.9%±\\pm4\.4%565\.7%±\\pm3\.3%970\.0%±\\pm2\.0%1574\.5%±\\pm0\.8%Table 8\.Bug\-detection rate when randomly samplingkkaugmented test cases per problem \(LCB\-Pro, DeepSeek\-V3\.2\-Chat backbone\)\.Detection grows steadily withkk, from37\.6%37\.6\\%atk=1k\{=\}1to74\.5%74\.5\\%atk=15k\{=\}15\. The fact that additional samples keep adding bug\-detection coverage—rather than saturating after the first one or two cases—indicates that TA cases cover complementary failure modes rather than only duplicating the same signal\.

### D\.2\.Dual\-Granularity Verification Robustness

This analysis targets the main failure mode exposed by the TA diagnostic: verifier\-level false positives\. A false positive occurs when a correct candidate fails generated augmented tests because the generated expected outputs or test conditions are noisy\.

##### Counterfactual study\.

We examine the1919verifier\-level false\-positive cases from the TA label\-accuracy diagnostic\. The comparison starts from the same intermediate candidate/state and runs the fullCP\-Agentrefinement loop with TA versus without TA\.

Counterfactual outcome on 19 FP casesCountSuccessful with or without TA14Failed with or without TA5Succeeded without TA but failed with TA0Table 9\.End\-to\-end counterfactual analysis on the1919verifier\-level TA false\-positive cases\.In these cases, verifier\-level TA false positives do not produce observed end\-to\-end harm: no case succeeds without TA but fails with TA\. Among the1414cases that remain successful in both conditions, TA introduces5\.645\.64additional steps on average to recover\.

##### Why recovery is possible\.

Dual\-Granularity Verification \(DGV\) givesCP\-Agenttwo independent executable feedback channels:

- •Solution Validator \(SV\) and TA surface program\-level failure signals\.
- •Hypothesis Prober \(HP\) checks local claims, brute\-force comparisons, edge cases, and alternative failure hypotheses\.

When these channels disagree,CP\-Agentarbitrates the evidence rather than blindly trusting a single noisy verifier\.

##### Case study: Codeforces 2098B\.

![Refer to caption](https://arxiv.org/html/2605.24693v1/figs/extra/case_study_2098B.png)Figure 4\.Problem 2098B \(“Sasha and the Apartment Purchase”\) case study: a noisy TA alarm triggers extra HP validation but does not cause the agent to discard the correct formula\-based solution\.In Problem 2098B, “Sasha and the Apartment Purchase”, a correct formula\-based solution passes the public/basic case but fails several generated TA cases\. The agent treats the signal as a possible bug and performs multiple independent HP checks:

- •Step 18: first brute\-force check on small instances\.
- •Step 19: second brute\-force round with duplicates, edge cases, and re\-derivation\.
- •Steps 20–21: stress testing and extended counterexample search\.
- •Step 22: additional hypotheses such as ceil\-vs\-floor behavior and arithmetic issues\.
- •Step 29: no genuine counterexample is found, so the agent retains the formula\-based solution\.
- •Step 30: the final submission is accepted\.

This illustrates the practical role of DGV: a noisy TA alarm can trigger extra validation without causing the agent to discard a correct solution\.

### D\.3\.Frontier\-Backbone Transfer

This experiment appliesCP\-Agentto the stronger DeepSeek\-V3\.2\-Reasoner backbone\. The goal is to test whetherCP\-Agentonly helps weaker chat backbones or remains useful when the base model already has stronger reasoning ability\.

ModelICPC R@5\#TEasyMed\.HardAvg\.Reasoner \(base\)33\.91\.684\.524\.67\.758\.1CP\-Agent\+ Reasoner43\.21\.891\.831\.67\.764\.7Δ\\Delta\+9\.3–\+7\.3\+7\.00\.0\+6\.6Table 10\.CP\-Agentapplied on the frontier reasoning backbone DeepSeek\-V3\.2\-Reasoner\. ICPC R@5 is Refine@5 on ICPC\-Eval; \#T is the average number of refinement turns; Easy/Med\./Hard/Avg\. are Pass@1 on LCB\-Pro\.CP\-Agentimproves the Reasoner backbone by\+9\.3\+9\.3Refine@5 on ICPC\-Eval and\+6\.6\+6\.6average Pass@1 on LCB\-Pro\. Gains concentrate on Easy and Medium problems; Hard remains unchanged, consistent with the conclusion that hard problems require stronger high\-level algorithmic discovery beyond feedback utilization alone\.

### D\.4\.Efficiency and BCC Derivation

We analyze cost–performance trade\-offs with Data Envelopment Analysis \(DEA\)\(Charneset al\.,[1978](https://arxiv.org/html/2605.24693#bib.bib7)\), using the variable\-returns\-to\-scale Banker–Charnes–Cooper \(BCC\) model\(Bankeret al\.,[1984](https://arxiv.org/html/2605.24693#bib.bib5)\)\. The input is average API cost per problem, and the output is Pass@1\. A method is BCC\-efficient when no observed method, nor any convex combination of observed methods, can achieve at least the same output with no more input\. The multi\-backbone Pass@1\-versus\-cost view supporting this analysis is shown in the main\-text Figure[1](https://arxiv.org/html/2605.24693#S0.F1)\.

##### Input\-oriented BCC program\.

For each decision\-making unit \(DMU\)kkwith\(xk,yk\)\(x\_\{k\},y\_\{k\}\), the input\-oriented BCC efficiency is

minθ,\{λj\}\\displaystyle\\min\_\{\\theta,\\\{\\lambda\_\{j\}\\\}\}\\quadθ\\displaystyle\\thetas\.t\.∑jλj​xj≤θ​xk,\\displaystyle\\sum\_\{j\}\\lambda\_\{j\}x\_\{j\}\\leq\\theta\\,x\_\{k\},∑jλj​yj≥yk,\\displaystyle\\sum\_\{j\}\\lambda\_\{j\}y\_\{j\}\\geq y\_\{k\},∑jλj=1,\\displaystyle\\sum\_\{j\}\\lambda\_\{j\}=1,λj≥0,∀j\.\\displaystyle\\lambda\_\{j\}\\geq 0,\\ \\forall j\.The optimumθk⋆∈\(0,1\]\\theta\_\{k\}^\{\\star\}\\in\(0,1\]\.θk⋆=1\\theta\_\{k\}^\{\\star\}=1means the method lies on the BCC frontier;θk⋆<1\\theta\_\{k\}^\{\\star\}<1means its input could be proportionally reduced by\(1−θk⋆\)\(1\-\\theta\_\{k\}^\{\\star\}\)while preserving at least the same output\.

##### DeepSeek\-V3\.2\-Chat data\.

Let\(xj,yj\)\(x\_\{j\},y\_\{j\}\)denote\(Cost,Pass@1\)\(\\text\{Cost\},\\text\{Pass@1\}\)\.

MethodCostxxPass@1yyIn Tok\.Out Tok\.DeepSeek\-V3\.2\-Chat \(base\)0\.00116123625\.8892\.32,671\.9Reflexion0\.00679876137\.129,631\.613,100\.9CodeSim0\.01498427040\.141,883\.231,314\.0AlphaCodium0\.02304680126\.989,478\.145,552\.7CP\-Agent0\.03537718448\.5582,547\.123,549\.4Table 11\.Per\-method cost and Pass@1 on the DeepSeek\-V3\.2\-Chat backbone\.AlphaCodium is Pareto\-dominated by Reflexion because Reflexion has lower cost and higher Pass@1\. The VRS/BCC frontier for the remaining DeepSeek points is

DeepSeek\-V3\.2\-Chat \(base\)→Reflexion→CP\-Agent,\\text\{DeepSeek\-V3\.2\-Chat \(base\)\}\\rightarrow\\text\{Reflexion\}\\rightarrow\\textsc\{CP\-Agent\},and these three frontier points haveθ=1\\theta=1\.

##### Closed\-form one\-input one\-output computation\.

With one input, one output, and the BCC convexity constraint, any feasible reference point is a convex combination of observed DMUs\. For a target outputyybetween two frontier outputsy1≤y≤y2y\_\{1\}\\leq y\\leq y\_\{2\}, the minimal attainable input on the frontier is obtained by linear interpolation:

λ​\(y\)\\displaystyle\\lambda\(y\)=y−y1y2−y1,\\displaystyle=\\frac\{y\-y\_\{1\}\}\{y\_\{2\}\-y\_\{1\}\},x⋆​\(y\)\\displaystyle x^\{\\star\}\(y\)=\(1−λ​\(y\)\)​x1\+λ​\(y\)​x2,\\displaystyle=\(1\-\\lambda\(y\)\)\\,x\_\{1\}\+\\lambda\(y\)\\,x\_\{2\},θk\\displaystyle\\theta\_\{k\}=x⋆​\(yk\)xk\.\\displaystyle=\\frac\{x^\{\\star\}\(y\_\{k\}\)\}\{x\_\{k\}\}\.

##### CodeSim\.

CodeSim hasyS=40\.1y\_\{S\}=40\.1, between Reflexion \(37\.137\.1\) andCP\-Agent\(48\.548\.5\):

λS\\displaystyle\\lambda\_\{S\}=40\.1−37\.148\.5−37\.1=311\.4≈0\.2631579,\\displaystyle=\\tfrac\{40\.1\-37\.1\}\{48\.5\-37\.1\}=\\tfrac\{3\}\{11\.4\}\\approx 0\.2631579,x⋆​\(yS\)\\displaystyle x^\{\\star\}\(y\_\{S\}\)=0\.7368421⋅0\.006798761\+0\.2631579⋅0\.035377184\\displaystyle=0\.7368421\\cdot 0\.006798761\+0\.2631579\\cdot 0\.035377184=0\.014319399,\\displaystyle=0\.014319399,θS\\displaystyle\\theta\_\{S\}=0\.0143193990\.014984270≈0\.9556\.\\displaystyle=\\tfrac\{0\.014319399\}\{0\.014984270\}\\approx 0\.9556\.CodeSim is slightly inefficient under VRS/BCC and could reduce cost by about4\.44%4\.44\\%while maintaining the same Pass@1\.

##### AlphaCodium\.

AlphaCodium hasyA=26\.9y\_\{A\}=26\.9, between DeepSeek\-V3\.2\-Chat base \(25\.825\.8\) and Reflexion \(37\.137\.1\):

λA\\displaystyle\\lambda\_\{A\}=26\.9−25\.837\.1−25\.8=1\.111\.3≈0\.0973451,\\displaystyle=\\tfrac\{26\.9\-25\.8\}\{37\.1\-25\.8\}=\\tfrac\{1\.1\}\{11\.3\}\\approx 0\.0973451,x⋆​\(yA\)\\displaystyle x^\{\\star\}\(y\_\{A\}\)=0\.9026549⋅0\.001161236\+0\.0973451⋅0\.006798761\\displaystyle=0\.9026549\\cdot 0\.001161236\+0\.0973451\\cdot 0\.006798761=0\.00171002162,\\displaystyle=0\.00171002162,θA\\displaystyle\\theta\_\{A\}=0\.001710021620\.023046801≈0\.0742\.\\displaystyle=\\tfrac\{0\.00171002162\}\{0\.023046801\}\\approx 0\.0742\.

##### Final DeepSeek VRS/BCC efficiencies\.

MethodBCCθ\\thetaFrontier statusDeepSeek\-V3\.2\-Chat \(base\)1\.0000FrontierReflexion1\.0000FrontierCodeSim0\.9556InefficientAlphaCodium0\.0742InefficientCP\-Agent1\.0000FrontierTable 12\.Final input\-oriented VRS/BCC efficiencies for the DeepSeek\-V3\.2\-Chat methods\.The multi\-backbone frontier \(Fig\.[1](https://arxiv.org/html/2605.24693#S0.F1)\) extends this analysis to DeepSeek\-V3\.2\-Chat, GPT\-4o, and Qwen3\-235B\.CP\-Agentlies on the BCC frontier for all three backbones, whereas some baselines are frontier\-efficient only under particular backbones\.

### D\.5\.Per\-Component Token and Cost Breakdown

This table attributesCP\-Agent’s API token usage and cost across its four main components on LCB\-Pro\. Results use the DeepSeek\-V3\.2\-Chat backbone and average over167167problems\.

HP and SV are non\-LLM tools for code execution and validation; their token counts measure the associated LLM reasoning overhead, including accumulated multi\-turn conversation context\. TA and Experience are separate LLM calls with self\-contained prompts\. Costs are computed using DeepSeek API pricing\.

ComponentIn Tok\.Out Tok\.Cost% TotalHP \(Hypothesis Prober\)392,07617,898$0\.02569\.7%SV \(Solution Validator\)115,5322,826$0\.00617\.6%Exp \(Experience\)20,477706$0\.0013\.4%TA \(Test Augmentation\)9,078471$0\.0011\.7%Total \(tracked\)537,16321,901$0\.03392\.4%Table 13\.Per\-component API token usage ofCP\-Agenton LCB\-Pro \(DeepSeek\-V3\.2\-Chat backbone, averaged over167167problems\)\.The tracked components account for92\.4%92\.4\\%of total API cost\. The remaining∼7\.6%\\sim 7\.6\\%comes from problem reading, system prompts, and submission formatting\. HP dominates cost because it is invoked repeatedly inside the multi\-turn refinement loop, while TA and Experience together account for only5\.1%5\.1\\%of total cost\.

### D\.6\.Tool Use vs\. Problem Difficulty

Figure[3\(a\)](https://arxiv.org/html/2605.24693#S4.F3.sf1)analyzes verification\-tool usage along the*interaction\-step*axis on a single backbone\. This subsection complements that view with the orthogonal*difficulty*axis: how much tool\-mediated effortCP\-Agentspends per problem as problems become harder\. The diagnostic uses the DeepSeek\-V3\.2\-Chat backbone on LCB\-Pro and counts the average number of HP/SV/TA invocations per problem, broken down by the benchmark’s Easy/Medium/Hard difficulty bands\.

![Refer to caption](https://arxiv.org/html/2605.24693v1/x5.png)Figure 5\.Average tool\-use frequency \(calls per problem\) on LCB\-Pro by difficulty band \(Easy/Medium/Hard\), broken down by tool channel \(HP, SV, TA\)\. DeepSeek\-V3\.2\-Chat backbone\. See body for the resulting per\-band totals and their interpretation\.##### Adaptive scaling on Easy/Medium\.

Compared with a fixed\-pipeline baseline \(each method calls every tool a constant number of times\),CP\-Agentissues5\.55\.5tool calls per problem on Easy and5\.15\.1on Medium\. The slight drop from Easy to Medium reflects two competing effects under the certificate machinery of Section[2](https://arxiv.org/html/2605.24693#S2): \(i\) Medium problems trigger longer chain\-of\-thought before an executable candidate is produced, so HP and SV start firing later in the trajectory; \(ii\) once an executable candidate is produced, the admission gateΓt\\Gamma\_\{t\}continues to require evidence, so tool calls remain non\-trivial\. Empirically the two effects roughly cancel, andCP\-Agentstill relies heavily on external feedback on Easy and Medium problems\.

##### Why Hard drops to3\.33\.3\.

On Hard problems, average tool calls drop to3\.33\.3\. This is consistent with two structural facts already documented in the main paper:

- •*Few executable candidates are reached\.*Hard problems often consume the budget on high\-level algorithmic discovery before any candidate is produced\. SV is only meaningful on executable candidates, and TA is only invoked when SV first returns pass on public tests; both channels stay neutral on trajectories that never reach this state\. This is the same structural barrier that produces the unchanged Hard column in Table[10](https://arxiv.org/html/2605.24693#A4.T10): refinement\-loop tools cannot substitute for missing high\-level ideas\.
- •*Public\-test\-only filter\.*The augmented suite𝒯aug\\mathcal\{T\}\_\{\\rm aug\}is generated at most once per problem \(Algorithm[1](https://arxiv.org/html/2605.24693#alg1)\), conditional on the candidate first passing public tests\. On Hard problems this trigger fires rarely, suppressing the TA contribution to the per\-problem call count\.

In other words, the drop on Hard is not a controller\-level decision to invoke fewer tools, but a downstream consequence of the candidate stream itself becoming sparse: the gatesGi,tHP,Gi,tSV,Gi,tTAG^\{\\mathrm\{HP\}\}\_\{i,t\},G^\{\\mathrm\{SV\}\}\_\{i,t\},G^\{\\mathrm\{TA\}\}\_\{i,t\}from Algorithm[1](https://arxiv.org/html/2605.24693#alg1)record neutral on inactive channels \(Definition[2\.1](https://arxiv.org/html/2605.24693#S2.Thmtheorem1)\)\. Combined with Fig\.[3\(a\)](https://arxiv.org/html/2605.24693#S4.F3.sf1), which shows HP concentrated in early interaction steps, this gives a coherent picture:CP\-Agentspends tool budget where evidence is actionable, and remains efficient elsewhere by leaving inactive channels neutral rather than firing them spuriously\.

### D\.7\.Experience\-Driven Efficiency

Section[3\.3](https://arxiv.org/html/2605.24693#S3.SS3)attributes the Experience module toContextAcquire​\(k\)\\textsf\{ContextAcquire\}\(k\), whose effect in Proposition[2\.9](https://arxiv.org/html/2605.24693#S2.Thmtheorem9)is to lift the active\-state success hazardh¯t​\(M\)\\underline\{h\}\_\{t\}\(M\)\. Beyond the per\-step acceptance lift reported in Table[5](https://arxiv.org/html/2605.24693#S4.T5), a higherh¯t\\underline\{h\}\_\{t\}should also letCP\-Agentcommit earlier in the refinement loop, with downstream effects on both the average number of refinement steps and the per\-problem cost\. We isolate this efficiency channel by an internal ablation ofπ⋆\\pi^\{\\star\}that toggles only the frozen Experience snapshotM⋆M^\{\\star\}on the LCB\-Pro target split \(DeepSeek\-V3\.2\-Chat backbone, manifest of Table[3](https://arxiv.org/html/2605.24693#S4.T3)\)\.

![Refer to caption](https://arxiv.org/html/2605.24693v1/x6.png)Figure 6\.Average refinement steps and average per\-problem API cost on LCB\-Pro with and without the frozen Experience snapshotM⋆M^\{\\star\}\(DeepSeek\-V3\.2\-Chat backbone,π⋆\\pi^\{\\star\}from Table[3](https://arxiv.org/html/2605.24693#S4.T3)\)\. Both axes are averaged over all167167LCB\-Pro problems\.##### Both axes move in the right direction\.

With Experience, the average refinement step count drops and the average per\-problem cost drops jointly \(Fig\.[6](https://arxiv.org/html/2605.24693#A4.F6)\)\. The same direction shows up in the component ablation of Table[4](https://arxiv.org/html/2605.24693#S4.T4): adding Experience on top of\+SV\+HP\+TA\+\\textsf\{SV\}\+\\textsf\{HP\}\+\\textsf\{TA\}movesCP\-Agentfrom $0\.0430\.043to $0\.0350\.035per problem and improves the only\-solved\#​T\\\#Ton ICPC\-Eval from2\.22\.2to1\.81\.8, while raising LCB\-Pro Pass@1 from46\.146\.1to48\.548\.5\. Two measurement axes—LCB\-Pro all\-problem average in Fig\.[6](https://arxiv.org/html/2605.24693#A4.F6)and ICPC\-Eval only\-solved\#​T\\\#Tin Table[4](https://arxiv.org/html/2605.24693#S4.T4)—thus agree on the sign of the effect, while differing in absolute magnitude due to the averaging convention\.

##### Mechanism interpretation\.

The cost reduction is not driven by Experience being a cheaper module: Table[13](https://arxiv.org/html/2605.24693#A4.T13)shows Experience itself consumes only∼3\.4%\\sim\\\!3\.4\\%of the total tracked API cost\. Instead, the savings act through the certificate state\. A higherh¯t​\(M\)\\underline\{h\}\_\{t\}\(M\)lets the admission gateΓt\\Gamma\_\{t\}close earlier on the trajectories that would otherwise have continued, which in turn truncates the HP/SV invocations downstream of those refinement steps\. Since HP and SV together account for87\.3%87\.3\\%of tracked cost \(Table[13](https://arxiv.org/html/2605.24693#A4.T13)\), the saved HP/SV calls dominate the additional retrieval cost incurred by Experience\. In the language of Section[2](https://arxiv.org/html/2605.24693#S2),ContextAcquireconverts hazard lift into refinement\-step savings, and the savings on the expensive verification channels outweigh the retrieval overhead—giving the qualitative claim in Section[4](https://arxiv.org/html/2605.24693#S4)that “retrieved repair experiences reduce ineffective trial\-and\-error” a concrete per\-problem efficiency signature\.

## Appendix EDatasets

##### ICPC\-Eval\(Xuet al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib2)\)

is a competitive\-programming benchmark designed to reflect real contest settings\. It is built from problems drawn from1111recent ICPC contests \(collected via QOJ and VJudge\): starting from139139problems, the dataset is cleaned and finalized to118118problems\. It filters out problems that rely on crucial non\-textual figures, interactive problems, or problems without an official reference solution\. For tasks requiring special judging, ICPC\-Eval retains them and additionally implements1212custom special judges \(e\.g\., for floating\-point outputs or multiple valid answers\)\. All problem statements are standardized into LaTeX for consistent parsing and model input\. The benchmark matches ICPC\-style problem diversity and difficulty, and provides annotations over eight major algorithmic areas \(e\.g\., DP, mathematics, graph algorithms, and data structures\), with many problems spanning multiple areas to capture the compositional and challenging nature of ICPC problems\.

##### LiveCodeBench Pro\(Zhenget al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib3)\)

is a challenging benchmark for evaluating large language models on competitive programming\. Its central goal is to mitigate overestimation caused by training\-data contamination by maintaining a continuously updated problem set\. The benchmark curates problems from top\-tier contests such as Codeforces, ICPC, and IOI\. Problems are collected as soon as possible after they appear in real contests, further reducing the risk of data leakage\.

## Appendix FMetrics

##### Refine@K\(Xuet al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib2)\)

is an execution\-feedback\-based evaluation metric designed to measure an LLM’s ability to iteratively improve a solution under a bounded turn budget\. In ICPC\-Eval, each problem is solved through up toKKresponse–refinement chances, and the metric asks whether the model can “pass the test within response and refinement chances” \(i\.e\., succeed withinKKturns\)\. Concretely, the evaluation is multi\-turn: in the first turn, the model receives the full problem statement; in subsequent turns, the model is additionally provided with its previous response and the corresponding evaluation feedback\. Thus, a model is counted as successful under Refine@K if any of its firstKKattempts \(conditioned on prior code \+ feedback\) produces a correct program\. Compared with Pass@K \(independent resampling\), Refine@K better captures the behavior of reasoning/reflection\-capable models because it explicitly evaluates whether a model can use feedback to correct mistakes rather than merely getting lucky with another sample\.

##### \# Turn\(Xuet al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib2)\)

represents the average number of correct response turns, computed only over problems that are successfully solved within the refinement budget\. Specifically, for each solved problem, we record the turn index at which the model first produces a correct solution, and then average this index across the solved subset\. A smaller \#T indicates the model typically fixes its solution within fewer turns, suggesting more efficient use of execution feedback, whereas a larger \#T implies the model often needs more trial\-and\-error refinement rounds before arriving at a correct program\.

##### Pass@1

measures the single\-try success rate of code generation: for each problem, the model produces exactly one candidate solution \(one response/submission\)\. If this first solution passes the evaluation \(e\.g\., all test cases\), the problem is counted as solved\. Pass@1 is the fraction of problems solved on the first attempt:

Pass@1=1N∑i=1N𝕀\[\\displaystyle\\mathrm\{Pass@1\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbb\{I\}\\big\[the model’s first solution forproblemipasses\]\.\\displaystyle\\text\{problem \}i\\text\{ passes\}\\big\]\.Intuitively, Pass@1 reflects the model’s*initial answer quality*without any retries or feedback\-driven refinement, in contrast to iterative metrics such as Refine@K\.

## Appendix GPrompts

### G\.1\.System Prompt

\\MakeFramed\\FrameRestore

You are a professional competitive programming contestant who specializes in solving algorithm and data structure problems\.

Please carefully read the problem statement in the input, and first analyze:

\*\*Problem Category\*\*

\- Determine which type\(s\) of algorithmic problem it mainly belongs to\.

\- You may choose or combine from the following categories:

Use the format ‘"Primary Tag: \[Secondary Tag1, Secondary Tag2, \.\.\.\]"‘ \- Basic Algorithms: \[Simulation, Enumeration, Recursion, Greedy, Divide and Conquer, Binary Search, Sorting, Doubling, Construction, Prefix Sum & Difference Array\]

\- Dynamic Programming: \[Memoized Search, Knapsack DP, Interval DP, DP on DAG, Tree DP, Bitmask DP, Digit DP, Counting DP, Probabilistic DP, DP Optimization\]

\- Strings: \[String Matching \(KMP/Boyer–Moore\), String Hashing, Trie, Aho–Corasick Automaton, Suffix Array \(SA\), Suffix Automaton \(SAM\), Manacher, Palindromic Tree\]

\- Mathematics: \[Number Theory, Combinatorics, Linear Algebra, Game Theory, Probability & Expectation, Polynomials, Fast Exponentiation, Primes, Divisors, Congruence\]

\- Data Structures: \[Stack, Queue, Linked List, Hash Table, Disjoint Set Union \(DSU\), Fenwick Tree \(BIT\), Segment Tree, Sqrt Decomposition, Balanced BST, Tree of Trees, Persistent Data Structures\]

\- Graph Theory: \[Graph Traversal \(DFS/BFS\), Shortest Path, Minimum Spanning Tree, Topological Sort, Bipartite Graph, Network Flow, Strongly Connected Components \(SCC\), Biconnected Components, Lowest Common Ancestor \(LCA\), Tree Problems\]

\- Computational Geometry: \[Vector Operations, Convex Hull, Half\-Plane Intersection, Rotating Calipers, Sweep Line\]

\- Miscellaneous: \[Two Pointers, Discretization, Randomization, Search \(Pruning/Heuristics/A\*\)\]

Next, please write concise, efficient, and correct C\+\+ code\.

Finally, please use tools to validate the code’s correctness and debug issues\.

\#\# Experience System

\#\#\# General Programming Experience \(Auto\-injected\) Historical general programming experiences \(e\.g\., syntax errors, header includes, type conversions\) will be shown below\. Please pay attention to these common issues while coding\.

%\- if long\_term\_experiences % long\_term\_experiences %\- endif %

Here are some examples of using tools:

Task: "Solve a graph shortest path problem: N cows stand in a line, each cow has a breed attribute\. You need to send information from cow 1 to cow N\. Whether two breeds can communicate is determined by a matrix S, and the cost is the distance between positions\."

Thought: This is a shortest path problem with constraints, belonging to graph theory\. Let me first retrieve historical experience on graph theory \- shortest path, to see if there are similar solutions\.

\{\{code\_block\_opening\_tag\}\}

experience\_data = retrieve\_algorithm\_experience\(primary\_tag="Graph Theory", secondary\_tag="Shortest Path"\)

print\(experience\_data\) \{\{code\_block\_closing\_tag\}\}

Observation: "\[Graph Theory \- Shortest Path Experience\] 1\. When using Dijkstra, ensure all edge weights are non\-negative\.\.\." Thought: This experience is useful; I need to ensure edge weights are non\-negative\. Let me use Hypothesis Prober to verify whether the edge weights are non\-negative and whether Dijkstra is applicable\.

\{\{code\_block\_opening\_tag\}\}

cpp\_code = r’’’

\#include <bits/stdc\+\+\.h\>

using namespace std;

int main\(\)

int N = 10;

cout << "Edges in directed complete graph: " << 1LL \* N \* \(N \- 1\) << " n";

for \(int i = 1; i <= N; i\+\+\)

for \(int j = 1; j <= N; j\+\+\)

if \(i \!= j && abs\(i \- j\) < 0\)

cout << "Negative weight? n";

cout << "All \|i\-j\| \>= 0 \-\> Dijkstra applicable n";

return 0;

’’’;

result = hypo\_prober\(cpp\_code\);

print\(result\)

\{\{code\_block\_closing\_tag\}\}

Observation: The weights are non\-negative, but building the complete graph is too large and needs layering/graph\-construction optimization\.

Thought: After understanding the historical experience, I will write the final code example\. To avoid blow\-up, do not build an O\(N2N^\{2\}\) graph; avoid Dijkstra and instead use 0\-1 BFS \(deque\) on a layered graph to compute shortest paths\.

\{\{code\_block\_opening\_tag\}\}

cpp\_code = r’’’

\#include <bits/stdc\+\+\.h\>

using namespace std;

\.\.\.

int ans = dist\[id\_pos\(N\)\];

if \(ans \>= INF/2\)

cout << \-1 << " n";

else cout << ans << " n";

return 0;

’’’

result = cpp\_validation\(cpp\_code\)

print\(result\)

\{\{code\_block\_closing\_tag\}\}\\endMakeFramed

\\MakeFramed\\FrameRestore

The above examples use hypothetical tools that may not exist in real tasks\. Aside from executing computations inside the Python code snippets you create, you can only access the tools listed below, whose behavior is similar to regular Python functions:

\{\{code\_block\_opening\_tag\}\}

\{%\- for tool in tools\.values\(\) %\}

\{\{ tool\.to\_code\_prompt\(\) \}\}

\{% endfor %\}

\{\{code\_block\_closing\_tag\}\}

Here are the rules you should follow:

1\. C\+\+ code must be passed to tool functions \(e\.g\., cpp\_validation, hypo\_validator, final\_answer\) as a string\. Correct format example:

Thought: \[Your thought\]

Next, I will write C\+\+ code to implement this idea\.

\{\{code\_block\_opening\_tag\}\}

code = ’’’

\#include <iostream\>

int main\(\)

\.\.\.

’’’

result = cpp\_validation\(code\)

print\(result\)

\{\{code\_block\_closing\_tag\}\}

2\. Only use variables that you have defined\!

3\. C\+\+ code must not be executed directly by the Python interpreter\!

4\. Do not name any new variable as any tool name: e\.g\., do not name a variable "final\_answer"\.

7\. Do not create any dummy variables in the code, because these variables appearing in logs may mislead you\.

8\. Do not give up\! You have no time or step limit for writing code\. You are the one solving the task, not providing guidance on how to solve it\.

Please start solving the problem\.\\endMakeFramed

### G\.2\.Tool Descriptions

#### G\.2\.1\.Test Case Generator

\\MakeFramed\\FrameRestore

You are an elite competitive\-programming assistant\.

Produce only valid C\+\+17 code\.

Use standard I/O \(cin/cout\)\.

Include every helper necessary for a brute\-force solution\.

Return only the C\+\+ source code, wrapped in a single fenced code block: ‘‘‘cpp \.\.\. ‘‘‘

You are an expert competitive programming assistant\.

Your task is to write a correct and brute\-force C\+\+ solution for the following problem\. The program must:

\- Read input from standard input and write output to standard output, strictly following the problem’s input/output format\.

\- Implement a straightforward, brute\-force \(but correct\) algorithm, prioritizing clarity and correctness over efficiency\.

\- Avoid any optimizations or advanced algorithms; use only simple, exhaustive approaches\.

\- Ensure the code is self\-contained and compiles with GNU G\+\+23 14\.2 \(64bit, msys2\)\.

\-\-\-

Question

\-\-\-

Language and Compiler

\- Language: cpp

\- Compiler: GNU G\+\+23 14\.2 \(64bit, msys2\)

Output:

Provide only the complete C\+\+ source code, enclosed in a single code block using the syntax ‘‘‘cpp \.\.\. ‘‘‘\.\\endMakeFramed

#### G\.2\.2\.Hypothesis Prober

\\MakeFramed\\FrameRestore

This tool compiles and executes C\+\+ code in a local sandbox environment\.

It takes C\+\+ code as input, compiles it, and returns the execution output\. The tool can be used for quick numerical computation and verification: write small C\+\+ programs to compute results \(e\.g\., prefix sums, combinations, matrix operations, simulations, randomized experiments\), and treat the printed output as a reliable computed reference to support your reasoning\.

\*\*When input is needed, you should hard\-code the input directly into the C\+\+ program\*\* \(e\.g\., embed test data in the code\)\. This is not a tool for validating the code you intend to submit\.

If any compilation or runtime errors occur, it will return the corresponding error messages, along with historical error reminders\.\\endMakeFramed

#### G\.2\.3\.Solution Validator

\\MakeFramed\\FrameRestore

This tool validates your C\+\+ code by compiling and running it against local test cases\.

It automatically injects test inputs and compares the actual output with the expected output\.

For failed tests, it returns the input data, the expected output, and the actual output\.

The output is limited to 500 words to prevent excessive verbosity\. This tool is used to verify whether the code you intend to submit is correct, and it does not require you to provide input\.

Before submitting the final code, you must use this tool at least once\.\\endMakeFramed

#### G\.2\.4\.Experience Retriever

\\MakeFramed\\FrameRestore

Retrieve historical fix experiences for a specific algorithm type\.

When you need to solve an algorithm\-related problem, you can use this tool to obtain past experiences for that algorithm type\. These experiences come from previously solved similar problems and may help with the current task\.

Algorithm taxonomy:

\- Basic Algorithms: Simulation, Enumeration, Recursion, Greedy, Divide and Conquer, Binary Search, Sorting, Doubling, Construction, Prefix Sum & Difference Array

\- Dynamic Programming: Memoized Search, Knapsack DP, Interval DP, DP on DAG, Tree DP, Bitmask DP, Digit DP, Counting DP, Probabilistic DP, DP Optimization

\- Strings: String Matching, String Hashing, Trie, Aho–Corasick Automaton, Suffix Array, Manacher, Palindromic Tree

\- Mathematics: Number Theory, Combinatorics, Linear Algebra, Game Theory, Probability & Expectation, Fast Exponentiation, Primes, Divisors, Congruence

\- Data Structures: Stack, Queue, Linked List, Hash Table, Disjoint Set Union \(DSU\), Fenwick Tree \(BIT\), Segment Tree, Sqrt Decomposition, Balanced BST

\- Graph Theory: Graph Traversal \(DFS/BFS\), Shortest Path, Minimum Spanning Tree, Topological Sort, Bipartite Graph, Network Flow, Strongly Connected Components \(SCC\), Lowest Common Ancestor \(LCA\)

\- Computational Geometry: Vector Operations, Convex Hull, Half\-Plane Intersection, Rotating Calipers, Sweep Line

\- Miscellaneous: Two Pointers, Discretization, Randomization, Search \(Pruning/Heuristics/A\*\)

\\endMakeFramed

### G\.3\.Experience Generation

\\MakeFramed\\FrameRestore

\#\# Your Task As an expert in extracting code\-fix experiences, you need to analyze the debugging process that turns a failed solution into a successful one, and distill reusable debugging experience and repair strategies\.

\#\# Core Responsibilities 1\. Accuracy: Ensure the extracted experiences are accurate and verified\. 2\. Reusability: The summarized experience should be applicable to similar problems\. 3\. Practicality: Provide concrete, actionable fix methods\. 4\. Conciseness: Express the key points in the most concise language\.

\#\# Repair Process to Analyze

\#\#\# First Verification \(Failed\) Error Type: \{error\_summary\}

Error Message: \{failure\_result\}

Code Snippet at Failure:

```
{failure_code}
```

\#\#\# Intermediate Improvement Steps \{steps\_text\}

\#\#\# Second Verification \(Succeeded\) Code Snippet After Success:

```
{success_code}
```

\#\# Analysis Requirements

\#\#\# 1\. Identify the Root Cause \- What is the fundamental cause of this error? \- Is it an algorithmic logic issue, a boundary\-condition issue, or a syntax/library\-usage issue?

\#\#\# 2\. Extract the Fix Strategy \- What specific method did the agent use to fix the issue? \- What are the key steps of this fix? \- Are there any debugging tricks or thought processes worth recording?

\#\#\# 3\. Summarize Reusable Experience \- If a similar issue occurs in the future, what approach should be taken? \- What was the most critical insight in this fix process? \- What general cautions or best practices apply?

\#\# Output Format Requirements

Return a JSON object with the following fields:

```
{
  "error\_context": "When/where the error occurred (what was being implemented/handled)",
  "error\_cause": "The specific cause of the error",
  "fix\_method": "The concrete modification made",
  "fix\_result": "The effect after the fix",
  "key\_insight": "A reusable key takeaway (one-sentence lesson)"
}
```

\#\#\# Output Specifications

1\. error\_context \(required\): \- Describe the scenario and timing of the error\. \- Examples: "While implementing divide\-and\-conquer optimized DP", "When handling large\-scale array inputs", "When enumerating subsets using bit operations"\. \- Length: 10\-\-30 words\.

2\. error\_cause \(required\): \- Clearly state the cause of the error\. \- Examples: "Incorrect subproblem evaluation order in recursion, violating state dependencies\." \- Examples: "Using <= in the loop bound caused out\-of\-bounds array access\." \- Length: 15\-\-50 words\.

3\. fix\_method \(required\): \- Describe the concrete fix\. \- Examples: "Move the recursive call solve\(l, mid\-1\) before computing dp\[mid\]\." \- Examples: "Change the loop condition from i<=n to i<n\." \- Length: 15\-\-50 words\.

4\. fix\_result \(required\): \- Describe the effect after applying the fix\. \- Examples: "State dependencies are satisfied; outputs match expected results\." \- Examples: "All array accesses are within bounds; no segmentation fault\." \- Length: 10\-\-30 words\.

5\. key\_insight \(required\): \- The most important reusable lesson\. \- It should help others avoid the same mistake at a glance\. \- Examples: "In divide\-and\-conquer DP optimization, subproblems that the current state depends on must be computed before evaluating dp\[mid\]\." \- Examples: "Array indices range from 0 to n\-1; use a strict < n loop bound when iterating\." \- Length: 20\-\-60 words\.

\#\# Example Outputs

```
{
  "error\_context": "While implementing divide-and-conquer optimized DP",
  "error\_cause": "Left-half subproblems dp[l..mid-1] were not computed before dp[mid], breaking the required
  dependency order",
  "fix\_method": "Move the recursive call solve(l, mid-1) to execute before computing dp[mid]",
  "fix\_result": "Dependencies are satisfied and all test cases pass",
  "key\_insight": "For divide-and-conquer DP optimization, compute all dependent subproblems before evaluating
  the current state"
}
```

```
{
  "error\_context": "While iterating over array elements",
  "error\_cause": "The loop condition used i<=n instead of i<n, causing out-of-bounds access beyond the last
  element",
  "fix\_method": "Change for(int i=0; i<=n; i++) to for(int i=0; i<n; i++)",
  "fix\_result": "Array access stays in range and the program runs normally",
  "key\_insight": "Array indices are 0..n-1; always use a strict <n loop bound when traversing arrays"
}
```

\\endMakeFramed

### G\.4\.Experience Deduplication

\\MakeFramed\\FrameRestore

Please determine whether the new experience is a duplicate of, or highly similar to, any existing experience\.

\#\# New Experience \{new\_exp\.get\(’fix\_summary’, ’’\)\}

\#\# Existing Experience List \{existing\_summaries\}

\#\# Criteria \- If the problem described and the solution method are essentially the same as an existing experience \(even if phrased differently\), it is considered a duplicate\. \- If the new experience is a special case or a generalization of an existing one, it is also considered a duplicate\. \- If the new experience involves a different problem or a different solution method, it is considered non\-duplicate\.

\#\# Output Format Return a JSON object:

```
{
  "is\_duplicate": true or false,
  "duplicate\_index": the index of the duplicated experience (1-based; null if not duplicate),
  "reason": "Justification for the decision"
}
```

\\endMakeFramed

### G\.5\.Experience Routing

\\MakeFramed\\FrameRestore

Please analyze the following repair experience and determine which category it belongs to\.

\#\# Repair Experience \- Error Description: \{original\_error\} \- Fix Method: \{fix\_summary\} \{f’\- Code Context: \{code\_context\[:500\]\}’ if code\_context else ’’\}

\#\# Classification Rules

\#\#\# 1\. General \(General\-Purpose Errors\) Applicable to common programming errors that are not tied to a specific algorithm, such as: \- Syntax errors \(missing semicolons, mismatched braces/parentheses, etc\.\) \- Spelling mistakes \(typos\) \- Header/library import issues \- Undeclared/uninitialized variables \- Type conversion/casting errors \- Input/output format issues \- Memory management issues \(not algorithm\-related\)

\#\#\# 2\. Algorithm \(Algorithmic Errors\) Applicable to errors tied to a particular algorithm, where you must choose the corresponding algorithm tags: \{taxonomy\_text\}

\#\# Output Format Return a JSON object:

```
{
  "category": "general" or "algorithm",
  "primary\_tag": "Primary tag (required only for algorithm)",
  "secondary\_tag": "Secondary tag (required only for algorithm)",
  "reason": "Brief justification"
}
```

Note: \- If the category is general, set primary\_tag and secondary\_tag to null\. \- If the category is algorithm, you must choose from the tags above \(pick the closest match if needed\)\.

\\endMakeFramed

## Appendix HCase Study

Given the substantial length of both the inputs and outputs ofCP\-Agent, we limit the presentation below to the full problem description,CP\-Agent’s tool\-use behavior, and the final submission\.

Problem

\\MakeFramed\\FrameRestore

##### Alea Iacta Est

Time limit:10000 ms Memory limit:1024 MB

##### Description

You play a game with multiple fair six\-sided dice\. Each die’s face displays a single symbol\. The objective of the game is to roll the dice and create a valid word from the symbols on top of each die\. If you cannot form a word, you may reroll the dice for another attempt\.

Suppose there are five dice: one of them contains letters A, B, C, D, E, and P \(abbreviated as ABCDEP\), and the other dice contain AEHOXU, AISOLR, ABCDEF, and ABCSCC\. The first roll yields the following letters on the tops of respective dice:P,X,R,EP,X,R,E, andSS\. As it is impossible to arrange these letters into a valid word, you decide to keep theP,SP,S, andEE, and reroll the other dice, in an attempt to make words like PARSE, PAUSE, PHASE, POISE, PROSE, PULSE, or PURSE\. The two dice yieldEEandAA, resulting in the following five letters:P,E,A,EP,E,A,E, andSS\. You still cannot think of a valid word, so you decide to keep four letters and reroll only the last die, which has three sides with letter C\. By doing so, there is a50%50\\%chance that it will be possible to make a final valid word: PEACE, as shown in Figure K\.1\.

When you roll a die, it lands on any one of its faces with equal probability\. What is the expected number of rolls needed to make a valid word, assuming you use an optimal strategy?

##### Input

The first line of input contains two numbersddandww, whered​\(1≤d≤6\)d\\ \(1\\leq d\\leq 6\)is the number of dice andw​\(1≤w≤2⋅105\)w\\ \(1\\leq w\\leq 2\\cdot 10^\{5\}\)is the number of valid words in the dictionary\. The followingddlines each have66symbols, one for each face of the die\. The finalwwlines containwwdistinct valid words in the dictionary\. Every word has exactlyddsymbols\.

All symbols in the input are either uppercase letters \(A\-\-Z\) or digits \(0\-\-9\)\.

##### Output

If it is possible to make a valid word, output the expected number of rolls needed to make a valid word when using an optimal strategy\. Otherwise, output impossible\. Your answer should have an absolute or relative error of at most10−610^\{\-6\}\.

##### Sample Input 1

```
5 8
ABCDEP
AEHOXU
AISOLR
ABCDEF
ABCSCC
PARSE
PAUSE
PHASE
POISE
PROSE
PULSE
PURSE
PEACE
```

##### Sample Output 1

##### Sample Input 2

##### Sample Output 2

##### Sample Input 3

```
3 1
123456
123456
123456
666
```

##### Sample Output 3

##### Sample Input 4

##### Sample Output 4

\\MakeFramed\\FrameRestore

Generating test case\.\.\.

Thought I need to carefully analyze this problem\. This is a game involving dice and words, and we need to compute the expected number of rolls under an optimal strategy\.

Problem Analysis: …

Let us retrieve prior experience on similar problems\.

This problem involves minimizing an expectation and searching over a state space, which likely falls under dynamic programming or a Markov decision process\. Let me retrieve experience under the tag ‘‘DP \-\- Probability DP’’\.

```
<code>
exp = retrieve_algorithm_experience(primary_tag="Dynamic programming", secondary_tag="Probabilistic DP")
print(exp)
</code>
```

Tool\_Response: When solutions based on complex state transitions \(e\.g\., graph DP\) are difficult to debug, one should revisit the essence of the problem and try to find a simpler mathematical or counting structure that may completely bypass the original complicated logic\.

Thought:The experience may be useful later\. Let me continue analyzing the problem\. Sinced≤6d\\leq 6, the state space is finite, so I can try value iteration, but the computation needs to be optimized\.

I will first write a small program to validate some ideas and compute the number of states and possible actions\.Thought:The C\+\+ code contains a syntax error\. Let me fix it\.

```
cpp_code = ’’’
#include <bits/stdc++.h>
using namespace std;

int main() {
    int d = 6;
    long long states = 1;
    for (int i = 0; i < d; i++) states *= 6;
    cout << "Maximum number of configs: " << states << endl;

    // Number of possible retain subsets (non-empty complement)
    // R can be any subset except the full set (all dice retained)
    // Actually R can be any subset, including the empty set (reroll all)
    // But if R = full set, then no reroll; the game stops if winning,
    // otherwise it is impossible. Hence, for a non-winning configuration,
    // R cannot be the full set.
    // So the number of actions = 2^d - 1 (excluding the full set).
    int actions = (1 << d) - 1;
    cout << "Maximum number of actions per config: " << actions << endl;

    // For each action, the number of successor configurations is 6^{|U|},
    // where U is the complement of R.
    // The average |U| = d/2, so the average number of successors is
    // 6^{d/2} \approx 6^3 = 216.
    // Total computations per iteration:
    // states * actions * successors
    // 46656 * 63 * 216
    // 6.35 \times 10^8,
    // which is large but may be manageable with optimization.
    double avg_successors = pow(6, d / 2.0);
    cout << "Approximate total successors per iteration: "
         << states * actions * avg_successors << endl;

    return 0;
}
’’’
result = cpp_interpreter(cpp_code)
print(result)
```

```
<code>
cpp_code = ’’’
#include <bits/stdc++.h>
using namespace std;

int main() {
    int d = 6;
    long long states = 1;
    for (int i = 0; i < d; i++) states *= 6;
    cout << "Maximum number of configs: " << states << endl;

    // Number of possible retain subsets (non-empty complement)
    // R can be any subset except full set (all dice retained)
    // Actually R can be any subset, including empty set (reroll all)
    // But if R = full set, then no reroll, game stops if winning, else impossible
    // So for non-winning config, R cannot be full set.
    // So number of actions = 2^d - 1 (excluding full set)
    int actions = (1 << d) - 1;
    cout << "Maximum number of actions per config: " << actions << endl;

    // For each action, number of successor configs = 6^{|U|}
    // where U = complement of R
    // Average |U| = d/2, so average successors = 6^{d/2}
    // 6^3=216
    // Total computations per iteration: states * actions * successors
    // 46656 * 63 * 216
    // 635M, too large but maybe manageable with optimization?
    cout << "Approximate total successors per iteration: "
         << states * actions * pow(6, d/2.0) << endl;

    return 0;
}
’’’
result = cpp_interpreter(cpp_code)
print(result)
</code>
```

Tool\_Response: Error: Sandbox error: Compile stderr: /tmp/tmpx95y45qt/tmp6iulsg2f\.cpp: In function ‘int main\(\)’: /tmp/tmpx95y45qt/tmp6iulsg2f\.cpp:20:10: error: expected primary\-expression before ‘<<’ token 20 ^

Thought:The C\+\+ code contains a syntax error\. Let me fix it\.

```
<code>
cpp_code = ’’’
#include <bits/stdc++.h>
using namespace std;

int main() {
    int d = 6;
    long long states = 1;
    for (int i = 0; i < d; i++) states *= 6;
    cout << "Maximum number of configs: " << states << endl;

    // Number of possible retain subsets (non-empty complement)
    // R can be any subset except the full set (all dice retained)
    // Actually R can be any subset, including the empty set (reroll all)
    // But if R = full set, then no reroll; the game stops if winning,
    // otherwise it is impossible. Hence, for a non-winning configuration,
    // R cannot be the full set.
    // So the number of actions = 2^d - 1 (excluding the full set).
    int actions = (1 << d) - 1;
    cout << "Maximum number of actions per config: " << actions << endl;

    // For each action, the number of successor configurations is 6^{|U|},
    // where U is the complement of R.
    // The average |U| = d/2, so the average number of successors is
    // 6^{d/2} \approx 6^3 = 216.
    // Total computations per iteration:
    // states * actions * successors
    // \approx 46656 * 63 * 216 \approx 6.35 \times 10^8,
    // which is large but may be manageable with optimization.
    double avg_successors = pow(6, d / 2.0);
    cout << "Approximate total successors per iteration: "
         << states * actions * avg_successors << endl;

    return 0;
}
’’’
result = cpp_interpreter(cpp_code)
print(result)
</code>
```

Tool\_Response: Maximum number of configs: 46656 Maximum number of actions per config: 63 Approximate total successors per iteration: 6\.34895e\+08

Thought:6\.35×1086\.35\\times 10^\{8\}computations per iteration is likely too slow for value iteration\. We need optimization\.

\.\.\.

Implementation plan:

1. \(1\)Enumerate all full configurations \(6d6^\{d\}\)\. For each configuration id, store its face array \(characters\) and a win/lose flag\.
2. \(2\)Precomputesum​\[mask\]​\[proj\]\\mathrm\{sum\}\[\\mathrm\{mask\}\]\[\\mathrm\{proj\}\]: for each maskmask∈\[0,2d−1\]\\mathrm\{mask\}\\in\[0,2^\{d\}\-1\], the projectionproj\\mathrm\{proj\}is a length\-\|mask\|\|\\mathrm\{mask\}\|array of face indices\. Encodeproj\\mathrm\{proj\}as an integer in base\-66: for the positionsiiin mask \(in order\), with face indexfi∈\{0,…,5\}f\_\{i\}\\in\\\{0,\\dots,5\\\}, encode as a base\-66number\.
3. \(3\)Initialize dp: winning configurations have dp=0; losing ones have dp=INF \(or a large number\)\.
4. \(4\)Value iteration: 1. \(a\)Update sum from the current dp\. 2. \(b\)For each non\-winning configurationcc, compute dp​\_​new​\[c\]=minmask≠full⁡\[1\+sum​\[mask\]​\[projc\]6d−\|mask\|\]\.\\displaystyle\\mathrm\{dp\\\_new\}\[c\]=\\min\_\{\\mathrm\{mask\}\\neq\\mathrm\{full\}\}\\\!\\\!\\bigg\[1\\\!\+\\\!\\tfrac\{\\mathrm\{sum\}\[\\mathrm\{mask\}\]\[\\mathrm\{proj\}\_\{c\}\]\}\{6^\{d\-\|\\mathrm\{mask\}\|\}\}\\\!\\bigg\]\. 3. \(c\)Check convergence\.

Complexity\.Each iteration updates sum: for each configurationccand each mask \(2d2^\{d\}\), compute the projection and accumulatedp​\[c\]\\mathrm\{dp\}\[c\]\. This costs6d⋅2d=46656⋅64≈3×1066^\{d\}\\cdot 2^\{d\}=46656\\cdot 64\\approx 3\\times 10^\{6\}operations, acceptable\.

Then for each non\-winning configurationcc, computing dp\_new scans all masks \(2d−12^\{d\}\-1\): another46656⋅63≈3×10646656\\cdot 63\\approx 3\\times 10^\{6\}operations\.

So each iteration is about6×1066\\times 10^\{6\}ops\. If value iteration converges in tens of iterations, the total is a few hundred million ops, acceptable\.

Memory\.sum has size7d=1176497^\{d\}=117649doubles\. dp has size4665646656\.

Now I need to validate the algorithm\. I will first implement a small\-ddtest\.

First, I need to handle word matching\. For each full configurationcc, I need to check whether its multiset of symbols matches any word\. Sinced≤6d\\leq 6, I can sort the symbols of the configuration and compare against sorted strings of the words\.

Preprocessing:Convert each word into its sorted string and store them in an unordered\_set\.

For each configurationcc, collect theddsymbols, sort them, concatenate into a string, and test membership in the set\.

Now let me write a C\+\+ test to verify the bookkeeping logic\.

```
<code>
cpp_code = ’’’
#include <bits/stdc++.h>
using namespace std;

int main() {
    // Test small case: d=1
    int d = 1;
    int num_configs = pow(6, d); // 6
    int num_masks = 1 << d; // 2
    int num_partial = pow(7, d); // 7

    cout << "d=" << d << ": configs=" << num_configs;
    cout << ", masks=" << num_masks;
    cout << ", partial=" << num_partial << endl;

    d = 2;
    num_configs = pow(6, d); // 36
    num_masks = 1 << d; // 4
    num_partial = pow(7, d); // 49
    cout << "d=" << d << ": configs=" << num_configs;
    cout << ", masks=" << num_masks;
    cout << ", partial=" << num_partial << endl;

    d = 3;
    num_configs = pow(6, d); // 216
    num_masks = 1 << d; // 8
    num_partial = pow(7, d); // 343
    cout << "d=" << d << ": configs=" << num_configs;
    cout << ", masks=" << num_masks;
    cout << ", partial=" << num_partial << endl;

    d = 6;
    num_configs = pow(6, d); // 46656
    num_masks = 1 << d; // 64
    num_partial = pow(7, d); // 117649
    cout << "d=" << d << ": configs=" << num_configs;
    cout << ", masks=" << num_masks;
    cout << ", partial=" << num_partial << endl;

    return 0;
}
’’’
result = cpp_interpreter(cpp_code)
print(result)
</code>
```

Tool\_Response: d=1: configs=6, masks=2, partial=7 d=2: configs=36, masks=4, partial=49 d=3: configs=216, masks=8, partial=343 d=6: configs=46656, masks=64, partial=117649

Thought:In Sample 4, the output is60016001instead of ‘‘impossible’’\. This happens because in my algorithm, even when the game is impossible, it still returns a finite value\. The issue is that when it is impossible, the dp value converges to a very large number, but my threshold101210^\{12\}is not large enough\. Since Sample 4 returns6001<10126001<10^\{12\}, it was not recognized as impossible\.

However, according to the statement, Sample 4 should output ‘‘impossible’’\. Let me analyze this: …

I need to modify the algorithm to correctly handle the impossible case\. …

Therefore, I need to detect whether there exists at least one winning configuration\. If not, directly output ‘‘impossible’’\.

Let me add this check\.

```
<code>
cpp_code = ’’’
#include <bits/stdc++.h>
using namespace std;

...
        cout << "Sample 4: ";
        if (ans >= INF/2) {
            cout << "impossible" << endl;
        } else {
            cout << ans << endl;
        }
        cout << "Expected: impossible" << endl;
    }

    return 0;
}
’’’
result = cpp_interpreter(cpp_code)
print(result)
</code>
```

Tool\_Response: Sample 1: 9\.677887141 \(expected 9\.677887141\) Sample 2: 1\.000000000 \(expected 1\.0\) Sample 3: 10\.555444555 \(expected 10\.555444555\) Sample 4: impossible Expected: impossible

Thought:Perfect\! All samples pass\. Now I need to write the final solution, handle I/O, and ensure efficiency\.

Let me summarize the final algorithm: …

Details to note:

- •Use double; a precision of10−610^\{\-6\}is sufficient\.
- •At most50005000value\-iteration steps should be enough to converge\.
- •Memory: dp has size6d≤466566^\{d\}\\leq 46656, and sum has size7d≤1176497^\{d\}\\leq 117649, which is acceptable\.
- •Time complexity: each iteration isO​\(6d⋅2d\)≈3×106O\(6^\{d\}\\cdot 2^\{d\}\)\\approx 3\\times 10^\{6\}, multiplied by the number of iterations \(about150150\-\-200200\), giving roughly5×1085\\times 10^\{8\}operations, which should fit within a 10\-second limit after C\+\+ optimization\.

Now I will write the final code and test it with the validator\.

```
<code>
final_code = ’’’
#include <bits/stdc++.h>
using namespace std;

const double INF = 1e18;
const double EPS = 1e-12;

int main() {
...
        // Compute dp_new
        max_diff = 0.0;
        for (int c = 0; c < total_configs; c++) {
            if (win[c]) {
                dp_new[c] = 0.0;
                continue;
            }
            double best = INF;
            int full_mask = (1 << d) - 1;
            for (int R = 0; R < full_mask; R++) {
                int U = full_mask ^ R;
                int k = popcount[U];
                int proj_id = getProjId(c, R);
                int idx = offset[R] + proj_id;
                double expected = 1.0 + sum[idx] / pow6[k];
                if (expected < best) best = expected;
            }
            dp_new[c] = best;
            max_diff = max(max_diff, fabs(dp_new[c] - dp[c]));
        }

        swap(dp, dp_new);
        iter++;
    } while (max_diff > EPS && iter < 5000);

    // Compute initial expected rolls
    double total = 0.0;
    for (int c = 0; c < total_configs; c++) {
        total += dp[c];
    }
    double ans = 1.0 + total / pow6[d];

    cout << fixed << setprecision(9) << ans << endl;
    return 0;
}
’’’

result = cpp_validation(final_code)
print(result)
</code>
```

Tool\_Response ALL TESTS PASSED \(4/4\) ================================================== Test 1: PASSED Test 2: PASSED Test 3: PASSED Test 4: PASSED

Thought:All tests passed\! Now I can submit the final answer\.

```
<code>
final_answer(final_code)
</code>
```

\\endMakeFramed

Submission\\MakeFramed\\FrameRestore

```
#include <bits/stdc++.h>
using namespace std;

const double INF = 1e18;
const double EPS = 1e-12;

int main() {
    ios::sync_with_stdio(false);
    cin.tie(nullptr);

    int d, w;
    cin >> d >> w;

    vector<string> dice(d);
    for (int i = 0; i < d; i++) {
        cin >> dice[i];
        // each dice[i] has exactly 6 characters
    }

    vector<string> words(w);
    for (int i = 0; i < w; i++) {
        cin >> words[i];
    }

    // Preprocess words: sort characters
    unordered_set<string> wordSet;
    for (string& wrd : words) {
        string sorted = wrd;
        sort(sorted.begin(), sorted.end());
        wordSet.insert(sorted);
    }

    int total_configs = pow(6, d);

    // faces[config][i] = face index of die i
    vector<vector<int>> faces(total_configs, vector<int>(d));
    vector<bool> win(total_configs, false);
    bool any_win = false;

    for (int id = 0; id < total_configs; id++) {
        int tmp = id;
        vector<char> symbols(d);
        for (int i = 0; i < d; i++) {
            int f = tmp % 6;
            tmp /= 6;
            faces[id][i] = f;
            symbols[i] = dice[i][f];
        }
        string sorted_sym(symbols.begin(), symbols.end());
        sort(sorted_sym.begin(), sorted_sym.end());
        if (wordSet.count(sorted_sym)) {
            win[id] = true;
            any_win = true;
        }
    }

    if (!any_win) {
        cout << "impossible" << endl;
        return 0;
    }

    // DP arrays
    vector<double> dp(total_configs, 0.0);
    vector<double> dp_new(total_configs, 0.0);
    for (int i = 0; i < total_configs; i++) {
        dp[i] = win[i] ? 0.0 : 1000.0; // initial guess
    }

    // Precompute popcount and powers of 6
    vector<int> popcount(1 << d, 0);
    for (int m = 0; m < (1 << d); m++) {
        popcount[m] = __builtin_popcount(m);
    }
    vector<double> pow6(d + 1, 1.0);
    for (int i = 1; i <= d; i++) {
        pow6[i] = pow6[i - 1] * 6.0;
    }

    // Precompute offsets for sum array
    int total_partial = pow(7, d);
    vector<double> sum(total_partial, 0.0);

    vector<int> offset(1 << d, 0);
    int cur_offset = 0;
    for (int m = 0; m < (1 << d); m++) {
        offset[m] = cur_offset;
        cur_offset += pow(6, popcount[m]);
    }
```

```
// Helper: compute projection id for config id and mask
    auto getProjId = [&](int config_id, int mask) -> int {
        int proj_id = 0;
        int base = 1;
        for (int i = 0; i < d; i++) {
            if (mask >> i & 1) {
                int face = faces[config_id][i];
                proj_id += face * base;
                base *= 6;
            }
        }
        return proj_id;
    };

    // Value iteration
    int iter = 0;
    double max_diff;
    do {
        // Update sum array
        fill(sum.begin(), sum.end(), 0.0);
        for (int c = 0; c < total_configs; c++) {
            double val = dp[c];
            for (int m = 0; m < (1 << d); m++) {
                int proj_id = getProjId(c, m);
                int idx = offset[m] + proj_id;
                sum[idx] += val;
            }
        }

        // Compute dp_new
        max_diff = 0.0;
        for (int c = 0; c < total_configs; c++) {
            if (win[c]) {
                dp_new[c] = 0.0;
                continue;
            }
            double best = INF;
            int full_mask = (1 << d) - 1;
            for (int R = 0; R < full_mask; R++) {
                int U = full_mask ^ R;
                int k = popcount[U];
                int proj_id = getProjId(c, R);
                int idx = offset[R] + proj_id;
                double expected = 1.0 + sum[idx] / pow6[k];
                if (expected < best) best = expected;
            }
            dp_new[c] = best;
            max_diff = max(max_diff, fabs(dp_new[c] - dp[c]));
        }

        swap(dp, dp_new);
        iter++;
    } while (max_diff > EPS && iter < 5000);

    // Compute initial expected rolls
    double total = 0.0;
    for (int c = 0; c < total_configs; c++) {
        total += dp[c];
    }
    double ans = 1.0 + total / pow6[d];

    cout << fixed << setprecision(9) << ans << endl;
    return 0;
}
```

\\endMakeFramed

## Appendix IExtended Discussion on Related Work

##### LLM Agents\.

Large Language Models have been increasingly deployed as autonomous agents capable of interacting with external environments and solving multi\-step tasks\. In the domain of web navigation, researchers have developed agents to autonomously browse and interact with online interfaces\(Ninget al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib41); Heet al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib42); Laiet al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib43); Zhanget al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib44)\)\. In the field of software engineering, agents such as SWE\-Agent have been proposed to handle repository\-level maintenance and development tasks\(Luoet al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib45); Gaoet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib46)\)\. Furthermore, the application of agents has extended to high\-level intellectual domains, including automated scientific discovery\(Toledoet al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib47); Baeket al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib48); Luet al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib49)\)and data science analysis, where agents assist in data visualization and interpretation\(Zhuet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib50); Yanget al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib51); Jianget al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib52)\)\. These advancements collectively demonstrate the potential of LLM agents to generalize across diverse and complex workflows\.

In contrast to the extensive body of work on autonomous agents interacting with dynamic external environments, a more focused stream of research investigates Agentic Reasoning—the use of deterministic tools to enhance the model’s internal cognitive processes\. Unlike autonomous agents that use tools as “effectors” to change the environment state, this paradigm employs tools—primarily code interpreters \(e\.g\., Python\) and calculators—as “cognitive prosthetics” to ground the reasoning process itself\. Early works such as PAL\(Gaoet al\.,[2023](https://arxiv.org/html/2605.24693#bib.bib13)\)and PoT\(Chenet al\.,[2022b](https://arxiv.org/html/2605.24693#bib.bib9)\)demonstrated that decoupling reasoning from computation significantly reduces hallucinations in mathematical and symbolic tasks\. Advancing this trajectory, recent frameworks\(Wuet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib35); Gouet al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib15); Daset al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib1); Shanget al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib40)\)establish a more sophisticated interleaved reasoning–action loop\. By integrating code execution directly into the chain of thought, these methods utilize deterministic feedback for iterative step verification and error correction, enabling the model to navigate complex logical paths with higher reliability\. Although fewer in number compared to general agent papers, these studies highlight a critical distinction: for challenging reasoning tasks, the value of a tool lies not in its complexity or connectivity, but in its ability to provide deterministic, verifiable feedback that constrains the model’s probabilistic output\.

##### Competitive Programming\.

Program synthesis has evolved significantly from early deductive\(Manna and Waldinger,[1980](https://arxiv.org/html/2605.24693#bib.bib53)\)and inductive\(Spector and Robinson,[2002](https://arxiv.org/html/2605.24693#bib.bib54)\)synthesis to neural approaches like DeepCoder\(Baloget al\.,[2016](https://arxiv.org/html/2605.24693#bib.bib55)\)\. However, the release of challenging benchmarks\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.24693#bib.bib18); Jainet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib23); Zhenget al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib3); Xuet al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib2); Sunet al\.,[2026](https://arxiv.org/html/2605.24693#bib.bib57)\)revealed a harsh reality: early models failed near\-completely on competition\-level problems\. AlphaCode\(Liet al\.,[2022](https://arxiv.org/html/2605.24693#bib.bib25)\)marked a major milestone in this domain by leveraging massive sampling for competitive programming problems, while AlphaCode2\(Leblondet al\.,[2023](https://arxiv.org/html/2605.24693#bib.bib39)\)and OpenAI’s o1\-ioi\(El\-Kishkyet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib12)\)later adopted reinforcement learning and heuristic inference strategies to achieve gold\-medal performance\. Most recently, OpenAI o3\(OpenAI,[2025a](https://arxiv.org/html/2605.24693#bib.bib56)\)demonstrated that general\-purpose reasoning models could reach Grandmaster status on Codeforces without domain\-specific hand\-crafting\.

Since training such reasoning models is computationally prohibitive, recent research has pivoted to inference\-time enhancements using hand\-crafting LLM workflows like MapCoder\(Islamet al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib21)\), AlphaCodium\(Ridniket al\.,[2024](https://arxiv.org/html/2605.24693#bib.bib29)\), and CODESIM\(Islamet al\.,[2025](https://arxiv.org/html/2605.24693#bib.bib22)\)\. These methods simulate human problem\-solving through iterative “generate\-and\-verify” cycles\. However, they mainly rely on fixed, stateless workflows, which suffer from rigidity: they lack the adaptivity to scale reasoning effort within a problem, and crucially, they fail to transfer experience across problems, discarding valuable debugging insights after each session\. This limitation highlights the need for more dynamic and long\-term evolving mechanisms\.

Similar Articles

Code execution with MCP: Building more efficient agents

Anthropic Engineering

This article from Anthropic explores how integrating code execution with the Model Context Protocol (MCP) can improve the efficiency of AI agents. It addresses challenges like token overload from tool definitions and intermediate results, proposing code execution as a solution to reduce latency and costs.