Agentic Proving for Program Verification
Summary
This paper evaluates Claude Code in an agentic proving framework on the Clever benchmark for program verification, achieving over 98% success in specification generation and end-to-end verification, revealing that existing benchmarks may be insufficient for evaluating modern agentic provers.
View Cached Full Text
Cached at: 05/25/26, 08:59 AM
# Agentic Proving for Program Verification
Source: [https://arxiv.org/html/2605.23772](https://arxiv.org/html/2605.23772)
Alessandro Sosso Department of Computer Science Aarhus University Aarhus, Denmark sosso@cs\.au\.dk &Akhil Arora Department of Computer Science Aarhus University Aarhus, Denmark akhil\.arora@cs\.au\.dk &Bas Spitters Department of Computer Science Aarhus University Aarhus, Denmark spitters@cs\.au\.dk
###### Abstract
Agentic systems have recently emerged as state\-of\-the\-art approaches for automated theorem proving in formal mathematics\. To assess how far these capabilities extend to*program verification*, we evaluate Claude Code in an agentic proving framework onClever, a Lean 4 benchmark for verifiable code generation\. Our results show that Claude generates arguably valid specifications for 98\.8% of problems \(with 81\.3% also accepted byClever’s isomorphism\-based scoring on the correct portion of the benchmark\), certifies implementations against correct ground\-truth specifications for 87\.5% of problems, and reaches a 98\.1% success rate on the end\-to\-end program generation and verification pipeline over entries with self\-consistent premises\. Across all stages, Claude further provides high\-quality feedback on its own attempts \(as confirmed under manual review\), identifying underlying causes of failure and lingering bugs in the dataset\. These findings highlight a growing mismatch between the difficulty of existing program verification benchmarks and the capabilities of modern agentic provers, and point to the need for more rigorous, bug\-resilient evaluation methodologies, and in particular for alternatives to isomorphism\-based scoring of generated specifications\. More broadly, our results provide empirical evidence that tight compiler\-in\-the\-loop agentic paradigms are currently the most effective approach for foundational program verification\.
## 1Introduction
Large Language Models \(LLMs\) are known to hallucinate, and even recent “reasoning” models may produce arguments that appear coherent but are logically invalid\. In contrast, formal logic offers a long\-standing and rigorous alternative: mathematical arguments can be expressed in formal languages and mechanically checked for correctness\. This approach is embodied in interactive proof assistants based on type theory \(e\.g\., Rocq\[[28](https://arxiv.org/html/2605.23772#bib.bib35)\], Lean\[[21](https://arxiv.org/html/2605.23772#bib.bib16)\], Agda\[[23](https://arxiv.org/html/2605.23772#bib.bib36)\]\), higher\-order logic \(e\.g\., Isabelle\[[22](https://arxiv.org/html/2605.23772#bib.bib37)\], HOL4\[[29](https://arxiv.org/html/2605.23772#bib.bib38)\], HOL\-Light\[[12](https://arxiv.org/html/2605.23772#bib.bib39)\]\), and set theory \(e\.g\., Mizar\[[3](https://arxiv.org/html/2605.23772#bib.bib40)\], MetaMath\[[5](https://arxiv.org/html/2605.23772#bib.bib41)\]\)\. In this work, we focus on the first category, and in particular on the Lean 4 theorem prover\.
Recent years have seen striking progress in applying LLMs to automated proof construction in type\-theoretic settings\. Notable systems achieve strong performance on formalized mathematics benchmarks in Lean 4\[[14](https://arxiv.org/html/2605.23772#bib.bib42),[1](https://arxiv.org/html/2605.23772#bib.bib1),[7](https://arxiv.org/html/2605.23772#bib.bib43)\], as well as in Isabelle\[[18](https://arxiv.org/html/2605.23772#bib.bib33),[35](https://arxiv.org/html/2605.23772#bib.bib34)\]and Rocq\[[4](https://arxiv.org/html/2605.23772#bib.bib31),[33](https://arxiv.org/html/2605.23772#bib.bib32)\]\. These systems employ a range of architectural strategies, including whole\-proof generation, proof\-state search, and increasingly, agentic designs that integrate planning, tool use, and iterative refinement\.
In this paper, we study how these advances translate to*program verification*, a setting that is both practically relevant and structurally challenging\. Unlike informal mathematics, program verification requires reasoning about executable artifacts, explicit specifications, and subtle side conditions, all under strict syntactic and semantic constraints\. To this end, we evaluate state\-of\-the\-art agentic paradigms onClever\[[30](https://arxiv.org/html/2605.23772#bib.bib12)\], a recent Lean 4 benchmark for verifiable code generation\. Figure[1](https://arxiv.org/html/2605.23772#S1.F1)provides a schematic overview of our experimental pipeline\.
#### Preliminary experiments\.
Prior testing by the authors evaluated leading agentic and non\-agentic provers on the proof\-generation task of two Lean 4 benchmarks for verifiable code generation,CleverandVerina\[[36](https://arxiv.org/html/2605.23772#bib.bib11)\]\. Those experiments showed that agentic systems based on Claude Code achieved near\-saturation performance on the verified\-correct portions of these benchmarks, with Aristotle\[[1](https://arxiv.org/html/2605.23772#bib.bib1)\]close behind, while specialized whole\-proof generation models and symbolic tactics trailed substantially\. The strongest agents also consistently identified errors in the specifications and implementations of the benchmarks themselves, in several cases proposing fixes and successfully proving the corrected statements\. These findings suggested that compiler\-in\-the\-loop agentic systems are the most effective scaffold for foundational program verification, and that existing program verification benchmarks no longer stress modern agentic provers as much as their design intended\. The present work follows up on both observations, evaluating an agentic system on the fullCleverpipeline using its native evaluation infrastructure rather than the custom distillation used in the preliminary study; we refer to Appendix[A](https://arxiv.org/html/2605.23772#A1)for details\.
#### Contributions\.
Our agentic Claude Code setup reaches a new state of the art on the fullCleverpipeline\. Specifically:
- •Specification certification\.Claude generates arguably valid specifications for98\.8%of problems, of which81\.3%are also accepted byClever’s isomorphism\-based scoring on the portion of the benchmark where no issues in the ground\-truth specifications were identified\.
- •Implementation certification\.Claude generates and successfully certifies implementations against the correct ground\-truth specifications for87\.5%of problems\.
- •End\-to\-end pipeline\.On the entries with self\-consistent premises, Claude reaches a remarkable98\.1%success rate over the full specification \+ implementation \+ proof pipeline\.
- •Self\-diagnostic feedback\.Beyond raw performance, Claude consistently produces well\-argued analyses of its own outputs, identifying and classifying both the underlying causes of failure and lingering bugs in theCleverbenchmark\.
- •Methodological insight\.Our findings expose structural limitations of isomorphism\-against\-a\-ground\-truth as an evaluation method for autoformalization, and motivate concrete proposals for alternative evaluation strategies\.
Figure 1:Schematic overview of our experimental pipeline\. The four generation and proof tasks are identified in yellow:Clever’s original intended setup spans horizontally in red across the top row, while other pathways represent our custom variations of the setup\. The dashed portion in the top\-right corner is the preliminary experimentation setup of Appendix[A](https://arxiv.org/html/2605.23772#A1)\.
## 2Methodology
### 2\.1Dataset
TheCleverbenchmark\[[30](https://arxiv.org/html/2605.23772#bib.bib12)\]is a curated dataset of 161 problems adapted fromHumanEval\[[6](https://arxiv.org/html/2605.23772#bib.bib13)\], testing end\-to\-end automatic code generation and verification in Lean\. Each problem is provided as an annotated Lean file containing multiple sections: \(1\) the natural language \(NL\) specification of a function \(formatted as a Python\-style docstring in a comment\), complete with the function signature and example usages; \(2\) the Lean 4 signature of the formal specificationgenerated\_spec, with asorryplaceholder to be replaced by the generated specification; \(3\) a human\-authored ground\-truth specificationproblem\_specwith the same signature; \(4\) anisomorphismtheorem stating the semantical equivalence betweengenerated\_specandproblem\_spec, with placeholdersorry; \(5\) the Lean 4 signature of the implementation, with placeholdersorry; and \(6\) acorrectnesstheorem stating that the implementation satisfies theproblem\_specspecification, with placeholdersorry\. Helper definitions for problems that require them are also provided, as well as some formal test cases for the implementation\.
To evaluate the models, the benchmark employs a staged pipeline, defining 4 different tasks grouped in two stages:Specification certificationconsists of generating thegenerated\_specspecification \(spec\_gen\), then proving it is isomorphic to the ground truthproblem\_spec\(spec\_iso\);Implementation certificationconsists of implementing the function \(impl\_gen\), then proving it satisfies theproblem\_specspecification \(proof\_gen\)\.
The testing harness provided with the benchmark allows the selective retrieval of the sections related to each task, that can then be merged back into a unique Lean script\. The benchmark then evaluates the solutions by formatting a new Lean script with the generated elements, and testing it for successful compilation and absence ofsorrykeywords\.
#### Benchmark emendation\.
A custom fork of the original benchmark repository was used, in order to track the implementation of fixes in the benchmark and of minor edits in the evaluation scripts required by our custom setup\. Emendations to the benchmark entries were made in order to fix errors and inconsistencies in formatting, typos and erroneous test cases, as experimentation progressed and issues of this kind emerged\. All other elements such as ground\-truth specifications, helper functions, and theorem signatures were left untouched\.
#### Prior work\.
The original paper\[[30](https://arxiv.org/html/2605.23772#bib.bib12)\]reported a strict end\-to\-end ceiling of1/1611/161\(≈0\.62%\\approx 0\.62\\%\) under a pass@600s protocol, achieved by o4\-mini, Claude\-3\.7 and DeepSeek\-R1 in a few\-shot configuration, and by GPT\-4o and Claude\-3\.7 paired with the COPRA proof agent\[[31](https://arxiv.org/html/2605.23772#bib.bib49)\]\. The best per\-stage figures were14/16114/161\(8\.7%8\.7\\%\) on implementation certification \(Claude\-3\.7\+COPRA\) and3/1613/161\(1\.86%1\.86\\%\) on specification certification \(GPT\-4o\+COPRA\)\. The strongest partial\-pipeline follow\-up\[[16](https://arxiv.org/html/2605.23772#bib.bib50)\]reports54\.0%54\.0\\%onClever, but the figure measures only the correctness\-proof stage given a separately\-supplied implementation generated by GPT\-5\.2; frontier reasoning models in the same setting reach at most≈23\.6%\\approx 23\.6\\%\.
Independent follow\-up works have also surfaced spec\-quality issues inClever\. Specification\-level property\-based testing\[[10](https://arxiv.org/html/2605.23772#bib.bib51)\]identifies 18 defective entries \(11\.2%\), comprising 16 underspecified postconditions, 1 implementation bug and 1 incorrect specification; integrated quickcheck on the top\-level correctness theorem\[[16](https://arxiv.org/html/2605.23772#bib.bib50)\]additionally disproves 10 entries\.
### 2\.2Experimental Setup
Our experimental setup is built on top of the Claude Agent SDK\[[2](https://arxiv.org/html/2605.23772#bib.bib26)\]running Claude Opus 4\.6, with each agent instance primed with*lean\-lsp\-mcp*\[[9](https://arxiv.org/html/2605.23772#bib.bib21)\], a specialized MCP server that interfaces with the Lean LSP and provides search tools to find relevant lemmas in the project context and Mathlib, and*lean4\-skills*\[[11](https://arxiv.org/html/2605.23772#bib.bib17)\], a package injecting Lean\-specific instructions and commands, tactic best\-practices and workflow patterns into the model’s context\. For each attempt, the system initializes a temporary directory from a template as a Lean project, then creates a Lean file formatted using relevant components of a benchmark entry, and prompts a fresh agent instance to edit the file according to the current task\. Upon completion or timeout \(set at 3600 seconds\), the resulting file is retrieved along with the trace of the agent conversation, of tool invocations, and metadata of the run\.
Each new Claude instance is provided with a description of the task as system prompt, outlining details and the objective of the task, the format of the input file and of the desired output, behavior rules, and eventual guidelines that the agent is expected to follow\. The agent is then simply prompted with either a short NL prompt asking to fill sections of the target Lean file following the previously provided instructions \(spec\_gen,impl\_gen\), or, in case the task requires proving a given theorem \(spec\_iso,proof\_gen\), the/lean4:autoprovecommand from the skills package, which starts a multi\-cycle theorem\-proving routine\.
Claude is then asked to report on the outcome of the task with a flag, either in the same session as a follow\-up prompt after termination or through the successive analysis of the Lean outputs and agent replies\. The agent can classify it as successful \(ok\), unsuccessful \(fail\), or rigorously argued as impossible due to bugs in the input or incorrect assumptions \(issue\) \(see Appendix[B\.1](https://arxiv.org/html/2605.23772#A2.SS1)\)\.
For theorem\-proving tasks \(spec\_iso,proof\_gen\), the ‘impossibility’ case was further subdivided intomismatch, when the theorem was not provable but the input elements were different yet valid interpretations of the source NL description and test cases, andissue, when the proof failed due to at least one element \(specification / implementation / fixed benchmark component such as signatures\) being erroneous, i\.e\. contradicting the NL description \(again, in both cases rigorous supporting arguments or counterexamples must have been provided\)\. To extract deeper diagnostic insights, results in theissuecategory were further analyzed with granularity to identify where the bug originated: in the ground\-truthproblem\_spec, the generatedgenerated\_specor both forspec\_iso; and in the specification, implementation, or benchmark structural components forproof\_gen\.
Entries in themismatchcategory were found only in the solutions ofspec\_iso\. Further analysis was conducted in this case to identify if either direction of the isomorphism theorem was provable \(i\.e\. whether the generated specification was strictly stronger or weaker than the ground truth\) or whether neither held\.
Cross\-checks of the Claude\-reported flags against the evaluation through the benchmark harness showed no discrepancies, as all and onlyoksolutions passed the evaluation, except for two instances: one where compilation failed due to a bug in the test cases, but the solution was otherwise correct; the other, an instance that successfully passed the evaluation, but was reported asfaildue to Claude timing\-out before termination\.
## 3Results
We ran experiments on all four pipeline tasks using a variety of prompts of increasing detail and guidance\. In all cases, a single attempt \(pass@1\) was made, as this proved to be enough for Claude to produce specifications, implementations and proofs of consistently high quality, as confirmed under manual inspection\. The pass rates reported by the benchmark harness, however, did not always reflect this: in several configurations the success rate fell substantially below what manual review of the outputs would suggest\. We therefore designed the sequence of prompts described below to progressively address this discrepancy\. More details on results can be found in Appendix[C](https://arxiv.org/html/2605.23772#A3)\.
### 3\.1Specification certification
#### Simple prompt\.
The first run used a minimal prompt outlining only the task instructions and the expected input/output format, without any specific guidance on how to proceed\. Thespec\_gentask ran on the NL description of the problem alone, without example usages or formal test cases, matching the protocol of the original benchmark authors\. It returned theokflag for all 161 entries\. Thespec\_isotask on the generated specifications did not exhibit the same pass rate\. Only 4 of the failures stemmed from a failure in synthesizing the proof \(timeouts\); the remainder corresponded to non\-isomorphic specifications\. Among the non\-isomorphic entries, Claude flagged 22 asmismatch, and 67 asissuedue to bugs in the ground\-truthproblem\_spec, against only 9 attributable to bugs in thegenerated\_specand 1 in both\. In other words, all but 14 entries \(issues ingenerated\_specplus failures\) corresponded to valid generated specifications\. Within the 22mismatchentries,generated\_specwas strictly stronger thanproblem\_specin 15 cases against only 1 instance of the contrary, with the remaining 6 admitting neither direction of the implication\. This skew is consistent with the tendency of the ground\-truth specifications towards less restrictive formulations: they typically do not constrain the behaviour of the implementation on inputs outside the valid domain, for instance\.
#### Guided prompt\.
A second prompt aimed to align the generated specifications more closely with the ground\-truth, without revealing it\. We obtained candidate guidelines from an automated analysis of the previous run’s outputs and conversation traces, aimed at identifying the recurring design choices and patterns underlying the ground\-truth specifications, and refined them by hand\. We also removed any direct references or examples drawn from the benchmark entries at this stage\. The guidelines covered subjects such as the structural pattern of the specification, the treatment of edge cases and degenerate inputs, the preference for*relational*formulations of properties, and useful Lean 4 syntactic recommendations\. In addition, the prompt imposed a specific format for the specification body \(again derived from the analysis of the ground\-truth specifications\), and the input file now included the example usages from the NL description together with the formal Lean test cases, with the goal of clarifying which interpretation of the problem the ground\-truth had adopted\.
spec\_genagain produced high\-quality specifications, conforming to the required format and guidelines\.spec\_isodid not show a significant overall improvement, except for a decrease ofissueentries due to bugs ingenerated\_spec\(down to 2\), most of which now appeared asmismatchinstead\. This plausibly reflects the disambiguation provided by the more detailed instructions and the additional test cases\. The total number ofmismatchentries thus increased to 31, whileissueentries arising from bugs in the ground\-truth specification grew to 70, also driven by some ground\-truth specifications that now conflicted with the supplied test cases\. The skew withinmismatchentries towards stronger generated specifications was less pronounced than before \(17 strictly stronger, 8 strictly weaker, and 6 admitting neither direction\), although the direction remained consistent with the previous run for entries already classified as mismatches\.
#### Multiple specifications prompt\.
The infeasibility of proving the isomorphism formismatchentries \(where, recall, both specifications validly interpreted the NL description\) stemmed from the generated specification not aligning with the same interpretation adopted by the ground\-truth\. To increase the likelihood of producing a specification matching that interpretation, we designed a third prompt to obtain, for each problem, multiple candidate specifications covering distinct admissible readings of the NL description\.
We instructed the agent to output “all and only” valid interpretations of the problem, that is the set of semantically distinct specifications that a reasonable reader could defend as correct without contradicting the NL description or the test cases\. Variation ranged across dimensions such as precondition boundary, edge\-case permissiveness, quantifier interpretation, operation semantics, and property strength \(minimal vs\. maximal\)\. The prompt retained the same guidelines and example test cases as the previous one\.
We tested this prompt on the 105 entries flagged asissueormismatchin the previous run\. Claude generated up to 4 specifications per problem, with an average of 2\.76 and a total of 290 candidate specifications\. We then ran thespec\_isotask on each problem against all of its candidate specifications, with multiple Lean files as inputs\. Only 15 candidategenerated\_specsuccessfully proved isomorphic to the respectiveproblem\_spec, spanning 11 entries; among the remaining entries, 66 received theissueflag in at least one attempt, while 24 receivedmismatchoverall\.
#### Final classification\.
The outputs of the three runs underwent a final manual examination, that additionally fixed the classification where runs disagreed and identified lingering bugs not captured by the automated flags \(e\.g\. cases wherespec\_isosucceeded due to bugs in the underlying helper definitions\)\. This examination also confirmed that Claude’s per\-entry classifications and supporting arguments were reliable, accurate, and well\-reasoned in all cases\. The resulting consolidated classification tallies 65okentries, where Claude generated at least one valid specification that proved isomorphic to the ground\-truth; 81issueentries,111all flagged asissuein at least onespec\_isorun, except for one \(problem 160\) where most runs succeeded, but as a consequence of all specs relying on the same bugged helper function\.with arguably bugged ground\-truthproblem\_spec; 13mismatchentries, where the ground\-truthproblem\_specwas valid but never matched by any validgenerated\_spec; and 2failentries, exhibiting only timeouts or bugs in the generated specifications\. Therefore, Claude Code generated specifications that validly interpreted the source NL description of the function in at least one run for98\.8%of the entries\.
Figure 2:Distribution of output flags forspec\_isoruns\. The following color scheme will be used throughout the paper: green forokentries, red forfail, yellow formismatch, blue forissuein benchmark components, orange for those in generated components, purple for issues in both\.
### 3\.2Implementation certification
#### From ground\-truth specifications\.
Following the originalCleverprotocol, the first run of this stage generated implementations starting from the ground\-truth specifications provided in the benchmark\. Theimpl\_gentask succeeded on all entries, despite Claude initially flagging a few runs asissueafter identifying bugs in the provided test cases, which we subsequently fixed\. Theproof\_gentask on the resulting implementations produced 104okentries successfully certified, 9failentries due to timeouts, and 48 entries flagged asissue\. The vast majority of theseissueflags \(42 out of 48\) originated in the provided specifications, and all of them belong to the 81 problems identified asissuein the final classification ofspec\_iso\. Bugs in the ground\-truth specifications partly explain this high rate of issues, as Claude sometimes generates correct implementations that do not satisfy the erroneous specification\. On the other hand, Claude sees the specification at generation time, and can therefore produce implementations that satisfy it; combined with the fact that many specification bugs come from underspecification, this also explains why a number of entries with bugged ground\-truth specifications still succeed atproof\_gen\. Among the 80 problems with no bug found in the finalspec\_isoanalysis, 70 implementations were successfully generated and certified, for a final performance rate of87\.5%\.
#### From generated specifications\.
A second run generated implementations starting from the specifications produced by the simplespec\_genprompt\. As before,impl\_genreturnedokon all entries, except for three instances where Claude identified a mismatch between the generated specification and the provided test cases\. This is unsurprising, as the simplespec\_genprompt did not include the formal test cases or the example usages from the NL description, so we could expect inconsistencies between the two\. The correspondingproof\_genrun produced 144 successfully certifiedokentries, 6fail, and 11issue\.
We then carried out the same pipeline on the specifications produced by the guidedspec\_genprompt\.impl\_genagain succeeded on all entries; no specification–test\-cases mismatch arose this time, as the test cases had already been supplied atspec\_gentime\. Theproof\_gentask produced 148 successfully certifiedokentries, again 6fail, and 7issue\.
In both runs, theissueentries originate from a few distinct sources: bugs in the generated implementations \(e\.g\. implementations relying on finite fuel and thus not generalizing across all valid inputs\); bugs in the generated specifications \(e\.g\. incorrect handling of borderline cases\); and, in some cases, the fixed elements in the benchmark itself\. We discuss these in detail in Section[4](https://arxiv.org/html/2605.23772#S4)\.
Across both runs, 154 entries had at least oneokattempt, or a 95\.7% success rate on the end\-to\-end program verification pipeline, having automatically generated arguably correct specifications, implementations and correctness proofs over two attempts \(simple and guided prompts\)\. If we remove the 4 instances ofissuein the benchmark itself, this success rate further rises to98\.1%\.
Figure 3:Distribution of output flags forproof\_genruns\. Note the change of coloring for specificationissuebetween rows, as their source changes from the ground\-truth to automatic generation\.
## 4Discussion
### 4\.1Specification certification
Autoformalization is hard to evaluate\. Beyond the technical hurdle of producing a Lean specification that compiles, two more substantive issues dominate: choosing the right*strength*for the specification \(loose enough not to over\-constrain valid implementations, but tight enough to rule out incorrect ones\), and committing to a single*interpretation*of an inherently ambiguous NL description\. Problem 124, for instance, requires dates in themm\-dd\-yyyyformat, but it is left implicit whether single\-digit months and days should be accepted or rejected; and the ground\-truth specification of problem 27 only requires the output characters to have the opposite case of the corresponding input ones, an uncommon but defensible reading of the prose\.
In this sense, the central challenge in autoformalizing NL sources is that of*formalizing the intent*behind them, bridging the gap between the informal human specification and its formal counterpart\[[15](https://arxiv.org/html/2605.23772#bib.bib46)\]\. Even direct human evaluation can leave subtle bugs unnoticed, as they are often hard to spot\. Despite this, our results indicate that the agentic paradigm copes well with this kind of evaluation, as evidenced by the fact that only a handful of generated specifications \(and no ground\-truth ones\) slipped throughspec\_isoas ostensibly valid, only to be flagged asissuedownstream atproof\_gentime, when the impossibility of certifying a valid implementation against them surfaced bugs that the equivalence check had missed\.
#### Ground\-truth specifications\.
Table 1:Frequencies by type of ground\-truth specification issues identified across the Simple, Guided, Multiple runs ofspec\_isoand theproof\_genrun from ground\-truth specs\. Asterisks count the entries that were flagged asissuein both specs \(spec\_iso\) or in spec and impl \(proof\_gen\); superscripts mark entries classified under multiple types via the partner type’s initial\.TypeSimp\.Guid\.MultiProofProblem IDs*Conjunct used as guard*1413299\+1O9\{\+\}1^\{O\}24, 44, 49, 50, 72, 81, 84, 110, 115, 122, 140, 150, 152, 158, 163*Precedence bug*101129∗∗∗29^\{\*\*\*\}323, 45, 51, 69, 75, 85, 88, 93, 109, 126, 132, 154*Universal over invalid domain*761657, 67, 111, 125, 130, 143, 155*Nat arithmetic truncation*3∗3^\{\*\}28236, 109, 139, 153*Out\-of\-bounds default access*2372\+1C2\{\+\}1^\{C\}26, 64, 66, 81*Type\-system mismatch*3283100, 106, 151*Vacuous disjunct*337∗∗∗7^\{\*\*\*\}12, 68, 161*Trivial existential*12534, 39*Wrong semantics*181938∗38^\{\*\}15∗∗15^\{\*\*\}27, 47, 65, 86, 89, 90, 91, 94, 95, 98, 99, 103, 108, 114, 116, 117, 121, 129, 130, 134, 144*Missing constraint*1512220, 79, 97, 99, 105, 138, 139, 159*Extra/over\-restrictive constraint*215171, 156*Missing edge\-case clause*33715, 62, 78, 104As stated in Section[3](https://arxiv.org/html/2605.23772#S3), 80 of the 161 ground\-truthproblem\_specwere flagged at least once as having bugs\. We then had Claude classify the bugs that arose in eachissueentry across the threespec\_isoruns and theproof\_genrun from ground\-truth specifications, identifying a comprehensive taxonomy of 12 error types \(see Appendix[B\.2](https://arxiv.org/html/2605.23772#A2.SS2)\)\. For 5 of those problems \(numbers 81, 99, 109, 130, 139\), two types of issue were identified\. Most issues cluster on*Lean\-encoding*pitfalls:*conjunct\-as\-guard*patterns using\(P∧Q\)\(P\\wedge Q\)in place of\(P→Q\)\(P\\to Q\)\(15 problems\);*precedence bugs*from omitted parentheses around∧,→,↔\\wedge,\\to,\\leftrightarrow\(12\); universal quantifiers over inputs where the body is unsatisfiable \(7\); plus smaller counts of lossyNatarithmetic, default\-value access viagetElem\!/get\!, type\-system mismatches, vacuous disjuncts, and trivial existentials\. Together, these Lean\-language hazards affect 48 of the 80 problems\. The remaining 34 problems carry errors in the specification semantics:*over\-*or*under\-restrictive constraints*\(10 problems\),*missing edge\-case clauses*\(4\), and, most prominently, overall wrong interpretations of the problem departing from the NL description \(21 problems\)\. See Table[1](https://arxiv.org/html/2605.23772#S4.T1)\.
#### Generated specifications\.
Table 2:Frequencies by type of Claude\-generated specification*semantic*issues only acrossspec\_isoruns as in Table[1](https://arxiv.org/html/2605.23772#S4.T1)and theproof\_genruns from generated specs \(Simple and Guided prompts\)\. Asterisks and single\-letter superscripts formatted similarly to Table[1](https://arxiv.org/html/2605.23772#S4.T1)\.In contrast, Claude\-generated specifications fail in a*semantic*register, misstating the relation or under\-specifying it, and rarely in a*syntactic*one \(only two instances: a*trivial existential*and a*type\-system mismatch*\)\. Within the semantic class,*missing constraints*are the most frequent and*wrong semantics*less so: an effective inversion of the ground\-truth profile, suggesting that Claude accurately formalizes valid interpretations\. Per\-run counts expose two phenomena\. First, guidance is effective: the design guidelines instructed Claude on the policy for imposing constraints, producing a five\-fold drop in issues \(from 10 to 2\) and the complete disappearance of*missing\-constraint*and*over\-/under\-restrictive\-constraint*errors\. Second, those errors reappear in the multi\-specification run despite the same guidelines: beyond the mechanical multiplication of surface area from the larger number of generated specifications per problem, this also reflects Claude’s intentionally more “adventurous” generation, attempting to cover as much of the “interpretation space” as possible\. See Table[2](https://arxiv.org/html/2605.23772#S4.T2)\.
#### Mismatches\.
Table 3:Mismatch frequencies by type and direction across thespec\_isoruns\. Direction symbols:\>\>generated stronger than reference,//neither,<<generated weaker\. Single\-letter superscripts are formatted similarly to Table[1](https://arxiv.org/html/2605.23772#S4.T1)\.The distribution ofmismatchentries, where neither specification is buggy but the two are not isomorphic, reflects*authoring choices*rather than errors, often along several dimensions per entry\. 73 firings concern the boundary of the input domain \(gaps in preconditions, edge\-case handling\); 37 concern conditions imposed on the function output \(ordering of lists vs\. multisets, canonical\-form impositions\); 22 concern structural and definitional choices; and a minority of 10 are semantic in nature, e\.g\. different but mathematically defensible formulations, or quantification over indices instead of values\. See Table[3](https://arxiv.org/html/2605.23772#S4.T3)\.
### 4\.2Implementation certification
#### Implementations\.
Claude detected issues in the generated implementations for only 7 problems across the threeproof\_genruns\. Three involve hard\-coded*bounded fuel*\(e\.g\. the 200\-iteration Newton–Raphson loop in problem 32, see Appendix[D](https://arxiv.org/html/2605.23772#A4)\) that does not suffice for the arbitrarily large inputs that the specification quantifies over\. The remaining four are off\-by\-one indexing errors \(problems 47 and 163\), a missing case \(a stationary point for Newton’s method in problem 32 again\), and a single wrong\-algorithm choice \(problem 116: the implementation sorts numerically, while the specification requires sorting by binary digit count\)\.
#### Benchmark\.
Four problems carry issues in the fixed, non\-editable parts of the benchmark itself, which neither the generated specification nor the generated implementation can repair\. Problem 123 asks for acollatziterator: any proof of correctness against the formal specification essentially requires a proof of the Collatz conjecture itself\. Problem 39 \(prime\_fib\) similarly hinges on the open question of the existence of infinitely many Fibonacci primes\. Problem 160 \(do\_algebra\) ships a fixed helperapplyOpthat returns anoneon division by zero, which is not parsed by other parts of the script\. A notable example, Problem 32, exhibits a missing rational\-root precondition, and highlights other issues and phenomena observed in the results\. It is discussed in detail in Appendix[D](https://arxiv.org/html/2605.23772#A4)\.
### 4\.3Resources
All experiments ran on a*HP EliteBook 850 G6*laptop, as no particular computational resources were required: the workstation hosted our minimal evaluation infrastructure and the Lean\-related routines accessed via MCP, while inference was performed through Anthropic’s Claude API\. Runtimes per solve were usually between 10 and 15 minutes for proving tasks \(spec\_isoandproof\_gen\), against around 90 seconds on average for formalization tasks \(spec\_genandimpl\_gen\), with an exception for the multi\-specificationspec\_genrun averaging around 200 seconds\. Costs per solve \(as reported by the Claude Agent SDK\) scaled linearly with time, on average around $0\.3 for formalization tasks \($0\.5 in the multi\-specspec\_gencase\) and $2\.0–2\.5 for proving tasks\. The longest\-running instances that reached the 1\-hour timeout stood at around $10 per run\. Appendix[C\.2](https://arxiv.org/html/2605.23772#A3.SS2)provides scatter plots with a more detailed breakdown of runtimes\.
## 5Conclusions
The experiments reported in this paper meet the contributions announced in Section[1](https://arxiv.org/html/2605.23772#S1)\. In our setup, Claude reaches a new state of the art on the fullCleverpipeline: 98\.8% of problems receive an arguably valid generated specification \(of which 81\.3% also pass the isomorphism scoring on the benchmark’s correct portion\), 87\.5% of implementation certifications succeed from correct ground\-truth specifications, and 98\.1% of problems with self\-consistent premises are solved end\-to\-end\. Across all stages, Claude consistently produces well\-argued classifications of its own outputs, surfacing lingering bugs in the benchmark\. These findings corroborate the claim that compiler\-in\-the\-loop agentic systems are currently the most effective for foundational program verification, and that existing program\-verification benchmarks no longer adequately stress modern agentic provers\.
The same findings also expose structural limitations in how theCleverbenchmark is designed and evaluated, motivating the need for alternatives in future iterations\. Indeed, for isomorphism\-based scoring to remain a meaningful evaluation signal, the generated and reference specifications must come from a shared intent: this calls for more precise NL descriptions, explicit formalization hints \(e\.g\. pre\-committing to a type signature\), or an iterative disambiguation pass run before formalization, in which an LLM judge surfaces and resolves multiple admissible readings until none remains\. Otherwise, where formal isomorphism is intractable, the reliability of Claude’s analyses under manual review suggests agentic LLM judges, complemented by property\-based testing of the specifications themselves, as pragmatic alternatives\. Finally, asking the model to generate an implementation and a correctness proof against its own specification is also effective at surfacing vacuous, contradictory or under\-constrained specs that the equivalence check had missed, suggesting that future harnesses could combine these signals rather than rely on any of them in isolation\.
## Acknowledgments and Disclosure of Funding
We thank the Ethereum foundation for partially funding this research\. Arora’s lab is partly supported by grants from the Novo Nordisk Foundation \(NNF24OC0099109\), the Pioneer Centre for AI, and EU Horizon 2020 \(101168951\)\. We also gratefully acknowledge generous gifts from Microsoft and It\-vest \- networking universities\.
## References
- \[1\]T\. Achim, A\. Best, A\. Bietti, K\. Der, M\. Fédérico, S\. Gukov, D\. Halpern\-Leistner, K\. Henningsgard, Y\. Kudryashov, A\. Meiburg, M\. Michelsen, R\. Patterson, E\. Rodriguez, L\. Scharff, V\. Shanker, V\. Sicca, H\. Sowrirajan, A\. Swope, M\. Tamas, V\. Tenev, J\. Thomm, H\. Williams, and L\. Wu\(2025\-10\)Aristotle: IMO\-level Automated Theorem Proving\.arXiv\.External Links:2510\.01346,[Document](https://dx.doi.org/10.48550/arXiv.2510.01346)Cited by:[1st item](https://arxiv.org/html/2605.23772#A1.I1.i1.p1.1),[§1](https://arxiv.org/html/2605.23772#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.23772#S1.p2.1)\.
- \[2\]\(2026\)Claude\-agent\-sdk\-python\.Cited by:[§2\.2](https://arxiv.org/html/2605.23772#S2.SS2.p1.1)\.
- \[3\]G\. Bancerek, C\. Byliński, A\. Grabowski, A\. Korniłowicz, R\. Matuszewski, A\. Naumowicz, K\. Pąk, and J\. Urban\(2015\)Mizar: state\-of\-the\-art and beyond\.InIntelligent Computer Mathematics,M\. Kerber, J\. Carette, C\. Kaliszyk, F\. Rabe, and V\. Sorge \(Eds\.\),Cham,pp\. 261–279\.External Links:ISBN 978\-3\-319\-20615\-8,[Document](https://dx.doi.org/10.1007/978-3-319-20615-8%5F17)Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p1.1)\.
- \[4\]B\. Bayazıt, Y\. Li, and X\. Si\(2025\)A case study on the effectiveness of llms in verification with proof assistants\.External Links:2508\.18587Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p2.1)\.
- \[5\]M\. Carneiro, C\. E\. Brown, and J\. Urban\(2023\)Automated theorem proving for metamath\.In14th International Conference on Interactive Theorem Proving \(ITP 2023\),A\. Naumowicz and R\. Thiemann \(Eds\.\),Leibniz International Proceedings in Informatics \(LIPIcs\), Vol\.268,Dagstuhl, Germany,pp\. 9:1–9:19\.External Links:[Document](https://dx.doi.org/10.4230/LIPIcs.ITP.2023.9)Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p1.1)\.
- \[6\]M\. Chen, J\. Tworek, H\. Jun, Q\. Yuan, H\. P\. de Oliveira Pinto, J\. Kaplan, H\. Edwards, Y\. Burda, N\. Joseph, G\. Brockman, A\. Ray, R\. Puri, G\. Krueger, M\. Petrov, H\. Khlaaf, G\. Sastry, P\. Mishkin, B\. Chan, S\. Gray, N\. Ryder, M\. Pavlov, A\. Power, L\. Kaiser, M\. Bavarian, C\. Winter, P\. Tillet, F\. P\. Such, D\. Cummings, M\. Plappert, F\. Chantzis, E\. Barnes, A\. Herbert\-Voss, W\. H\. Guss, A\. Nichol, A\. Paino, N\. Tezak, J\. Tang, I\. Babuschkin, S\. Balaji, S\. Jain, W\. Saunders, C\. Hesse, A\. N\. Carr, J\. Leike, J\. Achiam, V\. Misra, E\. Morikawa, A\. Radford, M\. Knight, M\. Brundage, M\. Murati, K\. Mayer, P\. Welinder, B\. McGrew, D\. Amodei, S\. McCandlish, I\. Sutskever, and W\. Zaremba\(2021\)Evaluating Large Language Models Trained on Code\.arXiv\.External Links:2107\.03374,[Document](https://dx.doi.org/10.48550/arXiv.2107.03374)Cited by:[§2\.1](https://arxiv.org/html/2605.23772#S2.SS1.p1.1)\.
- \[7\]W\. Chen, S\. Wei, H\. Wang,et al\.\(2025\)Seed\-prover: deep and broad reasoning for automated theorem proving\.External Links:2507\.23726Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p2.1)\.
- \[8\]J\. Clune\(2026\)LeanHammer\.GitHub\.Cited by:[3rd item](https://arxiv.org/html/2605.23772#A1.I1.i3.p1.1)\.
- \[9\]O\. Dressler\(2026\)Lean\-lsp\-mcp\.GitHub\.Cited by:[Appendix A](https://arxiv.org/html/2605.23772#A1.SS0.SSS0.Px2.p1.1),[§2\.2](https://arxiv.org/html/2605.23772#S2.SS2.p1.1)\.
- \[10\]Y\. Feng, D\. Kafle, V\. Gladshtein, V\. Kurin, G\. Pîrlea, Q\. Zhao, P\. Müller, and I\. Sergey\(2026\)Certified program synthesis with a multi\-modal verifier\.External Links:2604\.16584,[Link](https://arxiv.org/abs/2604.16584)Cited by:[§2\.1](https://arxiv.org/html/2605.23772#S2.SS1.SSS0.Px2.p2.1)\.
- \[11\]C\. Freer\(2026\)Lean4\-skills\.GitHub\.Note:\(preliminary experiments of Appendix[A](https://arxiv.org/html/2605.23772#A1)on the earlier commit[5d6bd03](https://github.com/cameronfreer/lean4-skills/commit/5d6bd03a024d63c02225bd9814c981275e1c9ee5)\)Cited by:[Appendix A](https://arxiv.org/html/2605.23772#A1.SS0.SSS0.Px1.p1.1),[§2\.2](https://arxiv.org/html/2605.23772#S2.SS2.p1.1)\.
- \[12\]J\. Harrison\(1996\)HOL light: a tutorial introduction\.InFormal Methods in Computer\-Aided Design \(FMCAD\),Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p1.1)\.
- \[13\]Y\. Huang and L\. F\. Yang\(2025\-09\)Winning Gold at IMO 2025 with a Model\-Agnostic Verification\-and\-Refinement Pipeline\.arXiv\.External Links:2507\.15855,[Document](https://dx.doi.org/10.48550/arXiv.2507.15855)Cited by:[Appendix A](https://arxiv.org/html/2605.23772#A1.SS0.SSS0.Px2.p1.1)\.
- \[14\]T\. Hubert, H\. Mehta, L\. Sartran,et al\.\(2025\)Olympiad\-level formal mathematical reasoning with reinforcement learning\.Nature\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09833-y)Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p2.1)\.
- \[15\]S\. K\. Lahiri\(2026\-03\)Intent Formalization: A Grand Challenge for Reliable Coding in the Age of AI Agents\.arXiv\.External Links:2603\.17150,[Document](https://dx.doi.org/10.48550/arXiv.2603.17150)Cited by:[§4\.1](https://arxiv.org/html/2605.23772#S4.SS1.p2.1)\.
- \[16\]Z\. Li, Z\. Yang, D\. He, H\. Zhao, A\. Zhao, S\. Tang, K\. Yang, A\. Gupta, Z\. Su, and C\. Jin\(2026\)Goedel\-Code\-Prover: hierarchical proof search for open state\-of\-the\-art code verification\.External Links:2603\.19329,[Link](https://arxiv.org/abs/2603.19329)Cited by:[§2\.1](https://arxiv.org/html/2605.23772#S2.SS1.SSS0.Px2.p1.8),[§2\.1](https://arxiv.org/html/2605.23772#S2.SS1.SSS0.Px2.p2.1)\.
- \[17\]J\. Limperg and A\. H\. From\(2023\-01\)Aesop: White\-Box Best\-First Proof Search for Lean\.InProceedings of the 12th ACM SIGPLAN International Conference on Certified Programs and Proofs,CPP 2023,New York, NY, USA,pp\. 253–266\.External Links:[Document](https://dx.doi.org/10.1145/3573105.3575671),ISBN 979\-8\-4007\-0026\-2Cited by:[3rd item](https://arxiv.org/html/2605.23772#A1.I1.i3.p1.1)\.
- \[18\]X\. Lin, Q\. Cao, Y\. Huang, H\. Wang, J\. Lu, Z\. Liu, L\. Song, and X\. Liang\(2024\)FVEL: interactive formal verification environment with large language models via theorem proving\.External Links:2406\.14408Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p2.1)\.
- \[19\]Y\. Lin, S\. Tang, B\. Lyu, Z\. Yang, J\. Chung, H\. Zhao, L\. Jiang, Y\. Geng, J\. Ge, J\. Sun, J\. Wu, J\. Gesi, X\. Lu, D\. Acuna, K\. Yang, H\. Lin, Y\. Choi, D\. Chen, S\. Arora, and C\. Jin\(2025\-08\)Goedel\-Prover\-V2: Scaling Formal Theorem Proving with Scaffolded Data Synthesis and Self\-Correction\.arXiv\.External Links:2508\.03613,[Document](https://dx.doi.org/10.48550/arXiv.2508.03613)Cited by:[2nd item](https://arxiv.org/html/2605.23772#A1.I1.i2.p1.1)\.
- \[20\]J\. Liu, Z\. Zhou, Z\. Zhu, M\. D\. Santos, W\. He, J\. Liu, R\. Wang, Y\. Xie, J\. Zhao, Q\. Wang, L\. Zhi, J\. Li, and W\. Li\(2026\-01\)Numina\-Lean\-Agent: An Open and General Agentic Reasoning System for Formal Mathematics\.arXiv\.External Links:2601\.14027,[Document](https://dx.doi.org/10.48550/arXiv.2601.14027)Cited by:[Appendix A](https://arxiv.org/html/2605.23772#A1.SS0.SSS0.Px2.p1.1)\.
- \[21\]L\. d\. Moura and S\. Ullrich\(2021\)The lean 4 theorem prover and programming language\.InAutomated Deduction – CADE 28: 28th International Conference on Automated Deduction, Virtual Event, July 12–15, 2021, Proceedings,Berlin, Heidelberg,pp\. 625–635\.External Links:ISBN 978\-3\-030\-79875\-8,[Document](https://dx.doi.org/10.1007/978-3-030-79876-5%5F37)Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p1.1)\.
- \[22\]T\. Nipkow, L\. C\. Paulson, and M\. Wenzel\(2002\)Isabelle/hol: a proof assistant for higher\-order logic\.Lecture Notes in Computer Science, Vol\.2283,Springer\.Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p1.1)\.
- \[23\]U\. Norell\(2007\)Towards a practical programming language based on dependent type theory\.Ph\.D\. Thesis,Chalmers University of Technology and Göteborg University\.Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p1.1)\.
- \[24\]C\. Norman and J\. Avigad\(2025\-04\)Canonical for Automated Theorem Proving in Lean\.arXiv\.External Links:2504\.06239,[Document](https://dx.doi.org/10.48550/arXiv.2504.06239)Cited by:[3rd item](https://arxiv.org/html/2605.23772#A1.I1.i3.p1.1)\.
- \[25\]Project Numina\(2026\)Lean\-lsp\-mcp\.Cited by:[Appendix A](https://arxiv.org/html/2605.23772#A1.SS0.SSS0.Px2.p1.1)\.
- \[26\]Project Numina\(2026\)Numina\-lean\-agent\.Cited by:[Appendix A](https://arxiv.org/html/2605.23772#A1.SS0.SSS0.Px2.p1.1)\.
- \[27\]Z\. Z\. Ren, Z\. Shao, J\. Song, H\. Xin, H\. Wang, W\. Zhao, L\. Zhang, Z\. Fu, Q\. Zhu, D\. Yang, Z\. F\. Wu, Z\. Gou, S\. Ma, H\. Tang, Y\. Liu, W\. Gao, D\. Guo, and C\. Ruan\(2025\-04\)DeepSeek\-Prover\-V2: Advancing Formal Mathematical Reasoning via Reinforcement Learning for Subgoal Decomposition\.arXiv\.External Links:2504\.21801,[Document](https://dx.doi.org/10.48550/arXiv.2504.21801)Cited by:[2nd item](https://arxiv.org/html/2605.23772#A1.I1.i2.p1.1)\.
- \[28\]Rocq Development Team\(2026\)The rocq prover\.Note:Project website[https://rocq\-prover\.org](https://rocq-prover.org/)Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p1.1)\.
- \[29\]K\. Slind and M\. Norrish\(2008\)A brief overview of HOL4\.InTheorem Proving in Higher Order Logics \(TPHOLs 2008\),Lecture Notes in Computer Science, Vol\.5170,pp\. 28–32\.External Links:[Document](https://dx.doi.org/10.1007/978-3-540-71067-7%5F6)Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p1.1)\.
- \[30\]A\. Thakur, J\. Lee, G\. Tsoukalas, M\. Sistla, M\. Zhao, S\. Zetzsche, G\. Durrett, Y\. Yue, and S\. Chaudhuri\(2025\-10\)CLEVER: A Curated Benchmark for Formally Verified Code Generation\.arXiv\.External Links:2505\.13938,[Document](https://dx.doi.org/10.48550/arXiv.2505.13938)Cited by:[Appendix A](https://arxiv.org/html/2605.23772#A1.p1.1),[§1](https://arxiv.org/html/2605.23772#S1.p3.1),[§2\.1](https://arxiv.org/html/2605.23772#S2.SS1.SSS0.Px2.p1.8),[§2\.1](https://arxiv.org/html/2605.23772#S2.SS1.p1.1)\.
- \[31\]A\. Thakur, G\. Tsoukalas, Y\. Wen, J\. Xin, and S\. Chaudhuri\(2024\)An in\-context learning agent for formal theorem\-proving\.External Links:2310\.04353,[Link](https://arxiv.org/abs/2310.04353)Cited by:[§2\.1](https://arxiv.org/html/2605.23772#S2.SS1.SSS0.Px2.p1.8)\.
- \[32\]The Lean Developers\(2024\)The lean language reference: the grind tactic\.Lean FRO\.Note:Lean version 4\.15\.0 or laterExternal Links:[Link](https://lean-lang.org/doc/reference/latest/The--grind--tactic/)Cited by:[3rd item](https://arxiv.org/html/2605.23772#A1.I1.i3.p1.1)\.
- \[33\]J\. Viennot, G\. Baudart, E\. J\. G\. Arias, and M\. Lelarge\(2025\)MiniF2F in rocq: automatic translation between proof assistants – a case study\.External Links:2503\.04763Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p2.1)\.
- \[34\]H\. Wang, M\. Unsal, X\. Lin, M\. Baksys, J\. Liu, M\. D\. Santos, F\. Sung, M\. Vinyes, Z\. Ying, Z\. Zhu, J\. Lu, H\. de Saxcé, B\. Bailey, C\. Song, C\. Xiao, D\. Zhang, E\. Zhang, F\. Pu, H\. Zhu, J\. Liu, J\. Bayer, J\. Michel, L\. Yu, L\. Dreyfus\-Schmidt, L\. Tunstall, L\. Pagani, M\. Machado, P\. Bourigault, R\. Wang, S\. Polu, T\. Barroyer, W\. Li, Y\. Niu, Y\. Fleureau, Y\. Hu, Z\. Yu, Z\. Wang, Z\. Yang, Z\. Liu, and J\. Li\(2025\-04\)Kimina\-Prover Preview: Towards Large Formal Reasoning Models with Reinforcement Learning\.arXiv\.External Links:2504\.11354,[Document](https://dx.doi.org/10.48550/arXiv.2504.11354)Cited by:[2nd item](https://arxiv.org/html/2605.23772#A1.I1.i2.p1.1)\.
- \[35\]Q\. Xu, R\. Wang, P\. Wang, H\. Li, and C\. Watt\(2025\)A minimalist proof language for neural theorem proving over isabelle/hol\.External Links:2507\.18885Cited by:[§1](https://arxiv.org/html/2605.23772#S1.p2.1)\.
- \[36\]Z\. Ye, Z\. Yan, J\. He, T\. Kasriel, K\. Yang, and D\. Song\(2025\-10\)VERINA: Benchmarking Verifiable Code Generation\.arXiv\.External Links:2505\.23135,[Document](https://dx.doi.org/10.48550/arXiv.2505.23135)Cited by:[Appendix A](https://arxiv.org/html/2605.23772#A1.SS0.SSS0.Px3.p1.2),[Appendix A](https://arxiv.org/html/2605.23772#A1.p1.1),[§1](https://arxiv.org/html/2605.23772#S1.SS0.SSS0.Px1.p1.1)\.
## Appendix APreliminary Experiments
In prior experiments, the authors evaluated a suite of automated provers on proof generation on theClever\[[30](https://arxiv.org/html/2605.23772#bib.bib12)\]andVerina\[[36](https://arxiv.org/html/2605.23772#bib.bib11)\]datasets, finding that agentic paradigms substantially outperformed all alternatives\. We summarise its findings here, both because it directly motivates the present work and because some of its observations reappear, refined, in our current results\.
#### Setup\.
This earlier study used a different experimental setup than the one of Section[2\.2](https://arxiv.org/html/2605.23772#S2.SS2)\. It focused exclusively on proof generation starting from ground\-truth specifications and implementations, a task outside the scope of either benchmark’s provided harness \(in Figure[1](https://arxiv.org/html/2605.23772#S1.F1), the setup illustrated in the dashed portion of the diagram in the top\-right corner\)\. For this reason it relied on a custom distillation of the two datasets rather than on their provided evaluation infrastructure: derivative versions \(denotedClever\-P,Verina\-P\) were obtained by reusing the ground\-truth specifications and implementations, with other variants distilled to meet requirements for specific provers; NL descriptions were not provided\. The agentic configurations were further built on Claude Opus 4\.5 and on an earlier release of*lean4\-skills*\[[11](https://arxiv.org/html/2605.23772#bib.bib17)\]that predated the current/autoprovecommand, relying instead on the/fill\-sorrycommand, and with three independent one\-shot attempts per problem\. The benchmark releases evaluated wereCleverv1\.4\.0 and a contemporaryVerinabuild, both of which have since been partially patched in light of the bugs reported by those experiments\.
#### Provers\.
The evaluation covered four families of provers\. The*agentic*family included Claude Code with the earlier version of*lean4\-skills*and*lean\-lsp\-mcp*\[[9](https://arxiv.org/html/2605.23772#bib.bib21)\], and the*Numina\-Lean\-Agent*\[[20](https://arxiv.org/html/2605.23772#bib.bib27),[26](https://arxiv.org/html/2605.23772#bib.bib28)\]\. The package*lean4\-skills*injects domain\-specific instructions, best practices, and workflow patterns into the model prompt, and bundles a wide array of proof\-development utilities, including experimental subagents for specific tasks; the testing relied specifically on its/fill\-sorrycommand \(provided by the*lean4\-theorem\-proving*plugin\), which systematically targets and resolves outstandingsorrygoals\. The companion MCP server*lean\-lsp\-mcp*interfaces with the Lean LSP and exposes search tools for relevant lemmas in the project context and Mathlib, enabling an iterative agentic loop in which the model inspects the current proof state, proposes tactics, and refines them based on compiler diagnostics\. The*Numina\-Lean\-Agent*is a wrapper around Claude Code that combines it with*Numina\-Lean\-MCP*\[[25](https://arxiv.org/html/2605.23772#bib.bib29)\], a fork of*lean\-lsp\-mcp*extended with semantic search and other functionality:*LeanDex*, a semantic search engine for theorems and definitions in standard Lean libraries; an*Informal Prover*\[[13](https://arxiv.org/html/2605.23772#bib.bib30)\]used to generate informal proof sketches; and a*Discussion Partner*tool for querying external LLMs for reasoning and planning\. It was run with the providedprompts/prompt\_complete\_file\.txtprompt and the settingmax\_rounds=3, allowing up to three continuations of the same Claude Code conversation before exiting\.
The remaining provers served as performance baselines:
- •Monte Carlo Graph Search: Aristotle\[[1](https://arxiv.org/html/2605.23772#bib.bib1)\], Harmonic’s high\-level, API\-accessible proof system, achieves state\-of\-the\-art performance on established advanced\-mathematics benchmarks \(most notably gold\-medal performance at IMO 2025\)\. It performs a proof\-state\-aware search over a hypertree of proving steps and tactics, together with an informal reasoning system generating lemmas and proof sketches, and a specialized geometry solver\.
- •Whole\-proof generation models: specialized models trained or fine\-tuned for formal verification, that interleave chain\-of\-thought reasoning with formal code generation\. This family was represented by Kimina\-Prover\-72B\[[34](https://arxiv.org/html/2605.23772#bib.bib5)\], the highest\-performing open model of its kind onMiniF2F, with DeepSeek\-Prover\-V2\[[27](https://arxiv.org/html/2605.23772#bib.bib4)\]and Goedel\-Prover\-V2\[[19](https://arxiv.org/html/2605.23772#bib.bib2)\]in the same class\.
- •Symbolic tactics: symbolic automated reasoning tools integrated directly into Lean as tactics, serving as a baseline for symbolic automation — Grind\[[32](https://arxiv.org/html/2605.23772#bib.bib15)\], LeanHammer\[[8](https://arxiv.org/html/2605.23772#bib.bib14)\], Aesop\[[17](https://arxiv.org/html/2605.23772#bib.bib7)\], and Canonical\[[24](https://arxiv.org/html/2605.23772#bib.bib6)\]\.
#### Benchmarks\.
As mentioned, the prior study used both theCleverandVerinabenchmarks\.Clever’s format and staged evaluation pipeline are described in Section[2\.1](https://arxiv.org/html/2605.23772#S2.SS1); of its 167 problems \(161 adapted fromHumanEvaland 6 sample examples\), only 49 carry the complete ground\-truth implementation that the proof\-generation distillation requires\. Beyond the baseClever\-P andVerina\-P distillations, further derivative variants were created to probe the sensitivity to problem formulation displayed by some provers:
- •*\-PU*, where an initial unfolding and simplification of the specification and implementation definitions \(inspired by existing solutions in each dataset\) is prepended to the proofs before they are passed to the provers;
- •*\-PM*, where Mathlib is always imported as a dependency \(required by Aristotle\);
- •*\-PF*, carrying the auto\-generated fixes to the wrong statements identified in the datasets\.
Verina\[[36](https://arxiv.org/html/2605.23772#bib.bib11)\]comprises 189 manually curated tasks split into two difficulty tiers:Verina\-A \(basic\), 108 tasks derived from existing human\-written Dafny benchmarks \(MBPP\-DFY, CloverBench\) and translated to Lean; andVerina\-B \(advanced\), 81 more complex algorithmic problems adapted from platforms such as LeetCode and from challenging datasets such as LiveCodeBench\. Ground\-truth specifications and implementations are provided for all entries, while proof\-generation solutions were supplied for only 46 of the basic entries, the remainder being left assorry; this allowed the whole dataset to be used in the proof\-generation evaluation\.
#### Results\.
Figure 4:Overview of prover performance on theCleverandVerinabenchmarks\. Solid bars indicate the number of tests solved on the verified\-correct portions of each dataset, while the patterned bar segments above show additional solved instances obtained after correcting erroneous benchmark entries\. Numbers annotated with a ‘\+’ denote the corresponding count of these additional solves\.The evaluation revealed a clear performance hierarchy, as displayed by Figure[4](https://arxiv.org/html/2605.23772#A1.F4): agentic systems based on Claude Code defined the state\-of\-the\-art, with Aristotle trailing by a small margin; Kimina\-Prover displayed much lower pass rates \(7/32 onClever\-P, 34/183 onVerina\-P\), while symbolic provers provided a stable but significantly lower baseline, consistent with their design as gap\-fillers between assertions rather than full proof generators\. They also displayed a marked sensitivity to problem formulation, and in particular to term unfolding, as reflected in the performance gap between the \-P and \-PU variants of both datasets\. Considering only the verified\-correct portions of the datasets, Numina\-Lean\-Agent solved all 32Clever\-P entries, while Claude Code with*lean4\-skills*and Aristotle solved 31 and 28 respectively; out of the 183 correctVerina\-P entries, the same three provers solved 159, 136 and 132 \(Table[4](https://arxiv.org/html/2605.23772#A1.T4)\)\. The high performance of the agentic systems was attributed largely to their direct access to compiler feedback through their search and tool\-use loop, the main documented difference in implementation between the various approaches; conversely, the whole\-proof\-generation models were tested in a single one\-shot attempt without successive refinement on compiler feedback, and their performance might improve substantially under iterative protocols\.
Table 4:Number of solved tests by each prover over the correct entries in the tested benchmarks\. Numbers in brackets represent the number of solved entries among the auto\-fixed ones\. The second column reports the benchmark variant on which each prover was evaluated \(see the*Benchmarks*paragraph above\)\.A second, less expected finding was that Claude Code and Aristotle consistently identified errors in the benchmarks themselves: Aristotle through its goal\-negation search mechanism, Claude Code through its NL replies \(sometimes already proposing fixes and certifying the fixed statements without being prompted to do so\)\. Cumulatively, the provers identified 17 erroneous entries inClever222Reported publicly on 18 December 2025, see[https://github\.com/trishullab/clever/issues/65](https://github.com/trishullab/clever/issues/65)\.and 6 inVerina333Reported publicly on 26 January 2026, see[https://github\.com/sunblaze\-ucb/verina/issues/16](https://github.com/sunblaze-ucb/verina/issues/16)\. A total of 16 wrong entries were found in the original version ofVerina, some of which were later fixed by the benchmark’s authors on 5 January 2026 \(commite9926ecbc6d203b8e2ca008b500ccda4d09a8c6e\); those errors had already been reported by Aristotle and were identified independently by us on 16 November 2025\.\. When subsequently given the NL descriptions and the previous proof attempts, Claude Code provided fixes for all of them and proved a substantial fraction of the corrected statements: Numina\-Lean\-Agent proved 15/17 and 5/6 of the fixedCleverandVerinastatements respectively, while Claude Code with*lean4\-skills*proved 9/17 and 2/6, and Aristotle 9/17 and 3/6\. Most of the fixes, broken down by kind in Table[5](https://arxiv.org/html/2605.23772#A1.T5), addressed wrong specification logic \(9 cases\), implementation bugs \(6\) and missing preconditions \(5\), with a few off\-by\-one and boundary errors \(3\)\.
Table 5:Number of fixes per benchmark by kind\.
## Appendix BResults Classification
### B\.1Status Flags
At completion of each task, Claude was prompted to assign a status flag to the result, according to the following criteria:
- •oktask completed, no sorry remains, code compiles;
- •failunable to complete, sorry remains or code does not compile;
- •issuetask is impossible due to bugs/incorrect assumptions, and rigorous arguments in support have been provided\.
For thespec\_isotask, theissueflag was replaced by the following two cases:
- •mismatchequivalence is unprovable because the specifications differ, yet both are valid interpretations of the NL description and test cases; rigorous arguments in support have been provided;
- •issueequivalence is unprovable because at least one specification is erroneous \(contradicts the NL description or test cases\); rigorous arguments in support have been provided\.
For theproof\_gentask, theissueflag was replaced by the following two cases:
- •mismatchequivalence is unprovable because the specification/implementation differ, yet both are valid interpretations of the NL description and test cases; rigorous arguments in support have been provided;
- •issueequivalence is unprovable because the specification, the implementation or both are erroneous \(contradicting the NL description or test cases\); rigorous arguments in support have been provided\.
### B\.2Taxonomy of issue and mismatch kinds
Table 6:Taxonomy of specificationissuetypes\.Table 7:Taxonomy of specificationmismatchtypes\.Table 8:Taxonomy of implementationissuetypes\.
## Appendix CDetailed Breakdown of Results
Figure 5:Output flag for each problem ID across the threespec\_isoruns \+ the final spec classification, and theproof\_genruns\. Color scheme corresponds to the ones in Figures[2](https://arxiv.org/html/2605.23772#S3.F2)and[3](https://arxiv.org/html/2605.23772#S3.F3), while one\-character annotations indicate: formismatch, specifications generated stronger \(\>\), weaker \(<\) than reference, or neither \(⋅\\cdot\); forissue, bugs originating in ground\-truth \(P\) and generated \(G\) specs, or both \(B\) inspec\_iso, and bugs originating in the specification \(S\), the implementation \(I\), both \(B\), or the fixed parts of the benchmark \(\*\) inproof\_gen\.### C\.1Multi\-specspec\_isoresults
Figure 6:Breakdown of results for the Multi\-specspec\_isorun\. Output flag for tested problems in the Multi\-specspec\_isorun: overall result and detailed per\-spec flagging\. Color scheme corresponds to the one in Figure[2](https://arxiv.org/html/2605.23772#S3.F2), while one\-character annotations indicate: formismatch, specifications generated stronger \(\>\), weaker \(<\) than reference, or neither \(⋅\\cdot\); forissue, bugs originating in ground\-truth \(P\) and generated \(G\) specs, or both \(B\)\.
### C\.2Runtimes and costs
Figure 7:Scatter plot of attempt costs \(USD\) vs\. runtimes \(s\), across all tasks and runs per task\. Markers colored according to the schemes of Figures[2](https://arxiv.org/html/2605.23772#S3.F2)and[3](https://arxiv.org/html/2605.23772#S3.F3)\. Labeled black triangles report the average cost and runtime per run\.
## Appendix DA Notable Example: Problem 32
Problem 32 deserves a separate look, both because it is arguably the hardest problem in the dataset and because it gathers in a single entry several of the phenomena observed throughout our evaluation\. The task asks to find a real root of an odd\-degree polynomial given by its list of coefficients\.
Already in the preliminary experiments \(Appendix[A](https://arxiv.org/html/2605.23772#A1)\), the provided ground\-truth implementation was found to be incorrect: it was based on the Newton–Raphson method, which is not globally convergent and can enter cycles even on simple polynomials\. Following insight from the agent, it was replaced with a bisection method initialized from the Cauchy root bound, with bisection fuel adaptive in the bit\-size of the coefficients, ensuring convergence to the target precision \(see Figures[8](https://arxiv.org/html/2605.23772#A4.F8)and[9](https://arxiv.org/html/2605.23772#A4.F9)\)\.
Figure 8:Fix ofCleverproblem 32 generated by Claude Code\.Here’s a summary of the fix and why it’s correct: Root cause:Newton\-Raphson is not globally convergent\. Even with a correct derivative, it can enter cycles \(as shown by\[2, \-2, 0, 1\]cycling between 0 and 1\)\. No starting point is safe for all polynomials\. Fix:Replaced Newton\-Raphson with thebisection method, which is globally convergent: 1\.findBound: Finds a boundb\(starting from 1, doubling up to 2100\{\}^\{\\text\{100\}\}\) such thatpoly\(b\)andpoly\(\-b\)have opposite signs\. This is guaranteed to exist for odd\-degree polynomials \(even\-length coefficient list\) with non\-zero leading coefficient \-\- exactly the docstring’s preconditions\. 2\.bisect: Standard bisection on\[\-bound, bound\]with 1,000,000 iterations\. Each step halves the interval containing the root\. After n steps, the interval has width2·bound / 2n\{\}^\{\\text\{n\}\}, giving astronomically precise convergence \-\- far more than needed foreps = 10\-6\{\}^\{\\text\{\-6\}\}\. Verification against test cases: \-\[1, 2\]\(1 \+ 2x\): bisect finds mid = \-0\.5 where poly = 0 exactly in 2 steps \-\[\-6, 11, \-6, 1\]\(x³\-6x²\+11x\-6\): bisect converges to a value near root 1 with \|poly\|≤\\leqeps
The fix has two changes to the implementation \(spec unchanged\): Fix 1 \- Cauchy bound initialization \(lines 20\-22\): Instead offindBound xs 1 100, we compute the Cauchy root bound1 \+∑\\sum\|ai\{\}\_\{\\text\{i\}\}\|/\|ai\{\}\_\{\\text\{i\}\}\|as the starting value\. This guarantees the initial bound is already past all roots, regardless of coefficient magnitude\. Fix 2 \- Adaptive bisection fuel \(lines 38\-39\): The bisection fuel scales with the total bit\-size of the coefficients \(coeffBits \* \(xs\.length \+ 1\) \* 40 \+ 1000000\)\. This ensures enough precision even for polynomials with large/steep coefficients\.
Figure 9:Extract of Claude’s replies explaining the edits it made in fixing problem 32\.At specification certification, no run produced agenerated\_specthat proved isomorphic to the ground\-truth one\. The reason was systematic: every generated specification required the implementation to return an*exact*root of the polynomial, whereas the ground\-truth specification only required an*approximate*root within a fixed tolerance\. Thespec\_genagent could not have inferred this from context, since neither the NL description nor the example usages mentioned approximation or any threshold for it\. The corresponding entries were flagged either asmismatchor asissueingenerated\_spec\.
At implementation certification, no run succeeded either\. Two factors compound: the intrinsic difficulty of the implementation \(global convergence of the root\-finding method, Newton–Raphson vs\. bisection, finite fuel\), already witnessed by the fix to the ground\-truth implementation; and the function signatures imposed by the benchmark, which require a rational return type to side\-step the fact thatRealis not a computable type in Lean \(as a developer comment in the entry itself acknowledges\)\. Any implementation is therefore incorrect on odd\-degree polynomials with no rational root, such asx3\+2x^\{3\}\+2\. Theproof\_genattempts were accordingly flagged asissuein either the implementation or the benchmark\-imposed signatures\. The same observation explains why logically correct specifications were flagged asissueatspec\_isotime: the agent recognized that any specification asking for an exact root must fail on polynomials without rational roots, regardless of the implementation — a trap from which the ground\-truth specification escapes precisely by asking for an approximate root\.
Problem 32 thus condenses several of the difficulties this kind of benchmark exposes: a literal reading of the NL description leads to a specification incompatible with the structure imposed by the benchmark; reconciling the two requires a creative workaround \(here, an approximation threshold\); and the implementation itself calls for non\-trivial mathematical insight\.Similar Articles
OProver: A Unified Framework for Agentic Formal Theorem Proving
OProver is a unified framework for agentic formal theorem proving in Lean 4 that iteratively improves proof generation through training with verified proofs and compiler feedback, achieving state-of-the-art results on multiple benchmarks.
Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
This paper introduces Discover and Prove (DAP), an open-source agentic framework for automated theorem proving in Lean 4 that tackles 'Hard Mode' problems where the answer must be discovered independently before formal proof construction. The work releases new Hard Mode benchmark variants and achieves state-of-the-art results while revealing a significant gap between LLM answer accuracy (>80%) and formal prover success (<10%).
Claude Code: Best practices for agentic coding
This article outlines best practices for using Claude Code, an agentic coding environment by Anthropic. It emphasizes managing context windows, providing verification criteria for code, and separating exploration from execution to improve performance.
@logic_int: Aleph, our fully autonomous AI agent system for formal verification, aced all major theorem proving benchmarks includin…
Aleph, a fully autonomous AI agent system for formal verification, achieved top performance on major theorem proving benchmarks including PutnamBench, VeriSoftBench, and Verina.
Verifying Agentic Development at Scale (8 minute read)
Cognition's Ido Pesok shares lessons from building autonomous end-to-end testing into Devin, noting that for the first time, more Devin sessions are now triggered asynchronously than interactively, making verified-before-merge results a hard requirement rather than a nicety.