LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks

arXiv cs.AI Papers

Summary

LEAP is an agentic framework that enables general-purpose LLMs to achieve state-of-the-art performance in formal theorem proving in Lean, solving all 12 problems from the 2025 Putnam Competition and boosting formal solve rates from below 10% to 70% on a new benchmark (Lean-IMO-Bench), surpassing specialized systems.

arXiv:2606.03303v1 Announce Type: new Abstract: Large Language Models (LLMs) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean. We present LEAP, an agentic framework that enables general-purpose foundation models to achieve state-of-the-art performance on automated formal theorem proving. LEAP leverages foundation model capabilities, such as informal reasoning, instruction following, and iterative self-refinement. By decomposing complex problems into smaller units, the system bridges formal proof construction with informal blueprints through continuous interaction with the Lean compiler. To provide a rigorous evaluation beyond increasingly saturated benchmarks, we introduce Lean-IMO-Bench, a benchmark of IMO-style problems formalized in Lean, with short statements yet highly non-routine and multi-step proofs across a wide range of difficulty levels. Empirically, on the latest 2025 Putnam Competition, an annual mathematics competition for undergraduate students in North America, LEAP solves all 12 problems, matching recent breakthroughs by frontier formal mathematical models. On Lean-IMO-Bench, LEAP boosts the one-shot formal solve rate of general-purpose LLMs from below 10% to 70%, notably surpassing the 48% benchmark set by a specialized, gold-medal-caliber IMO system. Furthermore, we demonstrate LEAP's research-level utility by autonomously formalizing complex proofs for open combinatorial challenges, including a verified proof for a key subproblem in Knuth's Hamiltonian decomposition of even-order Cayley graphs.
Original Article
View Cached Full Text

Cached at: 06/03/26, 09:43 AM

# LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks
Source: [https://arxiv.org/html/2606.03303](https://arxiv.org/html/2606.03303)
\\uselogo\\correspondingauthor

ponienkung@google\.com\\reportnumber

Linfeng SongGoogle CloudDawsen HwangGoogle DeepMindJinsung YoonGoogle Cloud AI ResearchChun\-Liang LiGoogle Cloud AI ResearchSimone SeveriniGoogle CloudMirek OlšákGoogle DeepMindEdward LockhartGoogle DeepMindQuoc V LeGoogle DeepMindBurak GokturkGoogle Cloud AI ResearchThang LuongGoogle DeepMindTomas PfisterGoogle Cloud AI ResearchNanyun PengGoogle Cloud AI Research

###### Abstract

Large Language Models \(LLMs\) exhibit strong informal mathematical reasoning but struggle to generate mechanically verifiable proofs in formal languages like Lean\. We present LEAP \(LLM\-in\-Lean Environment Agentic Prover\), an agentic framework that enables general\-purpose foundation models to achieve state\-of\-the\-art performance on automated formal theorem proving\. LEAP leverages foundation model capabilities, such as informal reasoning, instruction following, and iterative self\-refinement\. By decomposing complex problems into smaller units, the system bridges formal proof construction with informal blueprints through continuous interaction with the Lean compiler\. To provide a rigorous evaluation beyond increasingly saturated benchmarks, we introduce Lean\-IMO\-Bench, a benchmark of IMO\-style problems formalized in Lean, with short statements yet highly non\-routine and multi\-step proofs across a wide range of difficulty levels\. Empirically, on the latest 2025 Putnam Competition, an annual mathematics competition for undergraduate students in North America, LEAP solves all 12 problems, matching recent breakthroughs by frontier formal mathematical models; On Lean\-IMO\-Bench, LEAP boosts the one\-shot formal solve rate of general\-purpose LLMs from below 10% to 70%, notably surpassing the 48% benchmark set by a specialized, gold\-medal\-caliber IMO system\. Furthermore, we demonstrate LEAP’s research\-level utility by autonomously formalizing complex proofs for open combinatorial challenges, including a verified proof for a key subproblem in Knuth’s Hamiltonian decomposition of even\-order Cayley graphs\.

## 1Introduction

Large Language Models \(LLMs\) have made impressive progress on mathematical reasoning with natural language, also known as “informal math reasoning”, demonstrating strong capabilities in complex reasoning and problem\-solving for both math competitions and research level maths\[huang2025winning,luong2025geminiimo,feng2026towards,feng2026aletheia,feng2026semi\]\. However, as discussed in recent works like Hilbert\[hilbert\]and Goedel\-Prover\-V2\[goedel\_prover\_v2\], solutions in natural language frequently suffer from logical fallacies and hallucinations, and they are hard to automatically verify\. This difficulty in verification is not merely a limitation of automated systems; even for human mathematicians, verifying complex proofs is a notoriously time\-consuming process requiring scarce expert labor\[greiffenhagen2024checking\]\. A famous example is the proof of the Kepler conjecture\[hales2005proof\], which required four years of peer review before the referees could only claim to be “99% certain” of its correctness,111[https://en\.wikipedia\.org/wiki/Kepler\_conjecture](https://en.wikipedia.org/wiki/Kepler_conjecture)eventually necessitating a decade\-long formal verification effort\[hales2017formal\]\. This verification bottleneck underscores that assessing correctness is a hard task in natural language, motivating the exploration of formal mathematics, where proofs are written in a machine\-checkable language and verified by a rigorous kernel like in Lean\[moura2021lean\], Isabelle\[nipkow2002isabelle\], Coq\[huet1997coq\], HOL Light\[harrison2009hol\], offers automated verification with guaranteed accuracy\. Yet, bridging the gap to formal theorem proving remains a significant challenge, and the performance of automated formal provers currently lags substantially behind that of general\-purpose LLMs operating in natural language\.

To bridge this gap, recent efforts in the research community predominantly fine\-tunes specialized prover models \(e\.g\., AlphaProof\[alphaproof\], DeepSeek Prover V2\[deepseek\_prover\_v2\], Seed Prover\[seed\_prover\], Goedel Prover V2\[goedel\_prover\_v2\]\) on formal corpora, with the assumption that general LLMs are ineffective for rigorous formal tasks without specialization\. Indeed, according to the FormalProofBench\[formalproofbench\]222The paper is associated with a private dataset with a live leaderboard[https://www\.vals\.ai/benchmarks/proof\_bench](https://www.vals.ai/benchmarks/proof_bench)\. We contacted them several times to participate on the leaderboard without receiving a response\.and TaoBench\[taobench\], general LLMs often underperform compared to specialized prover models\.

While some recent works explored agentic or inference\-time search, they still depend on specialized models\. For instance, Hilbert\[hilbert\], AlphaProofNexus\[sketchprover\], Aristotle\[aristotle\], and Seed Prover V1\.5\[seed\_prover\_v1\_5\]use general LLMs for informal reasoning but rely on specialized models for Lean proving steps\. Axiom333[https://github\.com/AxiomMath/putnam2025](https://github.com/AxiomMath/putnam2025)and Numina444[https://github\.com/project\-numina/Numina\-Putnam2025](https://github.com/project-numina/Numina-Putnam2025)claim strong results on Putnam 2025 while remained closed source without public access, making them scientifically unverifiable\.

In this paper, we show that while general LLMs remain limited in one\-shot theorem proving, the bottleneck is not language comprehension but generating long, complex, correct proofs in one attempt\. General LLMs offer complementary skills to specialized models: strong informal reasoning, instruction following, tool use, and self\-refinement\. These make them ideal for agentic ATP frameworks, where proof construction is decomposed and iteratively improved\. To this end, we introduceLEAP \(LLM\-in\-Lean Environment Agentic Prover\), an agentic framework using*only*general LLMs for formal math\. Inspired by human workflow, LEAP generates a high\-level blueprint forming a directed acyclic graph \(DAG\), then generates the Lean proof, iteratively correcting errors via compiler feedback\.

To evaluate progress beyond saturated benchmarks such as MiniF2F\[zheng2022minif2f\]and PutnamBench\[putnambench\], we introduce Lean\-IMO\-Bench, formalizing the challenging informal math benchmark IMO\-Bench\[imo\-bench\]problem statements into Lean\. In contrast to existing benchmarks, which either focus on shorter problems or emphasize broad undergraduate coverage, Lean\-IMO\-Bench targets the complementary regime of elementary statements whose solutions often hinge on highly non\-routine insights and unfold through long, multi\-step, and structurally intricate proofs, providing a sharper test of formal theorem proving\.

Empirically, on the latest 2025 Putnam Competition, a challenging annual undergraduate mathematics competition in North America whose 2025 top score was 110 out of 120 while the median was only 2, LEAP solves all 12 problems in Lean, achieving perfect performance\. This matches recent breakthrough results from frontier formal mathematical reasoning models such as Axiom[3](https://arxiv.org/html/2606.03303#footnote3)and Numina[4](https://arxiv.org/html/2606.03303#footnote4)\. On Lean\-IMO\-Bench, LEAP substantially improves general LLMs’ solve rate from under 10% to 70%, surpassing specialized ATP models \(5%\) and Aristotle \(48%\), a strong system with specialized ATP components that earned the score for Gold medal at IMO 2025\. The contribution of the paper is three\-fold:

- •Workflow\-Inspired Agentic DesignWe introduce LEAP, an agentic framework that codifies the human mathematical workflow – combining high\-level blueprint sketching with low\-level formal proof generation and iterative compiler feedback\. Crucially, LEAP demonstrates that state\-of\-the\-art formal theorem proving can be achieved using*only*general\-purpose LLMs, challenging the belief that specialized fine\-tuning is indispensable\.
- •Lean\-IMO\-Bench Dataset:To evaluate progress beyond saturated benchmarks such as MiniF2F and PutnamBench, we introduce Lean\-IMO\-Bench, a new challenging dataset that translates informal problem statements from IMO\-Bench into formal Lean statements\. Resources are available at[https://imobench\.github\.io](https://imobench.github.io/)\.
- •Strong Empirical Results and Insights:LEAP solves all 12 problems on Putnam 2025 and achieves a large improvement over prior baselines on Lean\-IMO\-Bench\. Our analysis suggests that the primary bottleneck in formal mathematics for general\-purpose LLMs is not formal language comprehension alone, but the lack of structured, iterative interaction with the proof environment\. The Lean solutions generated by LEAP are available at[https://github\.com/google\-deepmind/superhuman/leap](https://github.com/google-deepmind/superhuman/leap)\.

## 2LEAP: Blueprint\-Driven Automated Theorem Proving

### 2\.1Formalizing Proofs with Blueprints

Formalizing mathematical proofs is rarely a one\-shot task: it requires a structured plan for progressively translating a high\-level argument into Lean\. To manage this complexity, recent formalization efforts often use the Lean Blueprint tool555[https://github\.com/PatrickMassot/leanblueprint](https://github.com/PatrickMassot/leanblueprint), which let mathematicians write a human\-readable proof roadmap linked to Lean code and visualized as a directed acyclic graph \(DAG\), where each node represents a proof obligation\. This workflow has been instrumental in coordinating large\-scale projects such as the formalization roadmap for Fermat’s Last Theorem666[https://leanprover\-community\.github\.io/blog/posts/FLT\-announcement/](https://leanprover-community.github.io/blog/posts/FLT-announcement/), where a multi\-year proof effort is organized through an explicit dependency graph\.

Inspired by this workflow, we introduce LEAP, an agent for automated theorem proving with hierarchical decomposition and planning\. Rather than synthesizing a complete proof in a single pass, LEAP incrementally drafts blueprints, decomposes Lean goals into supporting lemmas, and maintains the evolving proof plan as an AND\-OR DAG\.

![Refer to caption](https://arxiv.org/html/2606.03303v1/x1.png)Figure 1:LEAP workflow\.LEAP first attempts direct formalization with compiler\-feedback revision and LeanSearch\[leansearch\]retrieval\. If this fails, it generates an informal blueprint and formal proof sketch, adding verified subgoals back to the DAG only when dependencies remain acyclic\.
### 2\.2Overview

[Figure 1](https://arxiv.org/html/2606.03303#S2.F1)illustrates the workflow of LEAP\. Given an input theorem, LEAP registers its Lean statement as the rootgoal,777Agoalis any theorem or lemma statement to be proved; decomposition introducessubgoals\. See[Appendix C](https://arxiv.org/html/2606.03303#A3)\.represented as an OR node in the AND\-OR DAG\. To process an open goal, astate readerretrieves its statement, dependencies, and related lemmas\. LEAP then first attempts adirect proofby generating an informal proof, translating it into Lean code, and checking the candidate with the Lean compiler\.

If direct proving fails, LEAP shifts todecomposition\. It first drafts an informal blueprint that proposes intermediate lemmas, then translates the blueprint into a Lean proof sketch\. The sketch proves the current goal assuming only the proposed lemmas: the main theorem body issorry\-free, whilesorryplaceholders are permitted only in the newly proposed lemma statements\. If the sketch is accepted by the Lean compiler, it is added as an AND node, and the proposed lemmas are added as child OR nodes\. This ensures that once all child subgoals are proved, the parent goal is also proved\. The verified sketch is then passed to thestate writer, which checks that the update preserves acyclicity before committing it to the DAG\. The agent then recursively processes the newly created subgoals\.

This workflow relies on three tightly coupled design choices:DAG\-based hierarchical memoization, which preserves progress and reuses lemmas across branches;interleaved informal\-formal planning, which connects natural\-language strategies with executable Lean code; andverification\-guided proof search, which uses compiler feedback and LLM\-based review to accept, revise, decompose, or abandon candidate branches\.

### 2\.3Hierarchical Memoization via DAG

LEAP uses the AND\-OR DAG not only to record proof progress, but also to structure hierarchical memoization\. OR nodes represent open goals or lemma statements, each of which may be resolved by any valid proof strategy, while AND nodes represent candidate decompositions whose success depends on proving all constituent subgoals\.[Figure 2](https://arxiv.org/html/2606.03303#S2.F2)illustrates this structure\.

The DAG provides two central advantages\. First,monotone refinement: once a goal is decomposed into supporting subgoals, subsequent search can focus on expanding and resolving these descendants without restructuring the established dependency order\. This separates local proof exploration from global proof organization: individual proof attempts may be revised, expanded, or abandoned, while the DAG preserves the stable dependency structure of the overall proof plan\. Second,lemma memoization: intermediate lemma statements are stored as shared proof nodes and can be reused whenever the same subproblem arises in different branches\. This also supportsanticipatory lemma planning: during blueprint generation, LEAP may propose auxiliary lemma statements that are not immediately required by the current sketch but could support later proof steps\. Such prospective lemmas remain available in the graph memory without being necessary for resolving the current AND node\. Together, these properties allow independent proof plans to converge on common dependencies while reducing redundant derivations\.

The resulting dependency structure also improves transparency: it exposes which goals remain open, which lemmas have been resolved, and which nodes block downstream progress\. This helps LEAP identify where additional lemmas, revised decompositions, or stronger assumptions may be needed, while providing an interpretable blueprint\-style workspace for human\-AI collaboration\.

### 2\.4Interleaved Informal–Formal Planning

As shown in[Figure 1](https://arxiv.org/html/2606.03303#S2.F1), both the direct proof path and the blueprint decomposition path in LEAP pass through an informal proof sketch\. This reflects the complementary strengths of LLMs and Lean: LLMs are effective at informal reasoning, strategy generation, and refinement, while Lean provides strict machine\-checkable verification\.

In direct proving, LEAP first generates an informal argument for the current goal before translating it into a candidate Lean proof\. In decomposition, it drafts an informal blueprint explaining how the goal can be reduced to supporting subgoals, then converts this plan into a Lean sketch that records the proposed dependencies\. In both cases, the informal sketch provides a planning space before formalization, making proof construction less brittle than direct code generation alone \(see[Appendix C](https://arxiv.org/html/2606.03303#A3)for examples of informal proofs and blueprints\)\.

This interleaving also makes proof progress more interpretable: each formal attempt is paired with an informal rationale, allowing users to inspect why a proof step or decomposition was proposed rather than only reading Lean code or compiler feedback\.

### 2\.5Verification\-Guided Proof Search

As shown in[Figure 1](https://arxiv.org/html/2606.03303#S2.F1), LEAP uses verification at two levels\. First, the Lean compiler formally checks candidate proofs and sketches, ensuring that accepted code is syntactically valid and type\-correct\. For proof sketches, LEAP only permitssorryplaceholders for the proposed subgoals \(lemmas\)\. This preserves the AND\-OR semantics of the proof DAG: once all referenced subgoals are proved, the parent goal is also proved\. Second, after a blueprint proposes new subgoals, an LLM reviewer assesses the quality of the decomposition: whether the subgoals are relevant to the parent goal, make the problem easier, and offer a plausible route to completing the proof\. This planning\-level review is crucial for complex goals, where a Lean sketch can be syntactically valid while introducing subgoals that are ill\-posed or no simpler than the original statement\. Without this filter, the agent may repeatedly expand weak blueprints, spending search budget on branches that do not make real progress\. We study this failure mode through an ablation without the LLM reviewer in[subsection 5\.3](https://arxiv.org/html/2606.03303#S5.SS3)\.

The LLM reviewer therefore acts as a search filter: it identifies unpromising decompositions, triggers backtracking, and encourages exploration of alternative strategies\. Currently, LEAP uses a simple DFS over the DAG with backtracking\. The effectiveness of this reviewer suggests a broader future direction: LLMs may also serve as heuristic evaluators for guiding search in formal proof spaces\.

![Refer to caption](https://arxiv.org/html/2606.03303v1/figures/dag_example.png)Figure 2:DAG example for Putnam 2025 Problem A6\.LEAP decomposes the theorem into a proof sketch and supporting lemmas\. Throughanticipatory lemma planning, the agent may also propose auxiliary lemma statements that are not immediately required but could be useful later; these are shown with dashed edges and are not needed to prove the main theorem\. Green nodes are proven nodes, and brown blocks denote definitions, structures, or variables introduced at a node\.

## 3Lean\-IMO\-Bench: Formalizing IMO Problems in Lean

Table 1:Baseline performance onLean\-IMO\-Benchacross three evaluation tasks\. Natural Language Proof performance is based on human expert review\.### 3\.1Lean\-IMO\-Bench

We introduceLean\-IMO\-Bench, a curated collection of 60 problems building upon the foundational work ofimo\-bench\.imo\-benchintroducedIMO\-ProofBench, a rigorous suite vetted by an expert panel of mathematicians and IMO medalists\. The benchmark contains 60 problems split evenly into aBasicset and anAdvancedset of 30 problems each\. TheBasicset spans pre\-IMO to IMO\-Medium difficulty and includes 8 algebra, 8 combinatorics, 8 number theory, and 6 geometry problems\. TheAdvancedset includes novel problems up to IMO\-Hard difficulty, with 8 algebra, 8 combinatorics, 6 number theory, and 8 geometry problems\. Overall, the benchmark is approximately balanced across algebra, combinatorics, geometry, and number theory\.

To ensure the highest level of accuracy inLean\-IMO\-Bench, Lean experts manually formalized and verified all 60 problem statements\. Because these problems are at the IMO level, the required mathematical background is elementary\. Consequently, we expect the corresponding Lean solutions to be concise, deliberately removing the overhead associated with formalizing complex, modern mathematical theories\.

The dataset can be used to evaluate models across three distinct tasks:Natural Language Proof,Formal Theorem Proving, andFormal Proof Translation, while we focus on formal theorem proving in this paper\. The baseline performance onLean\-IMO\-Benchis summarized in[Table 1](https://arxiv.org/html/2606.03303#S3.T1)\. For the Natural Language Proof task, we citeimo\-benchas reference: Gemini 2\.5 Pro shows strong informal reasoning performance\. However, as shown in[Table 1](https://arxiv.org/html/2606.03303#S3.T1), this does not directly translate to formal theorem proving: Gemini 3\.1 Pro performs substantially worse on Formal Theorem Proving, especially on the Advanced set\. Providing a correct informal proof in the Formal Proof Translation task also yields little improvement, with Pass@128 unchanged and only a marginal gain in Average@128\.

[Table 1](https://arxiv.org/html/2606.03303#S3.T1)demonstrates a stark gap in the models’ Lean capabilities\. Because the model can already successfully solve these problems in natural language, mathematical reasoning is not the bottleneck, thus reliably generating valid Lean code remains the primary challenge\.

## 4Experimental Results

We evaluate LEAP with Gemini\-3\.1\-pro as the backend large language model and compare it against four baselines:Gemini\-3\.1\-pro, which tests one\-shot proof generation by a strong general\-purpose model;Goedel\-Prover\-V2\-32B\[goedel\_prover\_v2\], a state\-of\-the\-art open\-source ATP model specialized for Lean;Hilbert\[hilbert\], an agentic Lean formalization framework that combines Goedel\-Prover\-V2\-32B with Gemini\-3\.1\-pro; andAristotle\[aristotle\], a specialized automated theorem\-proving system with dedicated ATP components that achieved gold\-medal\-level performance at the 2025 IMO\.

We evaluate formal proving ability on two datasets:Putnam 2025and our proposedLean\-IMO\-Bench\. Putnam 2025 contains twelve undergraduate\-level problems from the 86th William Lowell Putnam Mathematical Competition,888Mathematical Association of America,[*Results of the 86th William Lowell Putnam Mathematical Competition*](https://maa.org/news/results-of-the-86th-william-lowell-putnam-mathematical-competition/)\.a highly challenging North American mathematics competition\. In the 2025 competition, the top score was 110 out of 120, the average score was approximately 10, and the median score was 2\.

### 4\.1Results on Putnam 2025

Table[2](https://arxiv.org/html/2606.03303#S4.T2)presents the evaluation results on the Putnam 2025 benchmark\. Under a Pass@128 setting, the direct formalization baselines \(Gemini\-3\.1\-pro and Goedel\-Prover\-V2\-32B\) fail to solve any problems, indicating that single\-pass generation is insufficient for the logical complexity of this dataset\.

Table 2:Putnam 2025 results\. Green checkmarks \(✓\) indicate successfully solved problems, while red crosses \(×\\times\) indicate failures\. Evaluation settings:⋄indicates pass@128, while†indicates rollout=2\.The open\-source agentic framework Hilbert improves upon direct generation, solving 4 out of 12 problems\. However, during evaluation, we observed that Hilbert’s recursive search design leads to an exponential time complexity of𝒪​\(\(n⋅b\)d\)\\mathcal\{O\}\(\(n\\cdot b\)^\{d\}\), wherennis the number of lemma retries,bbis the average branching factor, andd=10d=10is the maximum proof depth\. Due to the high volume of redundant LLM calls required by this approach, we bounded each Hilbert rollout to a 7\-day time limit\. For context against state\-of\-the\-art proprietary systems, we also report the performance of Aristotle\. While the system is closed\-source, it serves as a strong baseline, solving 9 out of 12 problems given two rollouts\.999An[unofficial report](https://www.reddit.com/r/mlscaling/comments/1pjnccr/aristotle_smashes_putnam_by_solving_formally/)indicates Aristotle solved 10 out of 12 problems on this benchmark; however, neither that reported run nor our evaluation successfully solved problem A5\.

LEAP successfully solves all 12 Putnam 2025 problems, improving the benchmark solve rate from 0% via direct formalization to 100% with our agentic framework\. This performance directly results from the LEAP’s blueprint\-inspired AND\-OR DAG architecture, which resolves the search bottlenecks observed in standard recursive frameworks like Hilbert\. By supporting hierarchical memoization, LEAP allows independent proof branches to reuse shared intermediate lemmas, significantly mitigating exponential search complexity and allow LEAP to solve problems efficiently\. For a detailed, problem\-level breakdown of the computational cost and search efficiency required to achieve these results, See Table[3](https://arxiv.org/html/2606.03303#S4.T3)for runtime and efficiency statistics\.

Table 3:Runtime and search efficiency of LEAP on Putnam 2025\.For each problem, we report the computational cost \(total LLM calls for a verified proof\), the search space explored \(active DAG nodes/lemmas\), and the final Lean proof line count\.
### 4\.2Results on Lean\-IMO\-Bench

Table[4](https://arxiv.org/html/2606.03303#S4.T4)presents evaluation results on Lean\-IMO\-Bench\. We include this dataset to test model robustness across a broader spectrum of mathematical disciplines and distinct complexity tiers, providing a complementary challenge to the Putnam benchmark\.

Direct formalization baselines \(Gemini\-3\.1\-Pro and Goedel\-Prover\-V2\-32B\) and the open\-source Hilbert framework struggle significantly on this dataset, exhibiting severe performance degradation on the Advanced set\. While the proprietary Aristotle system resolves a majority of the Basic problems, its effectiveness drops sharply as complexity increases\. Notably, across all evaluated methods, performance in the Geometry category remains near zero\. This aligns with the well\-established difficulty of formalizing olympiad\-level geometry in Lean without the aid of supplementary, domain\-specific frameworks\. We retain this category strictly to evaluate general\-purpose reasoning under extreme formalization constraints\.

Against these baselines, LEAP achieves the highest overall solve rates, scoring 83\.3% on the Basic set and 56\.7% on the Advanced set\. By effectively leveraging its DAG\-based architecture, LEAP demonstrates strong domain generalization, maintaining a 100% solve rate in both Algebra and Number Theory regardless of the difficulty tier\.

Table 4:Results onLean\-IMO\-Bench\. We report the solve rate \(%\) across different mathematical categories for theBasicandAdvancedsets separately\. Evaluation settings:⋄indicates pass@128, while†indicates rollout=2\. Best results in each section are in bold\.

## 5Discussion

### 5\.1Beyond One\-Shot Formalization

A central motivation of LEAP is that general foundation models can be effective iterative formalizers, even when they are not specialized Lean provers\. While specialized provers are trained for formal proof synthesis, general models offer complementary capabilities such as instruction following, long\-context reasoning, informal planning, tool use, and feedback\-based revision\.

To isolate this effect, we evaluate theDirect Formalizationcomponent labeled in[Figure 1](https://arxiv.org/html/2606.03303#S2.F1)under two settings\. In the one\-shot setting, each model is evaluated with Pass@128 over independently sampled proof attempts\. In the iterative setting, each model receives a single initial attempt and up to 20 compiler\-feedback revision steps, yielding a Pass@1 result under a smaller sampling budget\. As shown in[Table 5](https://arxiv.org/html/2606.03303#S5.T5), Goedel\-Prover\-V2\-32B does not benefit from this feedback loop, while Gemini\-3\.1\-pro improves substantially from20\.0%20\.0\\%to36\.6%36\.6\\%\.

This suggests that iterative formalization depends on capabilities beyond local Lean proof synthesis\. Interpreting compiler errors, maintaining context, and revising proof attempts over multiple steps can be as important as one\-shot formal proving accuracy\. These results support using a general foundation model as the reasoning backbone of LEAP, while leaving open the possibility of combining it with specialized provers for local proof generation\.

Table 5:One\-shotvs\.iterative formalizationperformance on Lean\-IMO\-Bench Basic set\.
### 5\.2Effect of DAG\-Based Memoization

LEAP maintains proof progress as a DAG\-based memory rather than an isolated decomposition tree\. This allows intermediate lemmas to be stored as shared nodes and reused across branches, while exposing graph context such as existing goals, dependencies, and previously proposed lemmas\.

To evaluate this design, we compare LEAP with a tree\-structured variant that follows the same workflow but removes global lemma sharing\. As shown in[Table 6](https://arxiv.org/html/2606.03303#S5.T6), the tree variant already substantially outperforms Hilbert\[hilbert\], which achieves 36\.6% and 6\.6% on the Basic and Advanced sets, respectively \([Table 4](https://arxiv.org/html/2606.03303#S4.T4)\)\. This indicates that interleaved informal–formal planning and verification\-guided search are effective even without DAG\-based memoization\. The full DAG version further improves performance from 73\.3% to 83\.3% on the Basic set and from 40\.0% to 56\.7% on the Advanced set, showing the benefit of global proof memory\.

The improvement is especially pronounced on harder categories, such as Advanced Algebra and Advanced Number Theory, where shared lemmas and graph context are more likely to matter\. We attribute this gain to two effects\. First, the DAG supports anticipatory lemma planning: auxiliary lemmas proposed at higher\-level nodes can later be reused by downstream subgoals \([Figure 2](https://arxiv.org/html/2606.03303#S2.F2)\)\. Second, repeated subproblems can be shared across branches, avoiding the need to rediscover or reprove the same lemma multiple times\. Together, these properties reduce redundant derivations and improve proof search efficiency\.

Table 6:DAG memoization ablation\.Solve rate \(%\) by category on Lean\-IMO\-Bench Basic \(B\)/Advanced \(A\) sets\.
### 5\.3Toward LLM\-Guided Proof Search

Compiler verification checks whether a proof sketch is formally well\-typed, but not whether its decomposition is useful\. A sketch may prove the parent goal from proposed lemmas that are unhelpful, overly difficult, or nearly equivalent to the original goal\. In LEAP, the LLM reviewer acts as a local search heuristic: it filters candidate decompositions by judging whether they meaningfully simplify the parent goal before they are committed to the DAG\.

We focus this ablation on Putnam 2025 Problem A5 because it is one of the most challenging cases in our evaluation, requiring the longest runtime and two rollouts for LEAP to formalize the proof successfully\. Removing the LLM\-based decomposition reviewer causes the agent to fail even after eight rollout attempts\. This contrast suggests that local LLM review provides a useful search signal: it rejects weak decompositions early, triggers backtracking, and prevents the agent from spending rollouts on branches that do not make substantive progress\. We further inspect decomposition traces from the ablated setting; a representative failure case is shown in[Figure 3](https://arxiv.org/html/2606.03303#S5.F3)\. The decomposition is formally admissible but does not simplify the mathematical state\. The agent first unfolds the definitions in the grandparent goal to create an intermediate lemma, then folds them back into a proposed subgoal that is syntactically identical to the original statement\. Without semantic review, this duplicate lemma is treated as a new step, causing the agent to repeat the same unproductive decomposition until its search budget is exhausted\. This failure highlights the potential of LLM\-guided proof search: a reviewer can assess whether a proposed lemma actually advances the proof, prune cyclic or non\-simplifying branches, and direct compute toward more promising paths\.

Figure 3:Unproductive decomposition without LLM review\.The proposed subgoal restates the grandparent goal, so the decomposition is formally admissible but does not simplify proof search\.
### 5\.4Perspective: General LLMs as Formal Provers: From Zero to Hero

As demonstrated by LEAP, the seemingly insurmountable gap between the poor one\-shot theorem proving performance of general LLMs and state\-of\-the\-art results can be effectively bridged by a well\-designed agentic framework\. By shifting the paradigm away from relying solely on small specialized LLMs, we show that the extensive knowledge, instruction following, and self\-correction capabilities of foundation models are more than sufficient\. When scaffolded correctly, these foundation models can progress from near zero formal math performance to solving highly complex problems\.

While small specialized LLMs lack the overarching agentic capabilities of their foundation counterparts, we acknowledge that they still hold value\. A hybrid architecture combining the high\-level, structural reasoning of a foundation model with the focused, formal step generation of a fine\-tuned specialized model could be a highly effective design pattern\. However, exploring this hybrid approach remains outside the scope of this paper, as our primary goal is to highlight the standalone power of general purpose LLMs in an agentic workflow\.

## 6Case Studies: Formalizing Open Problems in Combinatorics

#### Hamiltonian Decomposition of Directed Cayley Graphs\.

To evaluate LEAP on a highly complex mathematical task, we targeted a recently solved open problem in combinatorics: the Hamiltonian decomposition of the directed Cayley graphΓm=C​a​y​\(ℤm3,\{e1,e2,e3\}\)\\Gamma\_\{m\}=Cay\(\\mathbb\{Z\}\_\{m\}^\{3\},\\\{e\_\{1\},e\_\{2\},e\_\{3\}\\\}\)for evenmm\. Originally posed by Donald Knuth, the problem asks whether the graph’s directed arcs can be partitioned into exactly three distinct, spanning Hamiltonian cycles \. The informal mathematical proof for the even\-case construction is exceptionally intricate, relying on heavy combinatorial analysis and localized defect routing across different layers of the graph\. We focused our formalization efforts on a critical subproblem: rigorously verifying that the 2D planar projection of a single color class’s routing dynamics forms an unbroken mathematical cycle of lengthm2m^\{2\}\. The informal arguments for this specific dynamic span roughly 20 pages of dense piecewise maps, parity\-dependent intervals, and complex cross\-row transitions\. To tackle a formal verification of this magnitude, we deployed LEAP, which successfully decomposed the monolithic informal proof into a granular, highly structured proof graph\. By autonomously and systematically resolving the interdependent nodes of this graph, LEAP managed to fully verify the complex cycle\-merging dynamics, ultimately synthesizing over 5000 lines of rigorous Lean 4 code to complete the formal proof for this subproblem\. Full problem descriptions and informal proofs are available at[https://github\.com/dpwoodru/knuthCycles/tree/main](https://github.com/dpwoodru/knuthCycles/tree/main)\.

#### Formalizing Erdős Problem 457\.

We further tested LEAP on Erdős Problem 457, a classic graph theory problem concerning the density of triangle\-free graphs\. Although this problem is already resolved, it served as an ideal benchmark to assess LEAP’s ability to autonomously reconstruct and verify established mathematical results\. Tasked with deriving the known proof from first principles in Lean 4, LEAP effectively navigated the combinatorial constraints to confirm the theorem’s validity\. This successful reproduction demonstrates LEAP’s capability to reliably translate complex, existing literature into high\-assurance formal proofs without human intervention\.

Formal statements and detailed problem descriptions are provided in[Appendix B](https://arxiv.org/html/2606.03303#A2)\.

## 7Conclusion and Future Work

The success of LEAP suggests that modern general\-purpose LLMs possess substantial reasoning capabilities for rigorous domain\-specific tasks, provided they are coupled with appropriate structural scaffolding\. In formal mathematics, this scaffolding naturally takes the form of proof decomposition and verifier\-guided refinement: the model decomposes complex theorems into smaller subgoals, while the Lean compiler checks each formal step\. This design provides a structured mechanism for translating informal reasoning into mechanically verified proofs\. A central challenge for future work is how to navigate the resulting proof trees efficiently\. As decomposition produces increasingly fine\-grained subgoals, the search space can grow rapidly\. Future systems should therefore improve branch prioritization, decomposition strategies, and compute allocation across large proof searches\. Such advances will be critical for scaling agentic formal proving systems to more complex mathematical problems\.

## Acknowledgements

We thank Michael P\. Brenner, Honghao Lin, David Woodruff, Vahab Mirrokni for providing the informal proof of the even\-case of Knuth’s Cycles problem\. We would also like to thank Ashley Aragorn Khoo, Paul Lezeau, Calle Sönne, and Moritz Firsching for formalizing the Lean problem statements in Lean\-IMO\-Bench\.

## References

## Appendix ARelated Work

#### Neural Theorem Proving

Early work in neural theorem proving mainly utilize in\-house symbolic engines, such as Metamath\[polu2020generative\], MM0\[carneiro2019metamath\]or some dedicated formal language for geometry problems\[lu2021inter\]\. Later work such as mathlib\[mathlib\], LeanDojo\[leandojo\]and MiniF2F\[zheng2022minif2f\]pioneered the use of LLMs for generative theorem proving in Lean\. They serve as pillar that provide a rich library of known theorems, an interactive environment for step\-level search and a descent\-level of difficulty evaluation set\. To manage the large search space, HyperTree Proof Search\[hypertree\]and related Monte Carlo tree search methods\[lin2024lean,xin2025bfs\]have been explored\. While search\-based methods operate at the tactic level, Baldur\[baldur\]and DeepSeek\-prover\-v1\.5\[xin2024deepseek\]explored whole\-proof generation, attempting to produce a complete proof in a single step\. Another promising direction is guiding formal proof search with informal proofs or sketches\. The “draft, sketch, and prove” methodology\[draftsketchprove\]demonstrated that using an informal proof as a blueprint can significantly guide and improve formal theorem proving\. Our work, LEAP, builds on this intuition by utilizing general LLMs to generate informal blueprints and iteratively refine formal proofs based on compiler feedback, but without relying on specialized fine\-tuned models for the formalization step\.

#### Specialized Prover Models

Recent breakthroughs have often relied on extensive fine\-tuning of large models on formal mathematical corpora\. Representative works include AlphaProof\[alphaproof\], DeepSeek Prover V2\[deepseek\_prover\_v2\], Seed Prover\[seed\_prover\], Kimina Prover\[wang2025kimina\]and Goedel Prover V2\[goedel\_prover\_v2\]\. These models achieve state\-of\-the\-art performance by scaling up training and search on formal systems\. However, they require substantial computational resources for training and are highly specialized for specific formal languages\. In contrast, LEAP demonstrates that general\-purpose LLMs, when placed in a proper agentic environment, can achieve competitive performance without such specialized fine\-tuning\.

#### Auto\-Formalization

Auto\-formalization, the task of translating natural language mathematics into formal statements and proofs, is a critical bridge between informal and formal reasoning\. Early work relied on neural machine translation techniques\[wu2022autoformalization\]\. More recently, LLMs have been used to generate formal statements for training provers at scale, as seen in the auto\-formalization pipeline of AlphaProof\[alphaproof\]\. LEAP utilizes the strong auto\-formalization capabilities of general LLMs within its agentic harness to bridge the gap between informal blueprints and formal proofs\.

#### Mathematical Reasoning with LLMs

Large Language Models have shown impressive progress in solving natural\-language mathematical problems, demonstrating strong capabilities in complex reasoning\. Recent advancements, such as OpenAI o1\[openai\_o1\]and DeepSeek R1\[deepseek\_r1\], have demonstrated the effectiveness of scaling reinforcement learning for complex mathematical tasks, achieving high scores on competitive benchmarks like AIME\. However, direct evaluation of these models on formal theorem proving benchmarks often yields low solve rates, highlighting the gap between informal reasoning and formal verification\. LEAP addresses this by leveraging the strong informal reasoning and instruction\-following capabilities of general LLMs within an agentic harness, enabling them to interact with the Lean compiler and iteratively self\-correct, thus bridging the formalization gap without specialized fine\-tuning\.

## Appendix BProblem Statements

We present the LEAN statements of the open problems that we tested with LEAP\.

#### Hamiltonian Decomposition of Directed Cayley Graphs

The Hamiltonian decomposition problem for the directed Cayley graphΓm=C​a​y​\(ℤm3,\{e1,e2,e3\}\)\\Gamma\_\{m\}=Cay\(\\mathbb\{Z\}\_\{m\}^\{3\},\\\{e\_\{1\},e\_\{2\},e\_\{3\}\\\}\)asks whether its edges can be partitioned into three distinct directed Hamiltonian cycles\. For the even\-case construction \(m=2​h≥10m=2h\\geq 10\), the 3D routing dynamics of individual color classes can be analytically projected onto a 2D planar “round map” defined on aℤm×ℤm\\mathbb\{Z\}\_\{m\}\\times\\mathbb\{Z\}\_\{m\}grid\. The formal statement below encodes the exact operational semantics for the Color 2 subgraph—including its parity\-dependent structural defects, coordinate shifts, and piecewise transitions—and asserts that its round map forms a single, unbroken cycle of lengthm2m^\{2\}\.

`Lean Statement for the Hamiltonian Decomposition of Directed Cayley Graphs`

`Erdős 457 Erdős Problem 457 is a number theory challenge concerning the prime divisors of consecutive integers\. Specifically, it conjectures the existence of a real number ε\>0\\varepsilon\>0 such that for infinitely many integers nn, every prime number p≤\(2\+ε\)​log⁡np\\leq\(2\+\\varepsilon\)\\log n divides the product of the ⌊log⁡n⌋\\lfloor\\log n\\rfloor consecutive integers starting from n\+1n\+1\. The Lean formalization below captures this exact asymptotic prime divisibility condition\. Lean Statement for Erdős Problem 457 Appendix C Proof Contexts and Artifacts This section describes the formal and informal artifacts used by LEAP during proof planning\. Formal artifacts correspond to Lean\-level objects that are checked by the compiler or represented in the proof DAG, while informal artifacts correspond to natural\-language planning objects used to guide direct proving and decomposition\. Formal context\. A proof goal is a Lean theorem or lemma statement that remains to be proved\. The original input theorem is the root proof goal, while lemma statements introduced by decomposition become subgoals in the proof DAG\. A formal proof is a complete Lean proof of the current proof goal that does not rely on newly proposed unproven lemmas; once accepted by the Lean compiler, the corresponding goal is marked as resolved\. A proof sketch is a Lean artifact that proves the current goal assuming a set of proposed lemmas\. In LEAP, a proof sketch may contain sorry placeholders only for these explicitly proposed lemmas\. Thus, a verified proof sketch defines a valid decomposition: once all referenced proposed lemmas are later proved, the current goal is also proved\. We present examples of these context using Lean\-IMO\-Bench Problem 001 and 009 in the Basic Set\. All artifacts, except for the Proof Goal of the root problem, are created by LEAP automatically\.\) Example Proof Goal \(Lean\-IMO\-Bench, Basic 001\) Example Formal Proof \(Lean\-IMO\-Bench, Basic 001\) Example Proof Sketch \(Lean\-IMO\-Bench, Basic 006\) Informal context\. An informal proof is a natural\-language plan for proving the current goal directly, without introducing a decomposition into new subgoals\. It guides the generation of a complete formal proof\. An informal blueprint is a higher\-level natural\-language plan that explains how the current goal can be reduced to useful supporting lemmas\. Unlike an informal proof, an informal blueprint may introduce proposed lemmas that are not immediately proved and may include auxiliary lemmas for later proof steps\. These proposed lemmas are then translated into formal subgoals and organized in the proof DAG\. Examples are shown below: Example Informal Proof \(Lean\-IMO\-Bench, Basic 001\) Let S=\{f:ℤ→ℤ∣∀x,y∈ℤ,f​\(2​x\)\+2​f​\(y\)=f​\(f​\(x\+y\)\)\}S=\\\{f:\\mathbb\{Z\}\\to\\mathbb\{Z\}\\mid\\forall x,y\\in\\mathbb\{Z\},f\(2x\)\+2f\(y\)=f\(f\(x\+y\)\)\\\} and T=\{0\}∪\{x↦2​x\+c∣c∈ℤ\}T=\\\{0\\\}\\cup\\\{x\\mapsto 2x\+c\\mid c\\in\\mathbb\{Z\}\\\}\. We want to show that S=TS=T\. Part 1: Proof that T⊆ST\\subseteq S Suppose f∈Tf\\in T\. Then either ff is the constant zero function \(f=0f=0\) or f​\(x\)=2​x\+cf\(x\)=2x\+c for some c∈ℤc\\in\\mathbb\{Z\}\. • Case 1: If f=0f=0, then for any x,y∈ℤx,y\\in\\mathbb\{Z\}, the left\-hand side is f​\(2​x\)\+2​f​\(y\)=0\+2​\(0\)=0f\(2x\)\+2f\(y\)=0\+2\(0\)=0\. The right\-hand side is f​\(f​\(x\+y\)\)=f​\(0\)=0f\(f\(x\+y\)\)=f\(0\)=0\. The two sides match, so f∈Sf\\in S\. • Case 2: If f​\(x\)=2​x\+cf\(x\)=2x\+c for some c∈ℤc\\in\\mathbb\{Z\}, then for any x,y∈ℤx,y\\in\\mathbb\{Z\}, the left\-hand side is: f​\(2​x\)\+2​f​\(y\)=\(2​\(2​x\)\+c\)\+2​\(2​y\+c\)=4​x\+4​y\+3​cf\(2x\)\+2f\(y\)=\(2\(2x\)\+c\)\+2\(2y\+c\)=4x\+4y\+3c\. The right\-hand side is: f​\(f​\(x\+y\)\)=f​\(2​\(x\+y\)\+c\)=2​\(2​x\+2​y\+c\)\+c=4​x\+4​y\+3​cf\(f\(x\+y\)\)=f\(2\(x\+y\)\+c\)=2\(2x\+2y\+c\)\+c=4x\+4y\+3c\. The two sides are equal, so f∈Sf\\in S\. This proves T⊆ST\\subseteq S\. Part 2: Proof that S⊆TS\\subseteq T Let f∈Sf\\in S\. Then for all x,y∈ℤx,y\\in\\mathbb\{Z\}, we have: \(1\) f​\(2​x\)\+2​f​\(y\)=f​\(f​\(x\+y\)\)f\(2x\)\+2f\(y\)=f\(f\(x\+y\)\)\. Substitute x=0x=0 into Eq\. \(1\): f​\(0\)\+2​f​\(y\)=f​\(f​\(y\)\)f\(0\)\+2f\(y\)=f\(f\(y\)\)\. Let c=f​\(0\)c=f\(0\)\. Then for all y∈ℤy\\in\\mathbb\{Z\}, we have: \(2\) f​\(f​\(y\)\)=2​f​\(y\)\+cf\(f\(y\)\)=2f\(y\)\+c\. Using Eq\. \(2\) on the right\-hand side of Eq\. \(1\) \(since it holds for any input, including x\+yx\+y\), we obtain: \(3\) f​\(2​x\)\+2​f​\(y\)=2​f​\(x\+y\)\+cf\(2x\)\+2f\(y\)=2f\(x\+y\)\+c\. Substitute y=0y=0 into Eq\. \(3\): f​\(2​x\)\+2​f​\(0\)=2​f​\(x\)\+cf\(2x\)\+2f\(0\)=2f\(x\)\+c\. Since f​\(0\)=cf\(0\)=c, this becomes f​\(2​x\)\+2​c=2​f​\(x\)\+cf\(2x\)\+2c=2f\(x\)\+c, which yields: \(4\) f​\(2​x\)=2​f​\(x\)−cf\(2x\)=2f\(x\)\-c\. Now, substitute Eq\. \(4\) back into the left\-hand side of Eq\. \(3\): \(2​f​\(x\)−c\)\+2​f​\(y\)=2​f​\(x\+y\)\+c\(2f\(x\)\-c\)\+2f\(y\)=2f\(x\+y\)\+c\. Rearranging the terms gives: 2​f​\(x\+y\)=2​f​\(x\)\+2​f​\(y\)−2​c2f\(x\+y\)=2f\(x\)\+2f\(y\)\-2c\. Because the codomain is ℤ\\mathbb\{Z\}, we can divide both sides by 22 to obtain: \(5\) f​\(x\+y\)=f​\(x\)\+f​\(y\)−cf\(x\+y\)=f\(x\)\+f\(y\)\-c\. Define a new function g:ℤ→ℤg:\\mathbb\{Z\}\\to\\mathbb\{Z\} by g​\(x\)=f​\(x\)−cg\(x\)=f\(x\)\-c\. Then we can rewrite Eq\. \(5\) as: g​\(x\+y\)\+c=\(g​\(x\)\+c\)\+\(g​\(y\)\+c\)−cg\(x\+y\)\+c=\(g\(x\)\+c\)\+\(g\(y\)\+c\)\-c\. Simplifying this yields Cauchy’s functional equation: g​\(x\+y\)=g​\(x\)\+g​\(y\)g\(x\+y\)=g\(x\)\+g\(y\) for all x,y∈ℤx,y\\in\\mathbb\{Z\}\. Since gg satisfies g​\(x\+y\)=g​\(x\)\+g​\(y\)g\(x\+y\)=g\(x\)\+g\(y\) over integers, it must be a linear function\. Let k=g​\(1\)∈ℤk=g\(1\)\\in\\mathbb\{Z\}\. By standard induction, we have g​\(x\)=k​xg\(x\)=kx for all integers x≥0x\\geq 0 \(with g​\(0\)=0g\(0\)=0\)\. For negative integers, 0=g​\(0\)=g​\(x\+\(−x\)\)=g​\(x\)\+g​\(−x\)0=g\(0\)=g\(x\+\(\-x\)\)=g\(x\)\+g\(\-x\), meaning g​\(x\)=−g​\(−x\)=−\(−k​x\)=k​xg\(x\)=\-g\(\-x\)=\-\(\-kx\)=kx\. Thus g​\(x\)=k​xg\(x\)=kx for all x∈ℤx\\in\\mathbb\{Z\}\. As a result, we have f​\(x\)=k​x\+cf\(x\)=kx\+c for all x∈ℤx\\in\\mathbb\{Z\}\. To determine the permissible values of kk and cc, substitute f​\(x\)=k​x\+cf\(x\)=kx\+c back into the original Eq\. \(1\): Left\-hand side: f​\(2​x\)\+2​f​\(y\)=k​\(2​x\)\+c\+2​\(k​y\+c\)=2​k​x\+2​k​y\+3​cf\(2x\)\+2f\(y\)=k\(2x\)\+c\+2\(ky\+c\)=2kx\+2ky\+3c\. Right\-hand side: f​\(f​\(x\+y\)\)=k​\(f​\(x\+y\)\)\+c=k​\(k​\(x\+y\)\+c\)\+c=k2​x\+k2​y\+\(k\+1\)​cf\(f\(x\+y\)\)=k\(f\(x\+y\)\)\+c=k\(k\(x\+y\)\+c\)\+c=k^\{2\}x\+k^\{2\}y\+\(k\+1\)c\. For the equality 2​k​x\+2​k​y\+3​c=k2​x\+k2​y\+\(k\+1\)​c2kx\+2ky\+3c=k^\{2\}x\+k^\{2\}y\+\(k\+1\)c to hold for all x,y∈ℤx,y\\in\\mathbb\{Z\}, the corresponding coefficients must match\. Setting x=0x=0 and y=0y=0 yields: \(6\) 3​c=\(k\+1\)​c3c=\(k\+1\)c\. Setting x=1x=1 and y=0y=0 yields: 2​k\+3​c=k2\+\(k\+1\)​c2k\+3c=k^\{2\}\+\(k\+1\)c\. Subtracting Eq\. \(6\) from this gives 2​k=k22k=k^\{2\}, which simplifies to k​\(k−2\)=0k\(k\-2\)=0\. Since k∈ℤk\\in\\mathbb\{Z\}, the only solutions are k=0k=0 or k=2k=2\. • Case A: k=0k=0\. Substitute k=0k=0 into Eq\. \(6\): 3​c=c⟹2​c=0⟹c=03c=c\\implies 2c=0\\implies c=0\. Thus, f​\(x\)=0​x\+0=0f\(x\)=0x\+0=0, meaning ff is the zero function\. Hence f∈\{0\}⊆Tf\\in\\\{0\\\}\\subseteq T\. • Case B: k=2k=2\. Substitute k=2k=2 into Eq\. \(6\): 3​c=3​c3c=3c, which is true for any c∈ℤc\\in\\mathbb\{Z\}\. Thus, f​\(x\)=2​x\+cf\(x\)=2x\+c for some c∈ℤc\\in\\mathbb\{Z\}\. Hence f∈\{x↦2​x\+c∣c∈ℤ\}⊆Tf\\in\\\{x\\mapsto 2x\+c\\mid c\\in\\mathbb\{Z\}\\\}\\subseteq T\. In all cases, any function f∈Sf\\in S is also in TT, proving S⊆TS\\subseteq T\. Since both subset inclusions have been established, S=TS=T\. Example Informal Blueprint \(Lean\-IMO\-Bench, Basic 006\) High\-Level Mathematical Idea The goal is to prove that k≤K​\(c\)k\\leq K\(c\), where K​\(c\)=c12−2​c0​c2K\(c\)=c\_\{1\}^\{2\}\-2c\_\{0\}c\_\{2\}, given that the polynomial P​\(x\)=∑i=0kci​xiP\(x\)=\\sum\_\{i=0\}^\{k\}c\_\{i\}x^\{i\} with integer coefficients has degree kk and exactly kk real roots\. The proof elegantly uses multiset symmetric polynomials and the AM\-GM inequality, avoiding rational functions or polynomial derivatives: 1\. Polynomial Roots and Splitting: Since PP has natural degree kk and kk distinct real roots, it splits completely over ℝ\\mathbb\{R\}, and the multiset of its roots SS has cardinality kk\. 2\. Vieta’s Formulas: Vieta’s formulas express the coefficients c0,c1,c2c\_\{0\},c\_\{1\},c\_\{2\} in terms of the elementary symmetric polynomials of SS: c0=ck​\(−1\)k​Ekc\_\{0\}=c\_\{k\}\(\-1\)^\{k\}E\_\{k\} c1=ck​\(−1\)k−1​Ek−1c\_\{1\}=c\_\{k\}\(\-1\)^\{k\-1\}E\_\{k\-1\} c2=ck​\(−1\)k−2​Ek−2c\_\{2\}=c\_\{k\}\(\-1\)^\{k\-2\}E\_\{k\-2\} where Ei=esymmi​\(S\)E\_\{i\}=\\text\{esymm\}\_\{i\}\(S\)\. 3\. Multiset Identity: We construct a new multiset YY by mapping each root x∈Sx\\in S to the product of all other roots, \(S∖\{x\}\)\(S\\setminus\\\{x\\\}\)\.prod\. Purely multiset inductive identities show: ∑Y=Ek−1\\sum Y=E\_\{k\-1\} esymm2​\(Y\)=Ek​Ek−2\\text\{esymm\}\_\{2\}\(Y\)=E\_\{k\}E\_\{k\-2\} ∏Y=\(Ek\)k−1\\prod Y=\(E\_\{k\}\)^\{k\-1\} 4\. Sum of Squares: For the multiset Z=ck​YZ=c\_\{k\}Y, we evaluate the sum of its squares W=\{z2∣z∈Z\}W=\\\{z^\{2\}\\mid z\\in Z\\\}\. By the relation \(∑Z\)2=∑\(Z2\)\+2​esymm2​\(Z\)\(\\sum Z\)^\{2\}=\\sum\(Z^\{2\}\)\+2\\text\{esymm\}\_\{2\}\(Z\), we can algebraically compute ∑W=c12−2​c0​c2=K​\(c\)\\sum W=c\_\{1\}^\{2\}\-2c\_\{0\}c\_\{2\}=K\(c\)\. 5\. Integer Product Bound: The product of ZZ evaluates to ck​\(\(−1\)k​c0\)k−1c\_\{k\}\(\(\-1\)^\{k\}c\_\{0\}\)^\{k\-1\}\. Since PP has degree kk and c0≠0c\_\{0\}\\neq 0, both c0c\_\{0\} and ckc\_\{k\} are non\-zero integers\. Thus, the product of ZZ is a non\-zero integer, implying ∏W=\(∏Z\)2≥1\\prod W=\(\\prod Z\)^\{2\}\\geq 1\. 6\. AM\-GM Inequality: Applying the AM\-GM inequality to the multiset WW \(which consists of kk non\-negative real numbers whose product is ≥1\\geq 1\), we obtain ∑W≥k⟹K​\(c\)≥k\\sum W\\geq k\\implies K\(c\)\\geq k\. Required Global Definitions, Variables, or Structures No new definitions, axioms, or structures are needed\. We use purely standard Mathlib components \(like Multiset, Polynomial, and esymm\)\. Smaller Lemmas \(Subproblems\) lemma coeff\_of\_sum\_Icc \(c : ℕ\\mathbb\{N\} →\\rightarrow ℤ\\mathbb\{Z\}\) \(k : ℕ\\mathbb\{N\}\) \(hk : \(2 : ℕ\\mathbb\{N\}\) ≤\\leq k\) \(i : ℕ\\mathbb\{N\}\) \(hi : i ≤\\leq k\) : \(∑\\sum j ∈\\in Finset\.Icc 0 k, Polynomial\.monomial j \(c j\)\)\.coeff i = c i Purpose: Simplifies polynomial coefficient extraction from the given sum format\. lemma card\_roots\_eq\_of\_ncard\_rootSet \{k : ℕ\\mathbb\{N\}\} \{P : Polynomial ℝ\\mathbb\{R\}\} \(h\_deg : P\.natDegree = k\) \(h\_ncard : \(P\.rootSet ℝ\\mathbb\{R\}\)\.ncard = k\) : P\.roots\.card = k Purpose: Shows that a polynomial with kk distinct real roots and degree kk has exactly kk roots counting multiplicities\. lemma multiset\_map\_erase\_prod\_sum \{R : Type\*\} \[CommRing R\] \(s : Multiset R\) : \(s\.map \(fun x =\> \(s\.erase x\)\.prod\)\)\.sum = s\.esymm \(s\.card \- 1\) Purpose: Multiset identity relating the sum of element\-wise excluded products to Ek−1E\_\{k\-1\}\. lemma multiset\_map\_erase\_prod\_esymm\_two \{R : Type\*\} \[CommRing R\] \(s : Multiset R\) : \(s\.map \(fun x =\> \(s\.erase x\)\.prod\)\)\.esymm 2 = s\.prod \* s\.esymm \(s\.card \- 2\) Purpose: Multiset identity relating the 2nd elementary symmetric polynomial of excluded products to Ek​Ek−2E\_\{k\}E\_\{k\-2\}\. lemma multiset\_sum\_sq\_eq \{R : Type\*\} \[CommRing R\] \(s : Multiset R\) : \(s\.map \(fun x =\> x^2\)\)\.sum = \(s\.sum\)^2 \- \(2 : R\) \* s\.esymm 2 Purpose: Expresses the sum of squares of a multiset in terms of its sum and its 2nd elementary symmetric polynomial\. lemma multiset\_map\_erase\_prod\_prod \{R : Type\*\} \[CommRing R\] \(s : Multiset R\) : \(s\.map \(fun x =\> \(s\.erase x\)\.prod\)\)\.prod = s\.prod ^ \(s\.card \- 1\) Purpose: Computes the full product of the excluded products multiset\. lemma multiset\_sum\_ge\_card\_of\_prod\_ge\_one \(W : Multiset ℝ\\mathbb\{R\}\) \(hw : ∀\\forall x ∈\\in W, 0 ≤\\leq x\) \(hp : \(1 : ℝ\\mathbb\{R\}\) ≤\\leq W\.prod\) : \(W\.card : ℝ\\mathbb\{R\}\) ≤\\leq W\.sum Purpose: The AM\-GM inequality specialized for a multiset whose product is at least 1, proving that the sum is bounded below by its cardinality\. Proof Body Outline 1\. Define PP as the sum ∑i∈Finset\.Icc ​0​kmonomial ​i​\(ci\)\\sum\_\{i\\in\\text\{Finset\.Icc \}0k\}\\text\{monomial \}i\(c\_\{i\}\) and PRP\_\{R\} as P\.map ​\(algebraMap ​ℤ​ ​ℝ\)P\.\\text\{map \}\(\\text\{algebraMap \}\\mathbb\{Z\}\\text\{ \}\\mathbb\{R\}\)\. 2\. Apply coeff\_of\_sum\_Icc to assert PR\.coeff i=\(ci:ℝ\)P\_\{R\}\.\\text\{coeff \}i=\(c\_\{i\}:\\mathbb\{R\}\) for i∈\{0,1,2,k\}i\\in\\\{0,1,2,k\\\}\. 3\. Establish PR\.roots\.card=kP\_\{R\}\.\\text\{roots\.card\}=k using card\_roots\_eq\_of\_ncard\_rootSet and the natural degree injectivity\. 4\. Establish that PR\.splits ​\(RingHom\.id ​ℝ\)P\_\{R\}\.\\text\{splits \}\(\\text\{RingHom\.id \}\\mathbb\{R\}\) follows from Polynomial\.splits\_iff\_card\_roots\. 5\. Let s=PR\.rootss=P\_\{R\}\.\\text\{roots\}\. Invoke Vieta’s formulas \(Polynomial\.coeff\_eq\_esymm\_roots\_of\_splits\) to express c0c\_\{0\}, c1c\_\{1\}, and c2c\_\{2\} in terms of s\.esymm ​is\.\\text\{esymm \}i\. 6\. Define multisets YY and ZZ matching the theoretical blueprint\. Use multiset\_sum\_sq\_eq, multiset\_map\_erase\_prod\_sum, and multiset\_map\_erase\_prod\_esymm\_two to show that the sum of the squared elements of ZZ expands algebraically to exactly \(c12−2c0c2:ℝ\)=\(K\(c\):ℝ\)\(c\_\{1\}^\{2\}\-2c\_\{0\}c\_\{2\}:\\mathbb\{R\}\)=\(K\(c\):\\mathbb\{R\}\)\. 7\. Use multiset\_map\_erase\_prod\_prod to find Z\.prod=ck​\(\(−1\)k​c0\)k−1Z\.\\text\{prod\}=c\_\{k\}\(\(\-1\)^\{k\}c\_\{0\}\)^\{k\-1\}\. 8\. Observe that since c0c\_\{0\} and ckc\_\{k\} are non\-zero integers, their algebraic combination Z\.prodZ\.\\text\{prod\} represents a non\-zero integer, so its square \(the product of W=Z2W=Z^\{2\}\) is ≥1\\geq 1\. 9\. Feed WW to multiset\_sum\_ge\_card\_of\_prod\_ge\_one to deduce that W\.sum≥W\.cardW\.\\text\{sum\}\\geq W\.\\text\{card\}\. 10\. Using W\.card=kW\.\\text\{card\}=k and W\.sum=\(K\(c\):ℝ\)W\.\\text\{sum\}=\(K\(c\):\\mathbb\{R\}\), deduce \(k:ℝ\)≤\(K\(c\):ℝ\)\(k:\\mathbb\{R\}\)\\leq\(K\(c\):\\mathbb\{R\}\)\. Use norm\_cast to translate this back to \(k:ℤ\)≤K\(c\)\(k:\\mathbb\{Z\}\)\\leq K\(c\)\.`

Similar Articles

@FinanceYF5: Google new paper: Let LLM solve math competition problems, accuracy jumps from 10% to 70%. [LEAP framework] Instead of having the model write a complete proof at once, it breaks down the problem into a goal tree, learns step by step from Lean verifier feedback, and reuses proven lemmas. Result: All 12 problems of Putnam 2025 solved, IMO style…

X AI KOLs Timeline

Google new paper proposes the LEAP framework, which decomposes math problems into goal trees, learns from Lean verifier feedback, and improves LLM accuracy on math competition problems from 10% to 70%. It solves all 12 problems of Putnam 2025 and surpasses dedicated gold-medal-level systems on IMO-style benchmarks.

Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

arXiv cs.CL

This paper introduces Discover and Prove (DAP), an open-source agentic framework for automated theorem proving in Lean 4 that tackles 'Hard Mode' problems where the answer must be discovered independently before formal proof construction. The work releases new Hard Mode benchmark variants and achieves state-of-the-art results while revealing a significant gap between LLM answer accuracy (>80%) and formal prover success (<10%).