Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
Summary
This paper introduces Discover and Prove (DAP), an open-source agentic framework for automated theorem proving in Lean 4 that tackles 'Hard Mode' problems where the answer must be discovered independently before formal proof construction. The work releases new Hard Mode benchmark variants and achieves state-of-the-art results while revealing a significant gap between LLM answer accuracy (>80%) and formal prover success (<10%).
View Cached Full Text
Cached at: 04/20/26, 08:30 AM
# An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4 Source: https://arxiv.org/html/2604.15839 Chengwu Liu¹*, Yichun Yin², Ye Yuan¹, Jiaxuan Xie³, Botao Li⁴, Siqi Li⁴, Jianhao Shen², Yan Xu², Lifeng Shang², Ming Zhang¹ ¹State Key Laboratory for Multimedia Information Processing, School of Computer Science, PKU-Anker LLM Lab, Peking University ²Huawei Technologies Co., Ltd. ³School of Software & Microelectronics, Peking University ⁴School of Electronics Engineering and Computer Science, Peking University Correspondence: Ming Zhang ([email protected]) ## Abstract Most ATP benchmarks embed the final answer within the formal statement — a convention we call "Easy Mode" — a design that simplifies the task relative to what human competitors face and may lead to optimistic estimates of model capability. We call the stricter, more realistic setting "Hard Mode": the system must independently discover the answer before constructing a formal proof. To enable Hard Mode research, we make two contributions. First, we release MiniF2F-Hard and FIMO-Hard, expert-reannotated Hard Mode variants of two widely-used ATP benchmarks. Second, we introduce Discover And Prove (DAP), an agentic framework that uses LLM natural-language reasoning with explicit self-reflection to discover answers, then rewrites Hard Mode statements into Easy Mode ones for existing ATP provers. DAP sets the state of the art: on CombiBench it raises solved problems from 7 (previous SOTA, Pass@16) to 10; on PutnamBench it is the first system to formally prove 36 theorems in Hard Mode — while simultaneously revealing that state-of-the-art LLMs exceed 80% answer accuracy on the same problems where formal provers manage under 10%, exposing a substantial gap that Hard Mode benchmarks are uniquely suited to measure. --- ## 1 Introduction The use of AI to solve mathematical problems has attracted considerable research interest, not only because of the potential for concrete applications in domains like education and mathematical research, but also because tackling highly abstract mathematical problems generally requires capabilities that may generalize and transfer to complex real-world tasks. These capabilities include planning, search, deductive reasoning, and induction (Yang et al., 2024). Within the spectrum of mathematical tasks, competition problems — especially those at the International Mathematical Olympiad (IMO) level — have garnered particular attention. These problems go beyond numerical computation or simple formula application. They typically demand abstraction and modeling, rigorous logical argumentation, and often require elements of intuition and creativity. Accordingly, the ability to solve IMO-level problems is widely regarded as an important milestone for AI (Yang et al., 2024). Existing approaches to solving mathematical problems fall into two broad categories: informal methods and formal methods. Informal methods solve mathematical problems in natural language and leverage the strong reasoning abilities of large language models (LLMs), whereas formal methods use formal languages such as Lean (Moura & Ullrich, 2021) and Isabelle (Nipkow et al., 2002) to express the solution. A key advantage of formal methods is that proofs written in formal languages can be automatically and rigorously verified by a proof assistant program. At the International Mathematical Olympiad (IMO) 2024, participating AI systems employed formal methods (AlphaProof and AlphaGeometry, 2024). By IMO 2025, however, most evaluated systems had shifted toward informal approaches (Luong et al., 2025; Huang & Yang, 2025; Wei, 2025; Huawei-xiaoyi, 2025). We observe that current practice in many formalization efforts often embeds the final answer directly into the statement to be proved, which we refer to as "Easy Mode". We point out that Easy Mode may substantially reduce the difficulty of formal problem-solving tasks. To address this issue, we draw inspiration from prior works (PutnamBench by Tsoukalas et al., 2024 and CombiBench by Liu et al., 2025): answer-oriented problems are encoded in Lean 4 with two separate goals (two distinct `sorry`s). In this "Hard Mode" configuration, the model must first supply the final answer by replacing the first `sorry` with the answer and then produce a conventional formal proof for the remaining goal. This setup prevents embedding extra information that human contestants must discover by themselves in the formal statement. By definition, **Hard Mode** requires that any quantity a human competitor must derive through reasoning is not supplied as a premise in the formal statement; it must be independently discovered by the AI system. We adopt the term "Hard Mode" following the convention in the Lean community.¹ CombiBench (Liu et al., 2025) refers to the same distinction as "without solution" vs. "with solution", while PutnamBench (Tsoukalas et al., 2024) uses "no answer" vs. "with answer". An example illustrating the difference between Easy Mode and Hard Mode is presented in Figure 1. We commissioned expert annotators to reannotate two widely used ATP competition datasets, namely MiniF2F and FIMO, producing MiniF2F-Hard and FIMO-Hard. During reannotation, we corrected known alignment issues in existing formal benchmarks (Wang et al., 2025a; Lin et al., 2025b). To solve Hard Mode problems, we introduce the **Discover and Prove (DAP)** framework, a fully open-source, agent-based ATP framework for Hard Mode ATP tasks. DAP consists of two components: a Discovery Module and a Proving Module. We prompt an open-source LLM to generate and iteratively refine its reasoning and answers using self-verification procedures. After the Discovery module "discovers" a plausible answer and fills the first `sorry`, the Proving module attempts to produce complete formal proofs by invoking traditional ATP provers. DAP achieves state-of-the-art results. Evaluating on the full PutnamBench dataset, DAP solves 36 problems in total. On solution-style problems with Hard Mode variants, it solves 19 problems — to our knowledge, this constitutes the first public result on PutnamBench under this Hard Mode evaluation setting. On CombiBench Hard Mode, it solves 10 problems, improving significantly on the previous state of the art (Kimina-Prover Preview), which solved 8 problems. Our contributions are threefold: 1. We reannotate two commonly used ATP competition datasets, MiniF2F and FIMO, to align tasks presented to human competitors with those given to AI systems, removing the Easy Mode discrepancy and providing a more principled basis for evaluating AI mathematical capability. 2. We propose DAP, an open-source, agentic Hard Mode ATP framework. With a simple and straightforward design, DAP achieves state-of-the-art performance on PutnamBench and CombiBench. 3. We provide an analysis quantifying the individual contributions of the two modules of our proposed DAP prover to overall performance. These results shed light on the relative strengths of informal and formal method approaches for solving competition-level mathematical problems. Code and datasets are available at https://github.com/liuchengwucn/discover-and-prove. ## 2 Related Work **Figure 1:** Differences between Easy Mode and Hard Mode configurations in automated theorem proving. The example shown is a Lean 4 formalization of an IMO problem in two different styles. This example was intentionally selected to illustrate the kind of semantic misalignment that our re-annotation effort corrects: the Easy Mode formalization proves only that x must lie within certain ranges, but does not establish that every value in those ranges is attainable, weakening the original if-and-only-if requirement to a one-directional implication. By contrast, our Hard Mode formalization represents the answer as a set, thereby fully capturing the natural-language problem's requirement in the formal statement. ### 2.1 Mathematical Problem-Solving with AI Systems Powered by Chain-of-Thought (CoT) prompting and RLVR training, LLMs have made substantial progress in mathematical reasoning. Frontier models (e.g., OpenAI o1 by Jaech et al., 2024; DeepSeek R1 by Guo et al., 2025; Google Gemini 2.5 Pro by Comanici et al., 2025) now achieve near-saturation performance on widely-used math benchmarks, including GSM8K (Cobbe et al., 2021), MATH-500 (Lightman et al., 2023), and AIME 2024/2025. One downside of informal methods is that the solution traces they produce are notoriously difficult to verify automatically. Assessing the validity of a generated proof would require domain experts to inspect it carefully to detect subtle errors (Lightman et al., 2023), which is infeasible at scale. For this reason, informal systems are typically applied to problems that require only a final numerical or symbolic answer, verified by direct comparison against ground-truth (Wen et al., 2025). ### 2.2 Automated Theorem Proving The key strength of automated theorem proving (ATP) is that formal proofs can be checked rigorously and automatically by proof assistants (Yousefzadeh & Cao, 2025; Wang et al., 2023). Although formal approaches have faced challenges of limited formal-data availability (Xin et al., 2024), they have advanced rapidly with larger LLMs and scaling search compute (Xin et al., 2025; Chen et al., 2025). Recent systems such as Kimina-Prover Preview (Wang et al., 2025a), DeepSeek-Prover-V2 (Ren et al., 2025), and Goedel-Prover-V2 (Lin et al., 2025b) have made substantial progress on the MiniF2F benchmark (Zheng et al., 2021). Seed-Prover (Chen et al., 2025) is a lemma-style whole-proof reasoning model that iteratively refines proofs via Lean compiler feedback, proved lemmas, and self-summarization, achieving over 50% on PutnamBench and saturating MiniF2F. DSP (Jiang et al., 2023) guides formal theorem provers with natural-language draft proofs and structured sketches; DSP+ (Cao et al., 2025) revives this paradigm with modern reasoning models. DAP differs from DSP/DSP+ in three key respects; see Appendix B for a detailed comparison. Several agentic frameworks have also targeted Lean-based ATP (Thakur et al., 2023; Kumarappan et al., 2024; Baba et al., 2025; Wang et al., 2025b); see Appendix A. To handle informal mathematical problems that require a final answer, prior efforts typically convert problems requiring solutions into proof problems by embedding the desired answer into the formalized statement and proving that statement (Zheng et al., 2021; Liu et al., 2023; Xiong et al., 2023). This practice raises two concerns. First, embedding the answer in the statement can reduce the intrinsic difficulty: for many solution-oriented problems, the primary challenge is discovering the answer rather than proving a consequence once known, so supplying it acts as a substantial hint. Second, the resulting formalized statements are sometimes not perfectly semantically aligned with the tasks human contestants face. As prior analyses (e.g., FMCX by Xie et al., 2025; OlympiadBench by He et al., 2024) have noted, some existing formal benchmarks are misaligned: some formal statements may capture only a subset of the goals that human solvers must address. ### 2.3 Formalization & Data Curation High-quality formal datasets are scarce and require substantial expert effort (Xin et al., 2024). MiniF2F (Zheng et al., 2021) is one of the most widely used ATP benchmarks, containing formalized statements from mathematical olympiads and high-school and undergraduate courses; Kimina-Prover Preview (Wang et al., 2025a) supplies corrections to a subset of its problems. FIMO (Liu et al., 2023) is constructed from IMO shortlist problems but lacked a publicly available Lean 4 version, which impeded its adoption in recent work. PutnamBench (Tsoukalas et al., 2024) comprises hand-constructed formalizations of Putnam Competition problems and, for the first time, provides both "with answer" and "no answer" evaluation settings. CombiBench (Liu et al., 2025) offers 100 combinatorics problems ranging from middle-school level through IMO and university level, complementing the number-theoretic and algebraic focus of the other benchmarks; IMOSLLean4 (Sanjaya, 2025) and IMO-Steps (Yousefzadeh & Cao, 2024) additionally supply complete proofs. To alleviate data scarcity, automated formalization generation has also been explored.
Similar Articles
OProver: A Unified Framework for Agentic Formal Theorem Proving
OProver is a unified framework for agentic formal theorem proving in Lean 4 that iteratively improves proof generation through training with verified proofs and compiler feedback, achieving state-of-the-art results on multiple benchmarks.
LEAP: Supercharging LLMs for Formal Mathematics with Agentic Frameworks
LEAP is an agentic framework that enables general-purpose LLMs to achieve state-of-the-art performance in formal theorem proving in Lean, solving all 12 problems from the 2025 Putnam Competition and boosting formal solve rates from below 10% to 70% on a new benchmark (Lean-IMO-Bench), surpassing specialized systems.
Pythagoras-Prover: Advancing Efficient Formal Proving via Augmented Lean Formalisation
Pythagoras-Prover is a compute-efficient family of Lean theorem provers that achieves strong performance using curriculum supervised fine-tuning and a novel Augmented Lean Formalisation technique. The 4B model surpasses DeepSeek-Prover-V2-671B at pass@32 on MiniF2F-Test, and the 32B model sets a new state-of-the-art among open-source provers.
Process-Verified Reinforcement Learning for Theorem Proving via Lean
This paper presents Process-Verified Reinforcement Learning, using the Lean proof assistant as a process oracle to provide fine-grained tactic-level feedback during training, improving theorem proving performance.
Evaluating the Robustness of Proof Autoformalization in Lean 4
This paper evaluates the robustness of proof autoformalization models in Lean 4 under global and local perturbations, finding that current LLM-based models are sensitive to perturbations and often fail to faithfully reflect local changes.