Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics

arXiv cs.AI Papers

Summary

This paper introduces Formal Conjectures, an evolving benchmark of 2615 mathematical statements formalized in Lean 4, including open research conjectures for proof discovery and solved problems for auto-formalization, designed to evaluate automated reasoning systems with zero contamination.

arXiv:2605.13171v1 Announce Type: new Abstract: As automated reasoning systems advance rapidly, there is a growing need for research-level formal mathematical problems to accurately evaluate their capabilities. To address this, we present Formal Conjectures, an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4. Sourced from areas of active mathematical research, the dataset features 1029 open research conjectures providing a zero-contamination benchmark for mathematical proof discovery, and 836 solved problems for proof autoformalization. Notably, the repository provides a structured interface connecting mathematicians who formalize and clarify problems with the AI systems and humans attempting to solve them. Demonstrating its immediate utility, the benchmark has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures. We describe our approach to ensuring the correctness of these formalizations in a collaborative open-source project where contributions stem from an active community. In this framework, AI-generated proofs and disproofs serve as a valuable auditing mechanism to iteratively improve the fidelity of the benchmark. Finally, we provide a standardized evaluation setup and report baseline results on frozen evaluation subsets, demonstrating a climbable signal that measures the current frontier of automated reasoning on research-level mathematics.
Original Article
View Cached Full Text

Cached at: 05/14/26, 06:15 AM

# Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
Source: [https://arxiv.org/html/2605.13171](https://arxiv.org/html/2605.13171)
ean\]lean4breaklines,breakbefore=\. ,breakafter=\_ eancode\]lean4autogobble \[ Extension=\.ttf, UprightFont=\*\-Regular, BoldFont=\*\-Bold, ItalicFont=\*\-RegularItalic, BoldItalicFont=\*\-BoldItalic, RawFeature=\-calt, \]

Moritz Firsching1Paul Lezeau211footnotemark:1Salvatore Mercuri211footnotemark:1Miklós Z\. Horváth111footnotemark:1 Yaël Dillies3Calle Sönne1Eric Wieser1Fred Zhang1 Thomas Hubert1Blaise Agüera y Arcas4Pushmeet Kohli1

1Google DeepMind2Imperial College London3Stockholms universitet4Google

###### Abstract

As automated reasoning systems advance rapidly, there is a growing need for research\-level formal mathematical problems to accurately evaluate their capabilities\. To address this, we present Formal Conjectures,111[https://github\.com/google\-deepmind/formal\-conjectures](https://github.com/google-deepmind/formal-conjectures)an evolving benchmark of currently 2615 mathematical problem statements formalized in Lean 4\. Sourced from areas of active mathematical research, the dataset features 1029 open research conjectures providing a zero\-contamination benchmark for mathematical proof discovery, and 836 solved problems for proof auto\-formalization\. Notably, the repository provides a structured interface connecting mathematicians who formalize and clarify problems with the AI systems and humans attempting to solve them\. Demonstrating its immediate utility, the benchmark has already been leveraged to make new mathematical discoveries, including the resolution of open research conjectures\. We describe our approach to ensuring the correctness of these formalizations in a collaborative open\-source project where contributions stem from an active community\. In this framework, AI\-generated proofs and disproofs serve as a valuable auditing mechanism to iteratively improve the fidelity of the benchmark\. Finally, we provide a standardized evaluation setup and report baseline results on frozen evaluation subsets, demonstrating a climbable signal that measures the current frontier of automated reasoning on research\-level mathematics\.

## 1Introduction

As automated reasoning systems, including large language models \(LLMs\), continue improving their capabilities, the benchmarks used to measure their progress face significant challenges\. In the domain of \(formal\) mathematics, many existing evaluation frameworks suffer from:\(i\) Data leakage:As solutions to benchmark problems appear online, it is difficult to distinguish genuine reasoning from memorization\. This is a challenge for benchmarks likeZhenget al\.\([2021](https://arxiv.org/html/2605.13171#bib.bib1)\); Tsoukalaset al\.\([2024](https://arxiv.org/html/2605.13171#bib.bib14)\); Jianget al\.\([2025](https://arxiv.org/html/2605.13171#bib.bib10)\); Yuet al\.\([2025](https://arxiv.org/html/2605.13171#bib.bib5)\);\(ii\) Secretive evaluation:To combat leakage, some benchmarks require that the evaluation set be kept private, which hinders open and reproducible research\(Cholletet al\.,[2025](https://arxiv.org/html/2605.13171#bib.bib12); Glazeret al\.,[2024](https://arxiv.org/html/2605.13171#bib.bib3); Center for AI Safetyet al\.,[2026](https://arxiv.org/html/2605.13171#bib.bib11)\);\(iii\) Oversimplified success criteria:Many benchmarks rely on verifying a simple, machine\-checkable final answer \(such as a number\), which fails to capture the complex, multi\-step reasoning inherent in a full mathematical proof\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.13171#bib.bib9); Glazeret al\.,[2024](https://arxiv.org/html/2605.13171#bib.bib3)\); and\(iv\) Saturation:As models improve, some benchmarks become saturated\. For instance, MiniF2F\(Zhenget al\.,[2021](https://arxiv.org/html/2605.13171#bib.bib1)\)is now routinely solved with over 99%; seeHubertet al\.\([2025](https://arxiv.org/html/2605.13171#bib.bib15)\); Chenet al\.\([2025b](https://arxiv.org/html/2605.13171#bib.bib6)\); Linet al\.\([2025](https://arxiv.org/html/2605.13171#bib.bib4)\)\.

To address these limitations, we introduceFormal Conjectures, an open\-source and growing collection of mathematical problems formalized in Lean 4\(Moura and Ullrich,[2021](https://arxiv.org/html/2605.13171#bib.bib31)\)using Mathlib\(The mathlib Community,[2020](https://arxiv.org/html/2605.13171#bib.bib25)\)\. The benchmark focuses on*open conjectures*as the primary challenge, providing a zero\-contamination testbed where solutions cannot be found in existing training data\. The secondary challenge consists of \(informally\) solved problems for auto\-formalization, which do not share the same zero\-contamination guarantees\. We formalize all problems in a formal language to allow for objective, rigorous, and automated evaluation: a proposed solution is definitively correct if and only if its corresponding proof is accepted by the kernel without relying on forbidden axioms likesorry\. Our choice of Lean 4 is pragmatic: Mathlib, the largest formal mathematics library available, is essential for stating a wide variety of advanced conjectures\. While we focus on Lean 4, the core principles of our benchmark are language\-agnostic and extendable to other formal systems\.

This paper details the construction and methodology of theFormal Conjecturesbenchmark\. The benchmark is publicly available under the Apache 2\.0 license and has already been utilized for real\-world mathematical discovery\(Alexeev and Mixon,[2025](https://arxiv.org/html/2605.13171#bib.bib19); DeepMind,[2026](https://arxiv.org/html/2605.13171#bib.bib30)\)\. In summary, our main contributions are:

- •The Formal Conjectures Benchmark:A large\-scale, evolving dataset of 2615 mathematical statements in Lean 4, including 1029 open conjectures for zero\-contamination proof discovery and 836 solved problems for auto\-formalization\.
- •A Unified API for Mathematical Reasoning:A repository interface connecting mathematicians with an ecosystem of automated solvers to facilitate simultaneous statement auditing and rapid dissemination of problems to AI systems\.
- •Methodology for Statement Formalization:We introduce a collaborative pipeline and a three\-level taxonomy of misformalizations \(Translation, Underspecified, and Source\) to ensure formalization fidelity\. This includes theanswer\(sorry\)mechanism, which decouples the mathematical value discovery from proof verification\. We provide insights from 291 fixed cases of research\-level formalization\.
- •A Standardized Evaluation Framework:We provide a rigorous evaluation setup using the Lean 4 kernel to verify multi\-step reasoning without relying on oversimplified numerical answers\. We establish frozen subsets,FC100SolvedSet1andFC100OpenSet1, to enable stable, reproducible model comparisons\.

## 2Related Work

Our work intersects mathematical reasoning benchmarks and formal theorem proving\.

##### General mathematical reasoning benchmarks\.

Several benchmarks evaluate AI systems on advanced problem\-solving without requiring formal proofs\. FrontierMath\(Glazeret al\.,[2024](https://arxiv.org/html/2605.13171#bib.bib3)\)uses expert\-crafted problems verified by numerical answers, which, while machine\-checkable, do not fully capture the multi\-step reasoning process\. In January 2026, Epoch AI launched*FrontierMath: Open Problems*, a website\-only pilot of a small collection of unsolved research\-level informal problems with evaluation offered as a service\. Humanity’s Last Exam\(Center for AI Safetyet al\.,[2026](https://arxiv.org/html/2605.13171#bib.bib11)\)crowdsources expert questions but relies on answer\-checking rather than proof verification\. MathArena\(Balunovićet al\.,[2025](https://arxiv.org/html/2605.13171#bib.bib18)\)evaluates LLMs on fresh competition problems to reduce contamination risk and assesses informal proof\-writing\. The ARC Prize\(Cholletet al\.,[2025](https://arxiv.org/html/2605.13171#bib.bib12),[2026](https://arxiv.org/html/2605.13171#bib.bib13)\)targets abstract reasoning, sharing our focus on measuring genuine reasoning over pattern matching\. Formal benchmarks, by contrast, allow automatic verification of full reasoning or proof\.

##### Formal mathematics benchmarks and automated theorem proving\.

Interest in automated theorem proving has accelerated AI reasoning, leading to the rapid saturation of existing formal benchmarks\. MiniF2F\(Zhenget al\.,[2021](https://arxiv.org/html/2605.13171#bib.bib1)\)pioneered high\-school\-level evaluation but became increasingly saturated as models improved\. This shift was defined by AlphaProof\(Hubertet al\.,[2025](https://arxiv.org/html/2605.13171#bib.bib15)\)and AlphaGeometry 2\(Chervonyiet al\.,[2025](https://arxiv.org/html/2605.13171#bib.bib16)\)reaching silver\-medal performance at the 2024 IMO, where AlphaProof solved Lean problems and introduced the Formal\-IMO\(Hubertet al\.,[2025](https://arxiv.org/html/2605.13171#bib.bib15)\)dataset\. Systems like Aristotle\(Achimet al\.,[2025](https://arxiv.org/html/2605.13171#bib.bib17)\)and Seed\-Prover\(Chenet al\.,[2025b](https://arxiv.org/html/2605.13171#bib.bib6)\)achieved gold\-medal equivalents at the 2025 IMO\. This capability leap cascaded to undergraduate math, where benchmarks like PutnamBench\(Tsoukalaset al\.,[2024](https://arxiv.org/html/2605.13171#bib.bib14)\)and Putnam 2025 are being saturated by models such as Seed\-Prover 1\.5\(Chenet al\.,[2025a](https://arxiv.org/html/2605.13171#bib.bib7)\)and Numina\(Liuet al\.,[2026](https://arxiv.org/html/2605.13171#bib.bib8)\), and other industry systems\. While SorryDB\(Letsonet al\.,[2026](https://arxiv.org/html/2605.13171#bib.bib20)\)evaluates AI on practical theorem completion and FATE\(Jianget al\.,[2025](https://arxiv.org/html/2605.13171#bib.bib10)\)targets varying difficulties of frontier algebra, the saturation of earlier benchmarks highlights an acute need for research\-level evaluation: benchmarking reasoning on unsolved research\-level mathematics\.

##### Adoption of Formal Conjectures\.

Since its open\-source release in May 2025, the repository has been adopted by both mathematicians and tool builders\. Boris Alexeev used a repository formalization to prove Erdős Problem 124\(Bloom,[2025](https://arxiv.org/html/2605.13171#bib.bib21)\)with Aristotle\(Achimet al\.,[2025](https://arxiv.org/html/2605.13171#bib.bib17)\)\. Although the statement was initially misformalized and included an unintended extra hypothesis, Aristotle’s proof succeeded without relying on it\. Numerous subsequent successes such as\(Alexeev and Mixon,[2025](https://arxiv.org/html/2605.13171#bib.bib19)\)highlight the repository’s value for discovery and demonstrate how AI\-generated proofs serve as a vital auditing mechanism to improve formalization fidelity\. A DeepMind prover agent\(DeepMind,[2026](https://arxiv.org/html/2605.13171#bib.bib30)\)conducted a systematic evaluation of the full repository, solving several open problems\. To facilitate standardized, reproducible evaluation, this paper introduces named, versioned snapshots \(Section[4\.1](https://arxiv.org/html/2605.13171#S4.SS1.SSS0.Px5)\), moving beyond individual tests to systematic benchmarking across diverse methods\.

## 3Formal Conjectures

### 3\.1Problem Selection and Composition

Informal Source ProblemsErdős Problems1318Wikipedia476Green’s Open Problems175Papers131Open Quantum Problems125Written on the Wall II110OEIS105arXiv56MathOverflow53Other sources66A mix of established archives, research collections, and open problem lists\.Formal Conjectures BenchmarkResearch Open \(1029\) ↪\\hookrightarrowPrimary Goal:New Proof Discovery Zero\-contamination testbed for genuine reasoning\.Research Solved \(836\) ↪\\hookrightarrowSecondary Goal:Auto\-formalization\.All Other Statements \(750\) ∙\\bulletTest \(467\)∙\\bulletAPI \(155\)∙\\bulletTextbook \(128\) ↪\\hookrightarrowPurpose:Sanity & Definition Testing\.Formalization Pipeline

Figure 1:Left: informal source distribution\. Right: formal category distribution, including open \(1029\) and solved \(836\) problems\.May 2025Apr 20260200200400400600600Lean filesFigure 2:Repository growth over time\.The benchmark’s problems are sourced from diverse mathematical literature for broad coverage\. Key sources include the collection of problems posed by Paul Erdős, as catalogued on*erdosproblems\.com*\(Bloom,[2023](https://arxiv.org/html/2605.13171#bib.bib22)\); the Kourovka Notebook of unsolved problems in group theory\(Khukhro and Mazurov,[2026](https://arxiv.org/html/2605.13171#bib.bib23)\); the IQOQI Vienna list of open quantum problems\(IQOQI Vienna and Werner,[2017](https://arxiv.org/html/2605.13171#bib.bib24)\); well\-known Wikipedia conjectures; conjectures from recent publications in academic journals and on arXiv; and questions asked on MathOverflow\. The number of problems per informal source is in Figure[1](https://arxiv.org/html/2605.13171#S3.F1)\. Furthermore, the repository has grown steadily since its initial open\-source release, as visualized in Figure[2](https://arxiv.org/html/2605.13171#S3.F2)\.

The collection is divided into two primary categories:Unsolved Conjectures, open research problems where a formal or informal proof would represent a new mathematical discovery; andSolved Problems, established theorems with known*informal*proofs that benchmark an AI’s ability to formalize existing arguments, though only a fraction currently have a*formal*proof available\. We include both categories not only because they independently serve as benchmarks for different tasks \(proof generation and auto\-formalization\), but also because they are often intertwined\. When formalizing an open conjecture, it is natural to also state and prove simpler, solved variations\. For instance, a general conjecture about the existence of Hadamard matrices of size4​k4kfor all validk∈ℕk\\in\\mathbb\{N\}can be presented alongside the solved cases \(e\.g\., for allk≤166k\\leq 166\) and the first open case \(k=167k=167\)\.

Other sources of*solved*problems are collections of conjectures that include both open and solved questions\. In those cases, we also formalize the statements of the solved cases\. They are often related in subject to the open conjectures, and hence the formalization of their statements and the auto\-formalization of their proofs support the open conjectures\.

The collection of solved problems also includes simpler statements, ranging from trivial to textbook level, which help validate the correctness of the underlying formalizations\. Including such simpler problems helps detect incorrect formalizations of open problems: a disproof of an easy special case or known variant might indicate that some of the definitions used have been stated incorrectly\. See below for examples of how those initial misformalizations can lead to improved formal and informal problem statements\. We strive to always state the simplest open variant of a conjecture, as well as the conjecture in greater generality\.

##### The@\[category\]attribute\.

To help navigate this collection, each statement is labeled with one of the following category tags:researchopen\(an open research\-level problem\),researchsolved\(a solved research\-level problem\),textbook\(a textbook\-level math problem ranging from high school to graduate level\),API\(statements developing foundational theory for a new definition intended for general reuse\), andtest\(a statement serving as a sanity check\)\. The distribution of all categories is in Figure[1](https://arxiv.org/html/2605.13171#S3.F1)on the right\.

##### The@\[subject\]attribute\.

Statements are also labeled with subject tags following the AMS Mathematics Subject Classification\(Editors of Mathematical Reviews and zbMATH,[2020](https://arxiv.org/html/2605.13171#bib.bib2)\)\. The distribution by source collection and AMS subject classification is shown in Tables[3](https://arxiv.org/html/2605.13171#A1.T3)and[4](https://arxiv.org/html/2605.13171#A1.T4)\(Appendix[A\.1](https://arxiv.org/html/2605.13171#A1.SS1)\)\. While number theory and combinatorics account for over half of all statements, the benchmark already spans over 10 AMS subjects \(e\.g\., also quantum theory, or algebraic geometry\), which we aim to broaden further\.

### 3\.2Formalization Methodology

\{leancode\}/– There are no indecomposable vector bundles of rank 2 onℙn\\mathbb\{P\}^\{n\}forn≥7n\\geq 7\. \-/ @\[category research open, AMS 14\] theorem hartshorne\_conjecture \(n : ℕ\) \(hn : 7 ≤ n\) \(F : VectorBundles \(ProjectiveSpace \(Fin \(n \+ 1\)\) \(Spec \(\.of ℂ\)\)\)\) \(hF : F\.rank = 2\) : Nonempty \(F\.Splitting \(Fin 2\)\) := by sorry

Figure 3:A statement in Formal Conjectures, an open conjecture from\(Hartshorne,[1974](https://arxiv.org/html/2605.13171#bib.bib29)\)\. It includes an informal statement as a comment, tags, and a Lean statement\.All problems are presented as Lean theorems with informal statements\. While our core goal is providing open statements withsorryplaceholders for AI solvers to replace, we also include solved variants of conjectures and items from problem lists\. For open problems, no proofs are provided\. For solved items, we appreciate very short proofs or counterexamples \(under 25–50 lines\) that test the definitions\. Longer proofs are excluded to keep the repository lightweight; instead, contributors are encouraged to host proofs in their own repositories and link to them using the@\[formal\_proof\]mechanism described below\.

We provide an example statement in Figure[3](https://arxiv.org/html/2605.13171#S3.F3)\.

##### The\\leananswer\(sorry\) mechanism\.

Many mathematical problems ask not just for a proof but for a specific*answer*: a number, a function, or a set\. To handle this, we introduce an\\leananswer\(sorry\) construct, implemented as a custom Lean term elaborator, that separates the task of discovering a computable answer from proving it correct\. For example,“What is the smallestnnsuch thatP​\(n\)P\(n\)?”is formalized with\\leananswer\(sorry\) as a placeholder that a solver must replace with a concrete value, reducing the problem to a verifiable proof obligation\. The construct also handles unknown truth values: when\\leananswer\(sorry\) appears in a biconditional, a solver replaces it with\\leanTrue or\\leanFalse to conjecture or refute a proposition\. See Appendix[A\.2](https://arxiv.org/html/2605.13171#A1.SS2)for detailed examples and the full elaboration semantics\.

##### The“for Mathlib”pattern\.

The project maintains aFormalConjecturesForMathlibdirectory containing 88 files of auxiliary definitions and lemmas not yet in Mathlib but needed for conjectures\. These results are asynchronously contributed upstream to Mathlib\. See Appendix[A\.3](https://arxiv.org/html/2605.13171#A1.SS3)for details on the rationale and contents of this directory\.

##### The@\[formal\_proof\]attribute\.

To keep the repository lightweight, the@\[formal\_proof\]attribute decouples problem statements from their solutions\. It supports three modes: \(1\)\\leanformal\_conjectures for proofs solving exactly the type of the statement given here; it should point to a commit on top of a commit in themainbranch of our repo, filling in the\\leansorry; \(2\)lean4for equivalent problems solved in Lean 4 elsewhere \(e\.g\., in Mathlib or external repos\); and \(3\)other\_systemfor problems solved in other systems like Lean 3, Isabelle, or Rocq\. This design links to external verification sources while keeping the repository maintainable\. The attribute applies to any solved category, though usage on aresearchopenproblem triggers a linter warning\. Table[1](https://arxiv.org/html/2605.13171#S4.T1)summarizes proof coverage recorded via this attribute\.

### 3\.3Misformalizations

It is notoriously difficult to formalize a mathematical problem*correctly*without providing a formal proof\. A*misformalization*is a formal statement that is incorrect\. To provide a rigorous foundation for benchmark quality, we introduce a taxonomy of misformalizations\. It includes the following levels, where in all cases the formal statement is incorrect\.

1. 1\.Translation: the informal statement is accurate and explicitly phrased\.
2. 2\.Underspecified: the informal statement is accurate but lacks detail\.
3. 3\.Source: the informal statement is not as intended\.

These levels are ordered with respect to the degree of contribution from the formalizer\. Level 1 misformalizations arise from the formalization only, while level 3 misformalizations arise from the informal text only\. Moreover, for a Lean expert who does not necessarily have domain expertise in the informal statements, levels are ordered with respect to difficulty to fix\. Note that level 2 and 3 misformalizations have been useful in clarifying informal statements in the literature\. An illustrative example is Erdős Problem 978: multiple attempts to formalize the statement led to a refinement of the original informal problem222[https://www\.erdosproblems\.com/forum/thread/978](https://www.erdosproblems.com/forum/thread/978)\. The original statement,“Are there infinitely manynnfor whichf​\(n\)f\(n\)is\(k−2\)\(k\-2\)\-power\-free?”, was found to require multiple additional hypotheses and became:“Ifk\>3k\>3, and for all primesppthere existsnnsuch thatpk−2∤f​\(n\)p^\{k\-2\}\\nmid f\(n\), then are there infinitely manynnfor whichf​\(n\)f\(n\)is\(k−2\)\(k\-2\)\-power\-free?”More broadly, informal texts may exclude trivial cases without explicitly describing them, while formalization requires these to be explicit\.

Misformalizations can further be categorized into six types across the three levels:*syntactic*,*semantic*, and*misrepresentation*errors at the translation level;*implicit conventions*at the underspecified level; and*reporting*and*mathematical*errors at the source level\. The full taxonomy with definitions and representative pull requests is given in Appendix[A\.4](https://arxiv.org/html/2605.13171#A1.SS4)\. Across the repository a total of291291misformalizations have been fixed, with*misrepresentation*\(48%\) and*semantic*\(35%\) errors being the most common \(Table[5](https://arxiv.org/html/2605.13171#A1.T5)\)\. Detailed code diffs illustrating these misformalization examples can be found in Appendix[A\.5](https://arxiv.org/html/2605.13171#A1.SS5)\.

### 3\.4Avoiding Misformalizations

Informal SourceErdős, arXiv, etc\.Human FormalizationDraft Lean 4 code & testsMerge to Repositoryformal\-conjecturesAutomated RunsAI systems attempt proofsVerification & TriageManual inspection of resultsStatus UpdatedIn repo & informal sourceProofs &DisproofsMisformalizationSource ErrorValidResult

Figure 4:Iterative pipeline for Formal Conjectures\. Formalized statements are tested by periodic AI runs, solving problems or triggering a revision loop\.In Formal Conjectures, every contribution undergoes mandatory code review by humans with both Lean expertise and relevant mathematical domain knowledge\. While early contributions to the repository were written entirely by hand, the authoring process has evolved: most recent submissions use AI tooling, including agentic coding assistants and auto\-formalization systems\. To facilitate high\-quality AI\-assisted contributions, the repository provides anAGENTS\.mdfile with structured guidance for these tools\. Similarly, the review process increasingly leverages AI: reviewers use language models to cross\-check formalizations against informal sources, and AlphaProof runs on submissions to catch potential misformalizations before merging\.

Beyond this review process, we employ a number of additional strategies to mitigate the risk of misformalizations across the dataset\. Firstly, we employ and encourage a test\-based design for definitions\. Any new definitions introduced should be accompanied by a suite of proventestlemmas andAPIstatements that are designed to establish expected behavior of the definition and mitigate the risk of edge\-case errors\. Lean andMathlib’s tooling \(for exampledecide\) can be employed to prove many of theseteststatements\. As of the[bench\-v1\-lean4\.27\.0release](https://github.com/google-deepmind/formal-conjectures/releases/tag/bench-v1-lean4.27.0), the repository contains 467teststatements and 155APIstatements, amounting to 23\.8% of the repository statements\.

Secondly, automated theorem provers like AlphaProof regularly attempt proofs and disproofs across the repository \(Figure[4](https://arxiv.org/html/2605.13171#S3.F4)\); manually inspecting these results primarily reveals misformalizations\. In contrast to test\-based design, which proactively reduces the risk of edge\-case errors, this retroactive approach is capable of unearthing misformalizations across all categories as described in Section[3\.3](https://arxiv.org/html/2605.13171#S3.SS3)\. In addition, we use Gemini, AlphaProof, and other tools to automatically cross\-check formalizations against their informal source texts, flagging potential discrepancies for human review\. Future work involves further automating these processes in GitHub CI in order to reduce the overhead of manual post\-hoc fixes from the maintainers\.

Thirdly, custom Lean linters built on Mathlib’s framework enforce metadata and documentation standards across contributions\. These cover AMS classification tags,answer\(sorry\)usage, problem category annotations, and module docstrings\.

Finally, as Formal Conjectures has grown in stature, community engagement and proof/disproof attempts from external automated theorem provers have been increasingly valuable\. Furthermore, insights and corrections discovered through this process are actively upstreamed to the original sources, e\.g\., prompting the maintainer of the Erdős problems website to regularly clarify and adjust the informal problem statements\.

## 4Benchmark Evaluation

### 4\.1Evaluation Framework

##### Evaluation Paradigms\.

Formal Conjectures provides a dynamic, rigorous benchmark for advanced automated reasoning premised on novel mathematical insight as the ultimate intellectual hurdle for complex problems\. We define two distinct setups:

1. 1\.Primary Goal: New Formal Proof Discovery\.Measuring AI’s ability to discover formal proofs for open problems, these unsolved conjectures form a zero\-contamination testbed for genuine mathematical discovery\. Since no solutions exist in any training corpus, success provides a definitive signal of reasoning capabilities, removing the need for secretive evaluation sets common in other benchmarks\.
2. 2\.Secondary Goal: Proof Auto\-formalization\.This track provides a climbable benchmark treating the non\-trivial translation of established mathematics into formal code as a distinct, rigorous challenge\. It measures a model’s ability to work with Lean 4 and Mathlib to formalize known arguments\.

##### Evaluating Open Problems\.

Evaluating open problems is definitive: as no formal solutions exist at release time, the first kernel\-accepted proof provides a zero\-contamination success signal\. However, since a proof may settle a misformalized statement rather than the intended problem, any result for an open conjecture triggers manual inspection for fidelity\.

##### Fixed Benchmark Subsets\.

To enable stable model comparisons, we provide two frozen subsets of 100 problems each:FC100OpenSet1\(100research openstatements\) andFC100SolvedSet1\(100research solvedstatements\)\. These subsets are defined in the repository filesFC100OpenSet1\.leanandFC100SolvedSet1\.lean, which import exactly the corresponding theorem statements\. Because these files are compiled as part of the repository across all supported Lean versions, the problem sets remain fixed and comparable even as the repository evolves\. Further details on the construction of these subsets are in Appendix[B\.1](https://arxiv.org/html/2605.13171#A2.SS1)\.

##### Correctness\.

Following established formal math benchmark methodologies\(Zhenget al\.,[2021](https://arxiv.org/html/2605.13171#bib.bib1); Tsoukalaset al\.,[2024](https://arxiv.org/html/2605.13171#bib.bib14)\), our evaluation leverages the Lean 4 kernel for an objective, rigorous, and automatable success criterion\. A proposed solution is a Lean proof term replacing thesorryplaceholder\. Solutions are correct if and only if accepted by the kernel without relying on forbidden axioms, such assorry\. This binary criterion eliminates human grading ambiguity and provides an indisputable ground truth\. While theoretically susceptible to foundational bugs in the Lean kernel or unintended axioms \(e\.g\., a Lean Zulip thread333[https://leanprover\.zulipchat\.com/\#narrow/channel/270676\-lean4/topic/Soundness\.20bug\.3A\.20hasLooseBVars\.20is\.20not\.20conservative/with/520153084](https://leanprover.zulipchat.com/#narrow/channel/270676-lean4/topic/Soundness.20bug.3A.20hasLooseBVars.20is.20not.20conservative/with/520153084), which mentions a now\-fixed exploit\), in practice, kernel\-level verification provides the highest mathematical rigor\. Consequently, all evaluations are automatically verifiable, allowing results and proof terms to be shared publicly for transparent, reproducible research\.

##### Versioning and Reproducibility\.

Evaluating on a living repository is challenging: dependency updates can alter statement semantics, and new Mathlib theorems may simplify problems\. To enable reproducible comparisons, we tag frozen benchmark snapshots using a two\-part naming scheme:bench\-vNN\-lean4\.XX\.YY\. The*bench version*vNNidentifies a fixed problem set; the*Lean version*suffix pins the Mathlib tag \(and hence Lean toolchain\) against which statements compile and are evaluated\. Releases are issued every few months; when Lean versions bump, companion tags allow evaluating the same problem set against intermediate toolchains, e\.g\.bench\-v3\-lean4\.27\.0\. To preserve baselines, snapshots are immutable; misformalization fixes are never patched into existing versions but incorporated into the next \(e\.g\.,bench\-v\(\(N\+1\)\+1\)\)\.

##### A Living Benchmark and Unified API\.

Outside the frozen subsets, the repository operates as a dynamic benchmark that exhibits useful evaluation dynamics over time\. As new conjectures are added, solved, or corrected by the community, it naturally self\-adjusts its difficulty, expands, and improves\. The hardest problems remain open, pushing the frontier of automated reasoning\. Crucially, the repository acts as a unified API: researchers can submit formalized conjectures to a single centralized hub for simultaneous exposure to all evaluating AI systems\. This obviates the need to manually engage with individual provers, ensuring broad and concurrent attempts at mathematical discovery\.

Table 1:Proof coverage across all statement categories\.*Proved in repo*: the statement has asorry\-free proof in the repository;*Linked proof*: a proof is recorded via the@\[formal\_proof\]attribute but is not present in the repository itself;*With proof*: either of the above\.
##### Open Problems Baseline\.

For the primary proof\-discovery task on open problems, the baseline for the first tagged releasebench\-v1\-lean4\.27is0%0\\%for all systems: by definition, no formal proof exists for anyresearch openstatement at the time of tagging \(problems with known solutions are reclassified asresearch solved\)\. Over time, these open statements may receive formal or informal proofs\. Proofs that expose a misformalization \(Section[3\.3](https://arxiv.org/html/2605.13171#S3.SS3)\) rather than settling the intended problem trigger revisions incorporated into the next version\. In subsequent releases, interim\-solved problems are reclassified toresearch solved, with formal solutions receiving the@\[formal\_proof\]attribute\. The newresearch openset combines remaining unsolved problems with new additions; thus, successive versions start with a baseline of≥0%\\geq 0\\%solved statements and become increasingly more difficult\.

##### Proof Auto\-Formalization Baseline\.

For auto\-formalization, the baseline is system\-agnostic: we track all formal proofs contributed to the repository, regardless of origin\. These are tracked using the@\[formal\_proof\]attribute \(Section[3\.2](https://arxiv.org/html/2605.13171#S3.SS2)\), which records the proof system \(formal\_conjecturesfor proofs within the repository,lean4for proofs in external Lean 4 projects such as Mathlib, orother\_systemfor proofs in Isabelle, Rocq, etc\.\) and links to the source\. Table[1](https://arxiv.org/html/2605.13171#S4.T1)summarizes the current state\. Notably, 17\.3% ofresearch\_solvedstatements have a known formal proof \(5\.3% internal, 12\.1% linked externally\)\.

### 4\.2Illustrative Evaluation

We provide illustrative evaluations, with setup and cost details in Appendix[B\.2](https://arxiv.org/html/2605.13171#A2.SS2)\.

Table 2:Results on the frozenFC100SolvedSet1\.##### Frozen Evaluation Subsets\.

To ensure stable comparisons despite repository fluidity, we provide two frozen evaluation subsets, each containing 100 problems randomly sampled from thebench\-v1\-lean4\.27\.0tag:FC100SolvedSet1\(proof auto\-formalization\) andFC100OpenSet1\(proof discovery; 0% baseline\)\. By definition,FC100OpenSet1currently remains at a definitive 0% baseline across these evaluated methods, which highlights its status as a rigorous frontier for new discovery\. Results onFC100SolvedSet1\(Table[2](https://arxiv.org/html/2605.13171#S4.T2)\) demonstrate a clear, climbable signal across both compute and model capabilities\. To provide a diverse evaluation, we evaluate both a slightly improved version of the AlphaProof system used in the 2024 IMO\(Hubertet al\.,[2025](https://arxiv.org/html/2605.13171#bib.bib15)\)and a development version of a DeepMind prover agent\(DeepMind,[2026](https://arxiv.org/html/2605.13171#bib.bib30)\)\. These are different systems rather than incremental versions of the same architecture\. Utilizing a constrained tree\-search\-only inference setup on v6e TPUs, AlphaProof achieves a solve rate of 45% onFC100SolvedSet1with a low compute configuration of 1,000 simulations per problem\. By increasing the compute budget to 16,000 simulations, the solve rate improves to 50%, demonstrating a clear scaling signal\. We emphasize that these evaluations are conducted using tree\-search inference only and do not utilize AlphaProof’s test\-time reinforcement learning \(TTRL\) mechanism, which generates a curriculum to learn from during the proof finding process through weight updates\. Furthermore, utilizing a newer, development version of the DeepMind prover agent, the solve rate increases to 66%, illustrating that the benchmark effectively measures advancements across improved model architectures and methodologies\. Because running automated reasoning methods within Lean involves significant infrastructure overhead, we provide these specific systems as illustrative baselines to validate the benchmark’s signal\. We encourage the community to report results against these subsets as well as the evolving main repository\.

## 5Discussion

##### Broader Benefits\.

Beyond its primary role, Formal Conjectures providesprecise mathematicsby enforcing a level of formal precision that clarifies the exact meaning of statements\. This creates a library of conjectures referenced unambiguously via unique file names and commit hashes, ensuring fixed and verifiable mathematical intent; unlike standard citations where generality or precision is often left to interpretation\. Furthermore, the benchmark serves as a compass forMathlib development, as formalizing advanced conjectures reveals gaps in the existing library\. We address these viaFormalConjecturesForMathlib, a directory of results continuously upstreamed to Mathlib\. By identifying hard\-to\-state concepts, the repository both benchmarks reasoning and expands formalized mathematics\.

##### Limitations\.

Problem selection faces ascope biasconstrained by Mathlib’s current coverage, which we mitigate somewhat via ourForMathlibdirectory\. While some concepts remain hard to formalize, this limitation will diminish as Mathlib expands\. There is also aselection biastoward famous conjectures and Erdős problems, resulting in a concentration of number theory and combinatorics; we plan to mitigate this by sourcing diverse arXiv preprints in areas like computer science and physics\. Focusing on open problems introduces amisformalization riskfrom subtle statement or source errors\. While mitigated via new methods in Section[3\.4](https://arxiv.org/html/2605.13171#S3.SS4), these risks remain until a full proof is provided; fixing newly discovered errors may require revisions impacting historical comparisons\. The benchmark also facescontamination challenges: for solved problems, models may retrieve known arguments rather than reasoning from scratch; for open problems, zero\-contamination is not permanent as future solutions enter training data\. Our living repository tracks these developments and reclassifies problems accordingly\. Finally,verification and reproducibilityrely on the Lean 4 kernel’s integrity; while the theoretical exploitability of foundational bugs or unintended axioms is a general challenge for formal systems, it remains the highest standard for rigor\. To ensure stable results despite the repository’s evolution, we provide frozen, versioned subsets for standardized evaluation\.

##### Future Work\.

As an active open\-source project, we are committed to maintaining and extending Formal Conjectures with new problems, improved problems, and platform features; this ongoing commitment is reflected in the repository’s steady growth since its initial release \(see Figure[2](https://arxiv.org/html/2605.13171#S3.F2)\)\. See Appendix[C](https://arxiv.org/html/2605.13171#A3)for details on planned extensions, metadata improvements, and community features\.

## 6Conclusion

We presentFormal Conjectures, an evolving collection of mathematical problem statements formalized in Lean 4, including open conjectures\. Sourced from a diverse range of mathematical literature and areas, the benchmark provides broad coverage as an evaluation framework for AI systems’ capabilities in formal math\. By providing a zero\-contamination testbed of research\-level problems alongside a substantial set for proof auto\-formalization, we establish a rigorous environment for evaluating AI for formal mathematical research\. Crucially, to ensure stable and reproducible model comparisons as the repository evolves, we complement this dynamic dataset with versioned, frozen evaluation subsets\. We hope the work can drive the expansion of formalized math: both through its living repository and the continuous upstreaming of new definitions to Mathlib, as well as serve as a high\-fidelity benchmark that bridges the gap between human mathematical expertise and automated formal reasoning\. Ultimately, we believe this resource will not only measure current progress but also help accelerate the frontier of formal mathematical discovery\.

## Acknowledgments and Disclosure of Funding

We thank all contributors who have contributed to the repository since it was released as open source, among them Abel Doñate, Aditya Ramabadran, Amogh Parab, Anirudh Rao, Anthony Wang, Ayush Debnath, Bhavik Mehta, Bolton Bailey, Cong Lu, Daniel Chin, Felix Pernegger, Franz Huschenbeth, James Jordan, Jean\-Guillaume Durand, Jofre Costa, Junseok Lee, Junyan Xu, Madhu Shree Aravindan, Mario Krenn, Martin Bruse, Michael Rothgang, Mirek Olšák, Ralf Stephan, Reklle, Seewoo Lee, The bbchallenge Collaboration \(bbchallenge\.org\), Wojciech Nawrocki, Yan Yablonovskiy, Yoh Tanimoto, Yongxi Lin, Zeyu Zheng, and Zhen Ning David Liu\. A full list of contributors is available at[https://github\.com/google\-deepmind/formal\-conjectures/graphs/contributors](https://github.com/google-deepmind/formal-conjectures/graphs/contributors)\. Additionally, we thank Swarat Chaudhuri, George Tsoukalas, Anton Kovsharov, Henryk Michalewski, Edward Lockhart, and Goran Žužić for helpful discussions and support running experiments\.

## References

- T\. Achim, A\. Best, A\. Bietti, K\. Der, M\. Fédérico, S\. Gukov, D\. Halpern\-Leistner, K\. Henningsgard, Y\. Kudryashov, A\. Meiburg,et al\.\(2025\)Aristotle: imo\-level automated theorem proving\.arXiv preprint arXiv:2510\.01346\.Cited by:[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px2.p1.1),[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px3.p1.1)\.
- B\. Alexeev and D\. G\. Mixon \(2025\)Forbidden Sidon subsets of perfect difference sets, featuring a human\-assisted proof\.External Links:2510\.19804,[Link](https://arxiv.org/abs/2510.19804)Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p3.1),[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px3.p1.1)\.
- M\. Balunović, J\. Dekoninck, I\. Petrov, N\. Jovanović, and M\. T\. Vechev \(2025\)MathArena: Evaluating LLMs on Uncontaminated Math Competitions\.External Links:2505\.23281,[Link](https://arxiv.org/abs/2505.23281)Cited by:[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Bloom \(2023\)Erdős Problems\.Note:Accessed: 2026\-03\-24External Links:[Link](https://www.erdosproblems.com/)Cited by:[§3\.1](https://arxiv.org/html/2605.13171#S3.SS1.p1.1)\.
- T\. Bloom \(2025\)Erdős Problem \#124\.Note:See comment section\. Accessed: 2026\-04\-20External Links:[Link](https://www.erdosproblems.com/124)Cited by:[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px3.p1.1)\.
- Center for AI Safety, Scale AI, and HLE Contributors Consortium \(2026\)A benchmark of expert\-level academic questions to assess AI capabilities\.Nature649,pp\. 1139–1146\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09962-4),2501\.14249,[Link](https://arxiv.org/abs/2501.14249)Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p1.1),[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px1.p1.1)\.
- J\. Chen, W\. Chen, J\. Du, J\. Hu, Z\. Jiang, A\. Jie, X\. Jin, X\. Jin, C\. Li, W\. Shi,et al\.\(2025a\)Seed\-prover 1\.5: mastering undergraduate\-level theorem proving via learning from experience\.arXiv preprint arXiv:2512\.17260\.Cited by:[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px2.p1.1)\.
- L\. Chen, J\. Gu, L\. Huang, W\. Huang, Z\. Jiang, A\. Jie, X\. Jin, X\. Jin, C\. Li, K\. Ma,et al\.\(2025b\)Seed\-prover: deep and broad reasoning for automated theorem proving\.arXiv preprint arXiv:2507\.23726\.Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p1.1),[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Chervonyi, T\. H\. Trinh, M\. Olšák, X\. Yang, H\. H\. Nguyen, M\. Menegali, J\. Jung, J\. Kim, V\. Verma, Q\. V\. Le,et al\.\(2025\)Gold\-medalist performance in solving olympiad geometry with alphageometry2\.Journal of Machine Learning Research26\(241\),pp\. 1–39\.Cited by:[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px2.p1.1)\.
- F\. Chollet, M\. Knoop, G\. Kamradt, and B\. Landers \(2025\)ARC Prize 2024: Technical Report\.External Links:2412\.04604,[Link](https://arxiv.org/abs/2412.04604)Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p1.1),[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px1.p1.1)\.
- F\. Chollet, M\. Knoop, G\. Kamradt, and B\. Landers \(2026\)ARC Prize 2025: Technical Report\.External Links:2601\.10904,[Link](https://arxiv.org/abs/2601.10904)Cited by:[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px1.p1.1)\.
- G\. DeepMind \(2026\)DeepMind prover agent\.Note:In preparationCited by:[§B\.2](https://arxiv.org/html/2605.13171#A2.SS2.SSS0.Px2.p1.1),[§1](https://arxiv.org/html/2605.13171#S1.p3.1),[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2605.13171#S4.SS2.SSS0.Px1.p1.1)\.
- Editors of Mathematical Reviews and zbMATH \(2020\)MSC2020\-Mathematics Subject Classification System\.External Links:[Link](https://mathscinet.ams.org/mathscinet/msc/pdfs/classifications2020.pdf)Cited by:[§3\.1](https://arxiv.org/html/2605.13171#S3.SS1.SSS0.Px2.p1.1)\.
- P\. Erdős and A\. Hajnal \(1968\)On chromatic number of infinite graphs\.Theory of Graphs \(Proc\. Colloq\., Tihany, 1966\),pp\. 83––98\.Cited by:[Figure 13](https://arxiv.org/html/2605.13171#A1.F13)\.
- P\. Erdős \(1969\)Problems and results in chromatic graph theory\.Proof techniques in graph theory,pp\. 27–35\.Cited by:[Figure 13](https://arxiv.org/html/2605.13171#A1.F13)\.
- P\. Erdős \(1993\)SOME of my favorite solved and unsolved problems in graph theory\.Quaestiones Mathematicae16\(3\),pp\. 333–350\.External Links:[Document](https://dx.doi.org/10.1080/16073606.1993.9631741),[Link](https://doi.org/10.1080/16073606.1993.9631741),https://doi\.org/10\.1080/16073606\.1993\.9631741Cited by:[Figure 12](https://arxiv.org/html/2605.13171#A1.F12)\.
- E\. Glazer, E\. Erdil, T\. Besiroglu, D\. Chicharro, E\. Chen, A\. Gunning, C\. F\. Olsson, J\. Denain, A\. Ho, E\. de Oliveira Santos, O\. Järviniemi, M\. Barnett, R\. Sandler, M\. Vrzala, J\. Sevilla, Q\. Ren, E\. Pratt, L\. Levine, G\. Barkley, N\. Stewart, B\. Grechuk, T\. Grechuk, S\. V\. Enugandla, and M\. Wildon \(2024\)FrontierMath: A Benchmark for Evaluating Advanced Mathematical Reasoning in AI\.External Links:2411\.04872,[Link](https://arxiv.org/abs/2411.04872)Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p1.1),[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px1.p1.1)\.
- R\. Hartshorne \(1974\)Varieties of small codimension in projective space\.Bulletin of the American Mathematical Society80\(6\),pp\. 1017–1032\.Cited by:[Figure 3](https://arxiv.org/html/2605.13171#S3.F3)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the math dataset\.InThirty\-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track,Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p1.1)\.
- T\. Hubert, R\. Mehta, L\. Sartran, M\. Z\. Horváth, G\. Žužić, E\. Wieser, A\. Huang, J\. Schrittwieser, Y\. Schroecker, H\. Masoom, O\. Bertolli, T\. Zahavy, A\. Mandhane, J\. Yung, I\. Beloshapka, B\. Ibarz, V\. Veeriah, L\. Yu, O\. Nash, P\. Lezeau, S\. Mercuri, C\. Sönne, B\. Mehta, A\. Davies, D\. Zheng, F\. Pedregosa, Y\. Li, I\. von Glehn, M\. Rowland, S\. Albanie, A\. Velingker, S\. Schmitt, E\. Lockhart, E\. Hughes, H\. Michalewski, N\. Sonnerat, D\. Hassabis, P\. Kohli, and D\. Silver \(2025\)Olympiad\-level formal mathematical reasoning with reinforcement learning\.Nature651,pp\. 607–613\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09833-y),[Link](https://doi.org/10.1038/s41586-025-09833-y)Cited by:[§B\.2](https://arxiv.org/html/2605.13171#A2.SS2.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2605.13171#S1.p1.1),[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px2.p1.1),[§4\.2](https://arxiv.org/html/2605.13171#S4.SS2.SSS0.Px1.p1.1)\.
- IQOQI Vienna and R\. Werner \(2017\)Open Quantum Problems\.Note:Maintained by IQOQI Vienna since 2017\. Accessed: 2026\-03\-24External Links:[Link](https://oqp.iqoqi.oeaw.ac.at/open-quantum-problems)Cited by:[§3\.1](https://arxiv.org/html/2605.13171#S3.SS1.p1.1)\.
- J\. Jiang, W\. He, Y\. Wang, G\. Gao, Y\. Hu, J\. Wang, N\. Guan, P\. Wu, C\. Dai, L\. Xiao,et al\.\(2025\)Fate: a formal benchmark series for frontier algebra of multiple difficulty levels\.arXiv preprint arXiv:2511\.02872\.Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p1.1),[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px2.p1.1)\.
- E\. I\. Khukhro and V\. D\. Mazurov \(2026\)Unsolved Problems in Group Theory\. The Kourovka Notebook\.Note:21st edition\. First published in 1965; continuously updated on arXiv since 2014External Links:1401\.0300,[Link](https://arxiv.org/abs/1401.0300v41)Cited by:[§3\.1](https://arxiv.org/html/2605.13171#S3.SS1.p1.1)\.
- A\. Letson, L\. Sarra, A\. Poiroux, O\. Dressler, P\. Lezeau, D\. Aranha, F\. Pu, A\. Hill, M\. C\. Hidalgo, J\. Berman, G\. Tsoukalas, and L\. Taelman \(2026\)SorryDB: Can AI Provers Complete Real\-World Lean Theorems?\.External Links:2603\.02668,[Link](https://arxiv.org/abs/2603.02668)Cited by:[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Lin, S\. Tang, B\. Lyu, Z\. Yang, J\. Chung, H\. Zhao, L\. Jiang, Y\. Geng, J\. Ge, J\. Sun,et al\.\(2025\)Goedel\-prover\-v2: scaling formal theorem proving with scaffolded data synthesis and self\-correction\.arXiv preprint arXiv:2508\.03613\.Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p1.1)\.
- J\. Liu, Z\. Zhou, Z\. Zhu, M\. D\. Santos, W\. He, J\. Liu, R\. Wang, Y\. Xie, J\. Zhao, Q\. Wang,et al\.\(2026\)Numina\-lean\-agent: an open and general agentic reasoning system for formal mathematics\.arXiv preprint arXiv:2601\.14027\.Cited by:[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px2.p1.1)\.
- L\. d\. Moura and S\. Ullrich \(2021\)The lean 4 theorem prover and programming language\.InInternational Conference on Automated Deduction,pp\. 625–635\.Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p2.1)\.
- The mathlib Community \(2020\)The Lean Mathematical Library\.InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs,CPP 2020,New Orleans, LA, USA\.External Links:[Document](https://dx.doi.org/10.1145/3372885.3373824),[Link](https://doi.org/10.1145/3372885.3373824)Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p2.1)\.
- G\. Tsoukalas, J\. Lee, J\. Jennings, J\. Xin, M\. Ding, M\. Jennings, A\. Thakur, and S\. Chaudhuri \(2024\)PutnamBench: Evaluating Neural Theorem\-Provers on the Putnam Mathematical Competition\.External Links:2407\.11214,[Link](https://arxiv.org/abs/2407.11214)Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p1.1),[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.13171#S4.SS1.SSS0.Px4.p1.1)\.
- Z\. Yu, R\. Peng, K\. Ding, Y\. Li, Z\. Peng, M\. Liu, Y\. Zhang, Z\. Yuan, H\. Xin, W\. Huang,et al\.\(2025\)Formalmath: benchmarking formal mathematical reasoning of large language models\.arXiv preprint arXiv:2505\.02735\.Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p1.1)\.
- K\. Zheng, J\. M\. Han, and S\. Polu \(2021\)MiniF2F: a cross\-system benchmark for formal Olympiad\-level mathematics\.arXiv preprint arXiv:2109\.00110\.Cited by:[§1](https://arxiv.org/html/2605.13171#S1.p1.1),[§2](https://arxiv.org/html/2605.13171#S2.SS0.SSS0.Px2.p1.1),[§4\.1](https://arxiv.org/html/2605.13171#S4.SS1.SSS0.Px4.p1.1)\.

## Supplementary Material

Here, we include various supplementary details\.

## Appendix AFormal Conjectures Details

### A\.1Source and AMS Classification Tables

Note that a single source problem typically gives rise to multiple formal statements due to variants \(e\.g\., special cases, generalizations, solved sub\-problems, and test lemmas\)\. We explicitly allow and encourage the inclusion of multiple alternative formalizations for a single informal claim to capture different mathematical perspectives or resolutions of ambiguity; thus, the counts in this table reflect formal statements, not distinct source problems\.

Table 3:Number of formal statements by source collection, split by category\. A single source problem may yield multiple statements \(variants, special cases, tests\)\.Table[4](https://arxiv.org/html/2605.13171#A1.T4)shows the distribution of statements across AMS subject classifications\. Other notable areas beyond number theory and combinatorics include quantum theory, linear algebra, and algebraic geometry\.

Table 4:Top 10 AMS subject classifications by number of statements in Formal Conjectures\.
### A\.2The\\leananswer\(sorry\) Mechanism in Detail

Problems requiring a solution are formalized using\\leananswer\( \), a custom Lean term elaborator\. Concretely, a problem like“What is the smallestnnsuch thatP​\(n\)P\(n\)?”is formalized as:\{leancode\}theorem smallest\_n : IsLeast n : ℕ — P n answer\(sorry\) := by sorry As usual,\\leansorry is a placeholder which must be filled in order to claim that a problem is solved, but now there are two of them\. If the solution is known, then\\leananswer\(sorry\) is replaced in the statement with\\leananswer\(42\)\. This mechanism allows our statements to include computational conjectures where discovering the answer is the primary challenge and verification is routine\.

Crucially, while providing a proof simply requires finding any way to replace\\lean:= sorry with a valid term, providing an answer is a fundamentally different task: it requires evaluating mathematical meaning to determine the correct value\. Lean’s type checker can verify that a proposed answer leads to a valid proof, but it cannot judge whether the answer itself is mathematically meaningful—that remains a job for human mathematicians or AI systems with genuine mathematical reasoning capabilities\.

There are a number of advantages to using the\\leananswer\( \) wrapper over directly writing a solution or a bare\\leansorry placeholder:

- •It makes clear which part of the formalization corresponds to the solution; this allows the solution to be masked from\\leanresearch solved statements in order to evaluate agents\. This can either be done by inspecting the metadata inserted in the Lean term, or more pragmatically, by a simple regex replacement\.
- •It provides some protection against Lean’s flexible elaboration; a problem stated as\\lean1 / 2 \* 2 = sorry could be solved by replacing the\\leansorry with either\\lean\(0 : ℕ\) or\\lean\(1 : ℚ\), as by default, Lean only infers the type of an equality after analyzing both sides\. This is clearly undesirable; we do not want the meaning of our questions to change depending on the answer provided\! Stating the problem instead as\\lean1 / 2 \* 2 = answer\(sorry\) protects against this, as the\\leananswer elaborator postpones analysis which forces Lean to decide the type of the equality before seeing the solution\.
- •It forces the statement to be written in such a way that does not pre\-assume what the solution is\.

#### A\.2\.1Propositional Solutions

A particularly common use case arises when the truth value of a propositionPPis itself unknown, i\.e\. when the problem can be phrased as a question: ”Is it true that…”\. In this case,\\leananswer\(sorry\) serves as a propositional placeholder that should be replaced with either\\leanTrue or\\leanFalse:\{leancode\}theorem is\_P\_true : answer\(sorry\) ↔ P := by sorry Replacing\\leananswer\(sorry\) with\\leanTrue amounts to conjecturing thatPPholds \(and the solver must provePP\), while replacing it with\\leanFalse amounts to conjecturing thatPPfails \(and the solver must disprovePP\)\. This pattern is widely used throughout the repository for open problems where even the expected truth value is uncertain\.

For the convenience of evaluating solver systems that are capable of filling proofs only, when\\leananswer\(sorry\) is detected to be used propositionally in this way, it elaborates to\\leanTrue under its default settings\. This permits these systems to decide the proposition by either proving or disproving the statement, rather than by filling the placeholder\.

### A\.3The“for Mathlib”Pattern

It is typical when undertaking formalization projects to discover missing \(and often trivial\) results that seem suitable for Mathlib\. Contributing these results to Mathlib is of course the natural resolution, but this has a number of downsides:

- •New conjectures that depend on these results find themselves at the end of a long dependency chain; the missing results must undergo Mathlib review, the merged change must land in a monthly Mathlib release, and the Formal Conjectures repository must be upgraded to that release
- •Some results may be too specialized for Mathlib, and a Mathlib contribution would require generalizing these immediately\.
- •When the results entail a whole new mathematical development, it may take time for the Mathlib community to find the right abstractions and evolve said development into the right shape\. Bypassing the dependency chain mentioned above allows much of this evolution to happen more rapidly outside of Mathlib\.

The Lean community has widely adopted the pattern of creating aForMathlibdirectory as a staging ground for such results, the oldest example of which being[thefor\_mathlibdirectory](https://github.com/leanprover-community/lean-liquid/tree/master/src/for_Mathlib)in the“liquid tensor experiment”444[https://github\.com/leanprover\-community/lean\-liquid](https://github.com/leanprover-community/lean-liquid)\. The formal\-conjectures project’sFormalConjecturesForMathlibdirectory follows this pattern, and contains 88 files with results such as natural and logarithmic density of sets of natural numbers \(with proofs of basic properties like monotonicity and the density of even numbers\), arithmetic progressions and AP\-free sets, hypergraph Ramsey numbers, Turing machines and busy beaver halting numbers, perfect powers with a decidability instance, VC dimension in abelian groups, Latin squares and transversals, and a library of graph invariants including the Wiener and Szeged indices, the Lovász theta function, and the Havel–Hakimi residue\. Over time, these results are asynchronously contributed upstream to Mathlib\.

### A\.4Misformalization Taxonomy

Misformalizations can be categorized roughly as follows, with the level shown in parentheses\.555Pull request \(PR\) numbers refer to the repository at[https://github\.com/google\-deepmind/formal\-conjectures](https://github.com/google-deepmind/formal-conjectures); Erdős Problem discussions are at[https://www\.erdosproblems\.com](https://www.erdosproblems.com/)\.

- •Syntactic \(1\): where the formalizer misses a subtlety in the way Lean parses a piece of syntax\. For example, this could be missing brackets changing the meaning of an expression in unexpected ways; see PR \#1338\.
- •Semantic \(1\): where the formalizer misses a subtlety in the way that Lean represents types or operators\. For example, the natural numbers in Lean contain0and have truncated subtraction; see PRs \#349, \#1259, \#1262\.
- •Misrepresentation \(1\): where the formalizer formally misrepresents aspects of the informal statement during translation\. For example, the use of incorrect quantifiers, the use of incorrect logical connectives, or the failure to include hypotheses which are present in the source text; see PRs \#1164, \#1156\.
- •Implicit conventions \(2\): where the source text assumes domain expertise and does not explicitly include certain hypotheses or conventions\. For example, in analytic number theory, questions about the magnitude of a set of numbers are often implicitly asymptotic; see PRs \#2136, \#1151\.
- •Reporting \(3\): where the statement changes in the literature through repetition, usually as the result of typos; see Erdős Problem 918 discussion\.
- •Mathematical \(3\): where the source text is clearly not as intended or contains genuine mathematical errors; see Erdős Problem 728 discussion\.

Table 5:Number and proportion of misformalizations across categories\.
### A\.5Misformalization Examples

This section contains the misformalization examples, mentioned in Section[3\.3](https://arxiv.org/html/2605.13171#S3.SS3)\.

#### A\.5\.1Syntactical Errors

An example of a syntactical error involving the correction of misplaced parentheses is shown in Figure[5](https://arxiv.org/html/2605.13171#A1.F5)\.

/–Let$f\(n\)$betheminimalinteger$m$suchthat$n$isthesumofthe

$k$smallestdivisorsof$m$forsome$k\\geq1$\.Isittruethat

$\\limsupf\(n\)/n=\\infty$?\-/

@\[categoryresearchopen,AMS11\]

theoremerdos\_1054\.parts\.iii:\(∃\(A:Setℕ\),A\.HasDensity1∧

\-atTop\.limsup\(funn↦\(fn:EReal\)/n\)=⊤\)↔answer\(sorry\):=by

\+atTop\.limsup\(funn↦\(fn:EReal\)/n=⊤\)\)↔answer\(sorry\):=by

sorry

Figure 5:Code diff showing an example of a syntactical error, and the correction of parentheses\.
#### A\.5\.2Semantic Errors

Figures[6](https://arxiv.org/html/2605.13171#A1.F6),[7](https://arxiv.org/html/2605.13171#A1.F7)and[8](https://arxiv.org/html/2605.13171#A1.F8)illustrate various semantic errors encountered during formalization\.

/–‘Polynomial\.HasOddCoeffsf‘meansthatallcoefficientsof

‘f:Polynomialℤ‘areodd\.\-/

defPolynomial\.HasOddCoeffs\(f:Polynomialℤ\):Prop:=

\-∀i:ℕ,Odd\(f\.coeffi\)

\+∀i∈f\.support,Odd\(f\.coeffi\)

Figure 6:Code diff showing an example of a semantic error, with the fix requiring quantification over the support of the polynomial\./–Agroup‘HasPolynomialGrowth‘ifthereexistsafinitegeneratingset

suchthatthegrowthfunctionisboundedabovebyapolynomial\.\-/

defHasPolynomialGrowth\(G:Type\*\)\[GroupG\]:Prop:=

∃\(S:SetG\),Set\.FiniteS∧Subgroup\.closureS=⊤∧

∃\(C:ℝ\)\(d:ℕ\),C\>0∧

\-∀n:ℕ,\(GrowthFunctionSn:ℝ\)≤C\*\(n:ℝ\)^d

\+∀n\>0,\(GrowthFunctionSn:ℝ\)≤C\*\(n:ℝ\)^d

Figure 7:Code diff showing an example of a semantic error, where the formalization had failed to account for behavior at the natural number0\.\+/–Anexactcoveringofagroup‘G‘isafinitecollectionofsubgroups

\+‘\{H\_1,…,H\_k\}‘andrepresentative‘\{g\_1,…,g\_k\}‘suchthatthe

\+cosets‘g\_iH\_i‘arepairwisedisjointandtheirunioncovers‘G‘\.

\+

\+Notethatthisdiffersfrom‘Partition\(α:=SubgroupG\)‘becausethe

\+coveringconditionthereinvokes‘Subgroup\.sup‘whichissubgroupgeneration

\+andthusstrongerthanunion\.Thisdefinitioniseasiertouseinthis

\+contextthanthealternative‘Partition\(α:=SetG\)‘,whichlacks

\+subgroupdefinitionssuchas‘Subgroup\.index‘\.\-/

\+structureGroup\.ExactCovering\(G:Type\*\)\[GroupG\]\(ι:Type\*\)\[Fintypeι\]where

\+parts:ι→SubgroupG

\+reps:ι→G

\+nonempty\(i:ι\):\(partsi:SetG\)\.Nonempty

\+disjoint:\(Set\.univ\(α:=ι\)\)\.PairwiseDisjoint

\+fun\(i:ι\)↦repsi•\(partsi:SetG\)

\+covers:⋃i,repsi•\(partsi:SetG\)=Set\.univ

\\par/–

If‘G‘isagroupthencanthereexistanexactcoveringof‘G‘bymorethan

onecosetsofdifferentsizes?\(i\.e\.eachelementiscontainedinexactly

oneofthecosets\.\)

\-/

@\[categoryresearchopen,AMS20\]

theoremerdos\_274\(G:Type\*\)\[GroupG\]\(hG:1<ENat\.cardG\):

\-\(∃\(P:Partition\(⊤:SubgroupG\)\),

\-1<P\.parts\.ncard∧

\-\(∀A∈P\.parts,∃ᵉ\(s:G\)\(H:SubgroupG\),s•\(H:SetG\)=A\)∧

\-P\.parts\.PairwisefunAB↦\#A≠\#B\)↔answer\(sorry\):=by

\+\(∃\(ι:Type\*\)\(\_:Fintypeι\)\(P:Group\.ExactCoveringGι\),

\+1<Fintype\.cardι∧\(Set\.rangeP\.parts\)\.PairwisefunAB↦\#A≠\#B\)↔

\+answer\(sorry\):=by

sorry

Figure 8:Code diff showing an example of a semantic error, where the formalization missed the subtlety that\\leanPartition invokes\\leansup for the covering condition, which means something different for subgroups than for sets\.
#### A\.5\.3Misrepresentation Errors

Figure[9](https://arxiv.org/html/2605.13171#A1.F9)provides an example of a misrepresentation error where a hypothesis from the informal text was omitted, and Figure[10](https://arxiv.org/html/2605.13171#A1.F10)shows an error involving incorrect quantification\.

/–

\-A\(weak\)Giuganumberisanumber$n$suchthat

\+A\(weak\)Giuganumberisacompositenumber$n$suchthat

$$\\sum\_\{i=1\}^\{n\-1\}i^\{\\varphi\(n\)\}\\equiv\-1\\pmod\{n\}$$\.

\-/

defIsWeakGiuga\(n:ℕ\):Prop:=

\-2≤n∧¬n\.Prime∧n∣1\+∑i∈Finset\.Ioo0n,i^φn

\+n\.Composite∧n∣1\+∑i∈Finset\.Ioo0n,i^φn

Figure 9:Code diff showing an example of a misrepresentation error, where the formalization was missing a hypothesis that was present in the informal text\.@\[categoryresearchopen,AMS11\]

theoremerdos\_944:

\-\(∀k≥4,∀r≥1,∃\(G:SimpleGraphV\),G\.IsErdos944kr\)↔

\-answer\(sorry\):=by

\+\(∀k≥4,∀r≥1,∃\(V:Typeu\)\(G:SimpleGraphV\),G\.IsErdos944kr\)↔

\+answer\(sorry\):=by

sorry

Figure 10:Code diff showing an example of a misrepresentation error, where the vertex type\\leanV was incorrectly quantified\. In the old version\\leanV was a section variable and thus appeared out of the scope of the existential quantifier\.
#### A\.5\.4Implicit Conventions

Examples of errors stemming from implicit conventions in the source material can be seen in Figures[7](https://arxiv.org/html/2605.13171#footnote7)and[11](https://arxiv.org/html/2605.13171#footnote11)\.

@\[categoryresearchopen,AMS11\]

theoremerdos\_510:

answer\(sorry\)↔∃\(c:ℝ\)\(hc:0<c\),

\-∀N\>0,∀\(A:Finsetℕ\),0∉A→\#A=N→

\+∀ᶠNinatTop,∀\(A:Finsetℕ\),0∉A→\#A=N→

∃θ,∑n∈A,cos\(n\*θ\)<\-c\*sqrtN:=by

sorry

Figure 11:Code diff showing an example of an implicit convention, where the source777[https://www\.erdosproblems\.com/510](https://www.erdosproblems.com/510)implicitly assumes that the inequality holds for sufficiently largeNN\. This is a common implicit convention across analytic number theory where one is typically interested in the asymptotic behavior of functions or sequences of natural numbers\.@\[categoryresearchopen,AMS5\]

theoremerdos\_128:

\-\(\(∀\(G’:G\.Subgraph\)\[FintypeG’\.verts\]\[FintypeG’\.edgeSet\],

\-letIn:=Fintype\.cardV;

\-2\*G’\.verts\.toFinset\.card≥n→

\-50\*G’\.edgeSet\.toFinset\.card\>n^2\)→¬\(G\.CliqueFree3\)\)

\+\(\(∀\(V’:SetV\),

\+2\*V’\.ncard≥Fintype\.cardV→

\+50\*\(G\.induceV’\)\.edgeSet\.ncard\>Fintype\.cardV^2\)→¬\(G\.CliqueFree3\)\)

↔answer\(sorry\):=by

sorry

Figure 12:Code diff showing an example of an implicit convention, where the source101010[https://www\.erdosproblems\.com/history/128](https://www.erdosproblems.com/history/128)wrote “subgraph” rather than “induced subgraph”\. The comments111111[https://www\.erdosproblems\.com/forum/thread/128](https://www.erdosproblems.com/forum/thread/128)demonstrate that the problem is trivial if one does not restrict to induced subgraphs, which is why we class this as implicit convention\. It might also be considered to be a reporting error since the original paper explicitly writes “induced subgraph”,\[Erdős,[1993](https://arxiv.org/html/2605.13171#bib.bib26), p\. 344\]\.
#### A\.5\.5Reporting Errors

Figure[13](https://arxiv.org/html/2605.13171#footnote13)demonstrates a reporting error caused by ambiguities in the historical source material regarding induced subgraphs\.

/–Isthereagraphwith$\\aleph\_2$verticesandchromaticnumber$\\aleph\_2$suchthatevery

subgraphon$\\aleph\_1$verticeshaschromaticnumber$\\leq\\aleph\_0$?\-/

–Formalisationnote:sourcematerial\[ErHa68b\]usesonlyinducedsubgraphs

@\[categoryresearchopen,AMS5\]

theoremerdos\_918\.parts\.i:

\(∃\(V:Typeu\)\(G:SimpleGraphV\),\#V=ℵ\_2∧G\.chromaticCardinal=ℵ\_2∧

\-∀\(W:SetV\)\(\_:\#W=ℵ₁\),\(G\.induceW\)\.chromaticCardinal=ℵ₀\)↔

\+∀\(W:SetV\)\(\_:\#W=ℵ₁\),\(G\.induceW\)\.chromaticCardinal≤ℵ₀\)↔

answer\(sorry\):=by

sorry

Figure 13:Code diff showing an example of a reporting error between an internal draft formalization of the problem\. This problem appears inErdős and Hajnal \[[1968](https://arxiv.org/html/2605.13171#bib.bib28)\]andErdős \[[1969](https://arxiv.org/html/2605.13171#bib.bib27)\]\. In the former citation the explicit use of≤ℵ0\\leq\\aleph\_\{0\}is used, while this does not appear in the latter citation\. This ambiguity was only clarified post draft formalization with the discovery of trivial solutions – in the \(non\-induced\) subgraph case by the website comments131313[https://www\.erdosproblems\.com/forum/thread/918](https://www.erdosproblems.com/forum/thread/918)and by AlphaProof in the induced subgraph case\. Note also thatErdős and Hajnal \[[1968](https://arxiv.org/html/2605.13171#bib.bib28)\]explicitly writes*induced*subgraph whileErdős \[[1969](https://arxiv.org/html/2605.13171#bib.bib27)\]and the website do not\.

## Appendix BExperimental Evaluation Details

### B\.1Frozen Subsets

The frozen subsetsFC100SolvedSet1andFC100OpenSet1consist of 100 problems each, sampled uniformly at random from statements in theresearch solvedandresearch opencategories, respectively\. They are defined by the filesFormalConjectures/Subsets/FC100OpenSet1\.leanandFormalConjectures/Subsets/FC100SolvedSet1\.leanin the repository, which import exactly the corresponding theorem statements\. These files are compiled as part of every tagged release, ensuring that the problem sets remain well\-defined and their statements compile correctly across all supported Lean versions\.

The subset names \(e\.g\.,FC100OpenSet1\) are*orthogonal*to the repository’s release tags \(bench\-vNN\-lean4\.XX\.YY, see Section[4\.1](https://arxiv.org/html/2605.13171#S4.SS1.SSS0.Px5)\)\. When the bench versionNNstays the same and only the Lean version changes \(e\.g\., frombench\-v1\-lean4\.27\.0tobench\-v1\-lean4\.29\.0\), the problem set is guaranteed to contain the same statements; the new tag only ensures compilation against the updated Lean toolchain and Mathlib\. When the bench version increments \(e\.g\., frombench\-v1tobench\-v2\), statements in an existing set may receive fixes \(e\.g\., corrected misformalizations\) or category updates \(e\.g\., anopenproblem reclassified assolved\), but the*membership*of the set remains stable: no problems are added or removed\. For full reproducibility, results should specify both the set name and the release tag, e\.g\.,FC100SolvedSet1@bench\-v1\-lean4\.27\.0\.

In the future, we plan to release additional sets \(e\.g\.,FC100SolvedSet2,FC500OpenSet1\) that may contain entirely different problems, enabling targeted evaluations at different scales or difficulty levels while preserving earlier sets\.

### B\.2Illustrative Evaluation Details

As noted in Section[4\.2](https://arxiv.org/html/2605.13171#S4.SS2), our evaluations focus on demonstrating the benchmark’s utility and its sensitivity to scaling\. For all systems, a solution is correct if and only if the Lean 4 kernel accepts the proof term without relying on forbidden axioms\.

##### AlphaProof Setup\.

We evaluated a slightly updated version of AlphaProof\[Hubertet al\.,[2025](https://arxiv.org/html/2605.13171#bib.bib15)\], with tree\-search inference, without the test\-time reinforcement learning loop\. As reported in Table[2](https://arxiv.org/html/2605.13171#S4.T2), we distinguish between two compute configurations for AlphaProof to demonstrate the benchmark’s sensitivity to search budget: 1k sims utilizes 1,000 simulations and 2 attempts per problem \(approximately 0\.1 TPUh on v6e TPUs\); and 16k sims and 2 attempts \(approximately 1\.6 TPUh on v6e TPUs\)\.

##### DeepMind Prover Agent Setup\.

The DeepMind prover agent\[DeepMind,[2026](https://arxiv.org/html/2605.13171#bib.bib30)\]results were obtained using a development version of the method under active experimentation\. Based on current API pricing, this evaluation costs up to $100 per problem\.

## Appendix CExtended Discussion

### C\.1Future Work

We plan to continuously extend Formal Conjectures with new problems unblocked by newly added Mathlib definitions or from community contributions\. While our implementation uses Lean 4, the methodology is language\-agnostic and portable to systems like Isabelle or Rocq should they develop comparable library support\. We will also track future Lean updates to maintain ecosystem compatibility\. To better evaluate the difficulty and impact of conjectures, we plan to extend the metadata by recording when a conjecture was first proposed\. This objective metric will enable users to filter for long\-standing open problems versus contemporary questions\. However, since difficulty and importance are often subjective, we also propose an interactive interface where the community can share insights\. Users will be able to vote on a conjecture’s perceived difficulty, its mathematical significance, and their conviction regarding its truth value \(e\.g\., True, False, or independent of ZFC\)\.

Furthermore, references to the mathematical literature are currently handled inconsistently across the repository, often requiring duplication in each file\. We plan to improve this by centralizing reference management in future iterations\.

This vision is being integrated into an accompanying website141414[https://google\-deepmind\.github\.io/formal\-conjectures/](https://google-deepmind.github.io/formal-conjectures/)for the repository\. The website already serves as a frontend for the benchmark, allowing users to explore the source code, viewLaTeX\-rendered docstrings, and search, sort, or filter conjectures\. It also contains metadata recording when aresearchopenproblem gets solved or when aresearchsolvedproblem receives a formal proof; in the future, we plan to make these changes more prominently visible\.

Moving forward, we plan to expand this platform into a collaborative hub with interactive community features, including a public leaderboard to track successful AI\-based solutions and recognize groups taking on these problems\.

Similar Articles

MathAtlas: A Benchmark for Autoformalization in the Wild

arXiv cs.AI

MathAtlas is a large-scale benchmark for autoformalization of graduate-level mathematics, containing ~52k theorems and definitions extracted from 103 textbooks, with a mathematical dependency graph of ~178k relations. Experiments show state-of-the-art models achieve at most 9.8% correctness, highlighting the difficulty.

Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

arXiv cs.CL

This paper introduces Discover and Prove (DAP), an open-source agentic framework for automated theorem proving in Lean 4 that tackles 'Hard Mode' problems where the answer must be discovered independently before formal proof construction. The work releases new Hard Mode benchmark variants and achieves state-of-the-art results while revealing a significant gap between LLM answer accuracy (>80%) and formal prover success (<10%).

Solving (some) formal math olympiad problems

OpenAI Blog

OpenAI achieved a new state-of-the-art 41.2% on the miniF2F formal math olympiad benchmark using a technique called 'statement curriculum learning,' which iteratively trains a neural prover on proofs of increasing difficulty. The approach builds on iterative proof search and retraining over 8 iterations to significantly outperform the previous best of 29.3%.

Learning to Reason with Insight for Informal Theorem Proving

arXiv cs.CL

This paper proposes DeepInsightTheorem, a hierarchical dataset and Progressive Multi-Stage SFT training strategy to improve LLMs' informal theorem proving by teaching them to identify and apply core techniques through insight-aware reasoning.