MathAtlas: A Benchmark for Autoformalization in the Wild
Summary
MathAtlas is a large-scale benchmark for autoformalization of graduate-level mathematics, containing ~52k theorems and definitions extracted from 103 textbooks, with a mathematical dependency graph of ~178k relations. Experiments show state-of-the-art models achieve at most 9.8% correctness, highlighting the difficulty.
View Cached Full Text
Cached at: 05/15/26, 06:20 AM
# A Benchmark for Autoformalization in the Wild
Source: [https://arxiv.org/html/2605.14061](https://arxiv.org/html/2605.14061)
Nilay Patel†\\dagger\* nilay@ucsc\.edu &Noah Arias ngarias@ucsc\.edu &Davit Babayan dbabaya1@ucsc\.edu Victoria Cochran vacochra@ucsc\.edu &Timothy Libman tlibman@ucsc\.edu &Hafsah Mahmood hzmahmoo@ucsc\.edu &Liam McCarty lmccarty@ucsc\.edu &Soli Munoz mmunoz35@ucsc\.edu &Laurel Willey lwwilley@ucsc\.edu Jeffrey Flanigan\* jmflanig@ucsc\.edu
University of California, Santa Cruz ∗\*Primary authors †\\daggerCorrespondence
###### Abstract
Current autoformalization benchmarks are largely focused on olympiad or undergraduate mathematics, while graduate and research\-level mathematics remains underexplored\. In this paper, we introduceMathAtlas, the first large\-scale autoformalization benchmark ofin the wildgraduate\-level mathematics, containing∼52k\\sim 52ktheorems, definitions, exercises, examples, and proofs extracted from 103 graduate mathematics textbooks\.MathAtlasis enriched with a mathematical dependency graph containing∼178k\\sim 178krelations, and is the first autoformalization benchmark to include such relations, facilitating evaluation and development of dependency\-aware autoformalization systems\. Our extensive experiments show thatMathAtlasis high quality but extremely challenging: strong baselines achieve at most 9\.8% correctness on theorem statements and 16\.7% on definitions\. Furthermore, we find performance of state\-of\-the\-art models degrades substantially with dependency depth: onMA\-Hard, a subset of 700 entities with the deepest dependency trees, the best model achieves only 2\.6% correctness for autoformalization on this challenging dataset\. We releaseMathAtlasto the community as a benchmark set for large\-scale autoformalization of graduate\-level mathematicsin the wild\.
## 1Introduction
The popularity of formal mathematics \(computer verifiable mathematics\) within the mathematics community has grown dramatically over recent years\[[30](https://arxiv.org/html/2605.14061#bib.bib1),[32](https://arxiv.org/html/2605.14061#bib.bib2),[15](https://arxiv.org/html/2605.14061#bib.bib67)\]\. Together with advances in AI, this has caused a surge of interest inautoformalization\[[10](https://arxiv.org/html/2605.14061#bib.bib19),[36](https://arxiv.org/html/2605.14061#bib.bib16),[13](https://arxiv.org/html/2605.14061#bib.bib30),[2](https://arxiv.org/html/2605.14061#bib.bib20),[18](https://arxiv.org/html/2605.14061#bib.bib63)\], the task of automatically converting natural language math to a formal verifiable language such as Lean\[[24](https://arxiv.org/html/2605.14061#bib.bib75)\]\.
Figure 1:An entity fromMathAtlastogether with a subgraph of its dependency tree\. In yellow \(bottom\), we have a target theorem statement to be formalized \(Proposition 15\)\. Proposition 15 contains three “object references” to previously defined mathematical concepts: Dedeking ring, prime ideals, and principal ideal ring, which are dependencies needed to correctly formalize the theorem\. These objects each have their own dependency tree \(shown with a red border\), and so on\. Formalizations of objects possibly already exist \(e\.g\., in Mathlib\), or they need to be synthesized before the target statement can be formalized\.Until recently, autoformalization work has been largely focused on simpler settings such as olympiad or undergraduate mathematics\[[2](https://arxiv.org/html/2605.14061#bib.bib20),[36](https://arxiv.org/html/2605.14061#bib.bib16),[41](https://arxiv.org/html/2605.14061#bib.bib26),[9](https://arxiv.org/html/2605.14061#bib.bib25),[1](https://arxiv.org/html/2605.14061#bib.bib12),[12](https://arxiv.org/html/2605.14061#bib.bib68),[19](https://arxiv.org/html/2605.14061#bib.bib69)\], where mathematical statements require limited prerequisite material to properly formalize, much of which already has been formalized by humans in libraries such as Mathlib\[[22](https://arxiv.org/html/2605.14061#bib.bib62)\]\.
However, in more advanced mathematics, autoformalization becomes increasingly complex\. Advanced mathematics has a large amount of prerequisite theory, much of which has yet to be formalized\. This means an autoformalization system must first retrieve or produce the necessary dependent theory, correctly formalize it, and then formalize the original target statement\.
Additionally, existing benchmarks typically focus on theorem statements, and ignore formalizations of mathematicaldefinitions\. Systems which can only autoformalize theorem statements are limited in their scope by the level of existing theory because they cannot formalize theorems using math concepts they do not already have definitions for\. Thus, creating scalable autoformalization systems necessitates the study of definition autoformalization, for which large\-scale real\-world benchmarks are lacking\[[39](https://arxiv.org/html/2605.14061#bib.bib70),[34](https://arxiv.org/html/2605.14061#bib.bib73)\]\.
To comprehensively evaluate the performance of autoformalization methods in higher\-level mathematics, we presentMathAtlas111[https://huggingface\.co/datasets/MathAtlas/MathAtlas](https://huggingface.co/datasets/MathAtlas/MathAtlas), a large\-scale, graduate\-level benchmark designed to measure autoformalization performance on mathematics in the wild\.MathAtlascontains∼51k\\sim 51kinformal entities, divided into statements \(theorems, examples, and exercises\), definitions, and proofs, which are extracted from 103 graduate math textbooks\. Entities inMathAtlasspan 87 areas of mathematics, including real analysis, differential geometry, category theory, topology, quantum groups, among others\. Some entities inMathAtlasare formalizable on top of Mathlib, but many require additional formal theory which may not currently exist\. To facilitate access to these required dependencies, we enrich each entity with long\-range dependency relations, local variable references, and local context \(see[Figure 1](https://arxiv.org/html/2605.14061#S1.F1)\)\.
MathAtlasis the first large\-scale,in the wildautoformalization benchmark of graduate\-level mathematics, the first to include rich contextual and dependency data, and the first to annotate their inter\-entity dependencies\. Our extensive experiments show thatMathAtlasis very challenging: strong baseline methods have low overall autoformalization correctness \(10\.5% on statements and 20\.3% on definitions\) onMathAtlas\. Furthermore, we find this performance is correlated with an entity’sdependency depth: our strongest baseline has only 2\.6% overall correctness onMA\-Hard, a subset of 700 entities with the deepest dependency trees\.
Finally, we investigate how well existing semantic faithfulness metrics for autoformalization generalize toMathAtlas\. We presentMA\-Align222[https://huggingface\.co/datasets/MathAtlas/MA\-Align](https://huggingface.co/datasets/MathAtlas/MA-Align), a binary classification benchmark of 200 entities fromMathAtlas\. Unlike prior benchmarks such as ConsistencyCheck\[[6](https://arxiv.org/html/2605.14061#bib.bib86)\]and CriticLeanBench\[[28](https://arxiv.org/html/2605.14061#bib.bib91)\],MA\-Aligncontains both labeled examples of theorem statements and definitions, and covers a wide variety of mathematics at the graduate level\. Our experiments show that prior baselines have significant room for improvement onMA\-Align\.
We organize the rest of the paper as follows: In[§ 2](https://arxiv.org/html/2605.14061#S2), we presentMathAtlas, its construction, contents and quality\. We then discuss the benchmark’s autoformalization metrics \([§ 3](https://arxiv.org/html/2605.14061#S3)\), before moving on to our autoformalization experiments and an analysis of the results \([§ 4](https://arxiv.org/html/2605.14061#S4)\)\. Finally, we discuss related works \([§ 5](https://arxiv.org/html/2605.14061#S5)\), and conclude \([§ 6](https://arxiv.org/html/2605.14061#S6)\)\.
## 2MathAtlas Benchmark
Definition 2\.7\.10\(Invariant bilinear form\)\.: LetV=V0¯⊕V1¯V=V\_\{\\bar\{0\}\}\\oplus V\_\{\\bar\{1\}\}be a superspace and carrying a representation of a Lie algebra𝔤\\mathfrak\{g\}\. A bilinear form\(⋅\|⋅\)\(\\cdot\|\\cdot\)is saidinvariantif, for any homogeneous elementX∈𝔤X\\in\\mathfrak\{g\}andv,w∈Vv,w\\in V,\(Xv\|w\)\+p\(v,X\)\(v\|Xw\)=0\(Xv\|w\)\+p\(v,X\)\(v\|Xw\)=0\. Moreover, ifVVis the representation space of the adjoint representation, then the bilinear form is saidadjointinvariant\.
Corollary 3\.A\.24: LetXXbe a compact Kahler manifold\. Then the inclusioni:\(𝒜∗\(X\)c,d\)⊂\(𝒜∗\(X\),d\)i:\(\\mathcal\{A\}^\{\*\}\(X\)^\{c\},d\)\\subset\(\\mathcal\{A\}^\{\*\}\(X\),d\)is a dga\-quasi\-isomorphism\.
Example 5\.1\.14\.: Find two fundamental matrices for the linear homogeneous system in Example 5\.1\.11\.
Figure 2:Several entities fromMathAtlasof various types: \(1\) a definition entity which defines two terms: “invariant” and “adjoint invariant”\. \(2\) a theorem statement of a corollary from complex geometry\. \(3\) an example from an ODE textbook\.MathAtlasis constructed automatically from textbook documents\. We discuss the entity and relation types inMathAtlas, and then discuss our entity and relation extraction pipeline\. We also measure the quality ofMathAtlaswith a human evaluation\.
### 2\.1Structure ofMathAtlas
The structure ofMathAtlasis as follows, with some relevant statistics\.
#### Entities
We use the termentityto mean a span of informal text which can be formalized, such as a theorem in a textbook\. InMathAtlas, entities are either definitions, theorems, proofs, examples, or exercises \(see[Figure 2](https://arxiv.org/html/2605.14061#S2.F2)\)\. Entities can, but don’t always, haveidentifiers, such as “Theorem 4\.3\.” These identifiers are sometimes referenced by other entities \(see[Figure 1](https://arxiv.org/html/2605.14061#S1.F1)\)\. Importantly, entities sometimes contain multiple formalizable items\. For example, in[Figure 2](https://arxiv.org/html/2605.14061#S2.F2), Definition 2\.7\.10 actually defines two objects, and so it’s formalization may require two definitions\.
#### References
Entities can containreferencesto other entities, dependent mathematical objects, or local variables which are not explicitly defined in the entity itself\. We call theseentity references,object references, andlocal variable references, respectively\.
#### Relations
We also include relations between entity references and the target entity, and object references and entities which define those objects\. For example, an entityemete\_\{met\}may define ametric space, and contain an object referencertopr\_\{top\}to atopological space, which is defined in an entityetope\_\{top\}\. In this case, we add a relation⟨emet,etop⟩\\left<e\_\{met\},e\_\{top\}\\right\>\(see[Figure 1](https://arxiv.org/html/2605.14061#S1.F1)\)\. We definedependency depthto be the maximum height of an items dependency tree, where nodes areentities\(e\.g\., definitions or theorems\) and edges areobject references\. Similarly, we definedependency massto be the total number of unique nodes in a particular entity’s dependency tree\.
#### Statistics
We now discuss some statistics ofMathAtlas\. Overall,MathAtlascontains of 52,052 entities, which are comprised of∼18\\sim 18k theorems,∼10\\sim 10k exercises,∼10\\sim 10k definitions,∼10\\sim 10k proofs, and55k examples\.
About 90% of entities contain at least one object reference\. In sum, there are∼\\sim193k object references, for an average of 3\.69 per entity\. Of these, our relation extraction system found a matching target entity for 83\.7%, for a total of∼\\sim161k total object relations in the full graph\. In addition, 20\.5% of entities contain an entity reference, summing to∼17k\\sim 17ktotal entity references\. Of these, over 99\.8% were matched with a target entity\.
We now briefly present some features ofMathAtlas’s dependency graph\. Nodes have an average dependency depth of 30 \(median of 17\), but range anywhere from 0 to 80\. We also find a stark difference in dependency depth based on the field of mathematics\. For example, entities from textbooks on Lie algebra or Quantum groups had an average depth of∼50\\sim 50, whereas those from books on set theory and logic had an average depth of∼6\\sim 6\. This is in\-line with the intuition that certain fields of mathematics have more prerequisites, and thus deeper dependency trees\.
### 2\.2Construction ofMathAtlas
Here we give an overview of our dataset construction pipeline\. For additional details, see[Appendix A](https://arxiv.org/html/2605.14061#A1)\.
To constructMathAtlas, we collect 103 PDFs of math texts from a variety of sources and several fields of mathematics including subfields of algebra, analysis, geometry, number theory, probability, quantum theory, category theory, topology, and others\. The math texts cover a total of 87 fields and subfields of mathematics\. Following our pipeline \([Figure 5](https://arxiv.org/html/2605.14061#A1.F5)\), we convert the PDFs to MMD \(mathematical markdown\) files and filter out pages containing metadata, such as indices or glossaries\. Then, we run an entity extraction model on the MMDs, which identifies spans of text corresponding to entity types, along with any identifier found in the entity \(e\.g\., “Example 3\.4”\)\. Next, on any extracted entity, we run a reference extractor to identify any references to other entities, objects, or local variables\. Lastly, ondefinitionentities, we run a name extraction system, which identifies the name\(s\) of the defined object\(s\)\.
After extracting entities and associated metadata, we construct our dependency graph with a relation extraction system \(see[Appendix A](https://arxiv.org/html/2605.14061#A1)for system details\)\. We add “refers\-to” relations between entity references and their target entities, and between an object references and the entity which defines that object, if present\. Finally, we add a “proves” relation between a proof entity and the theorem entity it proves\.
### 2\.3Quality Analysis ofMathAtlas
We measure the intrinsic entity and relation extraction quality of the entities inMathAtlas\.
To evaluate entity extraction quality, we annotate a random sample of250250extracted entities fromMathAtlas, 50 entities of each type, based on whether the entity is valid\.333Our annotators hold, at minimum, undergraduate math degrees\.For an entity to be valid, it must be complete \(e\.g\., not cut off at the start or end, no missing segments due to improper PDF\-markdown conversion\), and have a correctly labeled type\. We find that overall, 90\.4% of the annotated entities are correctly extracted \(see[Table 1](https://arxiv.org/html/2605.14061#S2.T1)\)\.
We also annotate 100 randomly sampled references, comprised of 50 object references and 50 entity references\. Like before, references are considered correct if they are complete and are labeled with the correct type\. We find that 96% of references were properly extracted\. Of the four examples with errors, three were identifiers which were mistaken as entity references, and one was a name mistaken for an object reference\.
Finally, we measure the performance of our relation extraction system\. For each of the 100 randomly sampled references above, we retrieve 10 candidate targets with LeanSearch and annotate which, if any, are correct\. We compute precision, recall, and F1 score of our automated system against the human annotations\. Overall, we find an F1 score of 94\.8% \(see[1\(a\)](https://arxiv.org/html/2605.14061#S2.T1.st1)\)\.
Table 1:Intrinsic data quality analysis ofMathAtlas\.\(a\)Performance of our relation extraction system on a human\-annotated subset of 100 extracted relations \(50 of each type\)\. To measure recall, we annotate gold target entities from among the retrieved candidates; if no selected candidate is an appropriate match, we assume none exists\.
\(b\)Precision of our entity extraction system on a human\-annotated subset of 250 examples \(50 of each type\)\.
## 3Autoformalization Metrics forMathAtlas
Following previous work, we use the standard autoformalization metric ofcompile rate\.444We use the software packageblv\[[26](https://arxiv.org/html/2605.14061#bib.bib90)\]to measure compile rate, compiled against Lean version v4\.24\.0\.However, compile rate does not measure faithfulness of the formalization, so we also reportsemantic faithfulness\(see[subsection 3\.1](https://arxiv.org/html/2605.14061#S3.SS1)for a full discussion\)\.
We report a combined metric of compile rate and semantic faithfulness which we callcorrectness\. A formal statement must both compile and be faithful to the informal statement for it to be considered correct\.
### 3\.1Semantic Faithfulness Metric
Here, we discuss semantic faithfulness as a metric, and present experiments showing limitations of prior methods and our improvements\.
Together with syntactic correctness \(compilation rate\), semantic correctness, orsemantic faithfulnessis a critical metric for autoformalization\. Faithfulness measures whether a formal statement has the same mathematical meaning as its informal source\. Prior work on semantic faithfulness\[[21](https://arxiv.org/html/2605.14061#bib.bib84),[6](https://arxiv.org/html/2605.14061#bib.bib86),[28](https://arxiv.org/html/2605.14061#bib.bib91),[20](https://arxiv.org/html/2605.14061#bib.bib92)\]focuses on easier domains \(olympiad or undergraduate math\), and does not measure alignment of definitions\.MathAtlashighlights this gap in prior work, as it contains both high\-level mathematics and definition entities\.
To assess the quality of semantic faithfulness metrics on graduate\-level,in the wildmathematics, we introduceMA\-Align, a semantic faithfulness benchmark of syntactically correct 200 statements and definitions fromMathAtlas\. Each informal entity is formalized bygpt\-oss\-120b, and annotated with a binary label, “aligned” or “misaligned”\.555We selectively sample 200 such entities to better balance the label distribution\.We evaluate prior LLM\-as\-judge systems, CriticLean\[[28](https://arxiv.org/html/2605.14061#bib.bib91)\], as well as our few\-shot LLM\-as\-judge on ConsistencyCheck\[[6](https://arxiv.org/html/2605.14061#bib.bib86)\]and CriticLeanBench\[[28](https://arxiv.org/html/2605.14061#bib.bib91)\], andMA\-Align\(see[Table 2](https://arxiv.org/html/2605.14061#S3.T2)\)\.666We note that we leave out FormalAlign\[[21](https://arxiv.org/html/2605.14061#bib.bib84)\]as a baseline as there is no official implementation, and our reimplementation was unable to match the reported results\.
### 3\.2Semantic Faithfulness Results
We now analyze the results of our semantic faithfulness experiments \([Table 2](https://arxiv.org/html/2605.14061#S3.T2)\)\. Overall, we find that CriticLean has strong performance on ConsistencyCheck and CriticLeanBench, but a notable drop in accuracy onMA\-Align\. In contrast, the strongest LLM\-as\-judge system \(gpt\-5\.2 with an engineered prompt\) drops performance on ConsistencyCheck and CriticLeanBench, but improves by 6% onMA\-Alignfor definitions and 5% for statements\. We suspect this tradeoff is in part due to the selected examples, which were selected fromMathAtlasand therefore in\-domain forMA\-Align\.
However, as gpt\-5\.2 is both proprietary and expensive to use, we elect to use CriticLean \(32B\) as our primary semantic faithfulness metric forMathAtlas, as it is still well\-correlated with human evaluations, is open source,
Table 2:Results of various methods on several semantic faithfulness benchmarks\. “ReForm prompt” refers to the faithfulness prompt used byChenet al\.\[[6](https://arxiv.org/html/2605.14061#bib.bib86)\], and “Our Prompt” is an engineered, few\-shot prompt optimized for gpt 5\.2 with examples taken fromMathAtlas\.
## 4Autoformalization Experiments
In this section, we present autoformalization experiments onMathAtlaswith several strong baseline methods\. We first detail the various models and settings, followed by an analysis of the results\.
### 4\.1Dataset Splits
MathAtlasis naturally divided into five splits corresponding to each entity type:definitions, theorems, exercises, examples, andproofs\. However, exercises, examples, and theorems are formalized similarly: that is, as a theorem statement in Lean 4\. Thus, we report results on a joint “statement” split comprised of all exercises, examples, and theorems\. Note that only a correctstatementis required, not the full proof\. We leave the study of proof and full\-theorem formalization for future work\.
Additionally, we analyze several variables pertinent to high\-level,in the wildautoformalization\. First, we investigate whether an item already existing in Mathlib affects a system’s autoformalization performance\. It is likely that LLMs have seen Mathlib in its training data\. Additionally, Mathlib is constructed bottom\-up, meaning entities lower\-level fields are more likely to already be formalized\. These reasons suggest that existing items will have a higher average correctness\. Secondly, we determine to what extentmax dependency depthaffects correctness, as well asdependency mass\(i\.e\., the total number of dependencies in the dependency tree\)\.
#### MA\-Hard
We select a subset of∼700\\sim 700examples with the deepest dependency trees, which we callMA\-Hard\. This set primarily consists of examples from more advanced graduate\-level mathematics, such as Algebraic Geometry and Lie Theory, and spans 14 subjects\. We also note thatMA\-Hardconsists of both statements \(∼600\\sim 600\) and definitions \(∼100\\sim 100\)\.
#### Open Split
Approximately 30% of entities inMathAtlasare extracted from Springer textbooks\. To comply with copyright law, we release code to generateMathAtlasin full, which can be run by those who have access to Springer textbooks\. However, to facilitate access toMathAtlas, we release an open split of the remaining 70% of the data publicly\.777[https://huggingface\.co/datasets/MathAtlas/MathAtlas](https://huggingface.co/datasets/MathAtlas/MathAtlas)All examples fromMA\-Hardare in the open split\. Our results on this open subset are reported in the Appendix in[Table 5](https://arxiv.org/html/2605.14061#A4.T5)\.
### 4\.2Evaluated Methods
We measure the performance variety of autoformalization baselines, including fine\-tuned methods and prompting methods\. We evaluate several strong models with zero and few\-shot prompting\. We test zero\-shot and few\-shot prompting withgpt\-oss\-120Bandgpt\-oss\-20b\[[27](https://arxiv.org/html/2605.14061#bib.bib88)\]\(see[Appendix B](https://arxiv.org/html/2605.14061#A2)for full prompts\)\. We supply the target entity text as input, and the expected output is a syntactically and semantically correct formal representation in Lean 4 \(see[§ 3](https://arxiv.org/html/2605.14061#S3)\)\. This approach has been extensively studied in prior work, especially on benchmarks such as miniF2F\[[41](https://arxiv.org/html/2605.14061#bib.bib26)\]and ProofNet\[[2](https://arxiv.org/html/2605.14061#bib.bib20)\]\. We perform an ablation, evaluating zero\-shot prompting, few\-shot with randomly selected examples from LeanWorkbook\[[38](https://arxiv.org/html/2605.14061#bib.bib31)\], adding an engineered prompt, and using annotated, graduate\-level examples\. We use the zero\-shot prompt fromZhanget al\.\[[39](https://arxiv.org/html/2605.14061#bib.bib70)\]\. Our engineered prompt is more detailed in its instructions\. Importantly, it instructs the model to synthesize background theory when necessary, and allows for multiple formal output statements as is sometimes required inMathAtlas\.
On statements, we also test fine\-tuned methods: Herald Translator\[[12](https://arxiv.org/html/2605.14061#bib.bib68)\], Goedel Formalizer v2\[[17](https://arxiv.org/html/2605.14061#bib.bib76)\], Kimina Autoformalizer\[[33](https://arxiv.org/html/2605.14061#bib.bib89)\], Atlas Translator\[[19](https://arxiv.org/html/2605.14061#bib.bib69)\], and ReForm\[[6](https://arxiv.org/html/2605.14061#bib.bib86)\]\. However, for definitions, we do not include any fine\-tuned methods, as our experiments revealed autoformalization methods for theorems were unable to generalize to definitions in this domain\.888We are unable to evaluate some prior definition autoformalization methods such as ARIA\[[34](https://arxiv.org/html/2605.14061#bib.bib73)\]or LoC\-Decomp\[[31](https://arxiv.org/html/2605.14061#bib.bib80)\]due to unreleased code\.
#### Added Local Context
As discussed in[subsection 2\.1](https://arxiv.org/html/2605.14061#S2.SS1), some entities inMathAtlasrequire context to define variables, entities, or otherwise elaborate a mathematical setting\. We evaluate a simple baseline of including the prior 500 tokens in the context window of the LLM\. For this setting, we test only the few\-shot prompted systems, as this is out\-of\-domain for the fine\-tuned models\.
### 4\.3Results
Table 3:Results for autoformalization experiments onMathAtlas\. “Compiles” measures the percentage of examples which successfully compile\. “Faithful” measures what percent of the compiling theorems are faithful\. “Correct” indicates the overall percentage of statements which compile and are faithful\.\(a\)Autoformalization results for definitions\. We do not evaluate fine\-tuned methods for definitions because there are no prior\-work models available for definitions\. The best performance in each category and column are bolded\.
ModelCompilesFaithfulCorrectFine\-tunedHerald 7B11\.8%12\.7%1\.5%Kimina 7B27\.3%8\.4%2\.3%ATLAS\-L 8B21\.4%16\.4%3\.5%Goedel 8B20\.3%34\.9%7\.1%Goedel 32B23\.2%30\.6%7\.1%ReForm 8B19\.0%48\.4%9\.8%Promptedgpt\-oss\-20b \(zs\)9\.9%42\.4%4\.2%\+few\-shot16\.5%35\.8%5\.9%\+tuned prompt16\.8%32\.7%5\.5%\+tuned exs\.19\.1%35\.1%6\.7%\+local context18\.6%36\.0%6\.7%gpt\-oss\-120b \(zs\)14\.3%47\.6%6\.8%\+few\-shot16\.8%43\.5%7\.3%\+tuned prompt17\.1%43\.9%7\.5%\+tuned exs\.21\.7%35\.9%7\.8%\+local context20\.0%34\.5%6\.9%\(b\)Autoformalization results for statements \(theorems, examples, and exercises\)\. The best performance in each category and column are bolded\.
In this section we analyze the results of our autoformalization experiments onMathAtlas\.
Overall, we findMathAtlasto be very challenging\. On thestatementsplit, our strongest baseline is ReForm 8B, which achieved an overall correctness of 9\.8% \(see[Table 3](https://arxiv.org/html/2605.14061#S4.T3)\)\. We note the disparity between compilation rate and correctness, especially in fine\-tuned models: Kimina 7B, for example, achieves 27\.3% compilation rate, but just 2\.3% of formalizations compile and are faithful\. On thedefinitionsplit, our strongest system isgpt\-oss\-120b, with 16\.7% correctness, likewise with a significant 11\.8% gap between compilation rate and overall correctness\. As shown in[Table 4](https://arxiv.org/html/2605.14061#S4.T4),MA\-Hardposes a significant challenge, with our strongest model achieving only 2\.6% correctness\.
#### Local Context
Many items inMathAtlasrequire local context to correctly solve\. However, including 300 tokens of local context seems to confuse prompted systems, and the input decreases system performance from 20\.3% to 17\.4% on definitions, and 10\.5% to 8\.6% on statements for the best model \(see[Table 3](https://arxiv.org/html/2605.14061#S4.T3)\)\. This drop in performance indicates further research is necessary on incorporating local context into autoformalization methods\.
Figure 3:Correctness of formalized entities from our best models compared with presence in Mathlib\. Entities which our retrieval system \(LeanSearch\[[11](https://arxiv.org/html/2605.14061#bib.bib96)\]\) found in Mathlib are significantly more likely to be correctly formalized than otherwise\.
#### Object Dependencies
Next, we compare correctness versus max dependency depth\. In[Figure 4](https://arxiv.org/html/2605.14061#S4.F4), we notice a clear trend, showing that our best systems is less likely to correctly formalize deeper targets\. We compute a two\-sample Kolmogorov–Smirnov, and find significant difference between the max depth distributions for correct and incorrect examples \(p<0\.001p<0\.001\)\.
#### Existence in Mathlib
We compare correctness of definitions which are and aren’t grounded in Mathlib \(see[Figure 3](https://arxiv.org/html/2605.14061#S4.F3)\)\. We find that 27\.9% of definitions which our linker grounded in Mathlib are correctly formalized, whereas only 16\.8% of missing definitions are correct, statistically significant difference \(chi\-squarep<0\.001p<0\.001\)\. This supports intuition, as Mathlib is likely to be in the training data for modern LLMs likegpt\-oss\-120b, which may result in inflated autoformalization performance on those entities\.
\(a\)Correctness versus dependency depth of our best performing system for definitions \(gpt\-oss\-120b\)\.
\(b\)Correctness versus dependency depth of our best performing system for statements \(ReForm 8B\)
Figure 4:Correctness of our strongest models compared with dependency depth on examples inMathAtlas\.Table 4:Performance of few\-shot prompted models onMA\-Hard, which contains both definitions and statements\.
## 5Related Works
#### LLMs for Autoformalization
Large language models \(LLMs\) have made significant advancements in autoformalization\. Early work with LLMs such as inWuet al\.\[[36](https://arxiv.org/html/2605.14061#bib.bib16)\]used naive approaches, but showed promise in the methodology\. Further refinements with fine\-tuning on various data made significant progress\[[12](https://arxiv.org/html/2605.14061#bib.bib68),[17](https://arxiv.org/html/2605.14061#bib.bib76),[37](https://arxiv.org/html/2605.14061#bib.bib77),[19](https://arxiv.org/html/2605.14061#bib.bib69),[3](https://arxiv.org/html/2605.14061#bib.bib78)\]but, as we show in this paper, face similar issues in more difficult settings\. More recently, there has been work on breaking down targets either by decomposing into subproblems\[[31](https://arxiv.org/html/2605.14061#bib.bib80)\]or retrieving dependencies first\[[34](https://arxiv.org/html/2605.14061#bib.bib73),[23](https://arxiv.org/html/2605.14061#bib.bib79)\]\. These methods have encouraging results in harder settings\[[14](https://arxiv.org/html/2605.14061#bib.bib71)\], but use proprietary and expensive models and are token\-inefficient\. ProofFlow\[[5](https://arxiv.org/html/2605.14061#bib.bib85)\]introduces a method for formalizing natural language proofs, as well as ProofScore, a metric for measuring faithfulness to the natural language\.
#### Autoformalization Benchmarks
Prior autoformalization benchmarks have largely focused on formalizing theorem statements\[[41](https://arxiv.org/html/2605.14061#bib.bib26),[2](https://arxiv.org/html/2605.14061#bib.bib20)\], primarily on olympiad or undergraduate mathematics\. LeanEuclid\[[25](https://arxiv.org/html/2605.14061#bib.bib81)\]is a benchmark for autoformalizing geometry from Euclid’sElementsand the UniGeo dataset\[[7](https://arxiv.org/html/2605.14061#bib.bib82)\]\.Zhanget al\.\[[39](https://arxiv.org/html/2605.14061#bib.bib70)\]present two definition\-specific benchmark of 100 definitions taken from arXiv and Wikipedia, and show proprietary LLMs have low performance\. The FATE\-H and FATE\-X benchmarks\[[14](https://arxiv.org/html/2605.14061#bib.bib71)\]consist of 100 problems from commutative and abstract algebra, and pose a significant challenge compared to previous work\. In contrast to the FATE benchmarks,MathAtlasis much larger, contains multiple entity types, is more varied in topic, andwincludes local and long\-range dependencies which facilitate the study of isolated system components, such as retrieval or synthesis\.
#### Autoformalization Metrics
Correctness in autoformalization is measured with two metrics: compilation and semantic faithfulness\. The former is simple to measure with the language’s compiler, or specialized tools such as Kimina Lean Server\[[29](https://arxiv.org/html/2605.14061#bib.bib83)\]or blv\[[26](https://arxiv.org/html/2605.14061#bib.bib90)\]\.
More difficult to measure is faithfulness, which measures whether the generated formal language code is semantically equivalent to the input informal statement\. FormalAlign\[[21](https://arxiv.org/html/2605.14061#bib.bib84)\]introduces a methodology for fine\-tuning an alignment model for statements\. However, the methodology does not scale well to more difficult domains, and fails to capture the types of mistakes made by more advanced autoformalization models\.999No implementation or model weights have been released as of this paper\. Our analysis of these results is based on our re\-implementation, which failed to match reported numbers\.For benchmarks with gold formal labels, BEq\[[18](https://arxiv.org/html/2605.14061#bib.bib63)\]judges semantic equivalence between two formal statements based on definitional equivalence, and TransTED\[[20](https://arxiv.org/html/2605.14061#bib.bib92)\]presents a tree\-based similarity metric\. ProofScore\[[5](https://arxiv.org/html/2605.14061#bib.bib85)\]as previously mentioned is a faithfulness metric specifically for proofs\. Using an LLM\-as\-judge approach has been common for measuring semantic faithfulness\[[40](https://arxiv.org/html/2605.14061#bib.bib72)\], and has shown promise as a scalable metric\. ConsistencyCheck\[[6](https://arxiv.org/html/2605.14061#bib.bib86)\]is a benchmark for semantic faithfulness metrics, but as we show in[subsection 3\.2](https://arxiv.org/html/2605.14061#S3.SS2), high performance on ConsistencyCheck does not necessarily imply high performance in harder domains such asMA\-Align\.
## 6Conclusion
In conclusion, we introduceMathAtlas, a large\-scale benchmark for graduate\-level autoformalization of mathematics in the wild\. Through extensive experiments, we show that thatMathAtlasis very challenging for existing autoformalization systems\.
In contrast to prior benchmarks,MathAtlaswith a dependency\-graph structure which identifies prerequisite theory for each entity\. Our analysis shows that entities with deeper dependency trees are substantially less likely to be formalized correctly\. To study this difficulty, we constructMA\-Hard, a subset of 700 entities with the deepest dependency trees inMathAtlas, on which our strongest system achieves only 2\.6% correctness\. We additionally observe that entities successfully grounded to existing formalizations in Mathlib are significantly more likely to be formalized correctly\.
Finally, we introduceMA\-Align, a benchmark for evaluating semantic faithfulness on both mathematical statements and definitions forMathAtlas\. Experiments onMA\-Alignshow that trained metrics such as CriticLean, as well as strong LLM\-as\-judge systems, have clear room for improvement in faithfulness evaluation\.
We releaseMathAtlas,MA\-Align, and our experimental code to support future research forin the wildautoformalization\.
## Limitations
MathAtlasprimarily deals with graduate\-level mathematics, and leaves more difficult domains such as research mathematics for future work\. Certain textbooks are also omitted fromMathAtlasdue to copyright concerns, which can potentially bias data\.
## Acknowledgements
This work was supported in part by the National Science Foundation under the Artificial Intelligence, Formal Methods, and Mathematical Reasoning \(AIMing\) NSF grant, award\#2523479\\\#2523479\. This work used resources available through the National Research Platform \(NRP\) at the University of California, San Diego\. NRP has been developed, and is supported in part, by funding from National Science Foundation, from awards 1730158, 1540112, 1541349, 1826967, 2112167, 2100237, and 2120019, as well as additional funding from community partner\.
## References
- \[1\]\(2022\-11\-14\)Towards a Mathematics Formalisation Assistant using Large Language Models\(Website\)External Links:[Link](http://arxiv.org/abs/2211.07524),2211\.07524Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p2.1)\.
- \[2\]Z\. Azerbayev, B\. Piotrowski, and J\. AvigadProofNet: A Benchmark for Autoformalizing and Formally Proving Undergraduate\-Level Mathematics Problems\.Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p1.1),[§1](https://arxiv.org/html/2605.14061#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.14061#S4.SS2.p1.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px2.p1.1)\.
- \[3\]Z\. Azerbayev, H\. Schoelkopf, K\. Paster, M\. D\. Santos, S\. McAleer, A\. Q\. Jiang, J\. Deng, S\. Biderman, and S\. Welleck\(2023\-10\)Llemma: An Open Language Model For Mathematics\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2310.10631),2310\.10631Cited by:[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px1.p1.1)\.
- \[4\]L\. Blecher, G\. Cucurull, T\. Scialom, and R\. Stojnic\(2023\)Nougat: neural optical understanding for academic documents\.External Links:2308\.13418Cited by:[Appendix A](https://arxiv.org/html/2605.14061#A1.p1.1)\.
- \[5\]R\. Cabral, T\. M\. Do, X\. Yu, W\. M\. Tai, Z\. Feng, and X\. Shen\(2025\-10\)ProofFlow: A Dependency Graph Approach to Faithful Proof Autoformalization\.arXiv\.Note:arXiv:2510\.15981 \[cs\]External Links:[Link](http://arxiv.org/abs/2510.15981),[Document](https://dx.doi.org/10.48550/arXiv.2510.15981)Cited by:[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px3.p2.1)\.
- \[6\]G\. Chen, J\. Wu, X\. Chen, W\. X\. Zhao, R\. Song, C\. Li, K\. Fan, D\. Liu, and M\. Liao\(2026\)ReForm: reflective autoformalization with prospective bounded sequence optimization\.External Links:2510\.24592,[Link](https://arxiv.org/abs/2510.24592)Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p7.1),[§3\.1](https://arxiv.org/html/2605.14061#S3.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.14061#S3.SS1.p3.1),[Table 2](https://arxiv.org/html/2605.14061#S3.T2),[Table 2](https://arxiv.org/html/2605.14061#S3.T2.4.2),[§4\.2](https://arxiv.org/html/2605.14061#S4.SS2.p2.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px3.p2.1)\.
- \[7\]J\. Chen, T\. Li, J\. Qin, P\. Lu, L\. Lin, C\. Chen, and X\. Liang\(2022\)UniGeo: unifying geometry logical reasoning via reformulating mathematical expression\.External Links:2212\.02746,[Link](https://arxiv.org/abs/2212.02746)Cited by:[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px2.p1.1)\.
- \[8\]R\. Escriva, H\. Bashir, M\. Isom, J\. Huber, Macronova, S\. Kedia, itaismith, L\. VanderHart, J\. Radhakrishnan, tanujnay112, T\. Azarov, B\. Eggers, K\. Diaz, L\. Pei, jasonvigil, D\. Kim, A\. Troynikov, P\. I\. Thomas, R\. P, nicolasgere, G\. Shahbazian, E\. Culver, C\. Gamble, D\. Dash, T\. Krusinski, W\. Gu, swyx\.io, M\. Keller, R\. Him, and naynaly10\(2026\-may 7\)Chroma\-core/chroma\.Note:https://github\.com/chroma\-core/chromaExternal Links:[Link](https://github.com/chroma-core/chroma)Cited by:[Appendix A](https://arxiv.org/html/2605.14061#A1.SS0.SSS0.Px2.p1.1)\.
- \[9\]Exploration of neural machine translation in autoformalization of mathematics in Mizar \| Proceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs\(Website\)ACM Conferences\.External Links:[Document](https://dx.doi.org/10.1145/3372885.3373827),[Link](https://dl.acm.org/doi/10.1145/3372885.3373827)Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p2.1)\.
- \[10\]S\. Gadgil, A\. R\. Tadipatri, A\. Agrawal, A\. Narayanan, and N\. GoyalTowards automating formalisation of theorem statements using large language models\.Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p1.1)\.
- \[11\]G\. Gao, H\. Ju, J\. Jiang, Z\. Qin, and B\. Dong\(2025\)A semantic search engine for mathlib4\.External Links:2403\.13310,[Link](https://arxiv.org/abs/2403.13310)Cited by:[Appendix A](https://arxiv.org/html/2605.14061#A1.SS0.SSS0.Px3.p1.1),[Figure 3](https://arxiv.org/html/2605.14061#S4.F3),[Figure 3](https://arxiv.org/html/2605.14061#S4.F3.4.2)\.
- \[12\]G\. Gao, Y\. Wang, J\. Jiang, Q\. Gao, Z\. Qin, T\. Xu, and B\. Dong\(2025\)Herald: a natural language annotated lean 4 dataset\.External Links:[Link](https://arxiv.org/abs/2410.10878),2410\.10878Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.14061#S4.SS2.p2.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px1.p1.1)\.
- \[13\]A\. Q\. Jiang, W\. Li, and M\. Jamnik\(2023\)Multilingual mathematical autoformalization\.arXiv preprint arXiv: 2311\.03755\.Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p1.1)\.
- \[14\]J\. Jiang, W\. He, Y\. Wang, G\. Gao, Y\. Hu, J\. Wang, N\. Guan, P\. Wu, C\. Dai, L\. Xiao, and B\. Dong\(2025\)FATE: a formal benchmark series for frontier algebra of multiple difficulty levels\.External Links:[Link](https://arxiv.org/abs/2511.02872),2511\.02872Cited by:[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px1.p1.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px2.p1.1)\.
- \[15\]Prime Number Theorem and MoreExternal Links:[Link](https://github.com/AlexKontorovich/PrimeNumberTheoremAnd)Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p1.1)\.
- \[16\]W\. Kwon, Z\. Li, S\. Zhuang, Y\. Sheng, L\. Zheng, C\. H\. Yu, J\. E\. Gonzalez, H\. Zhang, and I\. Stoica\(2023\)Efficient memory management for large language model serving with pagedattention\.InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles,Cited by:[Appendix C](https://arxiv.org/html/2605.14061#A3.p1.1)\.
- \[17\]Y\. Lin, S\. Tang, B\. Lyu, Z\. Yang, J\. Chung, H\. Zhao, L\. Jiang, Y\. Geng, J\. Ge, J\. Sun, J\. Wu, J\. Gesi, X\. Lu, D\. Acuna, K\. Yang, H\. Lin, Y\. Choi, D\. Chen, S\. Arora, and C\. Jin\(2025\)Goedel\-prover\-v2: scaling formal theorem proving with scaffolded data synthesis and self\-correction\.External Links:[Link](https://arxiv.org/abs/2508.03613),2508\.03613Cited by:[§4\.2](https://arxiv.org/html/2605.14061#S4.SS2.p2.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px1.p1.1)\.
- \[18\]Q\. Liu, X\. Zheng, X\. Lu, Q\. Cao, and J\. Yan\(2025\)Rethinking and improving autoformalization: towards a faithful metric and a dependency retrieval\-based approach\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=hUb2At2DsQ)Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p1.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px3.p2.1)\.
- \[19\]X\. Liu, K\. Bao, J\. Zhang, Y\. Liu, Y\. Chen, Y\. Liu, Y\. Jiao, and T\. Luo\(2025\)ATLAS: autoformalizing theorems through lifting, augmentation, and synthesis of data\.External Links:[Link](https://arxiv.org/abs/2502.05567),2502\.05567Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.14061#S4.SS2.p2.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px1.p1.1)\.
- \[20\]X\. Liu, T\. Zhu, Z\. Dong, Y\. Liu, Q\. Guo, Z\. Liu, Y\. Chen, and T\. Luo\(2026\)ASSESS: a semantic and structural evaluation framework for statement similarity\.External Links:2509\.22246,[Link](https://arxiv.org/abs/2509.22246)Cited by:[§3\.1](https://arxiv.org/html/2605.14061#S3.SS1.p2.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px3.p2.1)\.
- \[21\]J\. Lu, Y\. Wan, Y\. Huang, J\. Xiong, Z\. Liu, and Z\. Guo\(2024\-10\)FormalAlign: Automated Alignment Evaluation for Autoformalization\.arXiv\.Note:arXiv:2410\.10135 \[cs\]External Links:[Link](http://arxiv.org/abs/2410.10135),[Document](https://dx.doi.org/10.48550/arXiv.2410.10135)Cited by:[§3\.1](https://arxiv.org/html/2605.14061#S3.SS1.p2.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px3.p2.1),[footnote 6](https://arxiv.org/html/2605.14061#footnote6)\.
- \[22\]T\. mathlib Community\(2020\-01\)The lean mathematical library\.InProceedings of the 9th ACM SIGPLAN International Conference on Certified Programs and Proofs,POPL ’20\.External Links:[Document](https://dx.doi.org/10.1145/3372885.3373824),[Link](http://dx.doi.org/10.1145/3372885.3373824)Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p2.1)\.
- \[23\]M\. J\. Min, Y\. Gao, W\. Sy, Z\. Li, X\. Si, and O\. Bastani\(2026\)Divide and abstract: autoformalization via decomposition and abstraction learning\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=NjgaeXNit3)Cited by:[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px1.p1.1)\.
- \[24\]L\. d\. Moura and S\. Ullrich\(2021\)The lean 4 theorem prover and programming language\.Springer\-Verlag,Berlin, Heidelberg\.External Links:[Document](https://dx.doi.org/10.1007/978-3-030-79876-5%5F37),ISBN 978\-3\-030\-79875\-8,[Link](https://doi.org/10.1007/978-3-030-79876-5_37)Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p1.1)\.
- \[25\]L\. Murphy, K\. Yang, J\. Sun, Z\. Li, A\. Anandkumar, and X\. Si\(2024\)Autoformalizing Euclidean geometry\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px2.p1.1)\.
- \[26\]Nilay Patel\(2026\-mar 11\)Offendo/blv\.Note:https://github\.com/offendo/blvExternal Links:[Link](https://github.com/offendo/blv)Cited by:[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px3.p1.1),[footnote 4](https://arxiv.org/html/2605.14061#footnote4)\.
- \[27\]OpenAI, :, S\. Agarwal, L\. Ahmad, J\. Ai, S\. Altman, A\. Applebaum, E\. Arbus, R\. K\. Arora, Y\. Bai, B\. Baker, H\. Bao, B\. Barak, A\. Bennett, T\. Bertao, N\. Brett, E\. Brevdo, G\. Brockman, S\. Bubeck, C\. Chang, K\. Chen, M\. Chen, E\. Cheung, A\. Clark, D\. Cook, M\. Dukhan, C\. Dvorak, K\. Fives, V\. Fomenko, T\. Garipov, K\. Georgiev, M\. Glaese, T\. Gogineni, A\. Goucher, L\. Gross, K\. G\. Guzman, J\. Hallman, J\. Hehir, J\. Heidecke, A\. Helyar, H\. Hu, R\. Huet, J\. Huh, S\. Jain, Z\. Johnson, C\. Koch, I\. Kofman, D\. Kundel, J\. Kwon, V\. Kyrylov, E\. Y\. Le, G\. Leclerc, J\. P\. Lennon, S\. Lessans, M\. Lezcano\-Casado, Y\. Li, Z\. Li, J\. Lin, J\. Liss, Lily, Liu, J\. Liu, K\. Lu, C\. Lu, Z\. Martinovic, L\. McCallum, J\. McGrath, S\. McKinney, A\. McLaughlin, S\. Mei, S\. Mostovoy, T\. Mu, G\. Myles, A\. Neitz, A\. Nichol, J\. Pachocki, A\. Paino, D\. Palmie, A\. Pantuliano, G\. Parascandolo, J\. Park, L\. Pathak, C\. Paz, L\. Peran, D\. Pimenov, M\. Pokrass, E\. Proehl, H\. Qiu, G\. Raila, F\. Raso, H\. Ren, K\. Richardson, D\. Robinson, B\. Rotsted, H\. Salman, S\. Sanjeev, M\. Schwarzer, D\. Sculley, H\. Sikchi, K\. Simon, K\. Singhal, Y\. Song, D\. Stuckey, Z\. Sun, P\. Tillet, S\. Toizer, F\. Tsimpourlas, N\. Vyas, E\. Wallace, X\. Wang, M\. Wang, O\. Watkins, K\. Weil, A\. Wendling, K\. Whinnery, C\. Whitney, H\. Wong, L\. Yang, Y\. Yang, M\. Yasunaga, K\. Ying, W\. Zaremba, W\. Zhan, C\. Zhang, B\. Zhang, E\. Zhang, and S\. Zhao\(2025\)Gpt\-oss\-120b & gpt\-oss\-20b model card\.External Links:2508\.10925,[Link](https://arxiv.org/abs/2508.10925)Cited by:[§4\.2](https://arxiv.org/html/2605.14061#S4.SS2.p1.1)\.
- \[28\]Z\. Peng, Y\. Yao, K\. Ma, S\. Guo, Y\. Li, Y\. Zhang, C\. Zhang, Y\. Zhang, Z\. Yu, L\. Li, M\. Liu, Y\. Xia, J\. Shen, Y\. Wu, Y\. Cao, Z\. Zhang, W\. Huang, J\. Liu, and G\. Zhang\(2025\)CriticLean: critic\-guided reinforcement learning for mathematical formalization\.External Links:2507\.06181,[Link](https://arxiv.org/abs/2507.06181)Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p7.1),[§3\.1](https://arxiv.org/html/2605.14061#S3.SS1.p2.1),[§3\.1](https://arxiv.org/html/2605.14061#S3.SS1.p3.1)\.
- \[29\]M\. D\. Santos, H\. Wang, H\. de Saxcé, R\. Wang, M\. Baksys, M\. Unsal, J\. Liu, Z\. Liu, and J\. Li\(2025\)Kimina lean server: technical report\.External Links:2504\.21230,[Link](https://arxiv.org/abs/2504.21230)Cited by:[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px3.p1.1)\.
- \[30\]P\. Scholze and K\. Buzzard\(2020\-12\-05\)Liquid tensor experiment\.External Links:[Link](https://xenaproject.wordpress.com/2020/12/05/liquid-tensor-experiment/)Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p1.1)\.
- \[31\]J\. Shi, Z\. Zhang, B\. Ma, S\. Zhao, Y\. Yuan, and G\. Wang\(2026\)LoC\-decomp: LLM autoformalization via logical concept decomposition and iterative feedback correction\.InThe Fourteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=0KFQ4F9YEH)Cited by:[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px1.p1.1),[footnote 8](https://arxiv.org/html/2605.14061#footnote8)\.
- \[32\]T\. Tao\(2023\-11\)Formalizing the proof of pfr in lean4 using blueprint: a short tour\.External Links:[Link](https://terrytao.wordpress.com/2023/11/18/formalizing-the-proof-of-pfr-in-lean4-using-blueprint-a-short-tour)Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p1.1)\.
- \[33\]H\. Wang, M\. Unsal, X\. Lin, M\. Baksys, J\. Liu, M\. D\. Santos, F\. Sung, M\. Vinyes, Z\. Ying, Z\. Zhu, J\. Lu, H\. de Saxcé, B\. Bailey, C\. Song, C\. Xiao, D\. Zhang, E\. Zhang, F\. Pu, H\. Zhu, J\. Liu, J\. Bayer, J\. Michel, L\. Yu, L\. Dreyfus\-Schmidt, L\. Tunstall, L\. Pagani, M\. Machado, P\. Bourigault, R\. Wang, S\. Polu, T\. Barroyer, W\. Li, Y\. Niu, Y\. Fleureau, Y\. Hu, Z\. Yu, Z\. Wang, Z\. Yang, Z\. Liu, and J\. Li\(2025\)Kimina\-prover preview: towards large formal reasoning models with reinforcement learning\.External Links:2504\.11354,[Link](https://arxiv.org/abs/2504.11354)Cited by:[§4\.2](https://arxiv.org/html/2605.14061#S4.SS2.p2.1)\.
- \[34\]H\. Wang, R\. Xie, Y\. Wang, G\. Gao, X\. Yu, and B\. Dong\(2025\)Aria: an agent for retrieval and iterative auto\-formalization via dependency graph\.External Links:[Link](https://arxiv.org/abs/2510.04520),2510\.04520Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p4.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px1.p1.1),[footnote 8](https://arxiv.org/html/2605.14061#footnote8)\.
- \[35\]L\. Wang, N\. Yang, X\. Huang, L\. Yang, R\. Majumder, and F\. Wei\(2024\)Improving text embeddings with large language models\.External Links:2401\.00368,[Link](https://arxiv.org/abs/2401.00368)Cited by:[Appendix A](https://arxiv.org/html/2605.14061#A1.SS0.SSS0.Px2.p1.1)\.
- \[36\]Y\. Wu, A\. Q\. Jiang, W\. Li, M\. N\. Rabe, C\. Staats, M\. Jamnik, and C\. Szegedy\(2022\-05\-25\)Autoformalization with Large Language Models\(Website\)External Links:[Link](http://arxiv.org/abs/2205.12615),2205\.12615Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p1.1),[§1](https://arxiv.org/html/2605.14061#S1.p2.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px1.p1.1)\.
- \[37\]Y\. Wu, D\. Huang, R\. Wan, Y\. Peng, S\. Shang, C\. Cao, L\. Qi, R\. Zhang, Z\. Du, J\. Yan, and X\. Hu\(2025\)StepFun\-formalizer: unlocking the autoformalization potential of llms through knowledge\-reasoning fusion\.External Links:[Link](https://arxiv.org/abs/2508.04440),2508\.04440Cited by:[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px1.p1.1)\.
- \[38\]H\. Ying, Z\. Wu, Y\. Geng, J\. Wang, D\. Lin, and K\. Chen\(2024\)Lean workbook: a large\-scale lean problem set formalized from natural language math problems\.arXiv preprint arXiv: 2406\.03847\.Cited by:[§4\.2](https://arxiv.org/html/2605.14061#S4.SS2.p1.1)\.
- \[39\]L\. Zhang, M\. Valentino, and A\. Freitas\(2025\-11\)Autoformalization in the wild: assessing LLMs on real\-world mathematical definitions\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,C\. Christodoulopoulos, T\. Chakraborty, C\. Rose, and V\. Peng \(Eds\.\),Suzhou, China,pp\. 1720–1738\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.90),ISBN 979\-8\-89176\-332\-6,[Link](https://aclanthology.org/2025.emnlp-main.90/)Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p4.1),[§4\.2](https://arxiv.org/html/2605.14061#S4.SS2.p1.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px2.p1.1)\.
- \[40\]L\. Zhang, M\. Valentino, and A\. Freitas\(2025\-06\)Beyond Gold Standards: Epistemic Ensemble of LLM Judges for Formal Mathematical Reasoning\.arXiv\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2506.10903),2506\.10903Cited by:[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px3.p2.1)\.
- \[41\]K\. Zheng, J\. M\. Han, and S\. Polu\(2021\-08\-31\)MiniF2F: a cross\-system benchmark for formal Olympiad\-level mathematics\(Website\)arXiv\.org\.External Links:[Link](https://arxiv.org/abs/2109.00110v2)Cited by:[§1](https://arxiv.org/html/2605.14061#S1.p2.1),[§4\.2](https://arxiv.org/html/2605.14061#S4.SS2.p1.1),[§5](https://arxiv.org/html/2605.14061#S5.SS0.SSS0.Px2.p1.1)\.
## Appendix AExtraction Pipeline
Figure 5:MathAtlascreation pipeline\. For visual convenience, we combine the “name” and “reference” extractors into one step called “identifier extractor\.”Our extraction pipeline starts with 103 selected textbooks covering a wide range of mathematics\. We use Nougat\[[4](https://arxiv.org/html/2605.14061#bib.bib55)\]to convert PDFs into Mathematical Markdown \(MMD\)\. We then run our entity extraction system, followed by our reference extraction and name extraction systems, and finally our relation extraction system\. We detail each below\.
```
Text:
"M is compact iff G is abelian."
Reasoning:
- M not introduced → include local_variable_reference
- G not introduced → include local_variable_reference
- compact is a named property → include object_reference
- abelian is a named property → include object_reference
Output:
[
{"term": "M", "reference_type": "local_variable_reference"},
{"term": "G", "reference_type": "local_variable_reference"},
{"term": "compact", "reference_type": "object_reference"},
{"term": "abelian", "reference_type": "object_reference"}
]
```
Figure 6:Example of reference extraction system input and output\.#### Entity Extraction
Our entity extraction system is a few\-shot, prompt\-engineeredgpt\-oss\-120b\. We prompt the model to generate an array of JSON objects with keys “type” indicating the entity’s type, optionally an “identifier”, which are spans such as Definition 3\.9 or Theorem 15, and “text”, which is the actual text of the entity\. We enforce this structure with a JSON schema using structured decoding, a temperature of 0\.3, and otherwise default sampling parameters\.
We then run our reference extractor system, likewise a promptedgpt\-oss\-120bmodel\. The reference extractor takes as input an extracted entity, and outputs a list of JSON objects with keys “term,” indicating the text of the reference, and “reference\_type,” which is one ofobject\_reference,entity\_reference, orlocal\_variable\_reference\. This is likewise enforced with JSON schema, and uses a temperature of 0\.3\.
Then, we run the name extractor, also a promptedgpt\-oss\-120b, prompted to output the names of objects defined in the entity\. The output a JSON array of strings, again enforced with a schema and structured decoding\. We likewise use a temperature of 0\.3\.
#### Relation Extraction
Each entity is embedded into a ChromaDB instance\[[8](https://arxiv.org/html/2605.14061#bib.bib94)\]together with all its metadata, using the Mistral E5 model fromWanget al\.\[[35](https://arxiv.org/html/2605.14061#bib.bib95)\]\. Then, our relation extraction system works in two stages\. First, we use the same model to retrieve candidate matches from the entity database\. Secondly, we promptgpt\-oss\-120bto determine which, if any, is the correct match, accounting for usage and context\. If there are multiple matches, we prioritize those from the same textbook, or otherwise those with the highest matching score\.
#### Grounding in Mathlib
We use LeanSearch\[[11](https://arxiv.org/html/2605.14061#bib.bib96)\]to link definitions inMathAtlasto formal definitions in Mathlib\. We query the definition text to get 10 candidate matches, and select from the matches using a prompted “gpt\-oss\-120b\.” Unlike the authors however, we find that augmenting the query degrades performance, so we skip that step in our final run\. We suspect this is because our queries are already fully\-formed definitions rather than keywords or phrases\. We include the selected matches as well as all 10 candidates for each definition in the released dataset\.
## Appendix BPrompts
\#\#Task
Youaregivenablockofmarkdownmathematicaltext\.
Yourtaskistoidentifyandextractallcontiguoussnippetsthatcorrespondtoanyofthefollowingmathematicalentitytypes:
\-Definition
\-Theorem\(thisincludespropositions,lemmas,corollaries\)
\-Example
\-Exercise
\-Proof
Onlyextractgenuinemathematicalcontentofthesetypes\.
IgnoreanddoNOTextract:
\-Generalexpositionornarrativeexplanation
\-Chaptertitles,sectionheadings,orindexes
\-Transitionaltextormotivation
\-Referencestoearlierorlaterresultswithoutstatingthem
\#\#Whatqualifiesasanentity
\#\#\#1\.Definitions\(veryimportant\)
AdefinitiondoesNOTrequireaheadingornumbering\.
Youmustextractablockasadefinitionifitsprimarypurposeistointroduce,name,orpreciselydescribeamathematicalobjectorconcept,evenifitappearsinlineinthetext\.
Treatablockasadefinitionifitcontainsdefinitionallanguage,includingbutnotlimitedto:
\-"iscalled"
\-"isdefinedas"
\-"wecall"
\-"issaidtobe"
\-"the\.\.\.of\.\.\.is"
\-"anelement\.\.\.is"
\-anemphasizedoritalicizedtermfollowedbyanexplanation
\-amathematicalobjectfollowedbyacopularverb\("is","are"\)andaprecisecondition
Ifaparagraphanswersthequestion"WhatisX?",itisadefinition\.
Asingleparagraphmayintroducemultipleconcepts\(e\.g\.,minimumpolynomialandconjugate\)\.
Ifso,extracttheentireparagraphasonedefinitionentity\.
Unlabeleddefinitionsaremandatory\.
Failuretoextractthemisconsideredanincorrectoutput\.
\#\#\#2\.Theorems
Extractanyexplicitlystatedmathematicalclaimintendedtobetrue,including:
\-Theorem
\-Proposition
\-Lemma
\-Corollary
Treatalloftheseastype"theorem"\.
Theentitymustassertaresult,notmerelymentionone\.
\#\#\#3\.Proofs
Extractaproofonlywhen:
\-Itisexplicitlylabeled\(e\.g\.,"Proof\.","ProofofProposition3\.2\."\),and
\-Itcontainsactualmathematicalreasoningorargument
Specialrule:
Ifaproofappearsonlyaspartofanexerciseprompt\(e\.g\.,"Provethat\.\.\."\)andcontainsnomathematicalargumentbeyondthestatementitself,doNOTextractaseparateproofentity\.
Caution:
Becarefultoextractafullproof,anddonotcutoffanyinformationhalfwaythrough\.
\#\#\#4\.Examples
Extractexamplesthatdemonstrateorinstantiateamathematicalconcept,whetherlabeled\("Example2\.3"\)orclearlysignpostedinthetext\.
\#\#\#5\.Exercises
Extractexercisesorproblemsintendedforthereadertosolve,whetherlabeledorclearlyformattedassuch\.
\#\#Outputformat
Foreachextractedentity,produceaJSONobjectwiththefollowingkeys:
\-"type":oneof"definition","theorem","example","exercise","proof"\(alwayslowercase\)
\-"identifier":aconciselabeltakenverbatimfromthetextintroducingtheentity
\(e\.g\.,"Definition6\.9\.","Proposition6\.11\.","Proof\.","Exercise7\(b\)"\)
Ifnoexplicitlabel,number,ornameispresent,setthistonull\.
\-"text":theexacttextoftheentity,copiedverbatim\.
DoNOTsummarize,paraphrase,normalizespacing,orrewritemathematics\.
\#\#Orderingrules
\-Preservetheoriginalorderofappearance\.
\-EachentitymustbeitsownJSONobject\.
\-Donotmergeseparateentities\.
\-Donotinventorinfermissingtext\.
\#\#Reasoningstep\(required\)
Beforeproducingthefinaloutput:
1\.Scanthetextsystematicallyfromstarttoend\.
2\.Activelysearchfor:
\-Explicitlabels\(Definition,Theorem,Proof,etc\.\)
\-Implicitdefinitionsusingdefinitionallanguage
3\.Foreachcandidateblock:
\-Decidewhetheritisgenuinemathematicalcontent
\-Confirmitmatchesoneoftheallowedentitytypes
4\.Excludeallnon\-mathematicalexposition\.
Onlyaftercompletingthisidentificationandsegmentationshouldyouproducetheoutput\.
\#\#Outputconstraints
\-RespondwithaJSONlistonly
\-DoNOTwraptheJSONincodefences
\-DoNOTincludeexplanations,commentary,orreasoningintheoutput
\-Ifnovalidentitiesarefound,returnanemptylist:\[\]
\#\#Priorityreminder
Ifablockintroducesordefinesamathematicalobject\-evenwithoutaheading\-itMUSTbeextractedasa"definition"\.
Figure 7:Prompt for entity extractionYouaregivenamathematicaldefinitionwrittenintextbookstyle\.
Yourtaskistoextractthenamesofthemathematicalobjects,concepts,orpropertiesbeingdefined\.
\#\#\#Rules:
\-Returnonlythenamesoftheitemsbeingdefined\.
\-Donotincludesymbols,formulas,orexplanations\.
\-Ifmultipleitemsaredefined,returnallofthem\.
\-Ifadefinitiongivesmultiplesynonymousnames\(e\.g\."X\(orY\)"\),includethemoststandardorcommonlyusedname\.
\-Ignorepurelynotationalconventionsunlesstheydefineacommonlynamedconcept\.
\-Normalizenamestolowercaseunlesscapitalizationisstandard\.
\-OutputtheresultasaJSON\-stylelistofstrings\.
Definition:
\{\{DEFINITION\_TEXT\}\}
Outputformat:
\["name1","name2",\.\.\.\]
\#\#\#Examples:
Example1:
Definition:
\*\*Definition0\.3\.4\*\*\.:
1\.A\_union\_oftwosets\\\(A\\\)and\\\(B\\\)isdefinedas\\\[A\\cupB\\coloneqq\\\{\{x:x\\inA\\text\{\{or\}\}x\\inB\\\}\}\.\\\]
2\.An\_intersection\_oftwosets\\\(A\\\)and\\\(B\\\)isdefinedas\\\[A\\capB\\coloneqq\\\{\{x:x\\inA\\text\{\{and\}\}x\\inB\\\}\}\.\\\]
3\.A\_complementof\\\(B\\\)relativeto\\\(A\\\)\_\(or\_set\-theoreticdifference\_of\\\(A\\\)and\\\(B\\\)\)isdefinedas\\\[A\\setminusB\\coloneqq\\\{\{x:x\\inA\\text\{\{and\}\}x\\notinB\\\}\}\.\\\]
4\.Wesay\_complement\_of\\\(B\\\)andwrite\\\(B^\{\{c\}\}\\\)insteadof\\\(A\\setminusB\\\)iftheset\\\(A\\\)iseithertheentireuniverseorifitistheobvioussetcontaining\\\(B\\\),andisunderstoodfromcontext\.
5\.Wesaysets\\\(A\\\)and\\\(B\\\)are\_disjoint\_if\\\(A\\capB=\\emptyset\\\)
Output:
\["union","intersection","set\-theoreticdifference","complement","disjoint"\]
Example2:
Definition:
\*\*Definition6\.1\*\*\(\\\(\\mathsf\{\{RCA\}\}\_\{\{0\}\}\\\)\)\.:An\_integraldomain\_isaring\\\(R\\\)withnononzerozero\-divisors
Output:
\["integraldomain"\]
Figure 8:Prompt for name extractionYouareanexpertmathematicianandaspecialistinformalmathematics,especiallyLean4andmathlib4\.
Yourtaskistoidentifyallmathematicalreferencesthatappearinagivenmathematicaldefinition,theorem,lemma,orstatement\.
Areferenceisanyterm,symbol,orphrasewhosemeaningdependsonapreviouslydefinedmathematicalobject,entity,orexternalcontext\.
Youmustclassifyeachreferenceintoexactlyoneofthefollowingcategories\.
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
REFERENCETYPES
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
1\."object\_reference"
Astandard,namedmathematicalobject,structure,property,orconceptwhosedefinitionexistsindependentlyofthecurrenttext\.
Examples:
\-group
\-topologicalspace
\-metricspace
\-compactspace
\-realnumber
\-realvectorspace
\-grouphomomorphism
\-generallineargroup
\-continuousfunction
\-primenumber
\-finiteset
Include:
\-Namedmathematicalstructures
\-Namedmathematicalpropertiesiftheyhaveformaldefinitions
\-Standardnamedconstructions
DoNOTinclude:
\-Genericwordslike"element","set","function","number"
\-Logicalwordslike"if","then","and","exists"
\-Lessspecificformsofanobject\.I\.e\.,inthesentence"LetGbeanabeliangroup",youshouldinclude"abeliangroup",NOT"group"\.
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
2\."entity\_reference"
Areferencetoaspecificpreviouslydefined,labeled,ornumberedmathematicalitem\.
Examples:
\-Definition3\.4
\-Theorem2\.1
\-Lemma5
\-Proposition4\.2
\-Corollary1\.3
\-previouslemma
\-abovetheorem
\-followingdefinition
Theserefertospecificformalentitiesinthedocumentstructure\.
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
3\."local\_variable\_reference"
AsymbolthatreferstoamathematicalobjectbutisNOTexplicitlyintroducedinthecurrenttext\.
Asymbolisexplicitlyintroducedifthecurrenttextdeclaresitusingphrasessuchas:
\-letMbe\.\.\.
\-supposeMis\.\.\.
\-fixM\.\.\.
\-givenM\.\.\.
\-whereMis\.\.\.
\-forMa\.\.\.
\-denotebyM\.\.\.
\-letM:=\.\.\.
\-assumeMis\.\.\.
\-takeM\.\.\.
\-anyequivalentexplicitdeclaration
Ifasymbolisexplicitlyintroducedinthecurrenttext,DONOTincludeit\.
OnlyincludesymbolsthatareusedWITHOUTbeingintroduced\.
EXAMPLES
Example:
Text:
"LetGbeagroup\.ThenGisabelianiffthecenterofGequalsG\."
Reasoning:
\-Gisexplicitlyintroduced\-\>exclude
\-groupisanamedobject\-\>include
\-abelianisanamedproperty\-\>include
\-centerisanamedconstruction\-\>include
Output:
\[
\{"term":"group","reference\_type":"object\_reference"\},
\{"term":"abelian","reference\_type":"object\_reference"\},
\{"term":"center","reference\_type":"object\_reference"\}
\]
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
Example:
Text:
"MiscompactiffGisabelian\."
Reasoning:
\-Mnotintroduced\-\>includelocal\_variable\_reference
\-Gnotintroduced\-\>includelocal\_variable\_reference
\-compactisanamedproperty\-\>includeobject\_reference
\-abelianisanamedproperty\-\>includeobject\_reference
Output:
\[
\{"term":"M","reference\_type":"local\_variable\_reference"\},
\{"term":"G","reference\_type":"local\_variable\_reference"\},
\{"term":"compact","reference\_type":"object\_reference"\},
\{"term":"abelian","reference\_type":"object\_reference"\}
\]
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
Ifnoreferencesexist,output:
\[\]
Figure 9:Prompt for reference extractionYouareameticulousexpertinLean4and‘mathlib4‘\.
Yourtaskistoactasa"grounding"reasonerforaformalizationagent\.Yourgoalistodeterminewhetheragivenmathematicalconcepthasacanonicalformaldefinitionin‘mathlib‘,basedonalistofsearchcandidates\.
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
YourTask\(FollowTheseStepsPRECISELY\)
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
Step1:DirectMatchAnalysis
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
\-First,lookfora\*\*direct,canonicaldefinition\*\*amongthecandidates\.
\-Adirectmatchistypicallya‘class‘,‘structure‘,or‘def‘whosenameisverysimilartotheconceptname\.
Example:
\-Concept:"localring"
\-Match:‘classIsLocalRing‘
\-Ifyoufindaclear,directmatch,usethatasyourprimaryanswer\.
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
Step2:DeductionfromUsagePatterns
\(OnlyifnodirectmatchisfoundinStep1\)
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
\-IfnodirectmatchwasfoundinStep1,yourtaskisto\*\*deduce\*\*thecanonicalnamebyfindinga\*\*consistentusagepattern\*\*acrossmultiple‘theorem‘and‘instance‘candidates\.
\-\*\*Analyzethesignatures:\*\*
Lookforacommonidentifierthatisconsistentlyusedasa\*\*type\*\*or\*\*typeclass\*\*acrossmultiplecandidates\.
\-\*\*Example:\*\*
Ifyouarelookingfor"CharZero"andthesearchresultsinclude:
\-‘instance:CharZeroN‘
\-‘instance:CharZeroZ‘
\-‘theoremmy\_thm\[CharZeroR\]‘
Thentheidentifier‘CharZero‘appearsrepeatedlyasatypeclass\.Thisisoverwhelmingevidencethatthecanonicaldefinitionisnamed‘CharZero‘\.
\-\*\*StrictRule:\*\*
Thenameyouselect\*\*must\*\*beanidentifierthatisexplicitlypresentinthecandidatelist\.
\-Do\*\*not\*\*invent,combine,orguessanewname\.
\-Ifnosingle,consistentpatternemergesfromthecandidates,youmustconcludethatnoconfidentmatchcanbefound\.
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
Step3:FinalDecision
\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-\-
\-BasedonyouranalysisfromStep1andStep2,determinethesinglebestcandidatefortheconcept\.
\-YouranswerMUSTbeasingle,validJSONobjectwiththefollowingkeys:
\-‘"best\_match"‘:
The\*\*listindex\*\*ofthecandidatewhichpreciselymatchesthesearchquery\.Ifnoconfidentmatchcanbefoundthrougheitherdirectmatchingorinference,thevaluemustbe‘null‘\.
\-‘"reasoning"‘:
Anexplanationofwhyyouchoseamatch,ordidn’tchooseanymatch\.
Figure 10:Prompt for grounding entities to mathlib\. 10 candidate results are fed into this prompt, and the output is an index of the best match and a reasoning\.
## Appendix CAdditional Experimental Details
We use vLLM\[[16](https://arxiv.org/html/2605.14061#bib.bib93)\]as an efficient inference server for all of our dataset construction, autoformalization, and evaluation\. If specified, we use the author’s recommended sampling parameters\. Otherwise, we use a temperature of 0\.5, a top\-p of 0\.95, and otherwise default parameters for all autoformalization experiments\. We use deterministic sampling for our semantic faithfulness systems\.
All experiments were run on a single H200 GPU\. Depending on the model, full\-benchmark runs run from 1\-10 hours\. Checking compilation runs on 10 cpu workers for approximately 5 minutes\. Measuring faithfulness runs for∼\\sim1\-2 hours, depending on the number of successfully compiling examples\.
## Appendix DBroader impacts
We releaseMathAtlaswith the hope to advance the field of autoformalization, and with it, AI and mathematics in general\. This outcome can potentially have larger societal impact as AI becomes more powerful and more widely used\. However, it is unlikely that this contribution in particular will have large\-scale negative ramifications\.
Table 5:Autoformalization results on the “open” split of MathAtlas\. The results listed here are similar to those in[Table 3](https://arxiv.org/html/2605.14061#S4.T3), but restricted to a subset of the data\.\(a\)Autoformalization results for definitions\. The best performance in each category and column are bolded\.
ModelCompilesFaithfulCorrectFine\-tunedHerald 7B11\.8%14\.4%1\.7%Kimina 7B27\.1%8\.1%2\.2%ATLAS\-L 8B20\.1%16\.9%3\.4%Goedel 8B18\.9%37\.6%7\.1%Goedel 32B22\.3%32\.7%7\.3%ReForm 8B18\.1%50\.8%9\.2%Promptedgpt\-oss\-20b \(zs\)9\.3%49\.5%4\.6%\+few\-shot16\.5%35\.8%5\.9%\+tuned prompt16\.8%32\.7%5\.5%\+tuned exs\.19\.1%35\.1%6\.7%gpt\-oss\-120b \(zs\)15\.5%43\.2%6\.7%\+few\-shot16\.3%44\.2%7\.2%\+tuned prompt17\.1%43\.9%7\.5%\+tuned exs\.20\.7%38\.6%8\.0%\(b\)Autoformalization results for statements \(theorems, examples, and exercises\)\. The best performance in each category and column are bolded\.Similar Articles
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
This paper introduces Formal Conjectures, an evolving benchmark of 2615 mathematical statements formalized in Lean 4, including open research conjectures for proof discovery and solved problems for auto-formalization, designed to evaluate automated reasoning systems with zero contamination.
TabularMath: Understanding Math Reasoning over Tables with Large Language Models
TabularMath introduces a benchmark and AutoT2T framework for evaluating LLMs' mathematical reasoning over tabular data, revealing that table complexity, data quality, and modality significantly impact model performance. The study addresses a gap in LLM evaluation by systematically assessing robustness to incomplete or inconsistent table information in real-world scenarios.
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet is a large-scale multilingual multimodal benchmark of 30,676 Olympiad-level math problems spanning 47 countries and 17 languages, designed to evaluate mathematical reasoning and retrieval in generative and embedding-based models. Even state-of-the-art models like Gemini and GPT-5 struggle with the benchmark, highlighting significant room for improvement in mathematical AI.
MIT scientists build the world’s largest collection of Olympiad-level math problems, and open it to everyone
MIT researchers, in collaboration with KAUST and HUMAIN, have released MathNet, the largest open-source dataset of Olympiad-level math problems, containing over 30,000 expert-authored problems from 47 countries.
Solving (some) formal math olympiad problems
OpenAI achieved a new state-of-the-art 41.2% on the miniF2F formal math olympiad benchmark using a technique called 'statement curriculum learning,' which iteratively trains a neural prover on proofs of increasing difficulty. The approach builds on iterative proof search and retraining over 8 iterations to significantly outperform the previous best of 29.3%.