Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs
Summary
Introduces Mask-Proof, an LLM-based pipeline that converts mathematical proofs into masked-step tasks for automated evaluation, and presents MaskProofBench, a benchmark of 292 curated problems achieving 96.8% agreement with expert annotators.
View Cached Full Text
Cached at: 06/16/26, 11:44 AM
# Mask-Proof: An LLM-based Automated Data Curation Pipeline on Mathematical Proofs Source: [https://arxiv.org/html/2606.15258](https://arxiv.org/html/2606.15258) Jierui Zhang[2024140911@bupt\.cn](https://arxiv.org/html/2606.15258v1/mailto:[email protected])School of Computer ScienceBeijing University of Posts and TelecommunicationsBeijingChinaSiyuan Tan[cream˙milk@bupt\.edu\.cn](https://arxiv.org/html/2606.15258v1/mailto:cream%CB%[email protected])Graduate College for EngineersBeijing University of Posts and TelecommunicationsBeijingChina,Xinhang Li[24110180028@m\.fudan\.edu\.cn](https://arxiv.org/html/2606.15258v1/mailto:[email protected])School of Mathematical SciencesFudan UniversityShanghaiChina,Longzhuangzhi Lin[linlong0591@bupt\.edu\.cn](https://arxiv.org/html/2606.15258v1/mailto:[email protected])School of Cyberspace SecurityBeijing University of Posts and TelecommunicationsBeijingChina,Dailin Li[ldlbest@mail\.dlut\.edu\.cn](https://arxiv.org/html/2606.15258v1/mailto:[email protected])School of Computer Science and TechnologyDalian University of TechnologyDalianChina,Chengfeng Gu[cgfa9563@gmail\.com](https://arxiv.org/html/2606.15258v1/mailto:[email protected])Chu Kochen Honors CollegeZhejiang UniversityHangzhouChina,Xinping Li[rccp@tsinghua\.edu\.cn](https://arxiv.org/html/2606.15258v1/mailto:[email protected])Department of Psychological and Cognitive SciencesTsinghua UniversityBeijingChina,Yaxian Hao[haoyaxian@bupt\.edu\.cn](https://arxiv.org/html/2606.15258v1/mailto:[email protected])Graduate College for EngineersBeijing University of Posts and TelecommunicationsBeijingChina,Shengjia Liang[liangshengjia@buaa\.edu\.cn](https://arxiv.org/html/2606.15258v1/mailto:[email protected])State Key Laboratory of Virtual Reality Technology and SystemsBeihang UniversityBeijingChina,Yuxiang Ren[renyuxiang@nju\.edu\.cn](https://arxiv.org/html/2606.15258v1/mailto:[email protected])School of Intelligence Science and TechnologyNanjing UniversityNanjingChinaandWenhao Liu[wenhaoliu@bupt\.edu\.cn](https://arxiv.org/html/2606.15258v1/mailto:[email protected])School of Computer ScienceBeijing University of Posts and TelecommunicationsBeijingChina \(2026\) ###### Abstract\. Large language models \(LLMs\) are increasingly capable of mathematical problem solving and can even assist with research\-level proofs, yet we still lack a scalable and reproducible way to measure step\-level reasoning in long proofs across diverse sources\. This evaluation gap limits trustworthy AI assistance in proof\-certified scientific progress\. Existing evaluations often emphasize final answers or rely on costly expert grading, while end\-to\-end proof generation remains open\-ended and hard to verify automatically\. We introduceMask\-Proof, a pipeline that turns real proofs into automatically checkable masked\-step tasks\. It masks key formula steps, provides the necessary surrounding context, and evaluates model reconstructions with an LLM\-based equivalence judge using repeated votes for stability\. The resultingMask\-ProofBenchcontains 292 curated problems across diverse research areas\. Experiments with 17 models show that reasoning\-enhanced models outperform standard models by 12% to 27%\. Our evaluator achieves 96\.8% agreement with expert annotators, enabling faithful, reproducible, and comparable measurement of step\-level mathematical reasoning\. Benchmark, annotations, and code are available at[https://github\.com/weating/Mask\-Proof](https://github.com/weating/Mask-Proof)\. Mathematical Reasoning, Large Language Models, Benchmark, Proof Evaluation, Automated Curation ††journalyear:2026††copyright:cc††conference:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2; August 09–13, 2026; Jeju Island, Republic of Korea††booktitle:Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V\.2 \(KDD ’26\), August 09–13, 2026, Jeju Island, Republic of Korea††doi:10\.1145/3770855\.3818886††isbn:979\-8\-4007\-2259\-2/2026/08††ccs:Computing methodologies Artificial intelligence††ccs:Computing methodologies Natural language processing## 1\.Introduction Large language models \(LLMs\) are increasingly competent at mathematical problem solving\(balunović2026matharenaevaluatingllmsuncontaminated\)and can even assist with research\-level proofs\(Weiet al\.,[2022](https://arxiv.org/html/2606.15258#bib.bib2); Trinhet al\.,[2024](https://arxiv.org/html/2606.15258#bib.bib3)\)\. Yet we still lack a scalable and reproducible measurement instrument for step\-level reasoning in long\-range proof—one that remains faithful to the underlying mathematics across heterogeneous sources such as research papers and competition benchmarks\(Maet al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib35); Zhenget al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib29); Panditet al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib7)\)\. This gap is increasingly visible in recent frontier\-system investigations\. For instance, Sébastien Bubeck and collaborators at OpenAI report that GPT\-5 can sometimes make meaningful progress on research\-level proof tasks under expert guidance\(Bubecket al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib1)\)\. However, such evidence often relies on step\-wise expert judgment, which is costly to scale, difficult to reproduce, and hard to compare across domains\. ##### Why This Matters for AI for Sciences\. This measurement gap is not merely a benchmarking nuisance\. Mathematics is a foundational scientific discipline in which progress is certified by proofs\. Accordingly, trustworthy AI assistance in mathematics requires domain\-grounded evaluation of reasoning processes, not only final outcomes\(Chenet al\.,[2022](https://arxiv.org/html/2606.15258#bib.bib34); Maet al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib35); Yanget al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib36)\)\. Mask\-Proof contributes proof\-centric data and evaluation tools that make step\-level reasoning measurable, reproducible, and comparable across real\-world proof sources\. ##### The Measurement Problem in Step\-Level Proof Evaluation\. Treating step\-level evaluation as a scientific measurement reveals three intertwined obstacles: what to measure, what context is required, and how to score correctness\. \(1\)*Step selection is non\-trivial\.*Proofs vary widely in style and granularity\. Many lines are trivial, purely definitional, or dominated by local surface cues\. Masking such steps risks measuring shortcut exploitation rather than genuine inference, undermining construct validity\(Geirhoset al\.,[2020](https://arxiv.org/html/2606.15258#bib.bib5); Zhuet al\.,[2024](https://arxiv.org/html/2606.15258#bib.bib37); Yuet al\.,[2023](https://arxiv.org/html/2606.15258#bib.bib38)\)\. \(2\)*Dependencies are often missing\.*Research proofs frequently rely on external dependencies—cross\-referenced lemmas, macros, or unstated background facts\. If these dependencies are not recovered, a masked step may be underdetermined from the provided context, confounding reasoning ability with missing information and making failures uninterpretable\. \(3\)*Scalable checking is difficult\.*Even when a ground\-truth step is available, correctness checking at scale remains hard: string matching is inadequate, symbolic tools offer no general guarantees, and expert evaluation does not scale\. Automated judging therefore requires explicit mechanisms to control variance and ensure reproducibility\(Kimet al\.,[2024](https://arxiv.org/html/2606.15258#bib.bib39); Thakuret al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib40); Keet al\.,[2024](https://arxiv.org/html/2606.15258#bib.bib41); Zhenget al\.,[2023](https://arxiv.org/html/2606.15258#bib.bib8)\)\. Taken together, these challenges prevent step\-level reasoning in mathematics from being studied as a reproducible scientific phenomenon\. ##### Mask\-Proof: Making Step\-Level Reasoning Measurable\. We introduceMask\-Proof, a pipeline that directly addresses the obstacles above by transforming heterogeneous proofs into verifiable, self\-contained, step\-level evaluation problems with variance\-controlled automatic checking\. The core idea is intentionally simple: rather than scoring entire proofs end\-to\-end, we test whether a model can reconstruct a key formula step whose ground truth is extracted directly from the original proof\. Concretely, Mask\-Proof \(i\) uses Codex CLI to agentically select key steps that cannot be solved by local cues alone, \(ii\) recovers the minimal dependency closure needed to make each masked step self\-contained, meaning that no external lemmas, definitions, or macros are further required beyond the provided context, and \(iii\) checks model outputs using a calibrated LLM\-based equivalence judge, with repeated independent judgments and majority voting to reduce variance\. Beyond benchmarking, Mask\-Proof serves as a reusable measurement framework, exhibiting effectiveness, reproducibility, and generalization across proof sources\. Figure[1](https://arxiv.org/html/2606.15258#S1.F1)illustrates examples of the resulting masked\-step format\. In this work, we make three contributions: 1. \(1\)Mask\-Proof, an LLM\-based automated curation pipeline that identifies inference\-critical formula steps, reconstructs self\-contained proof context, and produces verifiable masked\-step problems from real\-world mathematical proofs\. 2. \(2\)Mask\-ProofJudge, an automatic step\-level evaluation framework calibrated against expert judgments, achieving 96\.8% human–judge agreement with variance reduction via repeated independent judgments and voting\. 3. \(3\)Mask\-ProofBench, a benchmark generated by this pipeline that enables scalable and reliable evaluation of step\-level reasoning in real\-world research\-level mathematical proofs\. This benchmark contains 292 Mask\-Proof problems, all of which were manually audited by human experts in mathematics\. Figure 1\.\(I\) IMO\-ProofBench\(Luonget al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib19)\)represents competition\-level proof problems paired with full solutions and rubric\-style evaluation\. \(II\) After Mask\-Proof processing, the same proof data are converted into a verifiable masked format, where key formula/derivation steps are hidden as targets for automatic evaluation\. \(III\) Mask\-ProofBench extends this masked, verifiable format to research\-level proofs with substantially longer derivations, masking rigorous intermediate steps rather than final numerical results or highly templated fragments\. See Appendix[E](https://arxiv.org/html/2606.15258#A5)for complete details\.Three\-panel comparison of proof evaluation formatsPanel I shows an IMO\-ProofBench competition problem \(USAMO 2025\) with a full solution involving binomial coefficient identities and a partial\-credit rubric\. Panel II shows the same proof after Mask\-Proof processing, where a key congruence formula is replaced with a MASK token while the surrounding proof context is preserved\. Panel III shows a research\-level masked proof based on Lemma 2\.4 from Shankar et al\., involving SL3 orbit classification over fields of characteristic not 2, with additional self\-containedness information including Definition 2\.3, a masked matrix construction step, and an example LLM response from DeepSeek\-V3\.2\-Thinking that incorrectly inverts the sign of the \(2,2\) entry\. ## 2\.Related Work ##### Answer\-based mathematical reasoning benchmarks\. Evaluation of LLM mathematical reasoning has historically centered on benchmarks with uniquely verifiable endpoints\. GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2606.15258#bib.bib4)\)and MATH\(Hendryckset al\.,[2021](https://arxiv.org/html/2606.15258#bib.bib13)\)established final\-answer accuracy as the dominant metric, enabling scalable evaluation and providing limited visibility into reasoning processes\. As performance on these benchmarks saturated, subsequent work increased difficulty through olympiad\-level problems\(Heet al\.,[2024](https://arxiv.org/html/2606.15258#bib.bib25); Gaoet al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib14)\), expert\-authored frontier evaluations\(Glazeret al\.,[2024](https://arxiv.org/html/2606.15258#bib.bib30)\), and research\-level question\-answering\(Zhanget al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib21)\)\. However, even sophisticated answer\-based benchmarks fundamentally evaluate*conclusions*rather than*formulas*, leaving proof quality unexamined\. ##### Proof\-centric evaluation and formal theorem proving\. Proof\-centric evaluation shifts focus from answer correctness to formula rigor, making verifiability the core challenge\. Formal theorem proving addresses this by delegating correctness to proof assistant kernels: miniF2F\(Zhenget al\.,[2022](https://arxiv.org/html/2606.15258#bib.bib24)\)and PutnamBench\(Tsoukalaset al\.,[2024](https://arxiv.org/html/2606.15258#bib.bib26)\)benchmark formal proving on competition mathematics\. However, the natural\-language\-to\-formal gap introduces semantic errors that are difficult to eliminate; ReForm\(Chenet al\.,[2026](https://arxiv.org/html/2606.15258#bib.bib20)\)reveals that autoformalization is inherently challenging, with even human experts producing semantic errors in 38\.5% of cases\. Natural\-language proof evaluation is closer to real mathematical practice but substantially harder to grade at scale\. IMO\-ProofBench\(Luonget al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib19)\)introduces rubric\-based grading for competition proofs, while STORM\-BORN\(Liuet al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib12)\)curates challenging formulas from research papers with human\-in\-the\-loop verification\. More recently, IMProofBench\(Schmittet al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib33)\)targets*research\-level*proof generation with expert\-authored problems, and combines expert grading of full proofs with gradable follow\-up subquestions\. Yet these evaluations still require substantial human expert annotation and can’t scale to evaluate in\-the\-wild research\-level proofs\. ##### Verifiers and scalable checking\. To reduce reliance on expert grading, an important line of work trains verifiers that score reasoning processes\. Process reward models\(Lightmanet al\.,[2024](https://arxiv.org/html/2606.15258#bib.bib23); Luoet al\.,[2024](https://arxiv.org/html/2606.15258#bib.bib28)\)provide step\-level feedback, while diagnostic studies reveal that LLMs struggle to locate reasoning errors but can correct them given error locations\(Tyenet al\.,[2024](https://arxiv.org/html/2606.15258#bib.bib18)\)\. Self\-correction approaches have shown limited success: Huang et al\.\(Huanget al\.,[2024](https://arxiv.org/html/2606.15258#bib.bib31)\)demonstrate that LLMs cannot reliably self\-correct reasoning without external feedback\. More recent generator–verifier frameworks, such as DeepSeekMath\-V2\(Shaoet al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib11)\), scale verification through meta\-verification and self\-training, though their evaluation remains centered on competition\-style problems\. High\-profile demonstrations of AI\-assisted mathematics\(Bubecket al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib1)\)still emphasize expert oversight rather than fully automatic evaluation\. Collectively, prior work highlights a persistent capability–verifiability tension: as evaluations approach research\-level proofs, obtaining low\-cost, scalable, and reproducible verification becomes substantially harder\. Figure 2\.Mask\-Proof Pipeline\.Starting from raw arXiv LaTeX sources, our pipeline extracts complete proofs, repairs them into self\-contained contexts, and agentically masks a key formula step to produce Mask\-ProofBench\.Mask\-Proof pipeline overview diagramThe pipeline has three stages\. Stage one, Paper Collection: arXiv papers with LaTeX sources are downloaded and filtered into proof candidates\. Stage two, Automated Data Curation, contains two sub\-modules: Self\-contained Recovery, where Codex CLI takes a non\-self\-contained question and raw LaTeX source and resolves dependencies to produce a self\-contained question paired with the proof; and Agentic Masking for Proof, where Codex CLI selects a key formula step, replaces it with a MASK token, and records the masked content as ground truth, while trivial steps are filtered out\. Stage three, Mask\-ProofBench and Evaluation: the resulting benchmark instances are evaluated by multiple LLMs, and a Mask\-Proof Judge assesses semantic equivalence between model outputs and the ground\-truth masked content\. ## 3\.Methodology In this section, we describe the overall pipeline, Mask\-Proof, including automated paper collection and proof quality control, automated proof data curation and evaluation\. This LLM\-based pipeline is explicitly designed to be contamination\-resistant by collecting and curating recent arXiv submissions, reproducible by using powerful Codex CLI, and broadly generalizable by adding different sources of proof data, while ensuring that the resulting masked proof data supports fully automated verification and maintains rigorous guarantees of correctness and verifiability\. An overview of the pipeline is shown in Figure[2](https://arxiv.org/html/2606.15258#S2.F2)\. ### 3\.1\.Paper Collection and Proof Extraction To ensure the quality of the collected arXiv papers, proofs, and the resulting masked proof data, we curate a corpus of real\-world, research\-level mathematical papers under the following criteria\. First, we restrict our corpus to arXiv submissions between August and October 2025, so as to minimize potential contamination from pretraining data and reduce the risk of data leakage in large language models\. Second, we collect papers across a broad range of mathematical fields to mitigate domain bias; the distribution of the major subject categories is summarized in Table[4](https://arxiv.org/html/2606.15258#A4.T4)\. We mine these research\-level papers together with their full LaTeX sources to enable precise proof extraction and dependency resolution\. To ensure that the resulting evaluation instances require nontrivial reasoning, we further filter out trivial proofs, including very short arguments, purely definitional proofs, and pattern\-completable formulas, as such instances either do not involve substantive inference or are prone to context leakage\. ##### Human verification of benchmark quality\. Before downstream evaluation, all 292 finalMask\-ProofBenchinstances were manually audited by domain experts in mathematics\. The audit assessed the mathematical correctness of the masked ground\-truth step within the original proof, the sufficiency of the recovered context for solving the masked step without external documents, the completeness of the selected mask as a formula\-level reasoning step rather than a syntactic fragment, and screened each instance for obvious shortcut cues, malformed LaTeX, or unresolved paper\-specific notation\. Instances that failed these checks were discarded; the expert verification annotations will be released with the benchmark\. ### 3\.2\.Automated Proof Data Curation ##### Self\-contained Recovery Formally, we define a proof to be self\-contained if the pipeline can recover a sufficient dependency closure such that each masked step is solvable from the extracted theorem statement and the reconstructed proof context alone, with no reliance on external documents\. Concretely, the pipeline provides Codex CLI with \(a\) the proof snippet and \(b\) its source LaTeX data, and instructs it to resolveexplicit dependenciesby traversing internal cross\-references with explicit labels \(e\.g\.,\\ref,\\eqref\), locating the corresponding labeled targets in the source, appending the extracted content to the problem context, and expanding user\-defined macros that carry essential semantics into standard LaTeX\. In contrast,implicit dependenciesare not directly anchored by labels \(e\.g\., symbols whose closed\-form definitions appear elsewhere\); these are recovered by searching the source LaTeX for their defining statements and attaching the extracted content as Additional Information\. All relevant prompts are released with the code\. Cases dominated by external dependencies \(e\.g\., citations to external conclusions\) are discarded\. ##### Agentic Masking for Proof For each self\-contained proof, we construct masked steps by replacing exactly one contiguous reasoning step with\[MASK\]\. Each masked step contains only pure mathematical expressions \(no natural\-language text\)\. From each source proof, Mask\-Proof generates up to three such masked steps, each corresponding to an independent and well\-defined inference\. We defined a series of rubrics \(see Appendix[A](https://arxiv.org/html/2606.15258#A1)\) to guide the selection of masked positions\. These rubrics instruct Codex CLI to agentically mask key inference\-critical steps, ensuring that the resulting tasks necessarily require non\-trivial mathematical reasoning rather than shallow pattern matching\. ##### Faithful Mask Application A practical failure mode of directly applying LLM\-based masking is unintended proof rewriting, which can subtly alter semantics\. To preserve faithfulness to the original proof, our pipeline decouples*\(i\) selecting mask steps*from*\(ii\) applying the mask*\. We use Codex CLI to perform these two steps automatically under strict constraints, ensuring that only the target step is masked while the remaining proof text remains unchanged\. ### 3\.3\.Automated Proof Data Evaluation As illustrated in Figure[2](https://arxiv.org/html/2606.15258#S2.F2), our automated proof\-data curation pipeline converts each evaluation unit into a masked key formula step within a well\-defined, self\-contained proof context\. Compared to evaluating full proofs or sub\-proofs, focusing on such mask instances substantially reduces ambiguity and leads to more consistent automated judgments\. Nevertheless, exact string matching remains inadequate for mathematical proofs due to extensive notational diversity and equivalent reformulations\. To address this limitation, we introduce Mask\-ProofJudge, an LLM\-based evaluation framework where GPT\-OSS\-120B \(High\) serves as the backbone to assess semantic equivalence between model outputs and the ground\-truth masked content\. To further reduce evaluation variance arising from generation stochasticity, we sample four solutions per problem and aggregate multiple independent judgments via majority voting, reporting the resulting average accuracy\. In practice, the LLM judge achieves a 96\.8% agreement rate with human expert annotations, indicating that the proposed evaluation protocol provides a reliable and high\-confidence approach for expert\-level assessment\. ### 3\.4\.Case Study To explicitly demonstrate the workflow described in Sections[3\.2](https://arxiv.org/html/2606.15258#S3.SS2)and[3\.3](https://arxiv.org/html/2606.15258#S3.SS3), we trace the processing ofLemma 2\.4from Shankar et al\.\(Shankaret al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib15)\)\. This example, which corresponds to the case illustrated in Figure[1](https://arxiv.org/html/2606.15258#S1.F1), shows how each stage of our pipeline transforms a raw arXiv fragment into a rigorous evaluation instance\. Stage 1: Self\-contained Recovery\.In this case, the raw extraction ofLemma 2\.4requires constructing a matrix representation for the ”Δ\\Delta\-distinguished orbit\.” However, the critical definition of ”Δ\\Delta\-distinguished” is located inDefinition 2\.3, outside the local context\(Shankaret al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib15)\)\. As defined in our curation protocol, this constitutes an ”unresolved external dependency\.” Our pipeline’s dependency agent successfully identified this gap\. Instead of discarding the proof, it traced the term back toDefinition 2\.3, and appended it to the problem context\. This step ensured the instance satisfied the self\-contained criterion\. Stage 2: Agentic Mask Selection\.Following the recovery, the Mask Selection Agent scanned the proof to identify a non\-trivial reasoning step based on the rubrics in Appendix[A](https://arxiv.org/html/2606.15258#A1)\. The agent rejected steps involving simple algebraic rearrangements \(violating the ”non\-trivial” rubric\) and selected the specific matrix construction step\. This selection was strategic: correctly masking this matrix forces the model to synthesize the recovered global definition with local variablesa,b,c,da,b,c,d\. It ensures the task tests reasoning necessity rather than pattern matching or context copying\. Stage 3: Automated Evaluation\.Finally, we evaluated the response from DeepSeek\-V3\.2\-Thinking using Mask\-ProofJudge\. The model’s output visually resembled the ground truth but inverted the sign of the\(2,2\)\(2,2\)entry fromddto−d\-d\. Under the defining conventionRes\(A,B\)=4det\(xA−yB\)\\mathrm\{Res\}\(A,B\)=4\\det\(xA\-yB\), the ground\-truth representative yieldsf\(x,y\)f\(x,y\)\. The model response changes the signs of theB22B\_\{22\}andB33B\_\{33\}entries, yieldingf\(x,−y\)f\(x,\-y\)instead off\(x,y\)f\(x,y\)\. Thus the response does not lie in the required fiberRes−1\(f\)\\mathrm\{Res\}^\{\-1\}\(f\), so the error is mathematical\. ## 4\.Experiments In this section, we evaluate Mask\-Proof to ascertain its effectiveness in automated constructing masked problem following four questions: RQ1: How effective is Mask\-Proof as an automated data curation pipeline? RQ2: How reproducible is the automated Mask\-Proof pipeline? RQ3: Is our automated evaluation framework consistent with human experts in masked\-step evaluation? RQ4: Can the Mask\-Proof pipeline generalize to other data sources? Figure 3\.Avg@4 leaderboard on Mask\-ProofBench\.All 17 evaluated models are ranked by Avg@4 accuracy\. Reasoning\-enhanced models demonstrate a substantial performance advantage over standard models at comparable parameter scales\.Bar chart showing Avg@4 leaderboard on Mask\-ProofBenchA horizontal bar chart ranking 17 models by Avg@4 accuracy on Mask\-ProofBench\. GPT\-5\.2 High leads at 63\.3 percent, followed by Gemini\-3\-Flash\-preview at 59\.3 percent, GPT\-5\.2 Low at 52\.6 percent, Grok\-4\.1\-reasoning at 50\.9 percent, and DeepSeek\-V3\.2\-Thinking at 49\.2 percent\. Standard models score below 30 percent, with Qwen3\-14B lowest at 8\.1 percent\. Bars are color\-coded by closed\-source versus open\-source and reasoning\-enhanced versus standard, showing a clear performance boundary near 32 percent separating the two reasoning categories\. ### 4\.1\.Experimental Setup ##### Dataset\. To comprehensively evaluate the effectiveness of the Mask\-Proof pipeline, we curate two datasets: - •Mask\-ProofBench, constructed using the pipeline described in Section[3\.2](https://arxiv.org/html/2606.15258#S3.SS2)\. We collected 403 arXiv papers from August to October 2025, extracted 835 theorem–proof blocks, and retained 292 final masked instances\. - •To evaluate the cross\-dataset generalization of the automated pipeline, we additionally construct a comparison dataset by applying Mask\-Proof to all 30PB\-Advancedproofs from IMO\-ProofBench\(Luonget al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib19)\)\. This comparison dataset consists of 35 mask instances\. ##### Evaluation Metrics\. We evaluate the Mask\-Proof pipeline and the curated dataset using three complementary metrics: Avg@4, Spearman’s rank correlation coefficient, and BLEU\. Avg@4 measures the average correctness of an LLM over four independent attempts on each masked proof step\. Spearman’s rank correlation coefficient assesses the transferability of Mask\-Proof by quantifying the consistency of model rankings of IMO\-ProofBench under masked evaluation\. BLEU evaluates n\-gram overlap to measure surface\-level linguistic consistency, and is used to assess the reproducibility of self\-contained proof recovery\. Together, these metrics characterize the effectiveness, transferability, and reproducibility of the Mask\-Proof pipeline\. ##### Evaluated Models\. To conduct a comprehensive evaluation, we categorize state\-of\-the\-art LLMs into two groups based on their reasoning capability\. - •Standard models:General\-purpose models that generate answers directly without explicit deliberation \(e\.g\., DeepSeek\-V3\.2, Grok\-4\.1, Qwen3\-14B\), serving as baselines to quantify the benefit of test\-time reasoning\. - •Reasoning\-enhanced models:Models enhanced for multi\-step thinking \(e\.g\., DeepSeek\-R1, Grok\-4\.1\-reasoning, Gemini\-3\-Flash\-preview\), which produce internal chains of thought before outputting the final answer\. ### 4\.2\.Effectiveness of the Automated Mask\-Proof Pipeline \(RQ1\) #### 4\.2\.1\.Main Results on Mask\-ProofBench We evaluate 17 models onMask\-ProofBench, including both reasoning and non\-reasoning categories\. FigureLABEL:fig:avg\_accuracypresents the full leaderboard\. ##### Reasoning Capability Creates a Qualitative Boundary\. The results indicate that Mask\-ProofBench clearly separates reasoning\-enhanced models from standard models\. A performance boundary emerges at approximately 32%: eight out of nine reasoning\-enhanced models \(except GPT\-OSS\-120B under low reasoning effort\) exceed this threshold, whereas all six standard models fall below it\. To further quantify this effect, we conduct paired comparisons within the same model families\. For example, Grok\-4\.1\-reasoning outperforms Grok\-4\.1 by 26\.7%, demonstrating that the benchmark reliably and effectively captures improvements attributable to enhanced reasoning capability\. ##### Highly discriminative, not saturated, and extensible\. The leaderboard exhibits a wide performance range, from a lower bound of 8\.1% \(Qwen3\-14B\) to an upper bound of 63\.3% \(GPT\-5\.2 High\) in Avg@4, indicating strong discriminative power\. Notably, these results are obtained under a single\-mask setting with our pipeline, where only one formula\-level step is masked per proof\. This design leaves substantial headroom for increasing benchmark difficulty, for example by masking multiple interdependent steps or requiring multi\-step reconstruction as model capabilities advance\. ##### No Anomalous Reversals\. As a sanity check, we verify that the ranking contains no major surprises: smaller models do not outperform larger ones, and models known to struggle with mathematics do not unexpectedly excel\. This coherence indicates that the benchmark measures a stable capability rather than task\-specific artifacts\. Figure 4\.Leaderboard distortion from random masking\.Avg@4 performance under*random*and*strategic*masking for representative models\. Random masking substantially inflates scores—especially for standard models—while agentic masking consistently lowers performance and restores meaningful separation, demonstrating the necessity of anti\-hack filtering for a discriminative benchmark\.Grouped bar chart comparing agentic and random maskingA grouped bar chart comparing Avg@4 performance of six models under agentic masking versus random masking\. Under agentic masking, scores range from 8\.1 percent for Qwen3\-14B to 49\.2 percent for DeepSeek\-V3\.2\-Thinking\. Under random masking, all scores inflate substantially: DeepSeek\-V3\.2 rises from 29\.0 to 74\.4 percent, and Qwen3\-14B from 8\.1 to 50\.6 percent\. The performance gap between reasoning\-enhanced and standard models is clearly visible under agentic masking but largely disappears under random masking\. #### 4\.2\.2\.Ablation Study: Effectiveness of the Self\-Contained Module The Mask\-Proof pipeline intentionally resolves cross\-references and expands user\-defined macros\. Manual inspection of all 292 masked instances in Mask\-ProofBench confirms that 291 out of 292 are fully self\-contained after this preprocessing\. To verify that our self\-contained recovery agent effectively addresses the context incompleteness problem, we compare the performance of three models \(Grok\-4\.1\-reasoning, DeepSeek\-V3\.2\-Thinking, and Grok\-4\.1\) on a non–self\-contained variant of Mask\-ProofBench\. Figure[5](https://arxiv.org/html/2606.15258#S4.F5)shows consistent performance degradation under the non–self\-contained variant of the benchmark\. Model performance drops by 11% to 23%, with reasoning\-enhanced models exhibiting larger declines, indicating that model performance strongly depends on access to the resolved proof context\. Notably, models still achieve moderate accuracy \(13%–35%\) without context resolution, suggesting that context incompleteness does not affect all masked instances\. This is because even in non–self\-contained proofs, not every masked step requires the missing original proof context for correct reconstruction\. However, the non\-negligible performance loss confirms that the self\-containedness of research\-level mathematical proofs affects a substantial portion of the benchmark, and that our dependency resolution effectively mitigates this issue\. Figure 5\.Self\-containedness ablation\.Removing dependency resolution causes 11%–23% performance degradation, confirming that context completeness is essential for valid evaluation\.Grouped bar chart comparing performance with and without self\-contained recoveryA grouped bar chart showing Avg@4 for three models with and without self\-contained recovery\. With self\-contained recovery, Grok\-4\.1\-reasoning scores 50\.9 percent, DeepSeek\-V3\.2\-Thinking 49\.2 percent, and Grok\-4\.1 24\.2 percent\. Without self\-contained recovery, performance drops to 27\.8, 35\.0, and 13\.0 percent respectively, representing declines of 11 to 23 percentage points\. #### 4\.2\.3\.Ablation Study: Effectiveness of Our Agentic\-Masking A valid instance of mask\-in\-the\-context problem must resist shortcut solutions\. If models can achieve high scores by exploiting superficial cues \(e\.g\. positional patterns, local context copying\), the evaluation fails to measure genuine reasoning\. We validate our mask instances by comparing against random step selection\. ##### Agentic Masking Enables Reliable Model Differentiation\. To evaluate the effectiveness of our agentic masking strategy, we introduce a random\-masking baseline\. The only difference between random masking baseline and Mask\-ProofBench is that the former selects mask steps uniformly at random, without any position\-selection preferences\. By comparing LLM performance on Mask\-ProofBench and this baseline, we show that our mask selection approach leads to more reliable differentiation of models based on their reasoning capability\. Figure[4](https://arxiv.org/html/2606.15258#S4.F4)shows that random masking substantially inflates Avg@4 across models\. For instance, DeepSeek\-V3\.2 rises from 29\.0% to 74\.4% \(\+45\.4%\), and Qwen3\-14B from 8\.1% to 50\.6% \(\+42\.5%\)\. Crucially, the score inflation caused by random masking is highly asymmetric\. standard models gain disproportionately more, leading to a strong compression of the reasoning advantage observed under agentic masking\. In the DeepSeek family, agentic masking yields a clear performance separation, with DeepSeek\-V3\.2\-Thinking achieving 49\.2% and DeepSeek\-V3\.2 achieving 29\.0%\. Under random masking, both models reach nearly identical performance levels \(74\.0% vs\. 74\.4%\), eliminating the gap observed under agentic masking\. This suggests that a large fraction of randomly masked steps can be completed without true multi\-step reasoning, relying instead on superficial pattern cues\. Furthermore, random masking leads to unstable and misleading model rankings, where weaker standard models can match or even surpass stronger reasoning\-enhanced models, contradicting the expected ranking based on model scale and reasoning capability\. ##### Truncate ablation\. To address concerns that later proof lines may reveal the masked step, we conduct a separate leakage audit on a fixed diagnostic subset\. For each example, we construct a truncated variant by removing the suffix context after the mask, while retaining one complete following mathematical expression to preserve local proof coherence\. On this leakage\-audit subset, Qwen3\-30B\-A3B\-Instruct changes from 23\.79% to 23\.14% Avg@4, and Qwen3\-30B\-A3B\-Thinking changes from 24\.35% to 20\.79% Avg@4\. The performance does not collapse under suffix truncation, suggesting that suffix context is not the dominant source of model performance\. We interpret this result conservatively: the audit mitigates, but does not eliminate, the possibility of residual leakage in individual cases\. ##### Expert validation of mask selection\. To validate that selected masks correspond to meaningful inference\-critical steps, we collected 63 domain\-expert annotations across eight mathematical areas\. Experts assessed formula\-level completeness, global reasoning dependency, reasoning necessity, and resistance to shortcut completion\. The mean rubric scores were consistently high across algebraic geometry, algebraic topology, PDE/analysis, classical analysis and ODEs, combinatorics, differential geometry, numerical analysis, and optimization/control, supporting that the selected masks are generally valid reasoning targets rather than superficial infilling tasks\. The detailed breakdown is provided in Appendix[B](https://arxiv.org/html/2606.15258#A2)\. ##### Shortcutability and difficulty analysis\. To clarify which masked steps are genuinely challenging versus easily predictable, and to characterize the benchmark’s effective difficulty distribution, we conducted two analyses\. First, we masked every candidate position in 30 expert\-selected representative proofs and analyzed the shortcut rate of each step type by monitoring LLMs’ solving processes alongside small\-model performance\. Figure[6](https://arxiv.org/html/2606.15258#S4.F6)\(a\) reports these rates: as hypothesized, algebraic manipulations are significantly more shortcut\-prone than lemma/theorem applications \(77\.9% vs\. 52\.9%\), and bookkeeping/closure reaches 90\.0%\. This asymmetry explains why random masking disproportionately inflates standard\-model scores in Figure[4](https://arxiv.org/html/2606.15258#S4.F4)\. Second, we evaluated three representative models \(GPT\-5\.2 High, Gemini\-3\-Flash\-preview, and Grok\-4\.1\-reasoning\) on Mask\-ProofBench and report the mean Avg@4 averaged across the three models\. Figure[6](https://arxiv.org/html/2606.15258#S4.F6)\(b\) and \(c\) show that structural transformations and lemma/theorem applications are the hardest step types \(mean Avg@4 of 36\.2% and 37\.0%\), and case analysis/reduction is the hardest proof family \(24\.3%\), compared to direct proof at 58\.1%\. Figure 6\.Shortcutability and difficulty breakdowns\. \(a\) Shortcut rates across step types, measured on 30 expert\-selected proofs; higher values indicate more shortcut\-solvable candidates\. \(b\), \(c\) Mean Avg@4 on Mask\-ProofBench by step type and proof family; lower values indicate harder reconstruction\.Three\-panel breakdown figureThe first panel shows shortcut rates by candidate category, with lemma or theorem application the lowest at 52\.9 percent and bookkeeping or closure the highest at 90\.0 percent\. The second panel shows mean Avg@4 by step type, ranging from 36\.2 percent for structural transformation to 64\.6 percent for bookkeeping or closure\. The third panel shows mean Avg@4 by proof family, ranging from 24\.3 percent for case analysis or reduction to 58\.1 percent for direct proof\. Table 1\.Source\-level breakdown of our Mask\-Proof \(Avg@4, %\) compared with the official expert evaluation \(% of total points\) on IMO\-ProofBench\-Advanced\. ### 4\.3\.Reproducibility of the Mask\-Proof Pipeline \(RQ2\) #### 4\.3\.1\.Self\-Contained Recovery Reproducibility To evaluate the reproducibility of our self\-contained module, we run the pipeline twice independently on 20 sampled proofs from Mask\-ProofBench\. Three of these proofs are intrinsically self\-contained and require no additional context\. Explicit dependencies\.For references with explicit labels \(via\\ref,\\eqref\), we directly compare the resolved label sets\. The two runs achieve 97\.2% recall: out of 36 unique labels across both runs, 35 appear in both\. The single discrepancy is an optional theorem reference included by one run but not the other\. The high BLEU scores indicate that independently recovered Additional Information blocks are nearly identical across runs, suggesting that the implicit dependency resolution process is stable rather than sensitive to stochastic variation\.Implicit dependencies\.For implicit dependencies—such as symbols with closed\-form definitions elsewhere—no ground\-truth labels exist for direct comparison\. We use textual similarity as a proxy: if two runs resolve the same implicit dependencies, they should produce similar Additional Information blocks\. The symmetric BLEU\-4 achieves a mean of 0\.900 and median of 0\.974, with all 20 proofs exceeding the 0\.3 threshold\. #### 4\.3\.2\.Mask Selection Reproducibility To evaluate the reproducibility of mask selection, we run the agentic masking module twice independently on the same 20 proofs used in Section[4\.3\.1](https://arxiv.org/html/2606.15258#S4.SS3.SSS1)\. Both runs produce 28 masks in total\. We measure reproducibility by checking how many masks from the first run appear in the second\. After normalizing each masked formula—stripping outer math delimiters \(e\.g\.,$\.\.\.$,\\\[\.\.\.\\\]\) and whitespace—we check for exact matches within the same proof: 19 out of 28 masks \(67\.9%\) are reproduced\. We manually annotate all 9 non\-reproduced masks into three categories: - •Semantically proximate \(6/9\):The two runs select different points along the same formula chain, e\.g\., an intermediate equation vs\. the final result\. - •Same proof, different sub\-step \(1/9\):Both masks address related reasoning but test different skills, e\.g\., an error bound vs\. the parameter choice that optimizes it\. - •Genuinely different \(2/9\):The masks target unrelated mathematical content\. Counting semantically proximate cases as successful reproduction, the consistency rate rises from 67\.9% to 89\.3%\. Only 2 out of 28 masks \(7\.1%\) target genuinely different content\. Taken together, these results answer RQ2 by showing that the Mask\-Proof pipeline is reproducible at both the self\-contained recovery and mask selection stages\. Independent runs produce highly consistent self\-contained contexts and largely overlapping masks, with most discrepancies reflecting minor semantic variations rather than genuinely different problem constructions\. For released runs, we record the Codex CLI version, model setting, prompt, input hash, and deterministic post\-processing scripts so that LLM\-assisted stages can be rerun and audited even though they are not bitwise deterministic\. ### 4\.4\.Consistency Between Human Experts and Mask\-ProofJudge \(RQ3\) Since our benchmark masks pure LaTeX mathematical formulas—where a single ground truth may have multiple semantically equivalent representations \(e\.g\., notation variants, unsimplified terms\)—we validate that Mask\-ProofJudge can reliably assess equivalence\. We constructed a validation set of 1273 samples randomly drawn from actual responses of four different models\(Gemini\-3\-Flash\-preview, DeepSeek\-V3\.2\-Thinking, Kimi\-K2, and Grok\-4\.1\), containing a balanced mix of correct answers and plausible but incorrect formulas\. Two domain experts independently labeled the set by checking mathematical equivalence against the ground truth\. Comparing Mask\-ProofJudge’s majority\-vote verdicts to expert consensus, GPT\-OSS\-120B\(High\) achieves96\.8% accuracy, indicating that it reliably distinguishes benign notational variation from substantive mathematical errors\. ### 4\.5\.Cross\-Dataset Generalization Ability \(RQ4\) To evaluate whether the Mask\-Proof pipeline generalizes beyond our curated benchmark, we apply it to an external dataset with existing expert evaluations, enabling direct comparison between our automated mask\-based scoring and human expert judgments\. We apply the Mask\-Proof pipeline to all 30 PB\-Advanced proofs from IMO\-ProofBench\(Luonget al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib19)\), producing 35 masks across three sources: Novel \(22\), IMO2024 \(8\), and USAMO2025 \(5\)\. We evaluate seven models using both our mask\-based Avg@4 metric and the official expert scores\. Table[1](https://arxiv.org/html/2606.15258#S4.T1)presents the source\-level breakdown and ranking consistency measured by Spearman’sρ\\rho\. The mask\-based evaluation achieves Spearmanρ=0\.79\\rho=0\.79overall, reaching0\.890\.89on USAMO2025 and0\.860\.86on Novel\. This correlation is obtained entirely automatically—no human grading, no problem\-specific rubrics, and no per\-task scoring guidelines—whereas IMO\-ProofBench\(Luonget al\.,[2025](https://arxiv.org/html/2606.15258#bib.bib19)\)expert evaluation requires domain experts to design rubrics and manually grade every response\. Moreover, Mask\-ProofJudge uses GPT\-OSS\-120B High as an open\-source judge, rather than relying on frontier proprietary models or per\-response human grading\. Despite this modest resource requirement, the resulting model rankings closely track expert consensus, demonstrating that Mask\-Proof’s discriminative power comes primarily from agentic masking of key formula steps rather than expensive evaluation infrastructure\. The moderate correlation on IMO2024 \(ρ=0\.73\\rho=0\.73\) reflects a granularity mismatch: expert evaluations consider holistic proof structure and award partial credit for incomplete solutions, while mask\-based evaluation targets localized step\-level reasoning\. These two paradigms are complementary—Mask\-Proof excels at scalable, reproducible assessment of specific reasoning steps, while expert evaluation captures broader proof quality dimensions\. Overall, Mask\-Proof generalizes effectively to external proof sources and produces model rankings consistent with expert consensus, positioning it as a practical, scalable alternative for evaluating LLMs on research\-level mathematical reasoning\. ## 5\.Conclusion In this work, we presentedMask\-Proof, an automatic pipeline for generating refreshable masked proof tasks and evaluating large language models on mathematical proofs\. Mask\-Proof generatedMask\-ProofBench, a curated benchmark of 292 masked proof\-step instances spanning 35 mathematical fields\. Extensive experiments show thatMask\-Proofcan remove shortcut\-solvable instances, and evaluations ofMask\-ProofBenchand Masked IMO\-ProofBench reveal its effectiveness, reproducibility, and generalization across diverse proof sources\. ## 6\.Limitations and Ethical Considerations Limitations\.Mask\-Proof currently relies on raw LaTeX sources; extending coverage to PDFs, scanned documents, and web\-native mathematical text would require reliable mathematical OCR or multimodal parsing\. The LLM\-assisted components for dependency recovery and mask selection have been validated through reproducibility analyses and expert review, though the process is not bitwise deterministic, and our leakage and shortcutability analyses mitigate but do not eliminate the possibility of residual cues\. The LLM\-based judge \(96\.8% expert agreement\) may also inherit biases from its underlying model\. Ethical Considerations\.All source papers are drawn from arXiv under licenses permitting redistribution for research; we release only extracted proof fragments with attribution\. To mitigate benchmark contamination, we restrict our corpus to the most recent three months of submissions and recommend periodic refresh\. This work involves no human subjects\. ###### Acknowledgements\. We thank the anonymous reviewers for their constructive feedback\. This work was supported by the “111 Center” \(No\. B26023\)\. ## References - S\. Bubeck, C\. Coester, R\. Eldan, T\. Gowers, Y\. T\. Lee, A\. Lupsasca, M\. Sawhney, R\. Scherrer, M\. Sellke, B\. K\. Spears, D\. Unutmaz, K\. Weil, S\. Yin, and N\. Zhivotovskiy \(2025\)Early science acceleration experiments with gpt\-5\.External Links:2511\.16072,[Link](https://arxiv.org/abs/2511.16072)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.p1.1),[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px3.p1.1)\. - G\. Chen, J\. Wu, X\. Chen, W\. X\. Zhao, R\. Song, C\. Li, K\. Fan, D\. Liu, and M\. Liao \(2026\)ReForm: reflective autoformalization with prospective bounded sequence optimization\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=KfxRzCmRSX)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px2.p1.1)\. - W\. Chen, X\. Ma, X\. Wang, and W\. W\. Cohen \(2022\)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks\.External Links:2211\.12588,[Link](https://arxiv.org/abs/2211.12588)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.SS0.SSS0.Px1.p1.1)\. - K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168,[Link](https://arxiv.org/abs/2110.14168)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px1.p1.1)\. - B\. Gao, F\. Song, Z\. Yang, Z\. Cai, Y\. Miao, Q\. Dong, L\. Li, C\. Ma, L\. Chen, R\. Xu, Z\. Tang, B\. Wang, D\. Zan, S\. Quan, G\. Zhang, L\. Sha, Y\. Zhang, X\. Ren, T\. Liu, and B\. Chang \(2025\)Omni\-MATH: a universal olympiad level mathematic benchmark for large language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=yaqPf0KAlN)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px1.p1.1)\. - R\. Geirhos, J\. Jacobsen, C\. Michaelis, R\. Zemel, W\. Brendel, M\. Bethge, and F\. A\. Wichmann \(2020\)Shortcut learning in deep neural networks\.Nature Machine Intelligence2\(11\),pp\. 665–673\.External Links:[Document](https://dx.doi.org/10.1038/s42256-020-00257-z)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.SS0.SSS0.Px2.p1.1)\. - E\. Glazer, E\. Erdil, T\. Besiroglu, D\. Chicharro, E\. Chen, A\. Gunning, C\. F\. Olsson, J\. Denain, A\. Ho, E\. de Oliveira Santos, O\. Järviniemi, M\. Barnett, R\. Sandler, M\. Vrzala, J\. Sevilla, Q\. Ren, E\. Pratt, L\. Levine, G\. Barkley, N\. Stewart, B\. Grechuk, T\. Grechuk, S\. V\. Enugandla, and M\. Wildon \(2024\)FrontierMath: a benchmark for evaluating advanced mathematical reasoning in AI\.External Links:2411\.04872,[Link](https://arxiv.org/abs/2411.04872)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px1.p1.1)\. - C\. He, R\. Luo, Y\. Bai, S\. Hu, Z\. Thai, J\. Shen, J\. Hu, X\. Han, Y\. Huang, Y\. Zhang, J\. Liu, L\. Qi, Z\. Liu, and M\. Sun \(2024\)OlympiadBench: a challenging benchmark for promoting AGI with olympiad\-level bilingual multimodal scientific problems\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 3828–3850\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.211)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px1.p1.1)\. - D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InProceedings of the Neural Information Processing Systems Track on Datasets and Benchmarks,External Links:[Link](https://arxiv.org/abs/2103.03874)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px1.p1.1)\. - J\. Huang, X\. Chen, S\. Mishra, H\. S\. Zheng, A\. W\. Yu, X\. Song, and D\. Zhou \(2024\)Large language models cannot self\-correct reasoning yet\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=IkmD3fKBPQ)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px3.p1.1)\. - P\. Ke, B\. Wen, A\. Feng, X\. Liu, X\. Lei, J\. Cheng, S\. Wang, A\. Zeng, Y\. Dong, H\. Wang, J\. Tang, and M\. Huang \(2024\)CritiqueLLM: towards an informative critique generation model for evaluation of large language model generation\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 13034–13054\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.acl-long.704)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.SS0.SSS0.Px2.p1.1)\. - S\. Kim, J\. Shin, Y\. Cho, J\. Jang, S\. Longpre, H\. Lee, S\. Yun, S\. Shin, S\. Kim, J\. Thorne, and M\. Seo \(2024\)Prometheus: inducing fine\-grained evaluation capability in language models\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=8euJaTveKw)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.SS0.SSS0.Px2.p1.1)\. - H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2024\)Let’s verify step by step\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=v8L0pN6EOi)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px3.p1.1)\. - W\. Liu, Z\. Lu, X\. Hu, J\. Zhang, D\. Li, J\. Cen, H\. Cao, H\. Wang, Y\. Li, X\. Kun, D\. Li, P\. Zhang, C\. Zhang, Y\. Ren, X\. Huang, and Y\. Ma \(2025\)STORM\-BORN: a challenging mathematical derivations dataset curated via a human\-in\-the\-loop multi\-agent framework\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 23938–23958\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.findings-acl.1227)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px2.p1.1)\. - L\. Luo, Y\. Liu, R\. Liu, S\. Phatale, M\. Guo, H\. Lara, Y\. Li, L\. Shu, Y\. Zhu, L\. Meng, J\. Sun, and A\. Rastogi \(2024\)Improve mathematical reasoning in language models by automated process supervision\.External Links:2406\.06592,[Link](https://arxiv.org/abs/2406.06592)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px3.p1.1)\. - T\. Luong, D\. Hwang, H\. H\. Nguyen, G\. Ghiasi, Y\. Chervonyi, I\. Seo, J\. Kim, G\. Bingham, J\. Lee, S\. Mishra, A\. Zhai, H\. Hu, H\. Michalewski, J\. Kim, J\. Ahn, J\. Bae, X\. Song, T\. H\. Trinh, Q\. V\. Le, and J\. Jung \(2025\)Towards robust mathematical reasoning\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 35418–35442\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.emnlp-main.1794)Cited by:[Figure 1](https://arxiv.org/html/2606.15258#S1.F1),[Figure 1](https://arxiv.org/html/2606.15258#S1.F1.3.2),[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px2.p1.1),[2nd item](https://arxiv.org/html/2606.15258#S4.I1.i2.p1.1),[§4\.5](https://arxiv.org/html/2606.15258#S4.SS5.p2.1),[§4\.5](https://arxiv.org/html/2606.15258#S4.SS5.p3.3)\. - W\. Ma, A\. Cojocaru, N\. Kolhe, B\. Louie, R\. S\. Sharif, H\. Zhang, V\. Zhuang, M\. Zaharia, and S\. Min \(2025\)Reliable fine\-grained evaluation of natural language math proofs\.External Links:2510\.13888,[Link](https://arxiv.org/abs/2510.13888)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.15258#S1.p1.1)\. - S\. Pandit, A\. Xu, X\. Nguyen, Y\. Ming, C\. Xiong, and S\. Joty \(2025\)Hard2Verify: a step\-level verification benchmark for open\-ended frontier math\.External Links:2510\.13744,[Link](https://arxiv.org/abs/2510.13744)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.p1.1)\. - J\. Schmitt, G\. Bérczi, J\. Dekoninck, J\. Feusi, T\. Gehrunger, R\. Appenzeller, J\. Bryan, N\. Canova, T\. de Wolff, F\. Gaia, M\. van Garrel, B\. Hashemi, D\. Holmes, A\. Iribar Lopez, V\. Jaeck, M\. Jørgensen, S\. Kelk, S\. Kuhlmann, A\. Kurpisz, C\. Meroni, I\. Metzler, M\. Möller, S\. Muñoz\-Echániz, R\. Nowak, G\. Oberdieck, D\. Platt, D\. Possamaï, G\. Ribeiro, R\. Sánchez Galán, Z\. Sun, J\. Teichmann, R\. P\. Thomas, and C\. Vial \(2025\)IMProofBench: benchmarking AI on research\-level mathematical proof generation\.External Links:2509\.26076,[Link](https://arxiv.org/abs/2509.26076)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px2.p1.1)\. - A\. Shankar, A\. Siad, and A\. A\. Swaminathan \(2025\)Counting integral points on symmetric varieties with applications to arithmetic statistics\.Proceedings of the London Mathematical Society130\(4\),pp\. e70039\.External Links:[Document](https://dx.doi.org/10.1112/plms.70039)Cited by:[§3\.4](https://arxiv.org/html/2606.15258#S3.SS4.p1.1),[§3\.4](https://arxiv.org/html/2606.15258#S3.SS4.p2.2)\. - Z\. Shao, Y\. Luo, C\. Lu, Z\. Z\. Ren, J\. Hu, T\. Ye, Z\. Gou, S\. Ma, and X\. Zhang \(2025\)DeepSeekMath\-V2: towards self\-verifiable mathematical reasoning\.External Links:2511\.22570,[Link](https://arxiv.org/abs/2511.22570)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px3.p1.1)\. - A\. S\. Thakur, K\. Choudhary, V\. S\. Ramayapally, S\. Vaidyanathan, and D\. Hupkes \(2025\)Judging the judges: evaluating alignment and vulnerabilities in LLMs\-as\-judges\.External Links:2406\.12624,[Link](https://arxiv.org/abs/2406.12624)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.SS0.SSS0.Px2.p1.1)\. - T\. H\. Trinh, Y\. Wu, Q\. V\. Le, H\. He, and T\. Luong \(2024\)Solving olympiad geometry without human demonstrations\.Nature625\(7995\),pp\. 476–482\.External Links:[Document](https://dx.doi.org/10.1038/s41586-023-06747-5)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.p1.1)\. - G\. Tsoukalas, J\. Lee, J\. Jennings, J\. Xin, M\. Ding, M\. Jennings, A\. Thakur, and S\. Chaudhuri \(2024\)PutnamBench: evaluating neural theorem\-provers on the putnam mathematical competition\.InNeurIPS 2024 Track on Datasets and Benchmarks,External Links:[Link](https://openreview.net/forum?id=ChKCF75Ocd)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px2.p1.1)\. - G\. Tyen, H\. Mansoor, V\. Carbune, P\. Chen, and T\. Mak \(2024\)LLMs cannot find reasoning errors, but can correct them given the error location\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 13894–13908\.External Links:[Document](https://dx.doi.org/10.18653/v1/2024.findings-acl.826)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px3.p1.1)\. - J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, b\. ichter, F\. Xia, E\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35,pp\. 24824–24837\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.p1.1)\. - B\. Yang, Q\. Yang, Y\. Ma, and R\. Liu \(2025\)UTMath: math evaluation with unit test via reasoning\-to\-coding thoughts\.External Links:2411\.07240,[Link](https://arxiv.org/abs/2411.07240)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.SS0.SSS0.Px1.p1.1)\. - L\. Yu, W\. Jiang, H\. Shi, J\. Yu, Z\. Liu, Y\. Zhang, J\. T\. Kwok, Z\. Li, A\. Weller, and W\. Liu \(2023\)MetaMath: bootstrap your own mathematical questions for large language models\.External Links:2309\.12284,[Link](https://arxiv.org/abs/2309.12284)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.SS0.SSS0.Px2.p1.1)\. - J\. Zhang, C\. Petrui, K\. Nikolić, and F\. Tramèr \(2025\)RealMath: a continuous benchmark for evaluating language models on research\-level mathematics\.InNeurIPS 2025 Track on Datasets and Benchmarks,External Links:[Link](https://openreview.net/forum?id=RBssYVpQEr)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px1.p1.1)\. - C\. Zheng, Z\. Zhang, B\. Zhang, R\. Lin, K\. Lu, B\. Yu, D\. Liu, J\. Zhou, and J\. Lin \(2025\)ProcessBench: identifying process errors in mathematical reasoning\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 1009–1024\.External Links:[Document](https://dx.doi.org/10.18653/v1/2025.acl-long.50)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.p1.1)\. - K\. Zheng, J\. M\. Han, and S\. Polu \(2022\)miniF2F: a cross\-system benchmark for formal olympiad\-level mathematics\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=9ZPegFuFTFv)Cited by:[§2](https://arxiv.org/html/2606.15258#S2.SS0.SSS0.Px2.p1.1)\. - L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. P\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36,pp\. 46595–46623\.External Links:[Link](https://papers.nips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.SS0.SSS0.Px2.p1.1)\. - K\. Zhu, J\. Chen, J\. Wang, N\. Z\. Gong, D\. Yang, and X\. Xie \(2024\)DyVal: dynamic evaluation of large language models for reasoning tasks\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=gjfOL9z5Xr)Cited by:[§1](https://arxiv.org/html/2606.15258#S1.SS0.SSS0.Px2.p1.1)\. ## Appendix AMask Rubrics\. Our mask rubrics are as follows: - •Formula\-level completeness\.Each mask must correspond to a*complete mathematical expression or formula*, rather than a fragment of an equation or a syntactic sub\-component\. Partial masking of symbols, coefficients, or operators is disallowed\. - •Global reasoning dependency\.A valid mask should not be solvable by inspecting only its immediate local context\. Correct recovery must require integrating information from the theorem statement, earlier definitions or lemmas, and the global logical structure of the proof\. - •Reasoning density and necessity\.Masked steps should represent logically critical steps \(e\.g\., non\-trivial formulas, structural identities, or key transformations\) whose absence would break the proof\. Trivial algebraic rearrangements or routine substitutions are excluded\. - •Non\-templated and non\-heuristic\.Masks must not be recoverable via pattern matching, standard proof templates, symmetry guessing, backward inference from later lines, or memorized formula instantiation\. Any mask that can be solved without genuine mathematical reasoning is considered invalid\. - •Quality\-controlled multiplicity\.For each proof, we allow between0and33masks, determined by the proof’s conceptual depth, technical sophistication, and reasoning density\. Proofs that are low\-quality, overly short, informal, or mathematically trivial may yield zero masks\. ## Appendix BAdditional Validation Details ##### Expert validation of mask positions\. Table[2](https://arxiv.org/html/2606.15258#A2.T2)summarizes the 63 domain\-expert annotations used to validate whether agentically selected masks are meaningful reasoning targets rather than superficial completion tasks\. Table 2\.Expert validation of mask positions across mathematical areas\. ##### Truncation leakage audit\. Table[3](https://arxiv.org/html/2606.15258#A2.T3)reports the suffix\-truncation audit used to test whether later proof context is the dominant source of model performance\. These results should be read as evidence that the risk is mitigated, not as a proof that leakage is impossible\. Table 3\.Suffix\-truncation leakage audit on the leakage\-audit split\. ## Appendix CData format of Mask\-Proof Samples\. ##### Data format We serialize the processed samples injsonl\. Each record contains a unique identifier and the paired question/solution: ``` { "index": 6, "arxiv_id": "...", "proof_index": 8, "problem": <<PROBLEM TO PROVE>>, "reference_solution": <<PROOF>>, "mask_text": <<PROOF CONTAINS 1 MASK>>, "mask_content": <<ORIGINAL TEXT IN THE MASK>> } ``` Heremask\_textis the masked prompt \(typically formula\-only masks\), andmask\_contentis the corresponding ground\-truth proof fragment\. ## Appendix DData Distribution Table 4\.Percentage distribution of the top arXiv subject categories of papers used in Mask\-ProofBench\. ## Appendix ECase Study Masked Proof: ###### Proof\. We prove the result forW\(K\)W\(K\); the proof forW∨\(K\)W^\{\\vee\}\(K\)is identical\. Denote the coefficients offfbya,b,c,a,b,c,anddd, and assume \(by replacingffwith aGL2\(K\)\\mathrm\{GL\}\_\{2\}\(K\)\-translate, if necessary\) thatad≠0ad\\neq 0\. First, it is easy to check that there exists a distinguished orbit \(resp\. aΔ\\Delta\-distinguished orbit\) inW\(K\)W\(K\)\. The distinguished orbit is represented by the following pair: \(A,B\)=\(\(00120−a0120−c\),\(012012b000d\)\)\.\(A,B\)=\\left\(\\begin\{pmatrix\}0&0&\\tfrac\{1\}\{2\}\\\\ 0&\-a&0\\\\ \\tfrac\{1\}\{2\}&0&\-c\\end\{pmatrix\},\\begin\{pmatrix\}0&\\tfrac\{1\}\{2\}&0\\\\ \\tfrac\{1\}\{2\}&b&0\\\\ 0&0&d\\end\{pmatrix\}\\right\)\.TheΔ\\Delta\-distinguished orbit with resolventffis represented by the following pair: \(1\)\(A,B\)=\(\(−a00001/201/2b/4ad\),\(001/20d01/20−c/4ad\)\)\.\(A,B\)=\\left\(\\begin\{pmatrix\}\-a&0&0\\\\ 0&0&1/2\\\\ 0&1/2&b/4ad\\end\{pmatrix\},\\begin\{pmatrix\}0&0&1/2\\\\ 0&d&0\\\\ 1/2&0&\-c/4ad\\end\{pmatrix\}\\right\)\.It only remains to prove uniqueness\. For the distinguished case, letLLbe the étale cubic extension ofKKcorresponding toffunder the Delone–Faddeev bijection\.SL3\(K\)\\mathrm\{SL\}\_\{3\}\(K\)\-orbits onW\(K\)∩Res−1\(f\)W\(K\)\\cap\\mathrm\{Res\}^\{\-1\}\(f\)correspond bijectively to étale quartic extensions ofKKwith resolventffby Bhargava’s parametrization\. An orbit is distinguished if and only if it corresponds toL⊕KL\\oplus K, and so there is a unique distinguished orbit\. We give a much more hands\-on proof of uniqueness of theΔ\\Delta\-distinguished orbit\. Let\(A,B\)\(A,B\)\(with coefficientsaija\_\{ij\}andbijb\_\{ij\}\) be an element in aΔ\\Delta\-distinguished orbit withRes\(A,B\)=f\\mathrm\{Res\}\(A,B\)=f\. By the definition of beingΔ\\Delta\-distinguished, there exists aℙK1⊂ℙK2\\mathbb\{P\}^\{1\}\_\{K\}\\subset\\mathbb\{P\}^\{2\}\_\{K\}such that the restriction ofAAandBBto thisℙK1\\mathbb\{P\}^\{1\}\_\{K\}give reducible binary quadratic forms\. \(Note that neither of these binary forms can be0, otherwise we would haveΔ\(A,B\)=0\\Delta\(A,B\)=0\.\) By replacing\(A,B\)\(A,B\)with anSL3\(K\)\\mathrm\{SL\}\_\{3\}\(K\)\-translate, we may assume thatℙK1\\mathbb\{P\}^\{1\}\_\{K\}is the subspace\(∗,∗,0\)\(\\ast,\\ast,0\), and we may hence assume thata12=a22=b11=b12=0a\_\{12\}=a\_\{22\}=b\_\{11\}=b\_\{12\}=0\. Sincea11a\_\{11\}andb22b\_\{22\}are nonzero, we may use anotherSL3\(K\)\\mathrm\{SL\}\_\{3\}\(K\)\-transformation to further assume thata13=b23=0a\_\{13\}=b\_\{23\}=0\. Then we have a/4=det\(A\)=−a11a232and−d/4=det\(B\)=−b22b132\.a/4=\\det\(A\)=\-a\_\{11\}a\_\{23\}^\{2\}\\quad\\text\{and\}\\quad\-d/4=\\det\(B\)=\-b\_\{22\}b\_\{13\}^\{2\}\.Thus, with anotherSL3\(K\)\\mathrm\{SL\}\_\{3\}\(K\)\-transformation, we can ensure thata11=−aa\_\{11\}=\-a,b22=db\_\{22\}=d, anda23=b13=12a\_\{23\}=b\_\{13\}=\\tfrac\{1\}\{2\}\. Finally, the coefficientsa33a\_\{33\}andb33b\_\{33\}are uniquely determined by the values ofbbandcc\. The lemma follows\. ∎
Similar Articles
MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis
MA-ProofBench is a new formal benchmark for evaluating LLMs on theorem proving in mathematical analysis, containing 200 problems across two difficulty levels. The best model, GPT-5.5, achieves only 16% on Level I and 5% on Level II, highlighting a significant gap between informal and formal reasoning.
Evaluating Research-Level Math Proofs via Strict Step-Level Verification
This paper introduces a strict step-level verification framework for evaluating research-level mathematical proofs using LLMs, addressing context poisoning and outperforming global evaluation. The approach shifts focus to deductive constraints and reveals that remaining errors are often due to pedantic hyper-rigor, exposing implicit ambiguities in benchmarks.
MaxProof: Scaling Mathematical Proof with Generative-Verifier RL and Population-Level Test-Time Scaling
MaxProof is a test-time scaling framework that enhances mathematical proof generation using a generative verifier and population-level search, achieving scores exceeding human gold-medal thresholds on IMO 2025 and USAMO 2026.
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a new benchmark of 439 research-level math problems curated by mathematicians to evaluate the reasoning capabilities of frontier LLMs, highlighting significant gaps in solving advanced problems and recognizing ill-posed questions.
@rohanpaul_ai: Another great paper from Google. Shows general LLMs can solve formal math by planning proofs and checking each step. Ra…
A new Google paper introduces LEAP, an agentic framework that enables general LLMs to solve formal math problems by planning proofs and checking each step, raising performance from under 10% to 70% on the Lean IMO benchmark and solving all 2025 Putnam problems.