Evaluating the Robustness of Proof Autoformalization in Lean 4
Summary
This paper evaluates the robustness of proof autoformalization models in Lean 4 under global and local perturbations, finding that current LLM-based models are sensitive to perturbations and often fail to faithfully reflect local changes.
View Cached Full Text
Cached at: 06/16/26, 11:43 AM
# Evaluating the Robustness of Proof Autoformalization in Lean 4 Source: [https://arxiv.org/abs/2606.14867](https://arxiv.org/abs/2606.14867) [View PDF](https://arxiv.org/pdf/2606.14867) > Abstract:Proof autoformalization aims to translate a mathematical informal proof written in natural language into a formal proof in a formal language such as Lean~4\. Several works have developed LLM\-based models for proof autoformalization\. However, existing evaluations have typically focused on translating well\-formed informal proofs from curated datasets\. We argue that a robust proof autoformalizer must remain faithful even for informal proofs that diverge from these idealized ones, and we present the first study on the robustness of proof autoformalization models\. We formulate two categories of perturbations and evaluate robustness under each: a global perturbation paraphrases the informal proof in a different style, under which the formalization should remain consistent; a local perturbation alters a value, symbol, or proof step, possibly in a counterfactual way, and a robust formalization should faithfully reflect the perturbation rather than reverting to the original one or inferring a different one on its own\. We build a benchmark with both perturbations on miniF2F and MATH\-500, and automatically measure how stable a proof autoformalization's correctness is under global perturbations and how faithfully its output reflects local perturbations\. We evaluate seven recent models, all of which are sensitive to global perturbations and mostly fail to remain faithful under local perturbations\. Code and data are available via[this https URL](https://github.com/ucr-rai/robust-proof-autoformalization)\. ## Submission history From: Zhouxing Shi \[[view email](https://arxiv.org/show-email/b3912353/2606.14867)\] **\[v1\]**Fri, 12 Jun 2026 18:10:21 UTC \(246 KB\)
Similar Articles
Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4
This paper introduces Discover and Prove (DAP), an open-source agentic framework for automated theorem proving in Lean 4 that tackles 'Hard Mode' problems where the answer must be discovered independently before formal proof construction. The work releases new Hard Mode benchmark variants and achieves state-of-the-art results while revealing a significant gap between LLM answer accuracy (>80%) and formal prover success (<10%).
Evaluating Research-Level Math Proofs via Strict Step-Level Verification
This paper introduces a strict step-level verification framework for evaluating research-level mathematical proofs using LLMs, addressing context poisoning and outperforming global evaluation. The approach shifts focus to deductive constraints and reveals that remaining errors are often due to pedantic hyper-rigor, exposing implicit ambiguities in benchmarks.
Sorries Are Not the Hard Part: An Expert-Review Case Study of a Semi-Autonomous Formalization
This paper presents a case study of using a large language model (Claude Code) to formalize Grothendieck's vanishing theorem in the Lean theorem prover. It finds that while agents can produce verified code, they struggle with definitions and API design, emphasizing the need for expert review beyond mere compilation.
Using algebra and LLMs to verify a flight-plan bug fix in Lean
A developer uses LLMs and algebraic reformulation to formally verify a bug fix for the 2023 UK air traffic control meltdown in the Lean proof assistant, finding that LLMs are great at grinding proofs but poor at specifications.
OProver: A Unified Framework for Agentic Formal Theorem Proving
OProver is a unified framework for agentic formal theorem proving in Lean 4 that iteratively improves proof generation through training with verified proofs and compiler feedback, achieving state-of-the-art results on multiple benchmarks.