Evaluating the Robustness of Proof Autoformalization in Lean 4

arXiv cs.CL Papers

Summary

This paper evaluates the robustness of proof autoformalization models in Lean 4 under global and local perturbations, finding that current LLM-based models are sensitive to perturbations and often fail to faithfully reflect local changes.

arXiv:2606.14867v1 Announce Type: new Abstract: Proof autoformalization aims to translate a mathematical informal proof written in natural language into a formal proof in a formal language such as Lean~4. Several works have developed LLM-based models for proof autoformalization. However, existing evaluations have typically focused on translating well-formed informal proofs from curated datasets. We argue that a robust proof autoformalizer must remain faithful even for informal proofs that diverge from these idealized ones, and we present the first study on the robustness of proof autoformalization models. We formulate two categories of perturbations and evaluate robustness under each: a global perturbation paraphrases the informal proof in a different style, under which the formalization should remain consistent; a local perturbation alters a value, symbol, or proof step, possibly in a counterfactual way, and a robust formalization should faithfully reflect the perturbation rather than reverting to the original one or inferring a different one on its own. We build a benchmark with both perturbations on miniF2F and MATH-500, and automatically measure how stable a proof autoformalization's correctness is under global perturbations and how faithfully its output reflects local perturbations. We evaluate seven recent models, all of which are sensitive to global perturbations and mostly fail to remain faithful under local perturbations. Code and data are available via https://github.com/ucr-rai/robust-proof-autoformalization.
Original Article
View Cached Full Text

Cached at: 06/16/26, 11:43 AM

# Evaluating the Robustness of Proof Autoformalization in Lean 4
Source: [https://arxiv.org/abs/2606.14867](https://arxiv.org/abs/2606.14867)
[View PDF](https://arxiv.org/pdf/2606.14867)

> Abstract:Proof autoformalization aims to translate a mathematical informal proof written in natural language into a formal proof in a formal language such as Lean~4\. Several works have developed LLM\-based models for proof autoformalization\. However, existing evaluations have typically focused on translating well\-formed informal proofs from curated datasets\. We argue that a robust proof autoformalizer must remain faithful even for informal proofs that diverge from these idealized ones, and we present the first study on the robustness of proof autoformalization models\. We formulate two categories of perturbations and evaluate robustness under each: a global perturbation paraphrases the informal proof in a different style, under which the formalization should remain consistent; a local perturbation alters a value, symbol, or proof step, possibly in a counterfactual way, and a robust formalization should faithfully reflect the perturbation rather than reverting to the original one or inferring a different one on its own\. We build a benchmark with both perturbations on miniF2F and MATH\-500, and automatically measure how stable a proof autoformalization's correctness is under global perturbations and how faithfully its output reflects local perturbations\. We evaluate seven recent models, all of which are sensitive to global perturbations and mostly fail to remain faithful under local perturbations\. Code and data are available via[this https URL](https://github.com/ucr-rai/robust-proof-autoformalization)\.

## Submission history

From: Zhouxing Shi \[[view email](https://arxiv.org/show-email/b3912353/2606.14867)\] **\[v1\]**Fri, 12 Jun 2026 18:10:21 UTC \(246 KB\)

Similar Articles

Discover and Prove: An Open-source Agentic Framework for Hard Mode Automated Theorem Proving in Lean 4

arXiv cs.CL

This paper introduces Discover and Prove (DAP), an open-source agentic framework for automated theorem proving in Lean 4 that tackles 'Hard Mode' problems where the answer must be discovered independently before formal proof construction. The work releases new Hard Mode benchmark variants and achieves state-of-the-art results while revealing a significant gap between LLM answer accuracy (>80%) and formal prover success (<10%).

Evaluating Research-Level Math Proofs via Strict Step-Level Verification

arXiv cs.AI

This paper introduces a strict step-level verification framework for evaluating research-level mathematical proofs using LLMs, addressing context poisoning and outperforming global evaluation. The approach shifts focus to deductive constraints and reveals that remaining errors are often due to pedantic hyper-rigor, exposing implicit ambiguities in benchmarks.

OProver: A Unified Framework for Agentic Formal Theorem Proving

Hugging Face Daily Papers

OProver is a unified framework for agentic formal theorem proving in Lean 4 that iteratively improves proof generation through training with verified proofs and compiler feedback, achieving state-of-the-art results on multiple benchmarks.