The Signal-Coverage Matrix: Stratifying Type and Semantic Errors in Statement Autoformalization

arXiv cs.CL Papers

Summary

This paper introduces a signal-coverage matrix that decomposes type-correctness gains in autoformalization into four strata, revealing the mechanisms behind LLM refinements and showing that headline metrics can obscure which errors are actually resolved.

arXiv:2606.28013v1 Announce Type: new Abstract: Headline type-correctness (TC\%) of LLM autoformalization has climbed from $\sim$53\% to $\sim$76\% in two years, yet this scalar conceals which errors each method resolves. We propose a signal-coverage matrix that crosses the Lean elaborator (pass/fail) with a semantic-equivalence judgment (equivalent/not), sorting every output into one of four cells: true success (TS), type-only (TO), semantic-only (SO), or both fail (BF). On ProofNet\# and MiniF2F-test with DeepSeek V4-Pro across Vanilla, Lean-Retry, Sample-Filter, and Stratified Autoformalization (SAF): (1) the +34 to +36 TS gain across the three elab-feedback methods is $\sim$64\% type-stratum recovery, with SO flat on net (87.5\% of original semantic errors rescued, 8 newly created). (2) The TO-to-TS rate is 23/61 for each method (Wilson 95\% CI [26.6\%, 50.3\%]), and this stratum-level recovery rate predicts $\Delta$TS on held-out methods to within 2/186 and renders $\Delta$TC linear in the Vanilla elab-fail rate across six (model, dataset) cells ($R^2=0.96$). (3) The two judges disagree by 26 to 37 pp on elab-feedback outputs (vs. 7 pp on Vanilla), with 30 to 56\% of symbolic-judge false negatives traceable to elaborator-forced rewrites. The persistent residual reduces to two gold-formalization errors. TC\% gains should be credited by which cell moved, not by the scalar alone.
Original Article
View Cached Full Text

Cached at: 06/29/26, 05:25 AM

# Stratifying Type and Semantic Errors in Statement Autoformalization
Source: [https://arxiv.org/html/2606.28013](https://arxiv.org/html/2606.28013)
## The Signal\-Coverage Matrix: Stratifying Type and Semantic Errors in Statement Autoformalization

###### Abstract

Headline type\-correctness \(TC%\) of LLM autoformalization has climbed from∼\\sim53% to∼\\sim76% in two years, yet this scalar conceals which errors each method resolves\. We propose a signal\-coverage matrix that crosses the Lean elaborator \(pass/fail\) with a semantic\-equivalence judgment \(equivalent/not\), sorting every output into one of four cells: true success \(TS\), type\-only \(TO\), semantic\-only \(SO\), or both fail \(BF\)\. On ProofNet\#and MiniF2F\-test with DeepSeek V4\-Pro across Vanilla, Lean\-Retry, Sample\-Filter, and Stratified Autoformalization \(SAF\): \(1\) the\+34\+34to\+36\+36TSgain across the three elab\-feedback methods is∼\\sim64% type\-stratum recovery, withSOflat on net \(87\.5% of original semantic errors rescued, 8 newly created\)\. \(2\) TheTO→\\toTSrate is23/6123/61for each method \(Wilson 95% CI \[26\.6%, 50\.3%\]\), and this stratum\-level recovery rate predictsΔ​TS\\Delta\\textsc\{TS\}on held\-out methods to within2/1862/186and rendersΔ​T​C\\Delta TClinear in the Vanilla elab\-fail rate across six \(model, dataset\) cells \(R2=0\.96R^\{2\}\{=\}0\.96\)\. \(3\) The two judges disagree by 26 to 37 pp on elab\-feedback outputs \(vs\. 7 pp on Vanilla\), with 30 to 56% of symbolic\-judge false negatives traceable to elaborator\-forced rewrites\. The persistent residual reduces to two gold\-formalization errors\. TC% gains should be credited by which cell moved, not by the scalar alone\.

Autoformalization, Statement formalization, Semantic faithfulness, Elaborator feedback, Signal\-coverage matrix, LLM\-as\-judge

## 1Introduction

Statement\-level autoformalization maps a natural\-language theorem to a Lean 4 signature with asorry\-placeholder body\. Headline type\-correctness \(TC%\) on ProofNet\-class benchmarks\(Azerbayev et al\.,[2023](https://arxiv.org/html/2606.28013#bib.bib2); Poiroux et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib12)\)has climbed from∼\\sim53% \(GPT\-4o\(OpenAI,[2024](https://arxiv.org/html/2606.28013#bib.bib11)\)\) to∼\\sim76% \(DeepSeek V4\-Pro\(DeepSeek\-AI,[2026](https://arxiv.org/html/2606.28013#bib.bib4)\)\) in two years, driven by methods that refine against the Lean elaborator\. Because TC% does not certify denotational correctness, semantic\-faithfulness \(SF%\) estimators are reported alongside it: back\-translation\(Zhou et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib19)\)and generalized tree edit distance \(GTED,Liu et al\.,[2025b](https://arxiv.org/html/2606.28013#bib.bib9)\)\.

Headline metrics conceal mechanism\. A\+15\+15pp TC% gain hides which error stratum is repaired and which is not, and the choice of SF% judge can shift the reported value by≥25\\geq 25pp on elab\-feedback methods \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\)\. No published account decomposes TC% gains by stratum and method\. We provide such an account\.

Existing methods cluster along three axes\.Single\-oracle refinement against the elaborator\(Wu et al\.,[2022](https://arxiv.org/html/2606.28013#bib.bib15); Poiroux et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib12); Guo et al\.,[2025](https://arxiv.org/html/2606.28013#bib.bib5); Lu et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib10)\)consults only the elaborator \(or a close proxy\) during refinement, and the class has plateaued at∼\\sim75% TC% on ProofNet\#regardless of topology\.Semantic\-faithfulness judgingcomprises LLM back\-translation\(Zhou et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib19)\)and symbolic GTED via the Lean LSP\(Liu et al\.,[2025b](https://arxiv.org/html/2606.28013#bib.bib9)\), used as standalone SF% estimators but never calibrated against each other at scale\.Typed structured intermediate representationswere introduced for Isabelle by Draft, Sketch, and Prove\(Jiang et al\.,[2023](https://arxiv.org/html/2606.28013#bib.bib6)\), whose sketch is a textual proof skeleton rather than a typed structural object\. The broader formal\-math LM literature on premise selection, proof generation, and math pretraining\(Polu & Sutskever,[2020](https://arxiv.org/html/2606.28013#bib.bib13); Yang et al\.,[2023](https://arxiv.org/html/2606.28013#bib.bib17); Azerbayev et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib3); Lin et al\.,[2025](https://arxiv.org/html/2606.28013#bib.bib7)\)addresses different stages of the pipeline and is complementary\.

We instantiate a2×22\{\\times\}2signal\-coverage matrix \(elaborator pass/fail×\\timesjudge equivalent/not\) on DeepSeek V4\-Pro×\\timesProofNet\#\(186 problems\) under a dual\-judge protocol that pairs Claude Opus 4\.7\(Anthropic,[2026](https://arxiv.org/html/2606.28013#bib.bib1)\)with GTED atτ=0\.5\\tau\{=\}0\.5, tracking per\-problem transitions from Vanilla into Lean\-Retry, Sample\-Filter, and Stratified Autoformalization \(SAF\)\.

This paper makes three contributions:

\(1\)A diagnostic framework and the predictive regularity it yields\.A2×22\{\\times\}2signal\-coverage matrix and a4×44\{\\times\}4transition matrix decompose any TC% gain into type\-stratum recovery, semantic\-stratum turnover, and a durable residual\. Under this lens, three algorithmically distinct methods each recover23/6123/61of VanillaTO\(Wilson 95% CI \[26\.6%, 50\.3%\]\), andΔ​T​C\\Delta TCis linear in the Vanilla elab\-fail rate atR2=0\.96R^\{2\}\{=\}0\.96across six \(model, dataset\) cells \(§[2](https://arxiv.org/html/2606.28013#S2), §[5\.2](https://arxiv.org/html/2606.28013#S5.SS2), §[5\.4](https://arxiv.org/html/2606.28013#S5.SS4)\)\.

\(2\)A calibrated dual\-judge protocol\.Opus paired with GTED atτ=0\.5\\tau\{=\}0\.5achieves94\.4%94\.4\\%to95\.0%95\.0\\%precision across benchmarks, and a four\-class taxonomy attributes 30 to 56% of GTED false negatives to elaborator\-forced structural rewriting \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\)\.

\(3\)SAF as a causal\-attribution probe\.Stratified Autoformalization \(SAF\) refines a typed JSON IR against the elaborator and emits Lean by a deterministic translator\. Its determinism makes the candidate’s surface form a function of the IR alone, attributing the\+37\+37pp Opus\-vs\-GTED gap to elaborator\-driven surface geometry rather than LLM sampling noise \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\)\. SAF ties Lean\-Retry and Sample\-Filter on aggregate metrics\. Its role here is diagnostic rather than state\-of\-the\-art optimization\.

## 2The Signal\-Coverage Matrix

Under statement\-level scope \(NLstatementSS→\\toLean 4 signatureσ\\sigmawith:= by sorry\), an output that fails its specification falls into exactly one of two strata\. A type\-stratum failure meansσ\\sigmadoes not type\-check \(Unknown identifier,failed to synthesize,application type mismatch, …\), so the math may be understood but the Lean surface form is invalid\. A semantic\-stratum failure meansσ\\sigmatype\-checks but denotes a different proposition fromSS, for instance hardcodingZMod 2whenSSrefers to “any field of size 2”, usingSummable fwhenSSasserts the sequence has a limit, reversing an inequality, or dropping a hypothesis\. \(A third, implementation stratum for proof\-level autoformalization is out of scope\.\)

The two refinement signals used in practice are stratum\-specialized\. The Lean elaborator is complete on the type stratum, since every failure raises an error, and silent on the semantic stratum, since a semantically wrong but type\-correctσ\\sigmaelaborates without complaint\. Semantic judges, namely back\-translation\(Zhou et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib19)\)and GTED\(Liu et al\.,[2025b](https://arxiv.org/html/2606.28013#bib.bib9)\), cover the semantic stratum but forgive the type stratum, since a non\-compiling fragment back\-translates to noisyNL\.

Crossing elaboration \(pass / fail\) with the semantic judgment \(equivalent / not\-equivalent\) partitionsnnoutputs into four cells \([Figure1](https://arxiv.org/html/2606.28013#S2.F1)\): true success \(TS\), type\-only failure \(TO\), semantic\-only failure \(SO\), and both fail \(BF\)\. These labels denote failure modes, not correctness modes\. The two off\-diagonal cells are exactly the per\-instance disagreements between elaborator and judge, and their size, sources, and dynamics across methods are this paper’s core empirical object\. Writing\|c\|\|c\|for the count in cellcc,

T​C%=\|TS\|\+\|SO\|n,S​F%=\|TS\|n,n=∑c\|c\|\.TC\\%=\\tfrac\{\|\\textsc\{TS\}\|\+\|\\textsc\{SO\}\|\}\{n\},\\quad SF\\%=\\tfrac\{\|\\textsc\{TS\}\|\}\{n\},\\quad n=\\textstyle\\sum\_\{c\}\|c\|\.\(1\)
![Refer to caption](https://arxiv.org/html/2606.28013v1/x1.png)Figure 1:Signal\-coverage matrix: elaborator \(pass/fail\)×\\timesjudge \(equivalent/not\) gives the2×22\{\\times\}2cell decomposition and TC%/SF%\.
## 3Stratified Autoformalization

Because refinement signals are stratum\-specialized \(§[2](https://arxiv.org/html/2606.28013#S2)\), SAF factors autoformalization through a typed intermediate representation \(IR\)\. The LLM emits an IR, a deterministic∼\\sim30\-line translator produces Lean from it, and on elab failure the IR is refined rather than Lean source\. This determinism is central: every refined SAF output is a function of the IR alone\. On the aggregate metrics, SAF ties Lean\-Retry and Sample\-Filter on both TC and Opus\-judged SF \(§[5](https://arxiv.org/html/2606.28013#S5)\)\. Its role in this paper is the causal\-attribution probe of §[5\.5](https://arxiv.org/html/2606.28013#S5.SS5), where the\+37\+37pp Opus\-vs\-GTED gap is attributed to elaborator\-driven surface geometry rather than to LLM sampling noise\.

The refinement loop \([Algorithm1](https://arxiv.org/html/2606.28013#alg1)\) alternates IR generation by the LLM with elaborator checking until the candidate type\-checks\. The total LLM budget isK\+1K\{\+\}1\(one extraction plusKKrefinements\), and we useK=3K\{=\}3in all experiments\.

A SAF IR is a JSON object with five fields:

\{

"imports":\["Mathlib"\],

"opens":\["Polynomial","Real"\],

"objects":\[

\{"name":"G","type":"Type\*","binding":"implicit"\},

\{"name":"","type":"GroupG","binding":"instance"\},

\{"name":"a","type":"G","binding":"explicit"\}

\],

"hypotheses":\[\{"id":"h","expression":"a\*b=b\*a"\}\],

"conclusion":"<Lean4expression\>"

\}

Algorithm 1SAF refinement loop0:NL statement

SS, retry budget

KK
0:type\-checked Lean signature, or

⊥\\bot
1:

ir←LLM\.extract​\_​IR​\(S\)\\mathrm\{ir\}\\leftarrow\\mathrm\{LLM\.extract\\\_IR\}\(S\)
2:for

k=0k=0to

KKdo

3:

σ←ir​\_​to​\_​lean​\(ir\)\\sigma\\leftarrow\\mathrm\{ir\\\_to\\\_lean\}\(\\mathrm\{ir\}\)
4:

\(ok,errors\)←LeanElaborator\.check​\(σ\)\(\\mathrm\{ok\},\\mathrm\{errors\}\)\\leftarrow\\mathrm\{LeanElaborator\.check\}\(\\sigma\)
5:if

ok\\mathrm\{ok\}then

6:return

σ\\sigma
7:endif

8:

ir←LLM\.refine​\_​IR​\(S,ir,errors\)\\mathrm\{ir\}\\leftarrow\\mathrm\{LLM\.refine\\\_IR\}\(S,\\mathrm\{ir\},\\mathrm\{errors\}\)
9:endfor

10:return

⊥\\bot

Theimportsandopensfields carry the Mathlib environment,objectsdeclares typed binders with their binding mode,hypotheseslists named premises, andconclusionholds the goal as a Lean 4 expression\. Their\_to\_leantranslator is a∼\\sim30\-line template substitution that maps each field to its Lean counterpart:importstoimportlines,openstoopenlines, eachobjectsentry to a typed binder \(\(x : T\)/\{x : T\}/\[T\]perbinding\),hypothesesto hypothesis binders, andconclusionto the theorem body ending in:= by sorry\. The translator is purely template\-based \(no LLM, no fuzzy matching, no error recovery\), and malformed IR fails fast\.

## 4Experimental Setup

#### Datasets\.

ProofNet\#\(Poiroux et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib12)\)is the type\-complex benchmark: 186 undergraduate\-textbook problems \(Rudin, Munkres, Dummit\-Foote, Artin, Axler, Herstein, Pugh, Putnam, Ireland\-Rosen, Shakarchi\) exercising Mathlib typeclass machinery \(Group,MeasurableSpace,ContinuousMap,Sylow, …\)\. MiniF2F\-test\(Zheng et al\.,[2022](https://arxiv.org/html/2606.28013#bib.bib18)\)is the type\-simple benchmark: 244 olympiad problems admitting largely primitive\-type formalizations\. The two bracket the type\-system complexity axis\.

#### Subject models\.

DeepSeek V4\-Pro \(49B active / 1\.6T total MoE parameters\) is the primary subject, with the full battery on ProofNet\#and three methods on MiniF2F\. Qwen3\.5\-Plus\(Qwen Team,[2026](https://arxiv.org/html/2606.28013#bib.bib14)\)\(480B\-class\) and MiMo\-v2\.5\-Pro\(Xiaomi MiMo Team,[2026](https://arxiv.org/html/2606.28013#bib.bib16)\)\(Xiaomi, thinking traces disabled\) serve as cross\-family direction checks\. All decoding is attemperature=0\.0except whereN=4N\{=\}4sampling is the method itself\. Gold\-judge analysis is anchored on DeepSeek V4\-Pro×\\timesProofNet\#, with cross\-model and cross\-dataset runs serving as mechanism\-direction checks at coarser \(GTED\-only\) judge resolution\.

#### Methods\.

Four methods, three at the same∼\\sim4\-LLM\-call budget\. Vanilla: one\-shot\. Lean\-Retry \(Guo et al\.,[2025](https://arxiv.org/html/2606.28013#bib.bib5)\-style\): emit Lean signature, refine by feeding elaborator errors back to the LLM,K=3K\{=\}3retries\. Sample\-Filter \(Poiroux et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib12)\-style\): sampleN=4N\{=\}4candidates, retain elab\-pass candidates, pick first\. SAF \(this paper, §[3](https://arxiv.org/html/2606.28013#S3)\): typed JSON IR refined against the elaborator, translated to Lean by a deterministic∼\\sim30\-line procedure,K=3K\{=\}3\.

#### Judges\.

A dual\-judge protocol pairs Claude Opus 4\.7 \(Anthropic, disjoint from all four subject families\) as the gold cross\-judge with GTED\(Liu et al\.,[2025b](https://arxiv.org/html/2606.28013#bib.bib9)\)as the scalable symbolic proxy\. Opus compares candidate Lean against gold Lean for semantic equivalence under a STRICT prompt when the candidate elaborates and an EQUIV\-IF\-FIXED prompt when it does not, yielding 744 verdicts on DeepSeek V4\-Pro×\\timesProofNet\#×\\times\{Vanilla, Lean\-Retry, SAF, Sample\-Filter\}, 244 on DeepSeek V4\-Pro×\\timesMiniF2F×\\timesVanilla, and 988 in total\. GTED computes generalized tree edit distance on the LSP\-extracted typed operator tree, returns a similarity in\[0,1\]\[0,1\], is thresholded atτ=0\.5\\tau\{=\}0\.5, and runs on all 19 \(model, dataset, method\) combinations\.

#### Infrastructure\.

Lean 4\.19\.0 vialean\_interact, pre\-built Mathlib, ATLAS\-based\(Liu et al\.,[2025a](https://arxiv.org/html/2606.28013#bib.bib8)\)GTED LSP backend, elab timeout 60s, and JSONL resume on all pipelines\.

## 5Experimental Results and Analysis

### 5\.1Per\-method cell decomposition

![Refer to caption](https://arxiv.org/html/2606.28013v1/x2.png)Figure 2:Vanilla signal\-coverage matrix \(DeepSeek V4\-Pro×\\timesProofNet\#\)\. Rose\-outlined off\-diagonal cells mark elaborator/Opus disagreements\.For each of the 186 ProofNet\#test problems run on DeepSeek V4\-Pro under each of the four methods, we record the elaborator label \(pass / fail\) and the Opus label \(equivalent / not\-equivalent\), assigning each output to one of the four cells defined in §[2](https://arxiv.org/html/2606.28013#S2)\.[Table1](https://arxiv.org/html/2606.28013#S5.T1)reports the four\-method matrix,[Figure2](https://arxiv.org/html/2606.28013#S5.F2)visualizes Vanilla’s cell counts, and[Figure3](https://arxiv.org/html/2606.28013#S5.F3)shows the cross\-method TC / SF comparison \(computed per[Equation1](https://arxiv.org/html/2606.28013#S2.E1)\)\. We use DeepSeek V4\-Pro×\\timesProofNet\#as the running example because it is the only \(model, dataset\) cell with full four\-method Opus dual\-judging\.[AppendixE](https://arxiv.org/html/2606.28013#A5)extends the sameΔ\\DeltaTC /Δ\\DeltaSF comparison to the other five cells under GTED\. The cell\-level view is what lets us separate elaborator\-side recovery \(TO→\\toTS\) from semantic recovery \(SO→\\toTS\) rather than read both off a single aggregate rate\.

Table 1:Signal\-coverage matrix on DeepSeek V4\-Pro×\\timesProofNet\#\(n=186n\{=\}186\), four methods\. Cells are defined in[Figure1](https://arxiv.org/html/2606.28013#S2.F1), labels denote failure modes\.Vanilla decomposes as 95 / 16 / 61 / 14\. Of the 91 failures, the elaborator catches 75 \(TO\+\+BF\) and Opus catches 30 \(SO\+\+BF\), with overlap of 14\. The 77 / 186 outputs off\-diagonal \(41\.4%,SO\+\+TO\) are the per\-instance disagreements between the two signals\. The largest single recovery target for any elab\-only feedback method is the 61TOcases: semantically correct outputs that the elaborator nonetheless rejects on surface form\.

![Refer to caption](https://arxiv.org/html/2606.28013v1/x3.png)Figure 3:TC% and Opus\-judged SF% across four methods on DeepSeek V4\-Pro×\\timesProofNet\#\. The three elab\-feedback methods \(Lean\-Retry, Sample\-Filter, SAF\) cluster within∼1\\sim 1pp on both axes\.The three elab\-feedback methods add 34 to 36 problems toTS, shrinkTOby 20 to 24, shrinkBFby 7 to 10, and shrinkSOby only 5–6 from its baseline of 16, despite their algorithmic differences \(feedback vs\. sampling, raw Lean vs\. typed IR\)\. A bootstrap 95% CI \(B=10,000B\{=\}10\{,\}000,[Table9](https://arxiv.org/html/2606.28013#A2.T9)\) showsΔ\\DeltaTSandΔ\\DeltaTOsignificant for all three methods \(CI does not cross 0\), whileΔ\\DeltaSOis not significant for any\.

### 5\.2Per\-problem transitions:TSgains come mostly fromTO\-rescue

AggregateΔ\\Deltas answer which cells moved on net\. To resolve which problems moved, we join each problem’s Vanilla cell to its Lean\-Retry cell, giving the4×44\{\\times\}4transition matrix in[Figure4](https://arxiv.org/html/2606.28013#S5.F4)and the per\-cell decomposition ofΔ​\|TS\|\\Delta\|\\textsc\{TS\}\|in[Equation2](https://arxiv.org/html/2606.28013#S5.E2)\.

![Refer to caption](https://arxiv.org/html/2606.28013v1/x4.png)Figure 4:Per\-problem Vanilla→\\toLean\-Retry transition matrix\. Green column: rescue intoTS\. Orange column: newly createdSO\. Dotted diagonal: stay\-put\. Shade encodes count\.Writing\|c→c′\|\|c\{\\to\}c^\{\\prime\}\|for the number of problems in Vanilla cellccthat land in Lean\-Retry cellc′c^\{\\prime\}, the netTSgain is given by

Δ​\|TS\|\\displaystyle\\Delta\|\\textsc\{TS\}\|=\|TO→TS\|⏟23\+\|SO→TS\|⏟14\+\|BF→TS\|⏟2\\displaystyle=\\underbrace\{\|\\textsc\{TO\}\{\\to\}\\textsc\{TS\}\|\}\_\{23\}\+\\underbrace\{\|\\textsc\{SO\}\{\\to\}\\textsc\{TS\}\|\}\_\{14\}\+\\underbrace\{\|\\textsc\{BF\}\{\\to\}\\textsc\{TS\}\|\}\_\{2\}\(2\)−\|TS→¬TS\|⏟3=\+36,\\displaystyle\\quad\-\\underbrace\{\|\\textsc\{TS\}\{\\to\}\\neg\\textsc\{TS\}\|\}\_\{3\}\\;=\\;\+6,attributing23/36≈64%23/36\\approx 64\\%to type\-stratum recovery and14/36≈39%14/36\\approx 39\\%to semantic flip on the rescued side\. Per\-cell individual rescue rates are: 96\.8% of VanillaTSstayTS, 87\.5% \(14 / 16\) of VanillaSOproblems are individually rescued, and only 37\.7% \(23 / 61\) of VanillaTOare recovered, with the remainder stuck\.

### 5\.3Semantic turnover and the gold\-error floor

The aggregateSOcount drops only from 16 to 10 under Lean\-Retry, but this stability masks substantial per\-problem turnover\. 14 of 16 VanillaSOare individually rescued \(87\.5%\), while Lean\-Retry simultaneously creates 8 newSO\(4 fromBF, 3 fromTSregression, 1 fromTO\), for a net change of−6\-6: aggregate stability hides a near\-total replacement of the underlying error population\.

Of the 2 VanillaSOproblems that remainSOunder Lean\-Retry, both are gold\-formalization bugs, verified by Opus reasoning across all four methods\. The first,Ireland\-Rosen\|exercise\_1\_30, asks “Prove that1/2\+1/3\+⋯\+1/n1/2\+1/3\+\\cdots\+1/nis not an integer\.” Gold’s summand1n\+2\\frac\{1\}\{n\+2\}does not depend on the summation index, so the gold sum evaluates ton/\(n\+2\)n/\(n\+2\), a different proposition from the harmonic tail\. The candidate faithfully formalizes the harmonic tail, and Opus correctly judges candidate≠\\neqgold under all four methods \([Example1](https://arxiv.org/html/2606.28013#Thmtheorem1)\)\.

###### Example 1\(Buggy gold formalization,n=1n\{=\}1\)\.

theoremexercise\_1\_30\{n:Nat\}:

Not\(Existsfuna:Int⇒\\Rightarrow

\(Finset\.sum\(Finset\.univ:Finset\(Finn\)\)

fun\_⇒\\Rightarrow\(1:Rat\)/\(n\+2\)\)=a\)

theoremexercise\_1\_30\{n:Nat\}\(hn:2≤\\leqn\):

Not\(Existsfuna:Int⇒\\Rightarrow

\(Finset\.sum\(Finset\.Icc2n\)

funk⇒\\Rightarrow\(1:Rat\)/k\)=a\)

The companion caseDummit\-Foote\|exercise\_7\_2\_2has gold LHSp∣0p\\mid 0, trivially true in any commutative ring, so the biconditional collapses to a one\-sided statement \([AppendixF](https://arxiv.org/html/2606.28013#A6), Case D\)\. Both cases are documented with Opus traces and suggested corrections in the supplementary note\.

Across all four methods, only 1 of 186 problems is durablySO\(Ireland\-Rosen\|exercise\_1\_30\)\. Under Lean\-Retry the count rises to 2, both gold bugs\. Excluding the gold bugs, zero of 186 problems are durably semantic\-only across the three elab\-feedback methods, so theSOplateau approximates the gold\-error floor of the benchmark\.

### 5\.4A predictive regularity: stratum\-rates are method\-invariant

The aggregate gains of §[5\.1](https://arxiv.org/html/2606.28013#S5.SS1)and the transition matrix of §[5\.2](https://arxiv.org/html/2606.28013#S5.SS2)naturally raise the question: do the three elab\-feedback methods recover the*same*problems at the*same*rate, or do they merely converge in aggregate? For each Vanilla cellc∈\{TS,SO,TO,BF\}c\\in\\\{\\textsc\{TS\},\\textsc\{SO\},\\textsc\{TO\},\\textsc\{BF\}\\\}and each elab\-feedback methodmm, define the recovery rate

ρc→TS​\(m\)=\|\{p:Vanilla​\(p\)=c∧m​\(p\)=TS\}\|\|\{p:Vanilla​\(p\)=c\}\|\.\\rho\_\{c\\to\\textsc\{TS\}\}\(m\)\\;=\\;\\frac\{\|\\\{p:\\text\{Vanilla\}\(p\)=c\\,\\wedge\\,m\(p\)=\\textsc\{TS\}\\\}\|\}\{\|\\\{p:\\text\{Vanilla\}\(p\)=c\\\}\|\}\.\(3\)[Table2](https://arxiv.org/html/2606.28013#S5.T2)reports the four rates for each of the three elab\-feedback methods\.

Table 2:Per\-stratum recovery rates on DeepSeek V4\-Pro×\\timesProofNet\#, dual\-judged\.ρc→TS​\(m\)\\rho\_\{c\\to\\textsc\{TS\}\}\(m\)is the share of Vanilla cellccthat lands inTSunder methodmm\.All three elab\-feedback methods recover*exactly*23 of the 61 VanillaTOproblems, despite differing in algorithm \(sequential elaborator feedback, deterministic IR refinement, parallelN=4N\{=\}4sampling\), in LLM call topology, and in candidate generation entropy\. The point estimate has a wide Wilson 95% CI of\[26\.6%,50\.3%\]\[26\.6\\%,50\.3\\%\]atn=61n\{=\}61, so we do not claim the true rate is precisely 37\.7%\. What we claim, and what the predictive regularity of[Table3](https://arxiv.org/html/2606.28013#S5.T3)rests on, is that the three methods recover statistically indistinguishable populations on the type stratum: per\-method CIs overlap entirely, and a Fisher exact test of pairwise homogeneity returnsp=1\.00p\{=\}1\.00for all three pairs\.[AppendixA](https://arxiv.org/html/2606.28013#A1)gives Wilson CIs for every cell of[Table2](https://arxiv.org/html/2606.28013#S5.T2)and discusses the limits of then=186n\{=\}186sample\. The semantic recovery rate is high and tighter than theBFrate \(the latter is small\-nnnoise, 14BFproblems total\)\. The retention rate onTSis uniformly≥92\.6%\\geq 92\.6\\%, so net regression is small\.

Taking any single elab\-feedback method’s rates and applying them as a mass\-balance rule to the Vanilla cell distribution\(\|TS\|,\|SO\|,\|TO\|,\|BF\|\)=\(95,16,61,14\)\(\|\\textsc\{TS\}\|,\|\\textsc\{SO\}\|,\|\\textsc\{TO\}\|,\|\\textsc\{BF\}\|\)=\(95,16,61,14\)yields a testable prediction for the other methods\. With Lean\-Retry’s rates,Δ​TS^=0\.377⋅61\+0\.875⋅16\+0\.143⋅14−\(1−0\.968\)⋅95=\+36\.0\\hat\{\\Delta\\textsc\{TS\}\}=0\.377\{\\cdot\}61\+0\.875\{\\cdot\}16\+0\.143\{\\cdot\}14\-\(1\-0\.968\)\{\\cdot\}95=\+36\.0\.[Table3](https://arxiv.org/html/2606.28013#S5.T3)compares predicted to observed: the calibration on Lean\-Retry predicts the other two methods’Δ​TS\\Delta\\textsc\{TS\}to within 2 of 186 problems, an order of magnitude better than the bootstrap noise band of any cell\.

Table 3:Predicted vs\. observedΔ​TS\\Delta\\textsc\{TS\}on DeepSeek V4\-Pro×\\timesProofNet\#, using stratum\-rates calibrated from Lean\-Retry\.Extending across the six \(model, dataset\) cells of the cross\-model panel, where the test is necessarily at the TC level \(Opus dual\-judging is anchored on the V4\-Pro×\\timesProofNet\#cell\), the recovery rate of Vanilla elab\-fails under Lean\-Retry fits a single linear regression in the Vanilla elab\-fail rateVfailV\_\{\\rm fail\}:

Δ​T​C^=0\.325⋅Vfail\+2\.88\(R2=0\.957\),\\widehat\{\\Delta TC\}\\;=\\;0\.325\\cdot V\_\{\\rm fail\}\\;\+\\;2\.88\\quad\(R^\{2\}=0\.957\),\(4\)with per\-cell residuals in\[−1\.0,\+1\.3\]\[\-1\.0,\+1\.3\]pp \([Table4](https://arxiv.org/html/2606.28013#S5.T4)\)\. The recovery rate of elab\-fails \(Δ​T​C/Vfail\\Delta TC/V\_\{\\rm fail\}\) is41\.2%41\.2\\%on ProofNet\#\(range\[38\.3,45\.3\]\[38\.3,45\.3\]\) and61\.8%61\.8\\%on MiniF2F \(range\[48\.4,73\.3\]\[48\.4,73\.3\]\): the same elab\-signal recovers a strictly higher share of failures on type\-simple benchmarks because type\-simple failures concentrate in the easy surface\-form band thatK=3K\{=\}3refinement resolves\.

Table 4:Cross\-cell fit of[Equation4](https://arxiv.org/html/2606.28013#S5.E4)on six \(model, dataset\) cells under Lean\-Retry\. Rows are grouped by benchmark\.Two consequences for method comparison follow\. First, a new method’s*aggregate*Δ​T​C\\Delta TCis partially determined by its target’s Vanilla cell distribution, not only by the method’s algorithmic innovations: a method evaluated only on a type\-complex benchmark will look weaker than the same method on MiniF2F, by a benchmark\-modulation factor predicted fromVfailV\_\{\\rm fail\}alone\. Second, the type\-stratum recovery rate of23/6123/61is what current elab\-feedback methods deliver, and closing the gap to the elab\-only ceiling\(\|TS\|\+\|TO\|\)/n=83\.9%\(\|\\textsc\{TS\}\|\{\+\}\|\\textsc\{TO\}\|\)/n=83\.9\\%would require recovering hardTOcases that surviveK=3K\{=\}3refinement, a population on which no current method makes progress \(§[6](https://arxiv.org/html/2606.28013#S6)\)\. The within\-2/1862/186agreement is partly self\-consistency, since the three methods already converge in aggregate \([Table1](https://arxiv.org/html/2606.28013#S5.T1)\)\.

### 5\.5Opus vs\. GTED: the surface\-form rewriting gap

On the same 186 DeepSeek V4\-Pro×\\timesProofNet\#outputs, Claude Opus 4\.7 and GTEDτ=0\.5\\tau\{=\}0\.5give different SF% numbers \([Table5](https://arxiv.org/html/2606.28013#S5.T5)\)\. SAF’s deterministic IR translator isolates the cause\.

Table 5:Opus vs\. GTED SF% on DeepSeek V4\-Pro×\\timesProofNet\#outputs\.On Vanilla the two judges agree to within\+7\+7pp\. On the three elab\-feedback methods they disagree by\+26\+26to\+37\+37pp, with Opus reading SF as rising and GTED reading it as flat or falling\. The mechanism: GTED penalizes surface\-form normalizations \(alternativeopens,IsCompact univ↔\\leftrightarrowCompactSpace,Fin​n→ℝ\\mathrm\{Fin\}\\,n\\to\\mathbb\{R\}vs\.∀i:ℕ,Icc​0 1\\forall i:\\mathbb\{N\},\\,\\mathrm\{Icc\}\\,0\\,1restatements\) that elab\-feedback adopts to satisfy the elaborator, while Opus recognizes these as semantically equivalent\.[Example2](https://arxiv.org/html/2606.28013#Thmtheorem2)is a representative instance onMunkres\|exercise\_29\_4: gold and candidate denote the sameΠ\\Pitype, but GTED returns similarity 0\.143 because the extracted operator trees place an arrow node where the other places a∀\\forallbinder\.

###### Example 2\(Surface\-form gap,n=1n\{=\}1\)\.

theoremexercise\_29\_4:

Not\(LocallyCompactSpace

\(Nat→\\toSet\.Icc\(0:Real\)1\)\)

theoremexercise\_29\_4:

Not\(LocallyCompactSpace

\(foralli:Nat,Set\.Icc\(0:Real\)1\)\)

The gap is largest for SAF \(\+37\+37pp\): with surface form fixed by the deterministic translator, divergence is the IR’s choice, not LLM sampling noise\. Additional cases \(TO→\\toTSrecoveries, gold bugs\) are catalogued in[AppendixF](https://arxiv.org/html/2606.28013#A6)\.

Table 6:GTEDτ\\taucalibration against Opus\.P=P​\(Opus equiv∣GTED≥τ\)P=P\(\\text\{Opus equiv\}\\mid\\text\{GTED\}\\geq\\tau\),RRis recall on the Opus\-equiv elab\-pass set\. Recommended rowτ=0\.5\\tau\{=\}0\.5in blue underline\. Per\-method ProofNet\#precision in[Table10](https://arxiv.org/html/2606.28013#A3.T10)\.[Table6](https://arxiv.org/html/2606.28013#S5.T6)confirms thatτ=0\.5\\tau\{=\}0\.5maintains 95\.0% pooled precision on ProofNet\#and≥93%\\geq 93\\%on every method individually \(per\-method breakdown in[Table10](https://arxiv.org/html/2606.28013#A3.T10)\)\. The threshold is stable across distributions, so the gap is a structural property of elab\-feedback’s surface rewriting rather than of judge calibration\.

The Opus\-equiv but GTED\-non\-equiv set \(n=38n\{=\}38on Vanilla,n∈\[62,76\]n\\in\[62,76\]on the three elab\-feedback methods\) is what drives the gap\. We classify each case by the type of rewrite that distinguishes the candidate from gold \(rules and per\-method audit in[AppendixD](https://arxiv.org/html/2606.28013#A4)\):

- •Notational \(N\)\.Surface rewrites that preserve definitional shape:A\-\>Bvs\.forall \_ : A, B,Real^nvs\.Fin n \-\> Real, coercion \(ZMod 2for the unique field of order 2\), alternativeopen/namespace prefixes, or cosmetic renaming\.
- •Idiomatic \(I\)\.Definitional aliases or Mathlib formulation swaps:IsConjvs\.exists g, b = g\*a\*g\-1,Summablevs\.Tendstoof partial sums,CompactSpacevs\.IsCompact Set\.univ, orLinearEquiv\-via\-Nonemptytypeclass framing\.
- •Structural \(S\)\.Binder, quantifier, or typeclass restructuring that changes the post\-elaboration term:letvs\. explicit hypothesis, instance binder vs\. hypothesis binder,\[Field\]vs\.\(h : Field\), quantifier moved inside or outside an existential,Iffdirection swap, biconditional introduced or collapsed, orsubtypevs\. predicate\-restricted set\.
- •Residual \(Z\)\.Cases that mix multiple categories or resist single\-category classification\.

Table 7:Taxonomy of Opus\-vs\-GTED disagreements on DeepSeek V4\-Pro×\\timesProofNet\#\. Counts: problems with Opus equiv∧\\wedgeGTED<0\.5<0\.5∧\\wedgeelab\-pass\. Percentages: share of each method’s false\-negative pool\.[Table7](https://arxiv.org/html/2606.28013#S5.T7)reads as follows\. On Vanilla, 38 of 95 elab\-pass outputs \(40%40\\%\) are surface\-form rewrites Opus accepts but GTED does not, distributed across all four categories\. Under each elab\-feedback method the false\-negative count grows by 24 to 38 problems\. For Lean\-Retry and Sample\-Filter the growth concentrates in category S \(16 of 24 and 21 of 24 added cases, respectively\): the elaborator pressures the LLM to restructure binders, typeclass formulations, and quantifier orderings until the output type\-checks\. SAF’s growth is more diffuse: only 9 added in S \(the smallest, because SAF’s deterministicir\_to\_leantranslator does not freely restructure binders\) but\+20\+20in Z, reflecting the multi\-category rewrites the IR\-mediated path tends to produce\. The disagreement gap is therefore not random surface variance but a measurement of elaborator\-driven structural rewriting, which is precisely the mechanism that recoversTOintoTS\(§[5\.2](https://arxiv.org/html/2606.28013#S5.SS2)\)\.

### 5\.6Cross\-dataset:SOrate stable,TOrate scales with type\-complexity

DeepSeek V4\-Pro×\\timesMiniF2F\-test×\\timesVanilla \(244 problems, Opus\-judged\) decomposes asTS=163,SO=24,TO=51,BF=6\. TheSOrate is 9\.8%, essentially unchanged from 8\.6% on type\-complex ProofNet\#\. TheTOrate, by contrast, drops from 32\.8% to 20\.9%\. Dataset type\-complexity loads the type stratum and not the semantic stratum, providing supporting evidence at Vanilla\-only resolution for the mechanism of §[5\.2](https://arxiv.org/html/2606.28013#S5.SS2)\. Full MiniF2F dual\-judge analysis across the four methods is future work\.

### 5\.7Cross\-model: TC gain robust to subject model

[Figure5](https://arxiv.org/html/2606.28013#S5.F5)reports TC% for three subject models on both datasets\. Lean\-Retry≥\\geqVanilla in 6/6 cells \(meanΔ\\DeltaTC==\+10\.1\+10\.1pp\)\. SAF≥\\geqVanilla in 5/6, the one exception \(Qwen×\\timesProofNet\#,−12\.9\-12\.9pp\) being a high IR\-extraction failure rate on Qwen3\.5\-Plus, reported without ad\-hoc patching\. Under GTED\-only judging,Δ\\DeltaTC\>\>Δ\\DeltaSF holds in 6/6 cells for Lean\-Retry \(mean\+10\.1\+10\.1vs\.\+1\.3\+1\.3pp,[Table11](https://arxiv.org/html/2606.28013#A5.T11)\)\.

![Refer to caption](https://arxiv.org/html/2606.28013v1/x5.png)Figure 5:Cross\-model TC% on 3 models×\\times2 datasets×\\times3 methods\.
### 5\.8K\-saturation: refinement saturates fast and three methods converge

[Figure6](https://arxiv.org/html/2606.28013#S5.F6)plots SAF’s K\-saturation on DeepSeek V4\-Pro×\\timesProofNet\#\. TC% rises 50\.5→\\to72\.0→\\to74\.7→\\to75\.3% acrossK=0,1,2,3K\{=\}0,1,2,3\. Refinement saturates atK=1K\{=\}1\(\+21\.5\+21\.5pp\), withK=2,3K\{=\}2,3contributing\+3\.3\+3\.3pp combined\. All three methods land within 1\.0 pp at saturation \(SAF 75\.3, Lean\-Retry 75\.8, Sample\-Filter 76\.3\), so different topologies reach the same ceiling\. Per\-iteration faithfulness remains high:k∗=0k^\{\*\}\{=\}0: 89/94 \(94\.7%\),k∗=1k^\{\*\}\{=\}1: 34/40 \(85\.0%\),k∗=2k^\{\*\}\{=\}2: 5/5 \(100%\), andk∗=3k^\{\*\}\{=\}3: 1/1 \(100%\)\. Of the 46 problems that never pass, 39 \(84\.8%\) are Opus equiv\-if\-fixed \(i\.e\.,TO\)\. The 16 durablyTOacross all four methods share a common failure mode: plausible but invalid Mathlib API names that elab errors cannot prescribe \([AppendixF](https://arxiv.org/html/2606.28013#A6), Case E\)\.

![Refer to caption](https://arxiv.org/html/2606.28013v1/x6.png)Figure 6:SAF K\-saturation on DeepSeek V4\-Pro×\\timesProofNet\#\. Dashed bands: Lean\-Retry and Sample\-Filter saturation levels\.

## 6Discussion and Limitations

The \(TC≠\\neqSF\) gap itself is not new, but the per\-stratum quantification is\. On type\-complex benchmarks TC% alone is misleading: the recommended reporting unit is TC%, dual\-judge SF%, and the per\-cell decomposition \(TS/SO/TO/BF\)\. Two elab\-feedback methods compared under a single symbolic judge can differ by≥25\\geq 25pp purely from structural rewriting, so independent\-LLM verification is required for any SF% claim\.

#### Why three methods converge\.

The collapse to within∼\\sim1 pp TC% at saturation \(§[5\.8](https://arxiv.org/html/2606.28013#S5.SS8)\), the identical 37\.7%TO→\\toTSrate \(§[5\.4](https://arxiv.org/html/2606.28013#S5.SS4)\), and the shared structural\-rewriting signature in the Opus/GTED gap \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\) are three views of the same claim: atK≤3K\{\\leq\}3the elab signal is largely exhausted by the fixed∼\\sim4\-call budget, so all three refinement topologies converge to the same family of type\-checking surface forms\. The remaining axis of progress is the signal itself, not the topology on top of it\.

Per\-output semantic faithfulness is unreliable under elab\-feedback because of turnover \(87\.5% rescue offset by 8 newly created errors per method\)\. Future signals must correlate with semantic correctness rather than only distinguish elab\-pass from elab\-fail\.

SAF’s deterministic IR translation serves as a causal\-attribution probe for the surface\-form rewriting mechanism \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\) and the per\-stratum predictive regularity \(§[5\.4](https://arxiv.org/html/2606.28013#S5.SS4)\)\. Two ProofNet\#gold formalizations contain gold\-formalization errors \(Ireland\-Rosen\|exercise\_1\_30,Dummit\-Foote\|exercise\_7\_2\_2\) and are documented with suggested corrections in the supplementary note\. Benchmarks aiming to measureSObelow 1% will need gold audits, not larger LLMs alone\.

Our semantic labels rely on Claude Opus 4\.7 as a strong cross\-judge rather than expert Lean validators \(≥94%\\geq 94\\%pooled precision in §[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\)\. A small expert audit and a canonicalization\-aware symbolic metric \(e\.g\. operator\-tree matching afterwhnfreduction\) would absorb most of the N/I taxonomy categories without an LLM judge\. Both are natural follow\-ups\.

## 7Conclusion

We decompose TC% gains via a2×22\{\\times\}2signal\-coverage matrix and a4×44\{\\times\}4per\-problem transition matrix\. On DeepSeek V4\-Pro×\\timesProofNet\#, three elab\-feedback methods deliver \+34 to \+36TSvia∼\\sim64% type\-stratum recovery \([Equation2](https://arxiv.org/html/2606.28013#S5.E2)\), withSOnet\-flat but 87\.5% per\-problem turnover\. TheTO→\\toTSrate is23/6123/61for each method \(Wilson 95% CI \[26\.6%, 50\.3%\]\), cross\-method calibration predictsΔ​TS\\Delta\\textsc\{TS\}to within2/1862/186, andΔ​T​C\\Delta TC\{\}fits a linear regression in the Vanilla elab\-fail rate atR2=0\.96R^\{2\}\{=\}0\.96across six \(model, dataset\) cells\. A dual\-judge protocol \(Opus and GTED at 95% pooled precision\) shows that single\-judge SF% under\-credits elab\-feedback methods by≥25\\geq 25pp via elaborator\-forced structural rewrites\.

## References

- Anthropic \(2026\)Anthropic\.The Claude 4 model family\.[https://www\.anthropic\.com/claude](https://www.anthropic.com/claude), 2026\.Model card\.
- Azerbayev et al\. \(2023\)Azerbayev, Z\., Piotrowski, B\., Schoelkopf, H\., Ayers, E\. W\., Radev, D\., and Avigad, J\.Proofnet: Autoformalizing and formally proving undergraduate\-level mathematics\.*arXiv preprint arXiv:2302\.12433*, 2023\.
- Azerbayev et al\. \(2024\)Azerbayev, Z\., Schoelkopf, H\., Paster, K\., Dos Santos, M\., McAleer, S\., Jiang, A\. Q\., Deng, J\., Biderman, S\., and Welleck, S\.Llemma: An open language model for mathematics\.In*International Conference on Learning Representations*, 2024\.
- DeepSeek\-AI \(2026\)DeepSeek\-AI\.DeepSeek\-V4 technical report\.[https://huggingface\.co/deepseek\-ai](https://huggingface.co/deepseek-ai), 2026\.Technical report\.
- Guo et al\. \(2025\)Guo, Q\., Wang, J\., Zhang, J\., Kong, D\., Huang, X\., Xi, X\., Wang, W\., Wang, J\., Cai, X\., Zhang, S\., and Ye, W\.Autoformalizer with tool feedback\.*arXiv preprint arXiv:2510\.06857*, 2025\.
- Jiang et al\. \(2023\)Jiang, A\. Q\., Welleck, S\., Zhou, J\. P\., Li, W\., Liu, J\., Jamnik, M\., Lacroix, T\., Wu, Y\., and Lample, G\.Draft, sketch, and prove: Guiding formal theorem provers with informal proofs\.In*International Conference on Learning Representations*, 2023\.
- Lin et al\. \(2025\)Lin, H\., Sun, Z\., Welleck, S\., and Yang, Y\.Lean\-STaR: Learning to interleave thinking and proving\.In*International Conference on Learning Representations*, 2025\.
- Liu et al\. \(2025a\)Liu, X\., Bao, K\., Zhang, J\., Liu, Y\., Chen, Y\., Liu, Y\., Jiao, Y\., and Luo, T\.ATLAS: Autoformalizing theorems through lifting, augmentation, and synthesis of data\.*arXiv preprint arXiv:2502\.05567*, 2025a\.
- Liu et al\. \(2025b\)Liu, Y\., Zhu, T\., Liu, X\., Chen, Y\., Liu, Z\., Guo, Q\., Zhang, J\., Bao, K\., and Luo, T\.Generalized tree edit distance \(GTED\): A faithful evaluation metric for statement autoformalization\.In*ICML 2025 Workshop on AI for Math \(AI4Math\)*, 2025b\.arXiv:2507\.07399\.
- Lu et al\. \(2024\)Lu, J\., Wan, Y\., Liu, Z\., Huang, Y\., Xiong, J\., Liu, C\., Shen, J\., Jin, H\., Zhang, J\., Wang, H\., Yang, Z\., Tang, J\., and Guo, Z\.Process\-driven autoformalization in lean 4\.*arXiv preprint arXiv:2406\.01940*, 2024\.
- OpenAI \(2024\)OpenAI\.Hello GPT\-4o\.[https://openai\.com/index/hello\-gpt\-4o/](https://openai.com/index/hello-gpt-4o/), 2024\.
- Poiroux et al\. \(2024\)Poiroux, A\., Weiss, G\., Kunčak, V\., and Bosselut, A\.Improving autoformalization using type checking\.*arXiv preprint arXiv:2406\.07222*, 2024\.
- Polu & Sutskever \(2020\)Polu, S\. and Sutskever, I\.Generative language modeling for automated theorem proving\.*arXiv preprint arXiv:2009\.03393*, 2020\.
- Qwen Team \(2026\)Qwen Team\.Qwen3\.5 technical report\.[https://qwenlm\.github\.io](https://qwenlm.github.io/), 2026\.Technical report\.
- Wu et al\. \(2022\)Wu, Y\., Jiang, A\. Q\., Li, W\., Rabe, M\., Staats, C\., Jamnik, M\., and Szegedy, C\.Autoformalization with large language models\.*Advances in Neural Information Processing Systems*, 35:32353–32368, 2022\.
- Xiaomi MiMo Team \(2026\)Xiaomi MiMo Team\.MiMo\-V2\.5\-Pro: Reasoning model technical report\.[https://github\.com/XiaomiMiMo](https://github.com/XiaomiMiMo), 2026\.Technical report\.
- Yang et al\. \(2023\)Yang, K\., Swope, A\., Gu, A\., Chalamala, R\., Song, P\., Yu, S\., Godil, S\., Prenger, R\., and Anandkumar, A\.LeanDojo: Theorem proving with retrieval\-augmented language models\.In*Advances in Neural Information Processing Systems*, 2023\.
- Zheng et al\. \(2022\)Zheng, K\., Han, J\. M\., and Polu, S\.MiniF2F: a cross\-system benchmark for formal olympiad\-level mathematics\.In*International Conference on Learning Representations*, 2022\.doi:10\.48550/arXiv\.2109\.00110\.URL[https://arxiv\.org/abs/2109\.00110](https://arxiv.org/abs/2109.00110)\.
- Zhou et al\. \(2024\)Zhou, J\. P\., Staats, C\., Li, W\., Szegedy, C\., Weinberger, K\. Q\., and Wu, Y\.Don’t trust: Verify – grounding llm quantitative reasoning with autoformalization\.In*International Conference on Learning Representations*, 2024\.

## Appendix AWilson 95% CI on per\-stratum recovery rates

[Table8](https://arxiv.org/html/2606.28013#A1.T8)gives Wilson 95% CIs for every entry of[Table2](https://arxiv.org/html/2606.28013#S5.T2)\. Pairwise Fisher exact tests for homogeneity of theTO→\\toTSrate returnp=1\.0p\{=\}1\.0for all three pairs \(the observed counts are identical,23/6123/61for each method\)\. The semantic andBFrecovery rates have overlapping CIs across methods, consistent with a shared underlying rate plus small\-sample noise\. The wideBFCI \(±∼20\\pm\{\\sim\}20pp\) reflectsn=14n\{=\}14rather than instability of the recovery mechanism\.

Table 8:Wilson 95% CI for each entry of[Table2](https://arxiv.org/html/2606.28013#S5.T2)\.
## Appendix BBootstrap 95% CI on stratum\-shiftΔ\\Deltas

Resampling is at the problem level rather than the cell level because the four cell counts on a given problem are coupled \(each problem contributes to exactly one cell per method\), and intervals are the percentile method onB=10,000B\{=\}10\{,\}000resamples \(seed 20260512\) on DeepSeek V4\-Pro×\\timesProofNet\#\(n=186n\{=\}186\)\. The interval containment pattern is uniform across the three elab\-feedback methods:Δ\\DeltaTSandΔ\\DeltaTOare bounded away from 0 for all three,Δ\\DeltaSOCIs all straddle 0, andΔ\\DeltaBFreaches significance only on Lean\-Retry, which gives the precise numerical backing for the qualitative claim of §[5\.1](https://arxiv.org/html/2606.28013#S5.SS1)\.

Table 9:Bootstrap 95% CI onΔ\\Deltavs\. Vanilla \(B=10,000B\{=\}10\{,\}000\)\.Δ\\DeltaBFsignificant only for Lean\-Retry\.
## Appendix CGTEDτ\\taucalibration: per\-method precision

[Table6](https://arxiv.org/html/2606.28013#S5.T6)of §[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)reports the 4\-method pooled precision atτ=0\.5\\tau\{=\}0\.5\(95\.0%95\.0\\%on ProofNet\#\)\. The breakdown below shows that no single method drags the pool: per\-method precision stays at≥93\.4%\\geq 93\.4\\%atτ=0\.5\\tau\{=\}0\.5and rises monotonically withτ\\tau, so the chosen threshold is robust to which method’s elab\-pass outputs are pooled into the calibration set\.

Table 10:GTED precision at three thresholds, per\-method on DeepSeek V4\-Pro×\\timesProofNet\#elab\-pass subset\.
## Appendix DDisagreement taxonomy: rules and audit

[Table7](https://arxiv.org/html/2606.28013#S5.T7)of §[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)reports a four\-class classification of every Opus\-equiv but GTED<0\.5<0\.5output on DeepSeek V4\-Pro×\\timesProofNet\#, across four methods\. Classes are defined as follows:

N \(Notational\)\.The rewrite preserves the operator\-tree shape modulo Lean’s surface notation:A \-\> B↔\\leftrightarrowforall \_ : A, B,Real^n↔\\leftrightarrowFin n \-\> Real, alternativeopen/namespace prefixes, coercion \(ZMod 2as representative for “any field of order 2”\),Function\.Injective↔\\leftrightarrowInjective, and cosmetic variable renaming\. GTED penalises these because its operator tree is extracted pre\-reduction\.

I \(Idiomatic\)\.The rewrite swaps one Mathlib idiom for a definitionally equivalent or near\-equivalent one:IsConj↔\\leftrightarrowexists g, b = g\*a\*g\-1,Summable↔\\leftrightarrowTendstoof partial sums,CompactSpace↔\\leftrightarrowIsCompact Set\.univ,LocallyCompactSpace↔\\leftrightarrowNonempty\-wrappedLinearEquiv,generateFrom = sInf↔\\leftrightarrowgenerateFrom \(sInter \.\.\.\), etc\.

S \(Structural\)\.The rewrite changes term structure: binder ordering, hypothesis hoisting \(let\-binding vs\. explicit hypothesis vs\. premise\), typeclass formulation \(\[Field K\]vs\.\(h : Field K\),\[Fact p\.Prime\]vs\.Nat\.Primes,\[GCDMonoid R\]vs\. explicit gcd predicate\), quantifier reordering across an existential or universal,Iffdirection swap or biconditional collapse, implication\-vs\-biconditional substitution, subtype\-vs\-predicate\-restricted\-set, and hypothesis omission of vacuously\-true premises \(e\.g\.0 < nwhenn=0n\{=\}0admits no counterexample\)\.

Z \(Residual\)\.Cases mixing multiple categories \(e\.g\. both a notational and a structural rewrite\) or specific to NL ambiguity that do not cleanly map to one class\.

Classification rules are encoded as a regex cascade evaluated in the order I→\\toN→\\toS→\\toZ, and the first match wins\. Rules are tuned on the Opus rationale text, then frozen before the per\-method counts were tabulated\. The full classifier ships with the supplementary scriptanalysis/disagreement\_taxonomy\.py, which also dumps the per\-case rationale string and assigned label for every entry in[Table7](https://arxiv.org/html/2606.28013#S5.T7)\. Limitations of the classifier: \(i\) the Z bucket absorbs cases that mix two structural levels or that the rationale describes ambiguously, and \(ii\) borderline I vs\. N decisions \(e\.g\.deriv\[n\]vs\.iteratedDeriv n\) are made by the first matching rule rather than by adjudication\. The dominant\-class claim of §[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\(that S grows under elab\-feedback\) is robust to these noise sources because S’s share rises by 12 to 20 pp between Vanilla and the elab\-feedback methods, well outside any plausible reassignment of the Z residual\.

## Appendix ECross\-method TC/SF dissociation under GTED \(6 \(model, dataset\) combos\)

This table reports the GTED\-judgedΔ\\Deltavs\. Vanilla on the five \(model, dataset\) cells outside the Opus\-judged DeepSeek V4\-Pro×\\timesProofNet\#baseline of §[5\.1](https://arxiv.org/html/2606.28013#S5.SS1)\. The dissociation pattern reappears:Δ\\DeltaTC% is positive in 11 of 12 \(method, model, dataset\) rows \(up to\+16\.1\+16\.1pp\), whileΔ\\DeltaSF% stays within±5\\pm 5pp in 11 of 12 rows and is non\-positive in 6 of 12: elab\-feedback moves the elab axis without moving the semantic axis on every cell tested\.

Table 11:Δ\\Deltavs\. Vanilla under GTEDτ=0\.5\\tau\{=\}0\.5across the six \(model, dataset\) cells\. Left: ProofNet\#\. Right: MiniF2F\.\(a\)ProofNet\#\(type\-complex,n=186n\{=\}186\)
\(b\)MiniF2F \(type\-simple,n=244n\{=\}244\)

## Appendix FCase studies

Each case below shows the natural\-language statement, the gold formalization, and one or more candidate outputs\. Gold and candidates are reproduced in Lean 4 syntax with implicit elaboration arguments preserved\. Outcome tags use the cell labels of §[5\.1](https://arxiv.org/html/2606.28013#S5.SS1)\. The five cases together cover the four mechanisms identified in the body: Case A is aTO→\\toTSrecovery \(§[5\.2](https://arxiv.org/html/2606.28013#S5.SS2)\), Case B a canonical GTED false negative driven by surface\-form rewriting \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\), Cases C and D the two distinct gold\-formalization bug shapes that pin the durableSOfloor \(§[5\.3](https://arxiv.org/html/2606.28013#S5.SS3)\), and Case E a representativeTOthat survivesK=3K\{=\}3refinement across all four methods \(§[5\.8](https://arxiv.org/html/2606.28013#S5.SS8)\)\.

#### Case A:TYPE\_ONLY→\\toTRUE\_SUCCESSunder Lean\-Retry \(Dummit\-Foote\|exercise\_11\_1\_13\)\.

NL\.“As vector spaces overℚ\\mathbb\{Q\},ℝn≅ℝ\\mathbb\{R\}^\{n\}\\cong\\mathbb\{R\}, for alln∈ℤ\+n\\in\\mathbb\{Z\}^\{\+\}\.”

theoremexercise\_11\_1\_13\{n:Nat\}\(hn:0<n\):

Nonempty\(LinearEquiv\(RingHom\.idRat\)

\(Finn→\\toReal\)Real\)

theoremexercise\_11\_1\_13\{n:Nat\}\(hn:0<n\):

Nonempty\(LinearEquiv\(RingHom\.idRat\)

\(Real^n\)Real\)

theoremexercise\_11\_1\_13\{n:Nat\}\(hn:0<n\):

Nonempty\(LinearEquiv\(RingHom\.idRat\)

\(Finn→\\toReal\)Real\)

Opus judges equiv to gold under both STRICT and EQUIV\-IF\-FIXED prompts\. One of 23TO→\\toTSrecoveries\.

#### Case B: GTED false negative on a surface\-form rewrite \(Munkres\|exercise\_29\_4\)\.

NL\.“Show that\[0,1\]ω\[0,1\]^\{\\omega\}is not locally compact in the uniform topology\.”

theoremexercise\_29\_4:

Not\(LocallyCompactSpace\(Nat→\\toSet\.Icc\(0:Real\)1\)\)

theoremexercise\_29\_4:

Not\(LocallyCompactSpace

\(foralli:Nat,Set\.Icc\(0:Real\)1\)\)

The two function types are the sameΠ\\Pitype with different surface notations \(A \-\> Bis itself notation forforall \_ : A, B\)\. The elaborator treats them identically\. The typed operator tree GTED extracts from each is structurally divergent \(arrow node vs\.∀\\forallbinder\), giving the 0\.143 similarity\. Canonical shape of the\+26\+26to\+37\+37pp Opus\-vs\-GTED gap on elab\-feedback outputs \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\)\.

#### Case C: DurableSEM\_ONLYfrom a buggy gold formalization \(Ireland\-Rosen\|exercise\_1\_30\)\.

NL\.“Prove that1/2\+1/3\+⋯\+1/n1/2\+1/3\+\\cdots\+1/nis not an integer\.”

theoremexercise\_1\_30\{n:Nat\}:

Not\(Existsfuna:Int⇒\\Rightarrow

\(Finset\.sum\(Finset\.univ:Finset\(Finn\)\)

fun\_⇒\\Rightarrow\(1:Rat\)/\(n\+2\)\)=a\)

theoremexercise\_1\_30\{n:Nat\}\(hn:2≤\\leqn\):

Not\(Existsfuna:Int⇒\\Rightarrow

\(Finset\.sum\(Finset\.Icc2n\)

funk⇒\\Rightarrow\(1:Rat\)/k\)=a\)

The gold theorem and the natural\-language statement are different propositions: the gold sum is a rationaln/\(n\+2\)n/\(n\+2\)rather than the harmonic tail\. The only problem durablySOacross all four methods on DeepSeek V4\-Pro×\\timesProofNet\#\.

#### Case D: DurableSEM\_ONLYfrom a vacuous gold biconditional \(Dummit\-Foote\|exercise\_7\_2\_2\)\.

NL\.“Letp​\(x\)=an​xn\+⋯\+a0∈R​\[x\]p\(x\)=a\_\{n\}x^\{n\}\+\\cdots\+a\_\{0\}\\in R\[x\]\. Prove thatp​\(x\)p\(x\)is a zero divisor inR​\[x\]R\[x\]iff there is a nonzerob∈Rb\\in Rwithb⋅p​\(x\)=0b\\cdot p\(x\)=0\.”

theoremexercise\_7\_2\_2\{R:Type\*\}\[RingR\]

\(p:PolynomialR\):

Iff\(Dvd\.dvdp0\)

\(Existsfunb:R⇒\\Rightarrowb≠\\neq0/\\b\*p=0\)

theoremexercise\_7\_2\_2\{R:Type\*\}\[CommRingR\]

\(p:PolynomialR\):

Iff\(Existsfunq:PolynomialR⇒\\Rightarrowq≠\\neq0/\\p\*q=0\)

\(Existsfunb:R⇒\\Rightarrowb≠\\neq0/\\b\*p=0\)

Gold’s LHS is vacuous over any commutative ring, collapsing the biconditional to its right\-hand side\. DurablySOunder Lean\-Retry and Sample\-Filter\. Under SAF the IR refinement loop fails to elaborate at all \(BF\)\. Both gold\-formalization bugs are documented with Opus reasoning traces in the supplementary note accompanying this submission\.

#### Case E: DurableTOacross all four methods \(Dummit\-Foote\|exercise\_3\_4\_5b\)\.

NL\.“Prove that quotient groups of a solvable group are solvable\.”

theoremexercise\_3\_4\_5b\{G:Type\*\}\[GroupG\]\[IsSolvableG\]

\(H:SubgroupG\)\[NormalH\]:

IsSolvable\(G/H\)

theoremsolvable\_quotient\{G:Type\*\}\[GroupG\]\[SolvableG\]

\(H:SubgroupG\)\[H\.Normal\]:

Solvable\(G/H\)

theoremsolvable\_quotient\{G:Type\*\}\[GroupG\]\[SolvableG\]

\(H:SubgroupG\)\[h:H\.Normal\]:

Solvable\(G/H\)

Opus judges both candidates equiv\-if\-fixed: the mathematical content matches gold, only the Mathlib API name is wrong\. SAF and Sample\-Filter likewise fail to elaborate atK=3K\{=\}3on this problem\. One of 16 problems that remainTOacross all four methods on DeepSeek V4\-Pro×\\timesProofNet\#: a canonical instance of the elab signal naming a symptom without surfacing the cure\.

Similar Articles

Evaluating the Robustness of Proof Autoformalization in Lean 4

arXiv cs.CL

This paper evaluates the robustness of proof autoformalization models in Lean 4 under global and local perturbations, finding that current LLM-based models are sensitive to perturbations and often fail to faithfully reflect local changes.

Theory-Scale Auto-Formalization of Logics for Computer Science

arXiv cs.LG

Introduces LCS-Bench, a theory-scale benchmark for auto-formalization of logics for computer science, covering 327 textbook items over 4,076 Lean declarations. Evaluation on 14 models shows the benchmark is challenging, with SOTA models achieving only 20.1% on auto-formalization tasks.