The Signal-Coverage Matrix: Stratifying Type and Semantic Errors in Statement Autoformalization
Summary
This paper introduces a signal-coverage matrix that decomposes type-correctness gains in autoformalization into four strata, revealing the mechanisms behind LLM refinements and showing that headline metrics can obscure which errors are actually resolved.
View Cached Full Text
Cached at: 06/29/26, 05:25 AM
# Stratifying Type and Semantic Errors in Statement Autoformalization
Source: [https://arxiv.org/html/2606.28013](https://arxiv.org/html/2606.28013)
## The Signal\-Coverage Matrix: Stratifying Type and Semantic Errors in Statement Autoformalization
###### Abstract
Headline type\-correctness \(TC%\) of LLM autoformalization has climbed from∼\\sim53% to∼\\sim76% in two years, yet this scalar conceals which errors each method resolves\. We propose a signal\-coverage matrix that crosses the Lean elaborator \(pass/fail\) with a semantic\-equivalence judgment \(equivalent/not\), sorting every output into one of four cells: true success \(TS\), type\-only \(TO\), semantic\-only \(SO\), or both fail \(BF\)\. On ProofNet\#and MiniF2F\-test with DeepSeek V4\-Pro across Vanilla, Lean\-Retry, Sample\-Filter, and Stratified Autoformalization \(SAF\): \(1\) the\+34\+34to\+36\+36TSgain across the three elab\-feedback methods is∼\\sim64% type\-stratum recovery, withSOflat on net \(87\.5% of original semantic errors rescued, 8 newly created\)\. \(2\) TheTO→\\toTSrate is23/6123/61for each method \(Wilson 95% CI \[26\.6%, 50\.3%\]\), and this stratum\-level recovery rate predictsΔTS\\Delta\\textsc\{TS\}on held\-out methods to within2/1862/186and rendersΔTC\\Delta TClinear in the Vanilla elab\-fail rate across six \(model, dataset\) cells \(R2=0\.96R^\{2\}\{=\}0\.96\)\. \(3\) The two judges disagree by 26 to 37 pp on elab\-feedback outputs \(vs\. 7 pp on Vanilla\), with 30 to 56% of symbolic\-judge false negatives traceable to elaborator\-forced rewrites\. The persistent residual reduces to two gold\-formalization errors\. TC% gains should be credited by which cell moved, not by the scalar alone\.
Autoformalization, Statement formalization, Semantic faithfulness, Elaborator feedback, Signal\-coverage matrix, LLM\-as\-judge
## 1Introduction
Statement\-level autoformalization maps a natural\-language theorem to a Lean 4 signature with asorry\-placeholder body\. Headline type\-correctness \(TC%\) on ProofNet\-class benchmarks\(Azerbayev et al\.,[2023](https://arxiv.org/html/2606.28013#bib.bib2); Poiroux et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib12)\)has climbed from∼\\sim53% \(GPT\-4o\(OpenAI,[2024](https://arxiv.org/html/2606.28013#bib.bib11)\)\) to∼\\sim76% \(DeepSeek V4\-Pro\(DeepSeek\-AI,[2026](https://arxiv.org/html/2606.28013#bib.bib4)\)\) in two years, driven by methods that refine against the Lean elaborator\. Because TC% does not certify denotational correctness, semantic\-faithfulness \(SF%\) estimators are reported alongside it: back\-translation\(Zhou et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib19)\)and generalized tree edit distance \(GTED,Liu et al\.,[2025b](https://arxiv.org/html/2606.28013#bib.bib9)\)\.
Headline metrics conceal mechanism\. A\+15\+15pp TC% gain hides which error stratum is repaired and which is not, and the choice of SF% judge can shift the reported value by≥25\\geq 25pp on elab\-feedback methods \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\)\. No published account decomposes TC% gains by stratum and method\. We provide such an account\.
Existing methods cluster along three axes\.Single\-oracle refinement against the elaborator\(Wu et al\.,[2022](https://arxiv.org/html/2606.28013#bib.bib15); Poiroux et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib12); Guo et al\.,[2025](https://arxiv.org/html/2606.28013#bib.bib5); Lu et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib10)\)consults only the elaborator \(or a close proxy\) during refinement, and the class has plateaued at∼\\sim75% TC% on ProofNet\#regardless of topology\.Semantic\-faithfulness judgingcomprises LLM back\-translation\(Zhou et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib19)\)and symbolic GTED via the Lean LSP\(Liu et al\.,[2025b](https://arxiv.org/html/2606.28013#bib.bib9)\), used as standalone SF% estimators but never calibrated against each other at scale\.Typed structured intermediate representationswere introduced for Isabelle by Draft, Sketch, and Prove\(Jiang et al\.,[2023](https://arxiv.org/html/2606.28013#bib.bib6)\), whose sketch is a textual proof skeleton rather than a typed structural object\. The broader formal\-math LM literature on premise selection, proof generation, and math pretraining\(Polu & Sutskever,[2020](https://arxiv.org/html/2606.28013#bib.bib13); Yang et al\.,[2023](https://arxiv.org/html/2606.28013#bib.bib17); Azerbayev et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib3); Lin et al\.,[2025](https://arxiv.org/html/2606.28013#bib.bib7)\)addresses different stages of the pipeline and is complementary\.
We instantiate a2×22\{\\times\}2signal\-coverage matrix \(elaborator pass/fail×\\timesjudge equivalent/not\) on DeepSeek V4\-Pro×\\timesProofNet\#\(186 problems\) under a dual\-judge protocol that pairs Claude Opus 4\.7\(Anthropic,[2026](https://arxiv.org/html/2606.28013#bib.bib1)\)with GTED atτ=0\.5\\tau\{=\}0\.5, tracking per\-problem transitions from Vanilla into Lean\-Retry, Sample\-Filter, and Stratified Autoformalization \(SAF\)\.
This paper makes three contributions:
\(1\)A diagnostic framework and the predictive regularity it yields\.A2×22\{\\times\}2signal\-coverage matrix and a4×44\{\\times\}4transition matrix decompose any TC% gain into type\-stratum recovery, semantic\-stratum turnover, and a durable residual\. Under this lens, three algorithmically distinct methods each recover23/6123/61of VanillaTO\(Wilson 95% CI \[26\.6%, 50\.3%\]\), andΔTC\\Delta TCis linear in the Vanilla elab\-fail rate atR2=0\.96R^\{2\}\{=\}0\.96across six \(model, dataset\) cells \(§[2](https://arxiv.org/html/2606.28013#S2), §[5\.2](https://arxiv.org/html/2606.28013#S5.SS2), §[5\.4](https://arxiv.org/html/2606.28013#S5.SS4)\)\.
\(2\)A calibrated dual\-judge protocol\.Opus paired with GTED atτ=0\.5\\tau\{=\}0\.5achieves94\.4%94\.4\\%to95\.0%95\.0\\%precision across benchmarks, and a four\-class taxonomy attributes 30 to 56% of GTED false negatives to elaborator\-forced structural rewriting \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\)\.
\(3\)SAF as a causal\-attribution probe\.Stratified Autoformalization \(SAF\) refines a typed JSON IR against the elaborator and emits Lean by a deterministic translator\. Its determinism makes the candidate’s surface form a function of the IR alone, attributing the\+37\+37pp Opus\-vs\-GTED gap to elaborator\-driven surface geometry rather than LLM sampling noise \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\)\. SAF ties Lean\-Retry and Sample\-Filter on aggregate metrics\. Its role here is diagnostic rather than state\-of\-the\-art optimization\.
## 2The Signal\-Coverage Matrix
Under statement\-level scope \(NLstatementSS→\\toLean 4 signatureσ\\sigmawith:= by sorry\), an output that fails its specification falls into exactly one of two strata\. A type\-stratum failure meansσ\\sigmadoes not type\-check \(Unknown identifier,failed to synthesize,application type mismatch, …\), so the math may be understood but the Lean surface form is invalid\. A semantic\-stratum failure meansσ\\sigmatype\-checks but denotes a different proposition fromSS, for instance hardcodingZMod 2whenSSrefers to “any field of size 2”, usingSummable fwhenSSasserts the sequence has a limit, reversing an inequality, or dropping a hypothesis\. \(A third, implementation stratum for proof\-level autoformalization is out of scope\.\)
The two refinement signals used in practice are stratum\-specialized\. The Lean elaborator is complete on the type stratum, since every failure raises an error, and silent on the semantic stratum, since a semantically wrong but type\-correctσ\\sigmaelaborates without complaint\. Semantic judges, namely back\-translation\(Zhou et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib19)\)and GTED\(Liu et al\.,[2025b](https://arxiv.org/html/2606.28013#bib.bib9)\), cover the semantic stratum but forgive the type stratum, since a non\-compiling fragment back\-translates to noisyNL\.
Crossing elaboration \(pass / fail\) with the semantic judgment \(equivalent / not\-equivalent\) partitionsnnoutputs into four cells \([Figure1](https://arxiv.org/html/2606.28013#S2.F1)\): true success \(TS\), type\-only failure \(TO\), semantic\-only failure \(SO\), and both fail \(BF\)\. These labels denote failure modes, not correctness modes\. The two off\-diagonal cells are exactly the per\-instance disagreements between elaborator and judge, and their size, sources, and dynamics across methods are this paper’s core empirical object\. Writing\|c\|\|c\|for the count in cellcc,
TC%=\|TS\|\+\|SO\|n,SF%=\|TS\|n,n=∑c\|c\|\.TC\\%=\\tfrac\{\|\\textsc\{TS\}\|\+\|\\textsc\{SO\}\|\}\{n\},\\quad SF\\%=\\tfrac\{\|\\textsc\{TS\}\|\}\{n\},\\quad n=\\textstyle\\sum\_\{c\}\|c\|\.\(1\)
Figure 1:Signal\-coverage matrix: elaborator \(pass/fail\)×\\timesjudge \(equivalent/not\) gives the2×22\{\\times\}2cell decomposition and TC%/SF%\.
## 3Stratified Autoformalization
Because refinement signals are stratum\-specialized \(§[2](https://arxiv.org/html/2606.28013#S2)\), SAF factors autoformalization through a typed intermediate representation \(IR\)\. The LLM emits an IR, a deterministic∼\\sim30\-line translator produces Lean from it, and on elab failure the IR is refined rather than Lean source\. This determinism is central: every refined SAF output is a function of the IR alone\. On the aggregate metrics, SAF ties Lean\-Retry and Sample\-Filter on both TC and Opus\-judged SF \(§[5](https://arxiv.org/html/2606.28013#S5)\)\. Its role in this paper is the causal\-attribution probe of §[5\.5](https://arxiv.org/html/2606.28013#S5.SS5), where the\+37\+37pp Opus\-vs\-GTED gap is attributed to elaborator\-driven surface geometry rather than to LLM sampling noise\.
The refinement loop \([Algorithm1](https://arxiv.org/html/2606.28013#alg1)\) alternates IR generation by the LLM with elaborator checking until the candidate type\-checks\. The total LLM budget isK\+1K\{\+\}1\(one extraction plusKKrefinements\), and we useK=3K\{=\}3in all experiments\.
A SAF IR is a JSON object with five fields:
\{
"imports":\["Mathlib"\],
"opens":\["Polynomial","Real"\],
"objects":\[
\{"name":"G","type":"Type\*","binding":"implicit"\},
\{"name":"","type":"GroupG","binding":"instance"\},
\{"name":"a","type":"G","binding":"explicit"\}
\],
"hypotheses":\[\{"id":"h","expression":"a\*b=b\*a"\}\],
"conclusion":"<Lean4expression\>"
\}
Algorithm 1SAF refinement loop0:NL statement
SS, retry budget
KK
0:type\-checked Lean signature, or
⊥\\bot
1:
ir←LLM\.extract\_IR\(S\)\\mathrm\{ir\}\\leftarrow\\mathrm\{LLM\.extract\\\_IR\}\(S\)
2:for
k=0k=0to
KKdo
3:
σ←ir\_to\_lean\(ir\)\\sigma\\leftarrow\\mathrm\{ir\\\_to\\\_lean\}\(\\mathrm\{ir\}\)
4:
\(ok,errors\)←LeanElaborator\.check\(σ\)\(\\mathrm\{ok\},\\mathrm\{errors\}\)\\leftarrow\\mathrm\{LeanElaborator\.check\}\(\\sigma\)
5:if
ok\\mathrm\{ok\}then
6:return
σ\\sigma
7:endif
8:
ir←LLM\.refine\_IR\(S,ir,errors\)\\mathrm\{ir\}\\leftarrow\\mathrm\{LLM\.refine\\\_IR\}\(S,\\mathrm\{ir\},\\mathrm\{errors\}\)
9:endfor
10:return
⊥\\bot
Theimportsandopensfields carry the Mathlib environment,objectsdeclares typed binders with their binding mode,hypotheseslists named premises, andconclusionholds the goal as a Lean 4 expression\. Their\_to\_leantranslator is a∼\\sim30\-line template substitution that maps each field to its Lean counterpart:importstoimportlines,openstoopenlines, eachobjectsentry to a typed binder \(\(x : T\)/\{x : T\}/\[T\]perbinding\),hypothesesto hypothesis binders, andconclusionto the theorem body ending in:= by sorry\. The translator is purely template\-based \(no LLM, no fuzzy matching, no error recovery\), and malformed IR fails fast\.
## 4Experimental Setup
#### Datasets\.
ProofNet\#\(Poiroux et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib12)\)is the type\-complex benchmark: 186 undergraduate\-textbook problems \(Rudin, Munkres, Dummit\-Foote, Artin, Axler, Herstein, Pugh, Putnam, Ireland\-Rosen, Shakarchi\) exercising Mathlib typeclass machinery \(Group,MeasurableSpace,ContinuousMap,Sylow, …\)\. MiniF2F\-test\(Zheng et al\.,[2022](https://arxiv.org/html/2606.28013#bib.bib18)\)is the type\-simple benchmark: 244 olympiad problems admitting largely primitive\-type formalizations\. The two bracket the type\-system complexity axis\.
#### Subject models\.
DeepSeek V4\-Pro \(49B active / 1\.6T total MoE parameters\) is the primary subject, with the full battery on ProofNet\#and three methods on MiniF2F\. Qwen3\.5\-Plus\(Qwen Team,[2026](https://arxiv.org/html/2606.28013#bib.bib14)\)\(480B\-class\) and MiMo\-v2\.5\-Pro\(Xiaomi MiMo Team,[2026](https://arxiv.org/html/2606.28013#bib.bib16)\)\(Xiaomi, thinking traces disabled\) serve as cross\-family direction checks\. All decoding is attemperature=0\.0except whereN=4N\{=\}4sampling is the method itself\. Gold\-judge analysis is anchored on DeepSeek V4\-Pro×\\timesProofNet\#, with cross\-model and cross\-dataset runs serving as mechanism\-direction checks at coarser \(GTED\-only\) judge resolution\.
#### Methods\.
Four methods, three at the same∼\\sim4\-LLM\-call budget\. Vanilla: one\-shot\. Lean\-Retry \(Guo et al\.,[2025](https://arxiv.org/html/2606.28013#bib.bib5)\-style\): emit Lean signature, refine by feeding elaborator errors back to the LLM,K=3K\{=\}3retries\. Sample\-Filter \(Poiroux et al\.,[2024](https://arxiv.org/html/2606.28013#bib.bib12)\-style\): sampleN=4N\{=\}4candidates, retain elab\-pass candidates, pick first\. SAF \(this paper, §[3](https://arxiv.org/html/2606.28013#S3)\): typed JSON IR refined against the elaborator, translated to Lean by a deterministic∼\\sim30\-line procedure,K=3K\{=\}3\.
#### Judges\.
A dual\-judge protocol pairs Claude Opus 4\.7 \(Anthropic, disjoint from all four subject families\) as the gold cross\-judge with GTED\(Liu et al\.,[2025b](https://arxiv.org/html/2606.28013#bib.bib9)\)as the scalable symbolic proxy\. Opus compares candidate Lean against gold Lean for semantic equivalence under a STRICT prompt when the candidate elaborates and an EQUIV\-IF\-FIXED prompt when it does not, yielding 744 verdicts on DeepSeek V4\-Pro×\\timesProofNet\#×\\times\{Vanilla, Lean\-Retry, SAF, Sample\-Filter\}, 244 on DeepSeek V4\-Pro×\\timesMiniF2F×\\timesVanilla, and 988 in total\. GTED computes generalized tree edit distance on the LSP\-extracted typed operator tree, returns a similarity in\[0,1\]\[0,1\], is thresholded atτ=0\.5\\tau\{=\}0\.5, and runs on all 19 \(model, dataset, method\) combinations\.
#### Infrastructure\.
Lean 4\.19\.0 vialean\_interact, pre\-built Mathlib, ATLAS\-based\(Liu et al\.,[2025a](https://arxiv.org/html/2606.28013#bib.bib8)\)GTED LSP backend, elab timeout 60s, and JSONL resume on all pipelines\.
## 5Experimental Results and Analysis
### 5\.1Per\-method cell decomposition
Figure 2:Vanilla signal\-coverage matrix \(DeepSeek V4\-Pro×\\timesProofNet\#\)\. Rose\-outlined off\-diagonal cells mark elaborator/Opus disagreements\.For each of the 186 ProofNet\#test problems run on DeepSeek V4\-Pro under each of the four methods, we record the elaborator label \(pass / fail\) and the Opus label \(equivalent / not\-equivalent\), assigning each output to one of the four cells defined in §[2](https://arxiv.org/html/2606.28013#S2)\.[Table1](https://arxiv.org/html/2606.28013#S5.T1)reports the four\-method matrix,[Figure2](https://arxiv.org/html/2606.28013#S5.F2)visualizes Vanilla’s cell counts, and[Figure3](https://arxiv.org/html/2606.28013#S5.F3)shows the cross\-method TC / SF comparison \(computed per[Equation1](https://arxiv.org/html/2606.28013#S2.E1)\)\. We use DeepSeek V4\-Pro×\\timesProofNet\#as the running example because it is the only \(model, dataset\) cell with full four\-method Opus dual\-judging\.[AppendixE](https://arxiv.org/html/2606.28013#A5)extends the sameΔ\\DeltaTC /Δ\\DeltaSF comparison to the other five cells under GTED\. The cell\-level view is what lets us separate elaborator\-side recovery \(TO→\\toTS\) from semantic recovery \(SO→\\toTS\) rather than read both off a single aggregate rate\.
Table 1:Signal\-coverage matrix on DeepSeek V4\-Pro×\\timesProofNet\#\(n=186n\{=\}186\), four methods\. Cells are defined in[Figure1](https://arxiv.org/html/2606.28013#S2.F1), labels denote failure modes\.Vanilla decomposes as 95 / 16 / 61 / 14\. Of the 91 failures, the elaborator catches 75 \(TO\+\+BF\) and Opus catches 30 \(SO\+\+BF\), with overlap of 14\. The 77 / 186 outputs off\-diagonal \(41\.4%,SO\+\+TO\) are the per\-instance disagreements between the two signals\. The largest single recovery target for any elab\-only feedback method is the 61TOcases: semantically correct outputs that the elaborator nonetheless rejects on surface form\.
Figure 3:TC% and Opus\-judged SF% across four methods on DeepSeek V4\-Pro×\\timesProofNet\#\. The three elab\-feedback methods \(Lean\-Retry, Sample\-Filter, SAF\) cluster within∼1\\sim 1pp on both axes\.The three elab\-feedback methods add 34 to 36 problems toTS, shrinkTOby 20 to 24, shrinkBFby 7 to 10, and shrinkSOby only 5–6 from its baseline of 16, despite their algorithmic differences \(feedback vs\. sampling, raw Lean vs\. typed IR\)\. A bootstrap 95% CI \(B=10,000B\{=\}10\{,\}000,[Table9](https://arxiv.org/html/2606.28013#A2.T9)\) showsΔ\\DeltaTSandΔ\\DeltaTOsignificant for all three methods \(CI does not cross 0\), whileΔ\\DeltaSOis not significant for any\.
### 5\.2Per\-problem transitions:TSgains come mostly fromTO\-rescue
AggregateΔ\\Deltas answer which cells moved on net\. To resolve which problems moved, we join each problem’s Vanilla cell to its Lean\-Retry cell, giving the4×44\{\\times\}4transition matrix in[Figure4](https://arxiv.org/html/2606.28013#S5.F4)and the per\-cell decomposition ofΔ\|TS\|\\Delta\|\\textsc\{TS\}\|in[Equation2](https://arxiv.org/html/2606.28013#S5.E2)\.
Figure 4:Per\-problem Vanilla→\\toLean\-Retry transition matrix\. Green column: rescue intoTS\. Orange column: newly createdSO\. Dotted diagonal: stay\-put\. Shade encodes count\.Writing\|c→c′\|\|c\{\\to\}c^\{\\prime\}\|for the number of problems in Vanilla cellccthat land in Lean\-Retry cellc′c^\{\\prime\}, the netTSgain is given by
Δ\|TS\|\\displaystyle\\Delta\|\\textsc\{TS\}\|=\|TO→TS\|⏟23\+\|SO→TS\|⏟14\+\|BF→TS\|⏟2\\displaystyle=\\underbrace\{\|\\textsc\{TO\}\{\\to\}\\textsc\{TS\}\|\}\_\{23\}\+\\underbrace\{\|\\textsc\{SO\}\{\\to\}\\textsc\{TS\}\|\}\_\{14\}\+\\underbrace\{\|\\textsc\{BF\}\{\\to\}\\textsc\{TS\}\|\}\_\{2\}\(2\)−\|TS→¬TS\|⏟3=\+36,\\displaystyle\\quad\-\\underbrace\{\|\\textsc\{TS\}\{\\to\}\\neg\\textsc\{TS\}\|\}\_\{3\}\\;=\\;\+6,attributing23/36≈64%23/36\\approx 64\\%to type\-stratum recovery and14/36≈39%14/36\\approx 39\\%to semantic flip on the rescued side\. Per\-cell individual rescue rates are: 96\.8% of VanillaTSstayTS, 87\.5% \(14 / 16\) of VanillaSOproblems are individually rescued, and only 37\.7% \(23 / 61\) of VanillaTOare recovered, with the remainder stuck\.
### 5\.3Semantic turnover and the gold\-error floor
The aggregateSOcount drops only from 16 to 10 under Lean\-Retry, but this stability masks substantial per\-problem turnover\. 14 of 16 VanillaSOare individually rescued \(87\.5%\), while Lean\-Retry simultaneously creates 8 newSO\(4 fromBF, 3 fromTSregression, 1 fromTO\), for a net change of−6\-6: aggregate stability hides a near\-total replacement of the underlying error population\.
Of the 2 VanillaSOproblems that remainSOunder Lean\-Retry, both are gold\-formalization bugs, verified by Opus reasoning across all four methods\. The first,Ireland\-Rosen\|exercise\_1\_30, asks “Prove that1/2\+1/3\+⋯\+1/n1/2\+1/3\+\\cdots\+1/nis not an integer\.” Gold’s summand1n\+2\\frac\{1\}\{n\+2\}does not depend on the summation index, so the gold sum evaluates ton/\(n\+2\)n/\(n\+2\), a different proposition from the harmonic tail\. The candidate faithfully formalizes the harmonic tail, and Opus correctly judges candidate≠\\neqgold under all four methods \([Example1](https://arxiv.org/html/2606.28013#Thmtheorem1)\)\.
###### Example 1\(Buggy gold formalization,n=1n\{=\}1\)\.
theoremexercise\_1\_30\{n:Nat\}:
Not\(Existsfuna:Int⇒\\Rightarrow
\(Finset\.sum\(Finset\.univ:Finset\(Finn\)\)
fun\_⇒\\Rightarrow\(1:Rat\)/\(n\+2\)\)=a\)
theoremexercise\_1\_30\{n:Nat\}\(hn:2≤\\leqn\):
Not\(Existsfuna:Int⇒\\Rightarrow
\(Finset\.sum\(Finset\.Icc2n\)
funk⇒\\Rightarrow\(1:Rat\)/k\)=a\)
The companion caseDummit\-Foote\|exercise\_7\_2\_2has gold LHSp∣0p\\mid 0, trivially true in any commutative ring, so the biconditional collapses to a one\-sided statement \([AppendixF](https://arxiv.org/html/2606.28013#A6), Case D\)\. Both cases are documented with Opus traces and suggested corrections in the supplementary note\.
Across all four methods, only 1 of 186 problems is durablySO\(Ireland\-Rosen\|exercise\_1\_30\)\. Under Lean\-Retry the count rises to 2, both gold bugs\. Excluding the gold bugs, zero of 186 problems are durably semantic\-only across the three elab\-feedback methods, so theSOplateau approximates the gold\-error floor of the benchmark\.
### 5\.4A predictive regularity: stratum\-rates are method\-invariant
The aggregate gains of §[5\.1](https://arxiv.org/html/2606.28013#S5.SS1)and the transition matrix of §[5\.2](https://arxiv.org/html/2606.28013#S5.SS2)naturally raise the question: do the three elab\-feedback methods recover the*same*problems at the*same*rate, or do they merely converge in aggregate? For each Vanilla cellc∈\{TS,SO,TO,BF\}c\\in\\\{\\textsc\{TS\},\\textsc\{SO\},\\textsc\{TO\},\\textsc\{BF\}\\\}and each elab\-feedback methodmm, define the recovery rate
ρc→TS\(m\)=\|\{p:Vanilla\(p\)=c∧m\(p\)=TS\}\|\|\{p:Vanilla\(p\)=c\}\|\.\\rho\_\{c\\to\\textsc\{TS\}\}\(m\)\\;=\\;\\frac\{\|\\\{p:\\text\{Vanilla\}\(p\)=c\\,\\wedge\\,m\(p\)=\\textsc\{TS\}\\\}\|\}\{\|\\\{p:\\text\{Vanilla\}\(p\)=c\\\}\|\}\.\(3\)[Table2](https://arxiv.org/html/2606.28013#S5.T2)reports the four rates for each of the three elab\-feedback methods\.
Table 2:Per\-stratum recovery rates on DeepSeek V4\-Pro×\\timesProofNet\#, dual\-judged\.ρc→TS\(m\)\\rho\_\{c\\to\\textsc\{TS\}\}\(m\)is the share of Vanilla cellccthat lands inTSunder methodmm\.All three elab\-feedback methods recover*exactly*23 of the 61 VanillaTOproblems, despite differing in algorithm \(sequential elaborator feedback, deterministic IR refinement, parallelN=4N\{=\}4sampling\), in LLM call topology, and in candidate generation entropy\. The point estimate has a wide Wilson 95% CI of\[26\.6%,50\.3%\]\[26\.6\\%,50\.3\\%\]atn=61n\{=\}61, so we do not claim the true rate is precisely 37\.7%\. What we claim, and what the predictive regularity of[Table3](https://arxiv.org/html/2606.28013#S5.T3)rests on, is that the three methods recover statistically indistinguishable populations on the type stratum: per\-method CIs overlap entirely, and a Fisher exact test of pairwise homogeneity returnsp=1\.00p\{=\}1\.00for all three pairs\.[AppendixA](https://arxiv.org/html/2606.28013#A1)gives Wilson CIs for every cell of[Table2](https://arxiv.org/html/2606.28013#S5.T2)and discusses the limits of then=186n\{=\}186sample\. The semantic recovery rate is high and tighter than theBFrate \(the latter is small\-nnnoise, 14BFproblems total\)\. The retention rate onTSis uniformly≥92\.6%\\geq 92\.6\\%, so net regression is small\.
Taking any single elab\-feedback method’s rates and applying them as a mass\-balance rule to the Vanilla cell distribution\(\|TS\|,\|SO\|,\|TO\|,\|BF\|\)=\(95,16,61,14\)\(\|\\textsc\{TS\}\|,\|\\textsc\{SO\}\|,\|\\textsc\{TO\}\|,\|\\textsc\{BF\}\|\)=\(95,16,61,14\)yields a testable prediction for the other methods\. With Lean\-Retry’s rates,ΔTS^=0\.377⋅61\+0\.875⋅16\+0\.143⋅14−\(1−0\.968\)⋅95=\+36\.0\\hat\{\\Delta\\textsc\{TS\}\}=0\.377\{\\cdot\}61\+0\.875\{\\cdot\}16\+0\.143\{\\cdot\}14\-\(1\-0\.968\)\{\\cdot\}95=\+36\.0\.[Table3](https://arxiv.org/html/2606.28013#S5.T3)compares predicted to observed: the calibration on Lean\-Retry predicts the other two methods’ΔTS\\Delta\\textsc\{TS\}to within 2 of 186 problems, an order of magnitude better than the bootstrap noise band of any cell\.
Table 3:Predicted vs\. observedΔTS\\Delta\\textsc\{TS\}on DeepSeek V4\-Pro×\\timesProofNet\#, using stratum\-rates calibrated from Lean\-Retry\.Extending across the six \(model, dataset\) cells of the cross\-model panel, where the test is necessarily at the TC level \(Opus dual\-judging is anchored on the V4\-Pro×\\timesProofNet\#cell\), the recovery rate of Vanilla elab\-fails under Lean\-Retry fits a single linear regression in the Vanilla elab\-fail rateVfailV\_\{\\rm fail\}:
ΔTC^=0\.325⋅Vfail\+2\.88\(R2=0\.957\),\\widehat\{\\Delta TC\}\\;=\\;0\.325\\cdot V\_\{\\rm fail\}\\;\+\\;2\.88\\quad\(R^\{2\}=0\.957\),\(4\)with per\-cell residuals in\[−1\.0,\+1\.3\]\[\-1\.0,\+1\.3\]pp \([Table4](https://arxiv.org/html/2606.28013#S5.T4)\)\. The recovery rate of elab\-fails \(ΔTC/Vfail\\Delta TC/V\_\{\\rm fail\}\) is41\.2%41\.2\\%on ProofNet\#\(range\[38\.3,45\.3\]\[38\.3,45\.3\]\) and61\.8%61\.8\\%on MiniF2F \(range\[48\.4,73\.3\]\[48\.4,73\.3\]\): the same elab\-signal recovers a strictly higher share of failures on type\-simple benchmarks because type\-simple failures concentrate in the easy surface\-form band thatK=3K\{=\}3refinement resolves\.
Table 4:Cross\-cell fit of[Equation4](https://arxiv.org/html/2606.28013#S5.E4)on six \(model, dataset\) cells under Lean\-Retry\. Rows are grouped by benchmark\.Two consequences for method comparison follow\. First, a new method’s*aggregate*ΔTC\\Delta TCis partially determined by its target’s Vanilla cell distribution, not only by the method’s algorithmic innovations: a method evaluated only on a type\-complex benchmark will look weaker than the same method on MiniF2F, by a benchmark\-modulation factor predicted fromVfailV\_\{\\rm fail\}alone\. Second, the type\-stratum recovery rate of23/6123/61is what current elab\-feedback methods deliver, and closing the gap to the elab\-only ceiling\(\|TS\|\+\|TO\|\)/n=83\.9%\(\|\\textsc\{TS\}\|\{\+\}\|\\textsc\{TO\}\|\)/n=83\.9\\%would require recovering hardTOcases that surviveK=3K\{=\}3refinement, a population on which no current method makes progress \(§[6](https://arxiv.org/html/2606.28013#S6)\)\. The within\-2/1862/186agreement is partly self\-consistency, since the three methods already converge in aggregate \([Table1](https://arxiv.org/html/2606.28013#S5.T1)\)\.
### 5\.5Opus vs\. GTED: the surface\-form rewriting gap
On the same 186 DeepSeek V4\-Pro×\\timesProofNet\#outputs, Claude Opus 4\.7 and GTEDτ=0\.5\\tau\{=\}0\.5give different SF% numbers \([Table5](https://arxiv.org/html/2606.28013#S5.T5)\)\. SAF’s deterministic IR translator isolates the cause\.
Table 5:Opus vs\. GTED SF% on DeepSeek V4\-Pro×\\timesProofNet\#outputs\.On Vanilla the two judges agree to within\+7\+7pp\. On the three elab\-feedback methods they disagree by\+26\+26to\+37\+37pp, with Opus reading SF as rising and GTED reading it as flat or falling\. The mechanism: GTED penalizes surface\-form normalizations \(alternativeopens,IsCompact univ↔\\leftrightarrowCompactSpace,Finn→ℝ\\mathrm\{Fin\}\\,n\\to\\mathbb\{R\}vs\.∀i:ℕ,Icc0 1\\forall i:\\mathbb\{N\},\\,\\mathrm\{Icc\}\\,0\\,1restatements\) that elab\-feedback adopts to satisfy the elaborator, while Opus recognizes these as semantically equivalent\.[Example2](https://arxiv.org/html/2606.28013#Thmtheorem2)is a representative instance onMunkres\|exercise\_29\_4: gold and candidate denote the sameΠ\\Pitype, but GTED returns similarity 0\.143 because the extracted operator trees place an arrow node where the other places a∀\\forallbinder\.
###### Example 2\(Surface\-form gap,n=1n\{=\}1\)\.
theoremexercise\_29\_4:
Not\(LocallyCompactSpace
\(Nat→\\toSet\.Icc\(0:Real\)1\)\)
theoremexercise\_29\_4:
Not\(LocallyCompactSpace
\(foralli:Nat,Set\.Icc\(0:Real\)1\)\)
The gap is largest for SAF \(\+37\+37pp\): with surface form fixed by the deterministic translator, divergence is the IR’s choice, not LLM sampling noise\. Additional cases \(TO→\\toTSrecoveries, gold bugs\) are catalogued in[AppendixF](https://arxiv.org/html/2606.28013#A6)\.
Table 6:GTEDτ\\taucalibration against Opus\.P=P\(Opus equiv∣GTED≥τ\)P=P\(\\text\{Opus equiv\}\\mid\\text\{GTED\}\\geq\\tau\),RRis recall on the Opus\-equiv elab\-pass set\. Recommended rowτ=0\.5\\tau\{=\}0\.5in blue underline\. Per\-method ProofNet\#precision in[Table10](https://arxiv.org/html/2606.28013#A3.T10)\.[Table6](https://arxiv.org/html/2606.28013#S5.T6)confirms thatτ=0\.5\\tau\{=\}0\.5maintains 95\.0% pooled precision on ProofNet\#and≥93%\\geq 93\\%on every method individually \(per\-method breakdown in[Table10](https://arxiv.org/html/2606.28013#A3.T10)\)\. The threshold is stable across distributions, so the gap is a structural property of elab\-feedback’s surface rewriting rather than of judge calibration\.
The Opus\-equiv but GTED\-non\-equiv set \(n=38n\{=\}38on Vanilla,n∈\[62,76\]n\\in\[62,76\]on the three elab\-feedback methods\) is what drives the gap\. We classify each case by the type of rewrite that distinguishes the candidate from gold \(rules and per\-method audit in[AppendixD](https://arxiv.org/html/2606.28013#A4)\):
- •Notational \(N\)\.Surface rewrites that preserve definitional shape:A\-\>Bvs\.forall \_ : A, B,Real^nvs\.Fin n \-\> Real, coercion \(ZMod 2for the unique field of order 2\), alternativeopen/namespace prefixes, or cosmetic renaming\.
- •Idiomatic \(I\)\.Definitional aliases or Mathlib formulation swaps:IsConjvs\.exists g, b = g\*a\*g\-1,Summablevs\.Tendstoof partial sums,CompactSpacevs\.IsCompact Set\.univ, orLinearEquiv\-via\-Nonemptytypeclass framing\.
- •Structural \(S\)\.Binder, quantifier, or typeclass restructuring that changes the post\-elaboration term:letvs\. explicit hypothesis, instance binder vs\. hypothesis binder,\[Field\]vs\.\(h : Field\), quantifier moved inside or outside an existential,Iffdirection swap, biconditional introduced or collapsed, orsubtypevs\. predicate\-restricted set\.
- •Residual \(Z\)\.Cases that mix multiple categories or resist single\-category classification\.
Table 7:Taxonomy of Opus\-vs\-GTED disagreements on DeepSeek V4\-Pro×\\timesProofNet\#\. Counts: problems with Opus equiv∧\\wedgeGTED<0\.5<0\.5∧\\wedgeelab\-pass\. Percentages: share of each method’s false\-negative pool\.[Table7](https://arxiv.org/html/2606.28013#S5.T7)reads as follows\. On Vanilla, 38 of 95 elab\-pass outputs \(40%40\\%\) are surface\-form rewrites Opus accepts but GTED does not, distributed across all four categories\. Under each elab\-feedback method the false\-negative count grows by 24 to 38 problems\. For Lean\-Retry and Sample\-Filter the growth concentrates in category S \(16 of 24 and 21 of 24 added cases, respectively\): the elaborator pressures the LLM to restructure binders, typeclass formulations, and quantifier orderings until the output type\-checks\. SAF’s growth is more diffuse: only 9 added in S \(the smallest, because SAF’s deterministicir\_to\_leantranslator does not freely restructure binders\) but\+20\+20in Z, reflecting the multi\-category rewrites the IR\-mediated path tends to produce\. The disagreement gap is therefore not random surface variance but a measurement of elaborator\-driven structural rewriting, which is precisely the mechanism that recoversTOintoTS\(§[5\.2](https://arxiv.org/html/2606.28013#S5.SS2)\)\.
### 5\.6Cross\-dataset:SOrate stable,TOrate scales with type\-complexity
DeepSeek V4\-Pro×\\timesMiniF2F\-test×\\timesVanilla \(244 problems, Opus\-judged\) decomposes asTS=163,SO=24,TO=51,BF=6\. TheSOrate is 9\.8%, essentially unchanged from 8\.6% on type\-complex ProofNet\#\. TheTOrate, by contrast, drops from 32\.8% to 20\.9%\. Dataset type\-complexity loads the type stratum and not the semantic stratum, providing supporting evidence at Vanilla\-only resolution for the mechanism of §[5\.2](https://arxiv.org/html/2606.28013#S5.SS2)\. Full MiniF2F dual\-judge analysis across the four methods is future work\.
### 5\.7Cross\-model: TC gain robust to subject model
[Figure5](https://arxiv.org/html/2606.28013#S5.F5)reports TC% for three subject models on both datasets\. Lean\-Retry≥\\geqVanilla in 6/6 cells \(meanΔ\\DeltaTC==\+10\.1\+10\.1pp\)\. SAF≥\\geqVanilla in 5/6, the one exception \(Qwen×\\timesProofNet\#,−12\.9\-12\.9pp\) being a high IR\-extraction failure rate on Qwen3\.5\-Plus, reported without ad\-hoc patching\. Under GTED\-only judging,Δ\\DeltaTC\>\>Δ\\DeltaSF holds in 6/6 cells for Lean\-Retry \(mean\+10\.1\+10\.1vs\.\+1\.3\+1\.3pp,[Table11](https://arxiv.org/html/2606.28013#A5.T11)\)\.
Figure 5:Cross\-model TC% on 3 models×\\times2 datasets×\\times3 methods\.
### 5\.8K\-saturation: refinement saturates fast and three methods converge
[Figure6](https://arxiv.org/html/2606.28013#S5.F6)plots SAF’s K\-saturation on DeepSeek V4\-Pro×\\timesProofNet\#\. TC% rises 50\.5→\\to72\.0→\\to74\.7→\\to75\.3% acrossK=0,1,2,3K\{=\}0,1,2,3\. Refinement saturates atK=1K\{=\}1\(\+21\.5\+21\.5pp\), withK=2,3K\{=\}2,3contributing\+3\.3\+3\.3pp combined\. All three methods land within 1\.0 pp at saturation \(SAF 75\.3, Lean\-Retry 75\.8, Sample\-Filter 76\.3\), so different topologies reach the same ceiling\. Per\-iteration faithfulness remains high:k∗=0k^\{\*\}\{=\}0: 89/94 \(94\.7%\),k∗=1k^\{\*\}\{=\}1: 34/40 \(85\.0%\),k∗=2k^\{\*\}\{=\}2: 5/5 \(100%\), andk∗=3k^\{\*\}\{=\}3: 1/1 \(100%\)\. Of the 46 problems that never pass, 39 \(84\.8%\) are Opus equiv\-if\-fixed \(i\.e\.,TO\)\. The 16 durablyTOacross all four methods share a common failure mode: plausible but invalid Mathlib API names that elab errors cannot prescribe \([AppendixF](https://arxiv.org/html/2606.28013#A6), Case E\)\.
Figure 6:SAF K\-saturation on DeepSeek V4\-Pro×\\timesProofNet\#\. Dashed bands: Lean\-Retry and Sample\-Filter saturation levels\.
## 6Discussion and Limitations
The \(TC≠\\neqSF\) gap itself is not new, but the per\-stratum quantification is\. On type\-complex benchmarks TC% alone is misleading: the recommended reporting unit is TC%, dual\-judge SF%, and the per\-cell decomposition \(TS/SO/TO/BF\)\. Two elab\-feedback methods compared under a single symbolic judge can differ by≥25\\geq 25pp purely from structural rewriting, so independent\-LLM verification is required for any SF% claim\.
#### Why three methods converge\.
The collapse to within∼\\sim1 pp TC% at saturation \(§[5\.8](https://arxiv.org/html/2606.28013#S5.SS8)\), the identical 37\.7%TO→\\toTSrate \(§[5\.4](https://arxiv.org/html/2606.28013#S5.SS4)\), and the shared structural\-rewriting signature in the Opus/GTED gap \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\) are three views of the same claim: atK≤3K\{\\leq\}3the elab signal is largely exhausted by the fixed∼\\sim4\-call budget, so all three refinement topologies converge to the same family of type\-checking surface forms\. The remaining axis of progress is the signal itself, not the topology on top of it\.
Per\-output semantic faithfulness is unreliable under elab\-feedback because of turnover \(87\.5% rescue offset by 8 newly created errors per method\)\. Future signals must correlate with semantic correctness rather than only distinguish elab\-pass from elab\-fail\.
SAF’s deterministic IR translation serves as a causal\-attribution probe for the surface\-form rewriting mechanism \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\) and the per\-stratum predictive regularity \(§[5\.4](https://arxiv.org/html/2606.28013#S5.SS4)\)\. Two ProofNet\#gold formalizations contain gold\-formalization errors \(Ireland\-Rosen\|exercise\_1\_30,Dummit\-Foote\|exercise\_7\_2\_2\) and are documented with suggested corrections in the supplementary note\. Benchmarks aiming to measureSObelow 1% will need gold audits, not larger LLMs alone\.
Our semantic labels rely on Claude Opus 4\.7 as a strong cross\-judge rather than expert Lean validators \(≥94%\\geq 94\\%pooled precision in §[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\)\. A small expert audit and a canonicalization\-aware symbolic metric \(e\.g\. operator\-tree matching afterwhnfreduction\) would absorb most of the N/I taxonomy categories without an LLM judge\. Both are natural follow\-ups\.
## 7Conclusion
We decompose TC% gains via a2×22\{\\times\}2signal\-coverage matrix and a4×44\{\\times\}4per\-problem transition matrix\. On DeepSeek V4\-Pro×\\timesProofNet\#, three elab\-feedback methods deliver \+34 to \+36TSvia∼\\sim64% type\-stratum recovery \([Equation2](https://arxiv.org/html/2606.28013#S5.E2)\), withSOnet\-flat but 87\.5% per\-problem turnover\. TheTO→\\toTSrate is23/6123/61for each method \(Wilson 95% CI \[26\.6%, 50\.3%\]\), cross\-method calibration predictsΔTS\\Delta\\textsc\{TS\}to within2/1862/186, andΔTC\\Delta TC\{\}fits a linear regression in the Vanilla elab\-fail rate atR2=0\.96R^\{2\}\{=\}0\.96across six \(model, dataset\) cells\. A dual\-judge protocol \(Opus and GTED at 95% pooled precision\) shows that single\-judge SF% under\-credits elab\-feedback methods by≥25\\geq 25pp via elaborator\-forced structural rewrites\.
## References
- Anthropic \(2026\)Anthropic\.The Claude 4 model family\.[https://www\.anthropic\.com/claude](https://www.anthropic.com/claude), 2026\.Model card\.
- Azerbayev et al\. \(2023\)Azerbayev, Z\., Piotrowski, B\., Schoelkopf, H\., Ayers, E\. W\., Radev, D\., and Avigad, J\.Proofnet: Autoformalizing and formally proving undergraduate\-level mathematics\.*arXiv preprint arXiv:2302\.12433*, 2023\.
- Azerbayev et al\. \(2024\)Azerbayev, Z\., Schoelkopf, H\., Paster, K\., Dos Santos, M\., McAleer, S\., Jiang, A\. Q\., Deng, J\., Biderman, S\., and Welleck, S\.Llemma: An open language model for mathematics\.In*International Conference on Learning Representations*, 2024\.
- DeepSeek\-AI \(2026\)DeepSeek\-AI\.DeepSeek\-V4 technical report\.[https://huggingface\.co/deepseek\-ai](https://huggingface.co/deepseek-ai), 2026\.Technical report\.
- Guo et al\. \(2025\)Guo, Q\., Wang, J\., Zhang, J\., Kong, D\., Huang, X\., Xi, X\., Wang, W\., Wang, J\., Cai, X\., Zhang, S\., and Ye, W\.Autoformalizer with tool feedback\.*arXiv preprint arXiv:2510\.06857*, 2025\.
- Jiang et al\. \(2023\)Jiang, A\. Q\., Welleck, S\., Zhou, J\. P\., Li, W\., Liu, J\., Jamnik, M\., Lacroix, T\., Wu, Y\., and Lample, G\.Draft, sketch, and prove: Guiding formal theorem provers with informal proofs\.In*International Conference on Learning Representations*, 2023\.
- Lin et al\. \(2025\)Lin, H\., Sun, Z\., Welleck, S\., and Yang, Y\.Lean\-STaR: Learning to interleave thinking and proving\.In*International Conference on Learning Representations*, 2025\.
- Liu et al\. \(2025a\)Liu, X\., Bao, K\., Zhang, J\., Liu, Y\., Chen, Y\., Liu, Y\., Jiao, Y\., and Luo, T\.ATLAS: Autoformalizing theorems through lifting, augmentation, and synthesis of data\.*arXiv preprint arXiv:2502\.05567*, 2025a\.
- Liu et al\. \(2025b\)Liu, Y\., Zhu, T\., Liu, X\., Chen, Y\., Liu, Z\., Guo, Q\., Zhang, J\., Bao, K\., and Luo, T\.Generalized tree edit distance \(GTED\): A faithful evaluation metric for statement autoformalization\.In*ICML 2025 Workshop on AI for Math \(AI4Math\)*, 2025b\.arXiv:2507\.07399\.
- Lu et al\. \(2024\)Lu, J\., Wan, Y\., Liu, Z\., Huang, Y\., Xiong, J\., Liu, C\., Shen, J\., Jin, H\., Zhang, J\., Wang, H\., Yang, Z\., Tang, J\., and Guo, Z\.Process\-driven autoformalization in lean 4\.*arXiv preprint arXiv:2406\.01940*, 2024\.
- OpenAI \(2024\)OpenAI\.Hello GPT\-4o\.[https://openai\.com/index/hello\-gpt\-4o/](https://openai.com/index/hello-gpt-4o/), 2024\.
- Poiroux et al\. \(2024\)Poiroux, A\., Weiss, G\., Kunčak, V\., and Bosselut, A\.Improving autoformalization using type checking\.*arXiv preprint arXiv:2406\.07222*, 2024\.
- Polu & Sutskever \(2020\)Polu, S\. and Sutskever, I\.Generative language modeling for automated theorem proving\.*arXiv preprint arXiv:2009\.03393*, 2020\.
- Qwen Team \(2026\)Qwen Team\.Qwen3\.5 technical report\.[https://qwenlm\.github\.io](https://qwenlm.github.io/), 2026\.Technical report\.
- Wu et al\. \(2022\)Wu, Y\., Jiang, A\. Q\., Li, W\., Rabe, M\., Staats, C\., Jamnik, M\., and Szegedy, C\.Autoformalization with large language models\.*Advances in Neural Information Processing Systems*, 35:32353–32368, 2022\.
- Xiaomi MiMo Team \(2026\)Xiaomi MiMo Team\.MiMo\-V2\.5\-Pro: Reasoning model technical report\.[https://github\.com/XiaomiMiMo](https://github.com/XiaomiMiMo), 2026\.Technical report\.
- Yang et al\. \(2023\)Yang, K\., Swope, A\., Gu, A\., Chalamala, R\., Song, P\., Yu, S\., Godil, S\., Prenger, R\., and Anandkumar, A\.LeanDojo: Theorem proving with retrieval\-augmented language models\.In*Advances in Neural Information Processing Systems*, 2023\.
- Zheng et al\. \(2022\)Zheng, K\., Han, J\. M\., and Polu, S\.MiniF2F: a cross\-system benchmark for formal olympiad\-level mathematics\.In*International Conference on Learning Representations*, 2022\.doi:10\.48550/arXiv\.2109\.00110\.URL[https://arxiv\.org/abs/2109\.00110](https://arxiv.org/abs/2109.00110)\.
- Zhou et al\. \(2024\)Zhou, J\. P\., Staats, C\., Li, W\., Szegedy, C\., Weinberger, K\. Q\., and Wu, Y\.Don’t trust: Verify – grounding llm quantitative reasoning with autoformalization\.In*International Conference on Learning Representations*, 2024\.
## Appendix AWilson 95% CI on per\-stratum recovery rates
[Table8](https://arxiv.org/html/2606.28013#A1.T8)gives Wilson 95% CIs for every entry of[Table2](https://arxiv.org/html/2606.28013#S5.T2)\. Pairwise Fisher exact tests for homogeneity of theTO→\\toTSrate returnp=1\.0p\{=\}1\.0for all three pairs \(the observed counts are identical,23/6123/61for each method\)\. The semantic andBFrecovery rates have overlapping CIs across methods, consistent with a shared underlying rate plus small\-sample noise\. The wideBFCI \(±∼20\\pm\{\\sim\}20pp\) reflectsn=14n\{=\}14rather than instability of the recovery mechanism\.
Table 8:Wilson 95% CI for each entry of[Table2](https://arxiv.org/html/2606.28013#S5.T2)\.
## Appendix BBootstrap 95% CI on stratum\-shiftΔ\\Deltas
Resampling is at the problem level rather than the cell level because the four cell counts on a given problem are coupled \(each problem contributes to exactly one cell per method\), and intervals are the percentile method onB=10,000B\{=\}10\{,\}000resamples \(seed 20260512\) on DeepSeek V4\-Pro×\\timesProofNet\#\(n=186n\{=\}186\)\. The interval containment pattern is uniform across the three elab\-feedback methods:Δ\\DeltaTSandΔ\\DeltaTOare bounded away from 0 for all three,Δ\\DeltaSOCIs all straddle 0, andΔ\\DeltaBFreaches significance only on Lean\-Retry, which gives the precise numerical backing for the qualitative claim of §[5\.1](https://arxiv.org/html/2606.28013#S5.SS1)\.
Table 9:Bootstrap 95% CI onΔ\\Deltavs\. Vanilla \(B=10,000B\{=\}10\{,\}000\)\.Δ\\DeltaBFsignificant only for Lean\-Retry\.
## Appendix CGTEDτ\\taucalibration: per\-method precision
[Table6](https://arxiv.org/html/2606.28013#S5.T6)of §[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)reports the 4\-method pooled precision atτ=0\.5\\tau\{=\}0\.5\(95\.0%95\.0\\%on ProofNet\#\)\. The breakdown below shows that no single method drags the pool: per\-method precision stays at≥93\.4%\\geq 93\.4\\%atτ=0\.5\\tau\{=\}0\.5and rises monotonically withτ\\tau, so the chosen threshold is robust to which method’s elab\-pass outputs are pooled into the calibration set\.
Table 10:GTED precision at three thresholds, per\-method on DeepSeek V4\-Pro×\\timesProofNet\#elab\-pass subset\.
## Appendix DDisagreement taxonomy: rules and audit
[Table7](https://arxiv.org/html/2606.28013#S5.T7)of §[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)reports a four\-class classification of every Opus\-equiv but GTED<0\.5<0\.5output on DeepSeek V4\-Pro×\\timesProofNet\#, across four methods\. Classes are defined as follows:
N \(Notational\)\.The rewrite preserves the operator\-tree shape modulo Lean’s surface notation:A \-\> B↔\\leftrightarrowforall \_ : A, B,Real^n↔\\leftrightarrowFin n \-\> Real, alternativeopen/namespace prefixes, coercion \(ZMod 2as representative for “any field of order 2”\),Function\.Injective↔\\leftrightarrowInjective, and cosmetic variable renaming\. GTED penalises these because its operator tree is extracted pre\-reduction\.
I \(Idiomatic\)\.The rewrite swaps one Mathlib idiom for a definitionally equivalent or near\-equivalent one:IsConj↔\\leftrightarrowexists g, b = g\*a\*g\-1,Summable↔\\leftrightarrowTendstoof partial sums,CompactSpace↔\\leftrightarrowIsCompact Set\.univ,LocallyCompactSpace↔\\leftrightarrowNonempty\-wrappedLinearEquiv,generateFrom = sInf↔\\leftrightarrowgenerateFrom \(sInter \.\.\.\), etc\.
S \(Structural\)\.The rewrite changes term structure: binder ordering, hypothesis hoisting \(let\-binding vs\. explicit hypothesis vs\. premise\), typeclass formulation \(\[Field K\]vs\.\(h : Field K\),\[Fact p\.Prime\]vs\.Nat\.Primes,\[GCDMonoid R\]vs\. explicit gcd predicate\), quantifier reordering across an existential or universal,Iffdirection swap or biconditional collapse, implication\-vs\-biconditional substitution, subtype\-vs\-predicate\-restricted\-set, and hypothesis omission of vacuously\-true premises \(e\.g\.0 < nwhenn=0n\{=\}0admits no counterexample\)\.
Z \(Residual\)\.Cases mixing multiple categories \(e\.g\. both a notational and a structural rewrite\) or specific to NL ambiguity that do not cleanly map to one class\.
Classification rules are encoded as a regex cascade evaluated in the order I→\\toN→\\toS→\\toZ, and the first match wins\. Rules are tuned on the Opus rationale text, then frozen before the per\-method counts were tabulated\. The full classifier ships with the supplementary scriptanalysis/disagreement\_taxonomy\.py, which also dumps the per\-case rationale string and assigned label for every entry in[Table7](https://arxiv.org/html/2606.28013#S5.T7)\. Limitations of the classifier: \(i\) the Z bucket absorbs cases that mix two structural levels or that the rationale describes ambiguously, and \(ii\) borderline I vs\. N decisions \(e\.g\.deriv\[n\]vs\.iteratedDeriv n\) are made by the first matching rule rather than by adjudication\. The dominant\-class claim of §[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\(that S grows under elab\-feedback\) is robust to these noise sources because S’s share rises by 12 to 20 pp between Vanilla and the elab\-feedback methods, well outside any plausible reassignment of the Z residual\.
## Appendix ECross\-method TC/SF dissociation under GTED \(6 \(model, dataset\) combos\)
This table reports the GTED\-judgedΔ\\Deltavs\. Vanilla on the five \(model, dataset\) cells outside the Opus\-judged DeepSeek V4\-Pro×\\timesProofNet\#baseline of §[5\.1](https://arxiv.org/html/2606.28013#S5.SS1)\. The dissociation pattern reappears:Δ\\DeltaTC% is positive in 11 of 12 \(method, model, dataset\) rows \(up to\+16\.1\+16\.1pp\), whileΔ\\DeltaSF% stays within±5\\pm 5pp in 11 of 12 rows and is non\-positive in 6 of 12: elab\-feedback moves the elab axis without moving the semantic axis on every cell tested\.
Table 11:Δ\\Deltavs\. Vanilla under GTEDτ=0\.5\\tau\{=\}0\.5across the six \(model, dataset\) cells\. Left: ProofNet\#\. Right: MiniF2F\.\(a\)ProofNet\#\(type\-complex,n=186n\{=\}186\)
\(b\)MiniF2F \(type\-simple,n=244n\{=\}244\)
## Appendix FCase studies
Each case below shows the natural\-language statement, the gold formalization, and one or more candidate outputs\. Gold and candidates are reproduced in Lean 4 syntax with implicit elaboration arguments preserved\. Outcome tags use the cell labels of §[5\.1](https://arxiv.org/html/2606.28013#S5.SS1)\. The five cases together cover the four mechanisms identified in the body: Case A is aTO→\\toTSrecovery \(§[5\.2](https://arxiv.org/html/2606.28013#S5.SS2)\), Case B a canonical GTED false negative driven by surface\-form rewriting \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\), Cases C and D the two distinct gold\-formalization bug shapes that pin the durableSOfloor \(§[5\.3](https://arxiv.org/html/2606.28013#S5.SS3)\), and Case E a representativeTOthat survivesK=3K\{=\}3refinement across all four methods \(§[5\.8](https://arxiv.org/html/2606.28013#S5.SS8)\)\.
#### Case A:TYPE\_ONLY→\\toTRUE\_SUCCESSunder Lean\-Retry \(Dummit\-Foote\|exercise\_11\_1\_13\)\.
NL\.“As vector spaces overℚ\\mathbb\{Q\},ℝn≅ℝ\\mathbb\{R\}^\{n\}\\cong\\mathbb\{R\}, for alln∈ℤ\+n\\in\\mathbb\{Z\}^\{\+\}\.”
theoremexercise\_11\_1\_13\{n:Nat\}\(hn:0<n\):
Nonempty\(LinearEquiv\(RingHom\.idRat\)
\(Finn→\\toReal\)Real\)
theoremexercise\_11\_1\_13\{n:Nat\}\(hn:0<n\):
Nonempty\(LinearEquiv\(RingHom\.idRat\)
\(Real^n\)Real\)
theoremexercise\_11\_1\_13\{n:Nat\}\(hn:0<n\):
Nonempty\(LinearEquiv\(RingHom\.idRat\)
\(Finn→\\toReal\)Real\)
Opus judges equiv to gold under both STRICT and EQUIV\-IF\-FIXED prompts\. One of 23TO→\\toTSrecoveries\.
#### Case B: GTED false negative on a surface\-form rewrite \(Munkres\|exercise\_29\_4\)\.
NL\.“Show that\[0,1\]ω\[0,1\]^\{\\omega\}is not locally compact in the uniform topology\.”
theoremexercise\_29\_4:
Not\(LocallyCompactSpace\(Nat→\\toSet\.Icc\(0:Real\)1\)\)
theoremexercise\_29\_4:
Not\(LocallyCompactSpace
\(foralli:Nat,Set\.Icc\(0:Real\)1\)\)
The two function types are the sameΠ\\Pitype with different surface notations \(A \-\> Bis itself notation forforall \_ : A, B\)\. The elaborator treats them identically\. The typed operator tree GTED extracts from each is structurally divergent \(arrow node vs\.∀\\forallbinder\), giving the 0\.143 similarity\. Canonical shape of the\+26\+26to\+37\+37pp Opus\-vs\-GTED gap on elab\-feedback outputs \(§[5\.5](https://arxiv.org/html/2606.28013#S5.SS5)\)\.
#### Case C: DurableSEM\_ONLYfrom a buggy gold formalization \(Ireland\-Rosen\|exercise\_1\_30\)\.
NL\.“Prove that1/2\+1/3\+⋯\+1/n1/2\+1/3\+\\cdots\+1/nis not an integer\.”
theoremexercise\_1\_30\{n:Nat\}:
Not\(Existsfuna:Int⇒\\Rightarrow
\(Finset\.sum\(Finset\.univ:Finset\(Finn\)\)
fun\_⇒\\Rightarrow\(1:Rat\)/\(n\+2\)\)=a\)
theoremexercise\_1\_30\{n:Nat\}\(hn:2≤\\leqn\):
Not\(Existsfuna:Int⇒\\Rightarrow
\(Finset\.sum\(Finset\.Icc2n\)
funk⇒\\Rightarrow\(1:Rat\)/k\)=a\)
The gold theorem and the natural\-language statement are different propositions: the gold sum is a rationaln/\(n\+2\)n/\(n\+2\)rather than the harmonic tail\. The only problem durablySOacross all four methods on DeepSeek V4\-Pro×\\timesProofNet\#\.
#### Case D: DurableSEM\_ONLYfrom a vacuous gold biconditional \(Dummit\-Foote\|exercise\_7\_2\_2\)\.
NL\.“Letp\(x\)=anxn\+⋯\+a0∈R\[x\]p\(x\)=a\_\{n\}x^\{n\}\+\\cdots\+a\_\{0\}\\in R\[x\]\. Prove thatp\(x\)p\(x\)is a zero divisor inR\[x\]R\[x\]iff there is a nonzerob∈Rb\\in Rwithb⋅p\(x\)=0b\\cdot p\(x\)=0\.”
theoremexercise\_7\_2\_2\{R:Type\*\}\[RingR\]
\(p:PolynomialR\):
Iff\(Dvd\.dvdp0\)
\(Existsfunb:R⇒\\Rightarrowb≠\\neq0/\\b\*p=0\)
theoremexercise\_7\_2\_2\{R:Type\*\}\[CommRingR\]
\(p:PolynomialR\):
Iff\(Existsfunq:PolynomialR⇒\\Rightarrowq≠\\neq0/\\p\*q=0\)
\(Existsfunb:R⇒\\Rightarrowb≠\\neq0/\\b\*p=0\)
Gold’s LHS is vacuous over any commutative ring, collapsing the biconditional to its right\-hand side\. DurablySOunder Lean\-Retry and Sample\-Filter\. Under SAF the IR refinement loop fails to elaborate at all \(BF\)\. Both gold\-formalization bugs are documented with Opus reasoning traces in the supplementary note accompanying this submission\.
#### Case E: DurableTOacross all four methods \(Dummit\-Foote\|exercise\_3\_4\_5b\)\.
NL\.“Prove that quotient groups of a solvable group are solvable\.”
theoremexercise\_3\_4\_5b\{G:Type\*\}\[GroupG\]\[IsSolvableG\]
\(H:SubgroupG\)\[NormalH\]:
IsSolvable\(G/H\)
theoremsolvable\_quotient\{G:Type\*\}\[GroupG\]\[SolvableG\]
\(H:SubgroupG\)\[H\.Normal\]:
Solvable\(G/H\)
theoremsolvable\_quotient\{G:Type\*\}\[GroupG\]\[SolvableG\]
\(H:SubgroupG\)\[h:H\.Normal\]:
Solvable\(G/H\)
Opus judges both candidates equiv\-if\-fixed: the mathematical content matches gold, only the Mathlib API name is wrong\. SAF and Sample\-Filter likewise fail to elaborate atK=3K\{=\}3on this problem\. One of 16 problems that remainTOacross all four methods on DeepSeek V4\-Pro×\\timesProofNet\#: a canonical instance of the elab signal naming a symptom without surfacing the cure\.Similar Articles
Evaluating the Robustness of Proof Autoformalization in Lean 4
This paper evaluates the robustness of proof autoformalization models in Lean 4 under global and local perturbations, finding that current LLM-based models are sensitive to perturbations and often fail to faithfully reflect local changes.
Beyond the Library: An Agentic Framework for Autoformalizing Research Mathematics
Presents an agentic framework using general coding LLMs to autoformalize research-level mathematics into Lean 4 code, evaluated on Putnam problems and STOC conference papers.
Theory-Scale Auto-Formalization of Logics for Computer Science
Introduces LCS-Bench, a theory-scale benchmark for auto-formalization of logics for computer science, covering 327 textbook items over 4,076 Lean declarations. Evaluation on 14 models shows the benchmark is challenging, with SOTA models achieving only 20.1% on auto-formalization tasks.
Formal Methods Meet LLMs: Auditing, Monitoring, and Intervention for Compliance of Advanced AI Systems
This paper proposes techniques that combine formal methods (Linear Temporal Logic) with LLMs for auditing, monitoring, and intervening in AI systems to ensure compliance with behavioral constraints, showing that even small-model labelers can match frontier LLM judges in detecting violations.
From Signal Degradation to Computation Collapse: Uncovering the Two Failure Modes of LLM Quantization
Researchers identify two distinct failure modes in aggressive LLM quantization—Signal Degradation and Computation Collapse—and show that training-free fixes only remedy the former, indicating structural reconstruction is needed for ultra-low-bit models.