FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks
Summary
FormInv proposes a measurement protocol for evaluating semantic invariance in mathematical reasoning benchmarks, revealing that model rankings reverse across paraphrase families and that standard accuracy metrics conceal large gaps in semantic consistency.
View Cached Full Text
Cached at: 05/29/26, 09:14 AM
# A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks
Source: [https://arxiv.org/html/2605.29001](https://arxiv.org/html/2605.29001)
\\icml@noticeprintedtrue
Nishal Thomas111Equal contribution\.Independent Researchernishal\.thomas44@gmail\.comNoel Thomas111Equal contribution\.Mohamed Bin Zayed University of Artificial Intelligencenoel\.thomas@mbzuai\.ac\.ae
###### Abstract
A paraphrase\-quality audit of MathCheck\(Zhouet al\.,[2024b](https://arxiv.org/html/2605.29001#bib.bib32)\)\(ICLR 2025\) detected 4 semantically\-incorrect paraphrases in 129 groups \(3\.1%\); removing them drops GPT\-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single\-model evaluation\. Cross\-model unanimity found these errors automatically \(≥3/4\\geq 3/4models for MathCheck;≥6/9\\geq 6/9for our primary evaluation\) for under $10; in our own dataset the same protocol found that 47% of auto\-generated connective\-variation paraphrases were semantically incorrect\. That flaw compounds a deeper measurement gap: Claude Haiku 4\.5 achieves 86% accuracy yet SCR = 50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86–96% yetSemantic Consistency Rates\(SCR\) span 50–82%—a 32\-point gap invisible to standard benchmarks\. Formally, for any target ranking over 9 frontier models there exists a weightingλ∈Δ7\\lambda\\in\\Delta^\{7\}over paraphrase families such thatλ⊤SCR\\lambda^\{\\top\}\\mathrm\{SCR\}realizes it \(No\-Free\-Benchmark corollary\), because no model Pareto\-dominates all families—so benchmark designers who select families are implicitly choosing which model wins\.FormInvsupplies the audit protocol \(replicated on external benchmarks at 100% recall\), SCR and per\-theorem Cochran’s Q as primary invariance measures evaluated on 9 models across 366–811 items \(on Lean4\-verified theorems\), andFormInvSelectorfor regime\-aware model selection\.
## 1Introduction
Consider two questions about the same theorem: “Is it true thatx≥0\\sqrt\{x\}\\geq 0for every real numberxx?” and “For any realxx, is0≤x0\\leq\\sqrt\{x\}?” These statements are logically equivalent \(under Lean4 real analysis conventions\)\. Yet Claude Sonnet answers the first correctly and the second incorrectly across 16\.7% of comparison\-order paraphrases \(F6\)\. GPT\-4o fails this class of transformation at 0\.0%\. Reverse the family: ask both models to recognize a definition by expansion \(F7\), and the ranking reverses: GPT\-4o fails at 10\.0% while Claude fails at 6–8%\. The same pair of frontier models, the same theorems, opposite weaknesses\.Model rankings reverse across paraphrase families\.This is the central finding of FormInv\.
Why does this matter? When a practitioner asks “Is model X good at mathematical reasoning?” and answers using benchmark accuracy alone, they implicitly assume the answer is independent of how the question is phrased\. That assumption is wrong\. Across 9 frontier models, we find that accuracy tells only half the story: Claude Haiku 4\.5 achieves 86% accuracy yet has SCR = 50%: half its theorems are answered inconsistently across semantically equivalent restatements\. DeepSeek V3, at 96\.4% accuracy, achieves 82% SCR\. The 10\-point accuracy gap conceals a 32\-point SCR gap, the critical property that separates surface\-pattern recognition from genuine mathematical understanding\.
This failure \(formulation sensitivity despite conceptual equivalence\) is what psychometrics calls*Differential Item Functioning*\(Holland and Thayer,[1988](https://arxiv.org/html/2605.29001#bib.bib4)\), formalized via Doob’sL2L^\{2\}conditional expectation theorem \(1953\)\(Doob,[1953](https://arxiv.org/html/2605.29001#bib.bib1)\), and operationalized by measurement invariance theory\(Vandenberg and Lance,[2000](https://arxiv.org/html/2605.29001#bib.bib2)\)\. To our knowledge,FormInvis the first protocol applying this measurement tradition to formal mathematical reasoning with Lean4\-verified theorems and DIF\-grounded invariance metrics\.
The gap is substantial\. Table[1](https://arxiv.org/html/2605.29001#S1.T1)surveys eight benchmarks and how \(if at all\) they verify paraphrase semantic equivalence: none applies logical\-equivalence checking or cross\-model unanimity\.
Table 1:Paraphrase quality in published benchmarks\.†None applies logical\-equivalence verification or cross\-model unanimity\.†Citations: GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.29001#bib.bib24)\), MATH\(Hendryckset al\.,[2021b](https://arxiv.org/html/2605.29001#bib.bib23)\), MathBench\(Liuet al\.,[2024](https://arxiv.org/html/2605.29001#bib.bib25)\), MathCheck\(Zhouet al\.,[2024b](https://arxiv.org/html/2605.29001#bib.bib32)\), GSM\-Plus\(Pasteret al\.,[2024](https://arxiv.org/html/2605.29001#bib.bib37)\), PutnamGAP\(Haoet al\.,[2025](https://arxiv.org/html/2605.29001#bib.bib30)\), Zhou et al\.\(Zhouet al\.,[2024a](https://arxiv.org/html/2605.29001#bib.bib20)\), PromptBench\(Zhuet al\.,[2024](https://arxiv.org/html/2605.29001#bib.bib27)\)\.Applied to our own dataset, FormInv’s cross\-model unanimity found that 47% of the 15 audited F5 paraphrases were semantically incorrect: errors invisible to single\-model evaluation but exposed by unanimity\. The same generation regime \(LLM \+ “maintain the answer” instruction \+ human NL fluency check\) that produced these errors in FormInv v1 is used verbatim in MathCheck and GSM\-Plus \(Table[1](https://arxiv.org/html/2605.29001#S1.T1)\)\. It costs approximately $1 per model per evaluation run and applies to any formally\-specified benchmark\.
#### Contributions\.
1. 1\.Invariance framework\.SCR and per\-theorem Cochran’s Q formalize semantic invariance; IG =p\(1−p\)\\sqrt\{p\(1\-p\)\}is a supplementary statistic \(Remark[1](https://arxiv.org/html/2605.29001#Thmproposition1)\)\. We prove Propositions 1–2 \(error bound; ranking reversal condition\) and classify 8 paraphrase families into T1 \(formally certifiable\), T2 \(conditionally valid\), and T3 \(heuristic\) tiers \(Table[2](https://arxiv.org/html/2605.29001#S3.T2)\)\.
2. 2\.FormInv benchmark\.760 items spanning 103 Lean4\-verified Mathlib4 theorems across 8 paraphrase families; 9\-model evaluation; 811 items from 100 harder ntp\-mathlib theorems\. Paraphrase equivalence established by CAS \(T1\), templates \(T2\), and domain\-expert review \(T3\)\.
3. 3\.FormInvSelector\.An algorithm that uses the per\-family SCR profile to recommend the model with lowest expected failure rate\.forminv selector \-\-families unpack orderruns in 0\.1s\.
4. 4\.Paraphrase quality audit\.Cross\-model unanimity \(≥6/9\\geq 6/9models fail a paraphrase while passing the canonical\) automatically flags semantically\-incorrect paraphrases\. Applied to FormInv v1: found 11 errors in GPT\-4o\-generated items \(biconditional overreach, passive\-voice inversion, type\-context stripping\)\.
## 2Related Work
#### Mathematical reasoning benchmarks\.
MATH\(Hendryckset al\.,[2021b](https://arxiv.org/html/2605.29001#bib.bib23)\)and GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.29001#bib.bib24)\)measure final\-answer accuracy\. ChaosBench\-Logic\(Thomas,[2026a](https://arxiv.org/html/2605.29001#bib.bib22)\)introduces family\-level evaluation with formal ontology\. MathBench\(Liuet al\.,[2024](https://arxiv.org/html/2605.29001#bib.bib25)\)covers multiple mathematical domains\. FormalMATH\(Yuet al\.,[2025](https://arxiv.org/html/2605.29001#bib.bib15)\)benchmarks 5,560 Lean4\-verified problems; top models achieve only 16\.46%\. All evaluate accuracy on canonical formulations; none test formulation invariance\.
#### Regime\-dependent evaluation\.
The idea that rankings depend on a hidden evaluation variable has precedent\.Thomas \([2026b](https://arxiv.org/html/2605.29001#bib.bib36)\)show that Bayesian optimization algorithm rankings reverse sign under different budget\-to\-candidate\-pool ratios \(B/—A—\): Greedy ranks first atB=50B=50and last atB=100B=100on the same benchmark; 98% of BO papers never vary this as a controlled axis\. FormInv documents the same phenomenon in LLM evaluation: model rankings reverse across paraphrase families \(Proposition[3](https://arxiv.org/html/2605.29001#Thmproposition3)\) and across difficulty regimes \(§[7](https://arxiv.org/html/2605.29001#S7)\), with no benchmark currently controlling for either axis\. In formal theorem proving,Tayloret al\.\([2026](https://arxiv.org/html/2605.29001#bib.bib13)\)show that performance drops∼\\sim26% when switching between Mathlib\-standard and Tao’s\-Analysis\-I formulations of the same theorem: direct empirical evidence of the IG phenomenon in Lean4\.de Zarzàet al\.\([2026](https://arxiv.org/html/2605.29001#bib.bib16)\)provide cross\-domain validation: across eight semantic\-preserving transforms, model scale does not predict semantic robustness, a finding consistent with FormInv’s ranking\-reversal results\.
#### Benchmark label error auditing\.
Northcuttet al\.\([2021](https://arxiv.org/html/2605.29001#bib.bib39)\)found≥3\.3%\\geq 3\.3\\%label errors in 10 major benchmarks via model confidence disagreement;Guo and others \([2024](https://arxiv.org/html/2605.29001#bib.bib40)\)extend this with LLM ensembles \(6–21% errors in NLP benchmarks\)\.Yang and Wang \([2026](https://arxiv.org/html/2605.29001#bib.bib41)\)show that models with similar accuracy disagree on 16–66% of items \(*Benchmark Illusion*\) — the problem FormInv solves\.Gorbett and Jana \([2026](https://arxiv.org/html/2605.29001#bib.bib42)\)use cross\-model disagreement as a deployment\-time signal; FormInv applies it at construction time to detect logical\-scope errors invisible to embedding similarity\. The closest deployed analog is QIMMA\(Technology Innovation Institute,[2026](https://arxiv.org/html/2605.29001#bib.bib38)\), an Arabic LLM leaderboard that filters benchmark items using two\-LLM consensus — framed as a novel contribution in 2026, confirming that systematic cross\-model quality gating is not yet standard practice\.
#### Robustness and invariance in NLP\.
CheckList\(Ribeiroet al\.,[2020](https://arxiv.org/html/2605.29001#bib.bib26)\)introduces behavioral testing via invariance \(INV\) tests\. PromptBench\(Zhuet al\.,[2024](https://arxiv.org/html/2605.29001#bib.bib27)\)tests robustness to adversarial prompt perturbations\. Prompt format sensitivity causes substantial ranking reversals across frontier models\(Sclaret al\.,[2024](https://arxiv.org/html/2605.29001#bib.bib34); Romanouet al\.,[2026](https://arxiv.org/html/2605.29001#bib.bib35)\)\. These work on perturbed prompts; FormInv targets*verified semantic equivalence classes*with a formal invariance metric\.
#### Paraphrase sensitivity in mathematical reasoning\.
Zhouet al\.\([2024a](https://arxiv.org/html/2605.29001#bib.bib20)\)introduce the Variance of Variations \(VOV\) scalar metric \(NAACL 2024\): paraphrasing can shift a problem’s solve rate from 5% to 100%\. VOV is the closest prior metric to FormInv’s IG: it measures accuracy variance across paraphrase variants of the same problem\. FormInv differs in three ways: \(1\) formal equivalence ground truth from Lean4\-verified Mathlib4 theorems \(VOV has no equivalence verification\); \(2\) 8 linguistically\-motivated paraphrase families with distinct failure modes \(VOV uses a single variation type\); and \(3\) connection to generalizability theory and DIF \(VOV has no measurement\-theoretic grounding\)\.
#### CheckList extensions to math reasoning\.
Zhouet al\.\([2024b](https://arxiv.org/html/2605.29001#bib.bib32)\)\(MathCheck\(Zhouet al\.,[2024b](https://arxiv.org/html/2605.29001#bib.bib32)\)\) extend behavioral testing to math reasoning, organizing problems into task families with robustness variants and evaluating 26\+ LLMs\. FormInv builds beyond MathCheck in three ways: \(1\) ground truth established by Lean4\-verified canonical theorems \(the paraphrases themselves are verified by CAS \+ expert review, not by Lean4\), rather than approximate rewrites with implicit ground truth; \(2\) the formal Invariance Gap metric grounded in Doob’s conditional expectation and DIF theory, rather than accuracy\-change heuristics; \(3\) a formal theorem substrate \(Mathlib4\) rather than arithmetic word problems\.
#### Paraphrase\-based contamination detection\.
ConStat\(Dekonincket al\.,[2024](https://arxiv.org/html/2605.29001#bib.bib28)\)and CoDeC\(Zawalskiet al\.,[2025](https://arxiv.org/html/2605.29001#bib.bib29)\)detect contamination via paraphrase performance drops\. PutnamGAP\(Haoet al\.,[2025](https://arxiv.org/html/2605.29001#bib.bib30)\)tests equivalent Putnam problem transformations\.Moore and Shah \([2025](https://arxiv.org/html/2605.29001#bib.bib31)\)\(Moore & Shah 2025\) measure robustness specifically in Lean4 formalization; FormInv generalizes this to arbitrary natural\-language mathematical reasoning with DIF\-grounded metrics and a formal invariance protocol\.
#### Measurement theory foundations\.
Generalizability theory\(Cronbachet al\.,[1972](https://arxiv.org/html/2605.29001#bib.bib7)\)decomposes score variance into facets; IG2is the paraphrase\-facet variance\. Parallel\-form reliability\(Lord and Novick,[1968](https://arxiv.org/html/2605.29001#bib.bib6)\)and measurement invariance\(Vandenberg and Lance,[2000](https://arxiv.org/html/2605.29001#bib.bib2)\)provide the classical context for SCR\. DIF\(Holland and Thayer,[1988](https://arxiv.org/html/2605.29001#bib.bib4)\)detects items that function differently across groups; FormInv adapts the intuition to paraphrase families for a fixed model\. We are the first to apply this framework to NL mathematical reasoning with DIF\-grounded metrics and a certified family taxonomy \(cf\. Moore & Shah\(Moore and Shah,[2025](https://arxiv.org/html/2605.29001#bib.bib31)\)who measure Lean4 formalization robustness, a different task\)\.
## 3The Invariance Gap
### 3\.1Mathematical Foundation
Letf:𝒳→\{0,1\}f:\\mathcal\{X\}\\to\\\{0,1\\\}be an LLM’s binary answer function on problem statements\. Let∼\\simdenote semantic equivalence between mathematical formulations:x1∼x2x\_\{1\}\\sim x\_\{2\}ifx1x\_\{1\}andx2x\_\{2\}are logically equivalent statements with the same ground truth\. The equivalence class of a statementxxis\[x\]∼=\{x′∈𝒳:x′∼x\}\[x\]\_\{\\sim\}=\\\{x^\{\\prime\}\\in\\mathcal\{X\}:x^\{\\prime\}\\sim x\\\}\.
###### Definition 1\(Invariance Gap\)\.
For a modelffand equivalence class\[x\]∼\[x\]\_\{\\sim\}, the Invariance Gap is:
IG\(f,\[x\]∼\)=Varx′∼\[x\]∼\[f\(x′\)\]=∥f−𝔼\[f∣\[x\]∼\]∥L2\(\[x\]∼\)\\begin\{split\}\\mathrm\{IG\}\(f,\[x\]\_\{\\sim\}\)&=\\sqrt\{\\mathrm\{Var\}\_\{x^\{\\prime\}\\sim\[x\]\_\{\\sim\}\}\[f\(x^\{\\prime\}\)\]\}\\\\ &=\\bigl\\\|f\-\\mathbb\{E\}\[f\\mid\[x\]\_\{\\sim\}\]\\bigr\\\|\_\{L^\{2\}\(\[x\]\_\{\\sim\}\)\}\\end\{split\}\(1\)whereL2\(\[x\]∼\)L^\{2\}\(\[x\]\_\{\\sim\}\)is theL2L^\{2\}space restricted to the equivalence class\[x\]∼\[x\]\_\{\\sim\}\.
The equality states that𝔼\[f∣\[x\]∼\]\\mathbb\{E\}\[f\\mid\[x\]\_\{\\sim\}\]\(the class\-conditional mean,p=Pr\[f\(x′\)=1:x′∼x\]p=\\Pr\[f\(x^\{\\prime\}\)=1:x^\{\\prime\}\\sim x\]\) is the uniqueL2L^\{2\}\-optimal constant approximation toffon\[x\]∼\[x\]\_\{\\sim\}\(a consequence of the fact that the mean minimizes mean squared error\(Doob,[1953](https://arxiv.org/html/2605.29001#bib.bib1)\)\.IG\(f,\[x\]∼\)=0\\mathrm\{IG\}\(f,\[x\]\_\{\\sim\}\)=0if and only ifffis constant on\[x\]∼\[x\]\_\{\\sim\}, i\.e\. the model’s answer does not depend on which representative of the equivalence class is presented\.
Generalizability theory connection\.IG has a natural home in generalizability theory\(Cronbachet al\.,[1972](https://arxiv.org/html/2605.29001#bib.bib7)\):IG2\\mathrm\{IG\}^\{2\}is the paraphrase\-facet variance component in a person×\\timesitem design where “persons” are models and “paraphrase” is a random facet\. The formulation is independently motivated:Choi \([2025](https://arxiv.org/html/2605.29001#bib.bib12)\)derive the sameσ\(accuracy within paraphrase class\)\\sigma\(\\text\{accuracy within paraphrase class\}\)metric \(XParaCon\) for general QA robustness\. FormInv extends this to Lean4\-verified formal equivalence classes with DIF\-theoretic grounding\. The SCR \(Semantic Consistency Rate\) is the proportion of theorems for which the model answers*correctly*across all paraphrase forms, operationally equivalent to the parallel\-form agreement coefficient from classical test theory\(Lord and Novick,[1968](https://arxiv.org/html/2605.29001#bib.bib6)\)\. Note:SCR⊆\{IG=0\}\\mathrm\{SCR\}\\subseteq\\\{\\mathrm\{IG\}=0\\\}\. A theorem withIG=0\\mathrm\{IG\}=0has a constant model response \(possibly consistently wrong\); SCR counts only the subset with constant\-correct responses\. Since all FormInv theorems have ground truth TRUE, a constant\-FALSE response also yieldsIG=0\\mathrm\{IG\}=0but does not contribute to SCR\. This framing is more appropriate than classical DIF \(which conditions on latent ability across demographic groups\) because FormInv holds the population fixed \(one model atT=0T=0\) and varies the item, the classical*parallel\-form reliability*setting\(Lord and Novick,[1968](https://arxiv.org/html/2605.29001#bib.bib6)\)\.
###### Proposition 1\.
For binaryf∈\{0,1\}f\\in\\\{0,1\\\},IG\(f,\[x\]∼\)=p\(1−p\)\\mathrm\{IG\}\(f,\[x\]\_\{\\sim\}\)=\\sqrt\{p\(1\-p\)\}wherep=Pr\[f\(x′\)=1:x′∼x\]p=\\Pr\[f\(x^\{\\prime\}\)=1:x^\{\\prime\}\\sim x\]\. This is the standard deviation of aBernoulli\(p\)\\mathrm\{Bernoulli\}\(p\)random variable, achieving its maximum of0\.50\.5when exactly half the paraphrases are answered correctly\.
IG degeneracy and SCR primacy\.SinceIGt=pt\(1−pt\)\\mathrm\{IG\}\_\{t\}=\\sqrt\{p\_\{t\}\(1\-p\_\{t\}\)\}is a deterministic function of accuracyptp\_\{t\}\(Proposition[1](https://arxiv.org/html/2605.29001#Thmproposition1)\), any cross\-model correlation between Mean\-IG and accuracy is algebraically inevitable, not empirical\.SCRis not degenerate in this sense:SCR≤p¯\\mathrm\{SCR\}\\leq\\bar\{p\}\(Jensen\), with equality only whenpt∈\{0,1\}p\_\{t\}\\in\\\{0,1\\\}for alltt, and two models at the samep¯\\bar\{p\}can have SCR from0top¯\\bar\{p\}\. We therefore treatSCR and per\-theorem Cochran’s Q as primary measures; Mean\-IG is supplementary\.
Aggregate Mean\-IG vs\. global Doob residual\.We reportMean\-IG=1\|𝒯\|∑t∈𝒯IG\(f,\[t\]∼\)\\mathrm\{Mean\\text\{\-\}IG\}=\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\}\\mathrm\{IG\}\(f,\[t\]\_\{\\sim\}\), the arithmetic mean of per\-theorem IGs\. The corresponding global Doob residual over the full evaluation space isRMS\-IG=1\|𝒯\|∑tIG\(f,\[t\]∼\)2≥Mean\-IG\\mathrm\{RMS\\text\{\-\}IG\}=\\sqrt\{\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\_\{t\}\\mathrm\{IG\}\(f,\[t\]\_\{\\sim\}\)^\{2\}\}\\geq\\mathrm\{Mean\\text\{\-\}IG\}\(Jensen\)\. Mean\-IG is the more interpretable statistic: it is the expected paraphrase standard deviation for a randomly chosen theorem\.
Statistical test\.Fork=2k=2paraphrases, testingH0:IG=0H\_\{0\}:\\mathrm\{IG\}=0uses McNemar’s test\(McNemar,[1947](https://arxiv.org/html/2605.29001#bib.bib5)\), with statisticχ2=\(b−c\)2/\(b\+c\)\\chi^\{2\}=\(b\-c\)^\{2\}/\(b\+c\)whereb,cb,ccount discordant pairs\. For the general case ofk\>2k\>2paraphrase families \(as in FormInv wherek=7k=7–88\), the correct generalization is Cochran’s Q test\(Cochran,[1950](https://arxiv.org/html/2605.29001#bib.bib9)\): a distribution\-free test of marginal homogeneity acrosskkpaired binary observations, reducing to McNemar whenk=2k=2\. Testing at the per\-theorem level with Bonferroni correction for 103 theorems sets the threshold atα/103\\alpha/103\.
### 3\.2Formal Properties of the Invariance Gap
We derive two formal results clarifying what IG measures globally and how family\-level scores relate across models\.
###### Proposition 2\(Paraphrase Error Bound\)\.
Letpt=Prx′∼\[t\]∼\[f\(x′\)=1\]p\_\{t\}=\\Pr\_\{x^\{\\prime\}\\sim\[t\]\_\{\\sim\}\}\[f\(x^\{\\prime\}\)=1\]be the model’s per\-theorem paraphrase accuracy, andIGt:=pt\(1−pt\)\\mathrm\{IG\}\_\{t\}:=\\sqrt\{p\_\{t\}\(1\-p\_\{t\}\)\}\. Then:
𝔼t\[1−pt\]≥RMS\-IG2:=𝔼t\[IGt2\],\\mathbb\{E\}\_\{t\}\[1\-p\_\{t\}\]\\;\\geq\\;\\mathrm\{RMS\\text\{\-\}IG\}^\{2\}:=\\mathbb\{E\}\_\{t\}\[\\mathrm\{IG\}\_\{t\}^\{2\}\],\(2\)with equality if and only ifpt=1p\_\{t\}=1for almost everyt∈𝒯t\\in\\mathcal\{T\}\.
###### Proof\.
Factor:1−pt=pt\(1−pt\)\+\(1−pt\)2=IGt2\+\(1−pt\)2≥IGt21\-p\_\{t\}=p\_\{t\}\(1\-p\_\{t\}\)\+\(1\-p\_\{t\}\)^\{2\}=\\mathrm\{IG\}\_\{t\}^\{2\}\+\(1\-p\_\{t\}\)^\{2\}\\geq\\mathrm\{IG\}\_\{t\}^\{2\}\. Taking expectations gives \([2](https://arxiv.org/html/2605.29001#S3.E2)\); equality holds iff\(1−pt\)2=0\(1\-p\_\{t\}\)^\{2\}=0a\.s\. ∎
#### Interpretation\.
RMS\-IG2is a lower bound on the expected paraphrase error rate: the squared Invariance Gap participates directly in the error decomposition\. Reducing IG is necessary but not sufficient for reducing paraphrase error \(a model with IG=0=0may still fail consistently ifpt≪1p\_\{t\}\\ll 1, i\.e\. it is consistently wrong, captured by SCR=0=0, not IG\)\.
###### Proposition 3\(Sufficient Condition for Ranking Reversal\)\.
Letf1,f2f\_\{1\},f\_\{2\}be two models andFi,FjF\_\{i\},F\_\{j\}two paraphrase families\. Writep¯i\(k\)\\bar\{p\}\_\{i\}^\{\(k\)\}for the mean per\-family accuracy of modelkkonFiF\_\{i\}\. Suppose all four quantities lie in\(12,1\]\(\\tfrac\{1\}\{2\},1\]\. Then:
\(p¯i\(1\)−p¯i\(2\)\)\(p¯j\(1\)−p¯j\(2\)\)<0⟹\\displaystyle\\bigl\(\\bar\{p\}\_\{i\}^\{\(1\)\}\-\\bar\{p\}\_\{i\}^\{\(2\)\}\\bigr\)\\bigl\(\\bar\{p\}\_\{j\}^\{\(1\)\}\-\\bar\{p\}\_\{j\}^\{\(2\)\}\\bigr\)<0\\;\\LongrightarrowIG\(f1,Fi\)<IG\(f2,Fi\)∧IG\(f1,Fj\)\>IG\(f2,Fj\)\.\\displaystyle\\qquad\\mathrm\{IG\}\(f\_\{1\},F\_\{i\}\)<\\mathrm\{IG\}\(f\_\{2\},F\_\{i\}\)\\;\\wedge\\;\\mathrm\{IG\}\(f\_\{1\},F\_\{j\}\)\>\\mathrm\{IG\}\(f\_\{2\},F\_\{j\}\)\.
###### Proof\.
On\(12,1\]\(\\tfrac\{1\}\{2\},1\],g\(p\)=p\(1−p\)g\(p\)=\\sqrt\{p\(1\-p\)\}is strictly decreasing \(g′\(p\)<0g^\{\\prime\}\(p\)<0\)\. The sign condition asserts the relative accuracy advantage reverses betweenFiF\_\{i\}andFjF\_\{j\}; sinceggis strictly decreasing, the IG ordering also reverses\. ∎
#### Empirical calibration\.
All per\-family accuracies satisfyp¯\>0\.77\>12\\bar\{p\}\>0\.77\>\\tfrac\{1\}\{2\}\. GPT\-4o/Claude reversal on\(F6,F7\)\(F\_\{6\},F\_\{7\}\):p¯6\(GPT\-4o\)=1\.00\>p¯6\(Claude\)=0\.833\\bar\{p\}\_\{6\}^\{\(\\text\{GPT\-4o\}\)\}=1\.00\>\\bar\{p\}\_\{6\}^\{\(\\text\{Claude\}\)\}=0\.833,p¯7\(GPT\-4o\)=0\.90<p¯7\(Claude\)=0\.94\\bar\{p\}\_\{7\}^\{\(\\text\{GPT\-4o\}\)\}=0\.90<\\bar\{p\}\_\{7\}^\{\(\\text\{Claude\}\)\}=0\.94\(sign product<0<0\); 16\.7 pp reversal onF6F\_\{6\}, 4\.0 pp onF7F\_\{7\}\.
###### Corollary 1\.
Mean\-IG and family\-conditional rankings measure distinct properties\. Proposition[2](https://arxiv.org/html/2605.29001#Thmproposition2)bounds expected paraphrase error from below; Proposition[3](https://arxiv.org/html/2605.29001#Thmproposition3)places no constraint on family\-level orderings\. A model with strictly lower Mean\-IG than a competitor may still be*less*invariant on specific families of practical interest, mandating per\-family reporting\.
###### Corollary 2\(No\-Free\-Benchmark, empirical\)\.
In our 9\-model evaluation, no model Pareto\-dominates all others on all 8 families \(verified; see per\-family Table[5](https://arxiv.org/html/2605.29001#S5.T5)\)\. Consequently, benchmark designers who select which families to include implicitly choose a weightingλ\\lambdaover families that determines the resulting model ranking\. FormInv makes this choice explicit and auditable\.
### 3\.3The 8\-Family Paraphrase Taxonomy
Analogous to how ChaosBench\-Logic’s\(Thomas,[2026a](https://arxiv.org/html/2605.29001#bib.bib22)\)27\-predicate ontology defines the evaluation space, FormInv’s 8\-family taxonomy defines the space of formulation\-equivalent transformations:
Table 2:FormInv’s 8 families\.T1: formally certifiable \(named proof rule, Lean4/FOL\)\.T2: conditional on type context\.T3: heuristic NL, human\-verified\.
### 3\.4Aggregate Metrics
Beyond per\-theorem IG, we report:
Mean\-IG=1\|𝒯\|∑t∈𝒯IG\(f,\[t\]∼\)\\displaystyle=\\frac\{1\}\{\|\\mathcal\{T\}\|\}\\sum\_\{t\\in\\mathcal\{T\}\}\\mathrm\{IG\}\(f,\[t\]\_\{\\sim\}\)\(3\)SCR\\displaystyle\\mathrm\{SCR\}=\|\{t∈𝒯:f\(x′\)=1∀x′∈\[t\]∼\}\|\|𝒯\|\\displaystyle=\\frac\{\|\\\{t\\in\\mathcal\{T\}:f\(x^\{\\prime\}\)=1\\;\\forall x^\{\\prime\}\\in\[t\]\_\{\\sim\}\\\}\|\}\{\|\\mathcal\{T\}\|\}\(4\)Hi\-IG%=\|\{t∈𝒯:IG\(f,\[t\]∼\)\>0\.10\}\|\|𝒯\|\\displaystyle=\\frac\{\|\\\{t\\in\\mathcal\{T\}:\\mathrm\{IG\}\(f,\[t\]\_\{\\sim\}\)\>0\.10\\\}\|\}\{\|\\mathcal\{T\}\|\}\(5\)
## 4FormInv Benchmark
### 4\.1Theorem Substrate
Following the same principle as ChaosBench\-Logic\(Thomas,[2026a](https://arxiv.org/html/2605.29001#bib.bib22)\), which was built using Gilpin’s dysts dynamical systems library\(Gilpin,[2021](https://arxiv.org/html/2605.29001#bib.bib21)\)as its formal substrate, we build FormInv on a curated formal mathematics substrate from two sources:
Curated Mathlib4 theorems\.We manually curated 103 theorems from Mathlib4\(The mathlib Community,[2020](https://arxiv.org/html/2605.29001#bib.bib18); Baanenet al\.,[2025](https://arxiv.org/html/2605.29001#bib.bib14)\)spanning 7 mathematical domains \(Table[3](https://arxiv.org/html/2605.29001#S4.T3)\), selected to include diverse predicate structures and sufficient complexity for F7/F8 definitional transformations\. The 103 canonical theorem statements are Lean4\-verified Mathlib4 declarations\. The 657 generated paraphrases are*not*Lean4\-verified; equivalence is established by CAS \(SymPy\) for algebraic families, template construction for surface families, and domain\-expert review for F7–F8 \(see §[4](https://arxiv.org/html/2605.29001#S4)for full protocol\)\.
Track B pilot\.On 100 harder ntp\-mathlib theorems \(811 items\), the capability\-invariance correlation weakens tor=−0\.415r=\-0\.415\(p=0\.27p=0\.27, not significant\), confirming the correlation is difficulty\-regime\-dependent\. F5 connective\-variation failures rise to∼\\sim32% and all 8 families cluster at 28–35%, showing the F5 specificity effect dissolves at high difficulty\. The two reasoning models \(DeepSeek R1, o4\-mini\) become the most invariant at high difficulty, consistent with CoT reducing surface sensitivity when theorems demand deliberation\. Full findings A–D in Appendix[K](https://arxiv.org/html/2605.29001#A11)\.
Table 3:FormInv domain distribution \(v1: 103 theorems\)\.
### 4\.2Paraphrase Generation and Verification
For each theoremtt, we generate one paraphrase per applicable family using GPT\-4o with a structured prompt \(Appendix[A](https://arxiv.org/html/2605.29001#A1)\)\. Paraphrases are cached and verified by the following protocol:
1. 1\.F1–F3 \(surface\):CAS verification where applicable \(SymPy\) \+ manual audit on 20% sample
2. 2\.F4–F6 \(semantic\):CAS verification for algebraic families; template guarantee for others
3. 3\.F7–F8 \(definitional\):Human verification \(domain expert\) on 100% of items
This verification protocol adapts ChaosBench\-Logic’s CARE adjudication pipeline\(Thomas,[2026a](https://arxiv.org/html/2605.29001#bib.bib22)\): pre\-registered protocol, LLM dual review, and human sign\-off before repair decisions\. Final dataset:760 items\(103 theorems×\\timesavg\. 7\.4 applicable families\)\.
### 4\.3Dataset Properties
The dataset satisfies key quality gates: \(1\) All paraphrases have verified ground truth \(TRUE for valid Mathlib theorems\); \(2\) Equivalence is domain\-expert confirmed for F7/F8; \(3\) CAS\-verifiable items are independently reproducible; \(4\) SHA256 hash ensures reproducibility:e668f98a\.\.\.79a1656\(full hash in dataset card\)\.
## 5Experiments
### 5\.1Setup
Models\.We evaluate 9 models spanning 5 providers and 3 capability tiers:*Frontier flagship:*GPT\-4o, Gemini 2\.5 Flash, Claude Sonnet 4\.6, DeepSeek V3;*Efficient:*GPT\-4o\-mini, Claude Haiku 4\.5;*Reasoning \(CoT\):*o4\-mini \(OpenAI\), DeepSeek R1 \(deepseek\-reasoner\);*Open\-source:*Llama 3\.3 70B \(via OpenRouter\)\. All evaluated at temperature=0 or equivalent; reasoning models use their default inference modes\.
Dataset\.Primary: 366\-item FormInv v1 \(50 theorems, 9 models\)\. Extended: 103\-theorem evaluation \(760 items, 9 models, all families\); Track B: 100 ntp\-mathlib theorems \(811 items, 9 models\), harder difficulty\.Total: 203 theorems evaluated across all 9 models\.
Caveats\.Gemini 2\.5 Flash coverage is 96% \(15 items timed out at 2048 output tokens\)\. DeepSeek R1 responses include a reasoning trace; we parse the final TRUE/FALSE verdict, not intermediate steps\. Maximum output tokens: 4096 for reasoning models, 20 for others\.
Protocol\.Zero\-shot; system prompt: “You are evaluating mathematical statements\. Answer strictly based on mathematical correctness\. Return exactly TRUE or FALSE\.” Response caching with SHA256 keys ensures reproducibility across runs\.
Baseline\.Per\-theorem within\-canonical variance \(same canonical prompt, temperature=0\) is≈0\\approx 0for all models, confirming that all observed IG is due to paraphrase sensitivity, not API stochasticity\.
### 5\.2Main Results
Table 4:Per\-model invariance profile on FormInv v1 \(50 theorems, 366 items\)\. Sorted by Mean IG \(ascending = most invariant first\)\. SCR = Semantic Consistency Rate\.†\\dagger: 96% coverage \(15 items timed out\)\.N=103 stability \(9 models\):Rankings are stable at 103 theorems \(full 9\-model evaluation; Appendix[B](https://arxiv.org/html/2605.29001#A2)\)\. DeepSeek Chat leads: SCR = 83\.5% \(↑\\uparrow1\.5pp\)\. Gemini 2\.5 Flash: SCR = 82\.5% \(↑\\uparrow2\.5pp; coverage 96\.4%\)\. GPT\-4o: SCR = 79\.6% \(↓\\downarrow2\.4pp\) — GPT\-4o is less invariant on harder theorems\. Claude Haiku: SCR = 54\.4% \(↑\\uparrow4\.4pp\), gap narrows slightly to 29pp vs\. 32pp at N=50\. All deltas are within±\\pm5pp; no model degrades severely\.
#### Finding 1: The capability\-invariance correlation: what accuracy hides\.
The most striking result in Table[4](https://arxiv.org/html/2605.29001#S5.T4)is not any individual model’s score\. It is the range\. Claude Haiku 4\.5, a deployed commercial model in the same family as Claude Sonnet, achieves SCR = 50%: half its theorems are answered inconsistently across semantically equivalent restatements\. DeepSeek V3, at 96\.4% accuracy, achieves 82% SCR\. A 10\-point accuracy gap translates to a 32\-point invariance gap\.
A practitioner who evaluates Claude Haiku on a standard benchmark and observes 86% accuracy would conclude the model is adequate for mathematical reasoning tasks\. FormInv reveals a different picture: on 50% of theorems the model “knows,” the correct answer changes depending on how the question is phrased\. This is not a small\-model quirk, it is a structural failure of semantic representation that accuracy metrics are constitutionally blind to\.
Across all 9 models, the Pearson correlation between aggregate accuracy and mean IG isr=−0\.965r=\-0\.965\(Pearson\); Spearmanρ=−0\.883\\rho=\-0\.883,p=0\.0016p=0\.0016: more accurate models are*more*semantically invariant\. Removing Claude Haiku as the highest\-leverage point yieldsr=−0\.906r=\-0\.906, still significant \(p<0\.05p<0\.05,n=8n=8\), confirming the relationship holds across the full accuracy range \(Proposition[2](https://arxiv.org/html/2605.29001#Thmproposition2)bounds this relationship formally\)\. The 22% of theorem\-model pairs withIG\>0\.10\\mathrm\{IG\}\>0\.10represent theorems where models fail on roughly half the paraphrases, a systematic blind spot that no accuracy\-only evaluation can detect\.
#### Finding 2: Models exhibit sharply complementary failure patterns\.
Across the 50\-theorem shared evaluation for four 100%\-coverage chat models \(Claude Sonnet, GPT\-4o, GPT\-4o\-mini, DeepSeek V3\), we observe86 pairwise cross\-model disagreements: items where one model answers correctly and another incorrectly on the same paraphrase\. The complementarity is structural, not incidental: F6 \(comparison order\): Claude Sonnet 16\.7%, GPT\-4o 0\.0% — a 16\-point gap on the same transformation\. F7/F8 \(definitional unpack\): GPT\-4o 10\.0%, Claude 6–8%\. F5 \(connective variation\): all models fail equally \(22\.2%\) — no comparative advantage\. This complementarity implies benchmark accuracy is systematically optimistic per family, and an ensemble would show near\-zero IG on items where individual models fail\.
Figure 1:Per\-family failure rates across three representative models \(3\-model shared evaluation\)\. F5 \(connective variation\) is the highest\-failure family for all models; F6 \(comparison order\) shows the strongest complementarity \(Claude: 16\.7%, GPT\-4o: 0\.0%\)\.
#### Finding 3: The quality audit separates benchmark errors from model errors\.
The F5 apparent failure rate \(22\.2%\) is inflated by semantically\-incorrect paraphrases\. A 30\-item human annotation study \(2 annotators,κ=0\.47\\kappa=0\.47; model consensus resolves interpretive disagreements to effectiveκ≈0\.82\\kappa\\approx 0\.82\) found 7 of 15 F5 items contain biconditional overreach\. Cross\-model unanimity flagged all 7 at 100% recall, 0 false positives on MathCheck\-GSM \(curated external benchmark\)\. Removing the 7 bad items reduces F5 failure from 22\.2% to∼\\sim11%; remaining failures are model\-specific \(see Appendix[H](https://arxiv.org/html/2605.29001#A8)\)\.
#### FALSE controls \(§[D](https://arxiv.org/html/2605.29001#A4)\)\.
We constructed 5 FALSE theorem variants \(25 items, 5 families each\)\. An always\-TRUE oracle achieves balanced accuracy=50\.0%=50\.0\\%; Haiku achieves 89%, Sonnet 95% — neither is always\-TRUE\. The SCR gap from TRUE items \(Haiku 50%, Sonnet 74%\) persists on FALSE items \(60%, 80%\), confirming SCR measures genuine invariance\. Full results in Appendix[D](https://arxiv.org/html/2605.29001#A4)\.
### 5\.3Per\-Family Analysis
Table 5:Per\-family failure rates averaged across four 100%\-coverage chat models \(Claude Sonnet, GPT\-4o, GPT\-4o\-mini, DeepSeek V3\) on the shared 50\-theorem evaluation\. Surface families \(F1–F3\) are generally easier; semantic/definitional families \(F5–F8\) are harder\. F5 connective variation is uniquely difficult across all models\.F5 \(connective variation\) fails universally\.All models fail at≥19%\\geq 19\\%; even trivially\-provable biconditionals fail with “precisely if” or “just in case” variants\. The quality audit found 7/15 F5 items contained biconditional overreach \(§[6\.3](https://arxiv.org/html/2605.29001#S6.SS3)\); cleaned rate:∼\\sim11%\.
F6 \(comparison order\) is Claude’s failure, not GPT\-4o’s\.Claude Sonnet: 16\.7%; GPT\-4o: 0\.0% — a 16\-point gap onx≥0\\sqrt\{x\}\{\\geq\}0vs0≤x0\{\\leq\}\\sqrt\{x\}\. Claude applies a fixed\-subject directionality heuristic; GPT\-4o does not\.
F7/F8 \(definitional unpacking\) is GPT\-4o’s failure\.GPT\-4o fails F7 and F8 at 10\.0% each \(Claude: 6–8%\)\. Replacing “prime” with “has exactly two positive divisors” doubles GPT\-4o’s error relative to surface families\.
F3 \(active/passive\) inverts relational predicates\.“nndivides0” \(TRUE\) becomes “0is divisible bynn” → GPT\-4o answers FALSE \(4\.4%\), reading the passive as0÷n0\\div n\.
## 6Analysis
### 6\.1Does Accuracy Predict Invariance?
The primary invariance finding is a32\-point SCR gap: Claude Haiku 4\.5 \(86% accuracy\) achieves SCR = 50% — half its theorems are answered inconsistently across semantically equivalent restatements — while DeepSeek V3 \(96\.4% accuracy\) achieves SCR = 82%\. Standard accuracy alone misses this: all nine models score “adequate” by typical benchmark thresholds, yet their SCR profiles span a1\.64×1\.64\\timesrange \(50% to 82%\)\.
Across all 9 models, Mean\-IG also correlates with accuracy:r=−0\.965r=\-0\.965\(Pearson\),ρ=−0\.883\\rho=\-0\.883\(Spearman,p=0\.0016p=0\.0016\)\. As noted in Remark[1](https://arxiv.org/html/2605.29001#Thmproposition1), this correlation is an algebraic consequence ofIG=p\(1−p\)\\mathrm\{IG\}=\\sqrt\{p\(1\-p\)\}\(Proposition[1](https://arxiv.org/html/2605.29001#Thmproposition1)\) and does not constitute an independent empirical finding\. The SCR correlation with accuracy is empirically real and not algebraically constrained:ρ=−0\.917\\rho=\-0\.917\(Spearman,p<0\.001p<0\.001\), confirmed without the Haiku anchor \(ρ=−0\.833\\rho=\-0\.833,p<0\.05p<0\.05,n=8n=8\)\.
Accuracy alone misses this: the 10\-point accuracy gap between Haiku and DeepSeek V3 conceals a 32\-point SCR gap\. Reasoning mode has no systematic effect after controlling for accuracy: residual analysis shows DeepSeek R1 and o4\-mini at±1\.4%\\pm 1\.4\\%\(Appendix[G](https://arxiv.org/html/2605.29001#A7)\)\. The per\-theorem IG distribution is bimodal: 79% have IG=0=0\(all paraphrases consistent\) and 21% have IG≈0\.42\\approx 0\.42\(fail on 1–3 families\)\. SCR and Hi\-IG% are more informative than Mean\-IG; bootstrap stability and the full bimodal figure are in Appendix[M](https://arxiv.org/html/2605.29001#A13)\.
### 6\.2Family\-Conditional Ranking Reversals
A key implication of complementary failure patterns is that*model rankings reverse across paraphrase families\.*Rank ordering by F6 \(comparison order\) failure rate places GPT\-4o and o4\-mini joint\-first \(0\.0%\), while Claude Sonnet ranks 7th of 9 \(16\.7%\)\. Rank ordering by F7 \(definitional unpack\) reverses this: Claude Sonnet ranks 3rd \(6\.0%\), while GPT\-4o drops to 7th \(10\.0%\)\.
Kendallτ\\tauacross all 28 family pairs ranges from\+\+0\.06 to\+\+0\.84 \(all positive\), confirming pair\-level reversals are real but aggregate rankings weakly agree\. A practitioner choosing between GPT\-4o and Claude Sonnet for a definitional\-reasoning task \(F7\) should prefer Claude \(6% vs\. 10% failure\); for comparison\-direction \(F6\), prefer GPT\-4o \(0% vs\. 16\.7%\)\. Full per\-family rank table in Appendix[G](https://arxiv.org/html/2605.29001#A7)\.
### 6\.3Paraphrase Quality Audit
Forensic inspection of all cached responses reveals three GPT\-4o generation failure types:\(1\) Biconditional overreach \(F5\): “if and only if” applied to one\-directional theorems, asserting a false converse \(e\.g\.,Nat\.dvd\_add: models answering FALSE are mathematically correct; the paraphrase is the error\);\(2\) Passive\-voice inversion \(F3\): “0∣n0\\mid n” becomes “n÷0n\\div 0”, an undefined expression;\(3\) Type\-context stripping \(F1, F7\): Lean4 scope \(LinearOrderedRing\) dropped, making domain\-correct answers wrong over broader type universes\. Per\-item breakdown in Appendix[H](https://arxiv.org/html/2605.29001#A8)\.
We flag 11 items across these categories and exclude them from the cleaned analysis \(detailed per\-item breakdown in Appendix[H](https://arxiv.org/html/2605.29001#A8)\)\. After exclusion, GPT\-4o exhibits the highest proportion of genuinely model\-attributable failures \(≈\\approx84% of its remaining failures are not explained by paraphrase quality, based on item\-level inspection in the forensic cache audit\), while DeepSeek and Claude have a higher fraction attributable to paraphrase drift\. These findings validate a key FormInv design principle: tracking which items fail*across models*reliably separates paraphrase quality issues \(which cause universal or near\-universal failure\) from model\-specific reasoning weaknesses\.
Key finding:The F6 vs\. F7 ranking reversal \(GPT\-4o rank 1 on F6, rank 7 on F7\) reflects distinct capability profiles: GPT\-4o handles comparison\-direction equivalences well but fails on existential reformulations; Claude is the opposite\. A five\-category failure taxonomy \(biconditional overreach, existential blindness, self\-membership blindness, domain convention conflict, type\-context stripping\) is in Appendix[I](https://arxiv.org/html/2605.29001#A9)\.
A pilot cross\-benchmark study on MMLU formal\_logic and global\_facts shows consistent per\-family SCR patterns; full results in Appendix[J](https://arxiv.org/html/2605.29001#A10)\.
## 7FormInv as General Evaluation Protocol
FormInv is designed as a supplementary metric for any formally\-specified reasoning benchmark, analogous to how CheckList\(Ribeiroet al\.,[2020](https://arxiv.org/html/2605.29001#bib.bib26)\)extended behavioral testing across NLP tasks\. Three outputs of a FormInv run are immediately actionable\.
FormInvSelector: regime\-aware model selection\.Given a target reasoning task characterized by which paraphrase families it relies on \(e\.g\., definitional reasoning and comparison\-direction tasks use F7 and F6 respectively\), FormInvSelector uses the per\-family IG profile to recommend the model with minimum expected failure rate on those families:
```
forminv selector --families unpack order
# o4-mini 2.0% gemini-2.5-flash 2.1%
# gpt-4o (5.0%)
```
This is the actionable output of the regime\-dependent invariance analysis: a practitioner selecting between GPT\-4o and Claude Sonnet for a definitional\-reasoning application should prefer Claude \(6% vs GPT\-4o 10% failure on F7, §[6\.2](https://arxiv.org/html/2605.29001#S6.SS2)\)\.*Note:*FormInvSelector requires completed FormInv evaluation; it does not predict winners before evaluation runs\.
The FormInv quality audit protocol\.A by\-product of evaluating multiple models on the same paraphrase items is a cross\-model unanimity signal that reliably separates*paraphrase errors*from*model errors*\. The protocol: if≥6\\geq 6of the 9 evaluated models answer a paraphrase incorrectly while answering the canonical theorem correctly, flag the item for expert review\. The≥6/9\\geq 6/9threshold \(two\-thirds majority\) excludes coincidental model agreement while catching systematic failures attributable to the paraphrase itself\.
We applied this protocol to FormInv v1 and identified 11 semantically\-incorrect paraphrases generated by GPT\-4o, including: two cases of*biconditional overreach*\(“exactly when” applied to one\-directional theorems\); one case of*passive\-voice inversion*\(“n is divided by zero” inverting0∣n0\\mid nton÷0n\\div 0\); and several cases of*type\-context stripping*\(Lean4 scope dropped, making domain\-correct theorems false over broader type universes\)\. In each case, the models were mathematically correct, the paraphrase was the error\.
This protocol generalizes: any benchmark that generates paraphrases programmatically can use cross\-model unanimity as an automated quality gate, reducing reliance on costly domain\-expert review\.*The quality audit is a first\-class output of running FormInv, not a side effect\.*
Availability\.pip install forminv\. Commands:forminv audit,forminv selector,forminv eval\.
## 8Conclusion
The moral: if you generate paraphrases programmatically, you do not know how many are wrong, and the rankings your benchmark produces may be wrong for the same reason\.
We applied FormInv’s automated cross\-model unanimity protocol to all 129 MathCheck\(Zhouet al\.,[2024b](https://arxiv.org/html/2605.29001#bib.bib32)\)groups\. It flagged 4 semantically\-incorrect paraphrases \(3\.1%\): unit\-stripping \(Group 25\), sub\-question redirection \(Group 27\), and two additional errors in Groups 75 and 82\. Removing them*changes the model ranking*: GPT\-4o drops from 2nd to 4th place \(\+0\.6\+0\.6pp absolute\); Claude Haiku moves from 3rd to 2nd; DeepSeek from 4th to 3rd\. GPT\-4o’s apparent advantage was an artifact — it happened to answer the broken paraphrases correctly while other models failed them, artificially inflating its SCR\. Full reversal details in Appendix[E](https://arxiv.org/html/2605.29001#A5)\. The quality audit makes FormInv a tool every benchmark builder needs\. Run it before you publish\. Either way, your benchmark is better\.Limitations:18 Lean4 proofs certify representative examples \(Appendix[F](https://arxiv.org/html/2605.29001#A6)\); full\-scale certification and equivariance defect formalization are future work\.
## References
- ProofNet: autoformalizing and formally proving undergraduate\-level mathematics\.External Links:2302\.12433Cited by:[Appendix K](https://arxiv.org/html/2605.29001#A11.p6.1)\.
- A\. Baanen, M\. R\. Ballard, J\. Commelin, B\. G\. Chen, M\. Rothgang, and D\. Testa \(2025\)Growing Mathlib: maintenance of a large scale mathematical library\.InConference on Intelligent Computer Mathematics \(CICM\),External Links:2508\.21593Cited by:[§4\.1](https://arxiv.org/html/2605.29001#S4.SS1.p2.1)\.
- M\. Choi \(2025\)RoParQ: paraphrase\-aware alignment of large language models towards robustness to paraphrased questions\.External Links:2511\.21568Cited by:[§3\.1](https://arxiv.org/html/2605.29001#S3.SS1.p3.7)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.External Links:2110\.14168Cited by:[Table 1](https://arxiv.org/html/2605.29001#S1.T1),[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px1.p1.1)\.
- W\. G\. Cochran \(1950\)The comparison of percentages in matched samples\.Biometrika37\(3/4\),pp\. 256–266\.Cited by:[§3\.1](https://arxiv.org/html/2605.29001#S3.SS1.p6.10)\.
- L\. J\. Cronbach, G\. C\. Gleser, H\. Nanda, and N\. Rajaratnam \(1972\)The dependability of behavioral measurements: theory of generalizability for scores and profiles\.Wiley\.Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px8.p1.1),[§3\.1](https://arxiv.org/html/2605.29001#S3.SS1.p3.7)\.
- A\. P\. Dawid and A\. M\. Skene \(1979\)Maximum likelihood estimation of observer error\-rates using the EM algorithm\.Applied Statistics28\(1\),pp\. 20–28\.External Links:[Document](https://dx.doi.org/10.2307/2346806)Cited by:[Appendix F](https://arxiv.org/html/2605.29001#A6.SS0.SSS0.Px1.p1.17)\.
- I\. de Zarzà, J\. de Curtò, J\. Cabot, P\. Manzoni, and C\. T\. Calafate \(2026\)Semantic invariance in agentic AI\.InInternational Conference on Agents and Multi\-Agent Systems: Technologies and Applications \(AMSTA\),External Links:2603\.13173Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px2.p1.3)\.
- J\. Dekoninck, M\. N\. Müller, and M\. Vechev \(2024\)ConStat: performance\-based contamination detection in large language models\.External Links:2405\.16281Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px7.p1.1)\.
- J\. L\. Doob \(1953\)Stochastic processes\.Wiley\.Cited by:[§1](https://arxiv.org/html/2605.29001#S1.p3.1),[§3\.1](https://arxiv.org/html/2605.29001#S3.SS1.p2.8)\.
- W\. Gilpin \(2021\)Chaos as an interpretable benchmark for forecasting and data\-driven modelling\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track,External Links:2110\.05266Cited by:[§4\.1](https://arxiv.org/html/2605.29001#S4.SS1.p1.1)\.
- I\. Gorbett and S\. Jana \(2026\)Cross\-model disagreement as a label\-free correctness signal\.External Links:2603\.25450Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px3.p1.1)\.
- C\. Guoet al\.\(2024\)Are LLMs better than reported? detecting label errors and analyzing their effect on model performance\.External Links:2410\.18889Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Han, H\. Schoelkopf, Y\. Zhao, Z\. Qi, M\. Riddell, L\. Benson, L\. Sun, E\. Zubova, Y\. Qiao, M\. Burtell, D\. Peng, J\. Fan, Y\. Liu, B\. Wong, M\. Sailor, A\. Ni, L\. Nan, J\. Kasai, T\. Yu, R\. Zhang, S\. Joty, A\. R\. Fabbri, W\. Kryscinski, X\. V\. Lin, C\. Xiong, and D\. Radev \(2022\)FOLIO: natural language reasoning with first\-order logic\.External Links:2209\.00840Cited by:[Appendix J](https://arxiv.org/html/2605.29001#A10.p4.1)\.
- T\. Hao, T\. Wan, and S\. Zhai \(2025\)PutnamGAP: an investigation of LLM mathematical robustness via mathematically\-equivalent transformation of advanced mathematical problems\.External Links:2508\.08833Cited by:[Table 1](https://arxiv.org/html/2605.29001#S1.T1),[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px7.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021a\)Measuring massive multitask language understanding\.External Links:2009\.03300Cited by:[Appendix J](https://arxiv.org/html/2605.29001#A10.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021b\)Measuring mathematical problem solving with the MATH dataset\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track,External Links:2103\.03874Cited by:[Table 1](https://arxiv.org/html/2605.29001#S1.T1),[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px1.p1.1)\.
- P\. W\. Holland and D\. T\. Thayer \(1988\)Differential item performance and the mantel\-haenszel procedure\.InTest Validity,H\. Wainer and H\. I\. Braun \(Eds\.\),pp\. 129–145\.Cited by:[§1](https://arxiv.org/html/2605.29001#S1.p3.1),[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px8.p1.1)\.
- J\. Hu, T\. Zhu, and S\. Welleck \(2024\)miniCTX: neural theorem proving with \(long\-\)contexts\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track,External Links:2408\.03350Cited by:[Appendix K](https://arxiv.org/html/2605.29001#A11.p1.1)\.
- S\. Kimet al\.\(2025\)Large language models and the madness of crowds: measuring correlated errors across frontier models\.External Links:2411\.01539Cited by:[Appendix F](https://arxiv.org/html/2605.29001#A6.SS0.SSS0.Px1.p1.17)\.
- L\. Lamport, R\. Shostak, and M\. Pease \(1982\)The byzantine generals problem\.ACM Transactions on Programming Languages and Systems4\(3\),pp\. 382–401\.Cited by:[Appendix F](https://arxiv.org/html/2605.29001#A6.SS0.SSS0.Px1.p1.17)\.
- H\. Liu, Z\. Zheng, Y\. Qiao, H\. Duan, Z\. Fei, F\. Zhou, W\. Zhang, S\. Zhang, D\. Lin, and K\. Chen \(2024\)MathBench: evaluating the theory and application proficiency of LLMs with a hierarchical mathematics benchmark\.External Links:2405\.12209Cited by:[Table 1](https://arxiv.org/html/2605.29001#S1.T1),[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px1.p1.1)\.
- F\. M\. Lord and M\. R\. Novick \(1968\)Statistical theories of mental test scores\.Addison\-Wesley\.Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px8.p1.1),[§3\.1](https://arxiv.org/html/2605.29001#S3.SS1.p3.7)\.
- Q\. McNemar \(1947\)Note on the sampling error of the difference between correlated proportions or percentages\.Psychometrika12\(2\),pp\. 153–157\.Cited by:[§3\.1](https://arxiv.org/html/2605.29001#S3.SS1.p6.10)\.
- H\. Moore and A\. Shah \(2025\)Evaluating autoformalization robustness via semantically similar paraphrasing\.External Links:2511\.12784Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px7.p1.1),[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px8.p1.1)\.
- C\. G\. Northcutt, A\. Athalye, and J\. Mueller \(2021\)Pervasive label errors in test sets destabilize machine learning benchmarks\.InAdvances in Neural Information Processing Systems \(NeurIPS\) Datasets and Benchmarks Track,External Links:2103\.14749Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px3.p1.1)\.
- K\. Paster, M\. D\. Santos, Z\. Azerbayev, and J\. Ba \(2024\)GSM\-Plus: a comprehensive benchmark for evaluating the robustness of LLMs as mathematical problem solvers\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),External Links:2402\.19255Cited by:[Table 1](https://arxiv.org/html/2605.29001#S1.T1)\.
- M\. T\. Ribeiro, T\. Wu, C\. Guestrin, and S\. Singh \(2020\)Beyond Accuracy: behavioral testing of NLP models with CheckList\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 4902–4912\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.acl-main.442)Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px4.p1.1),[§7](https://arxiv.org/html/2605.29001#S7.p1.1)\.
- A\. Romanou, M\. Ibrahim, C\. Ross, C\. Shaib, K\. Oktar, S\. J\. Bell, A\. Ovalle, J\. Dodge, A\. Bosselut, K\. Sinha, and A\. Williams \(2026\)Brittlebench: quantifying LLM robustness via prompt sensitivity\.External Links:2603\.13285Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px4.p1.1)\.
- M\. Sclar, Y\. Choi, Y\. Tsvetkov, and A\. Suhr \(2024\)Quantifying language models’ sensitivity to spurious features in prompt design or: how i learned to start worrying about prompt formatting\.InInternational Conference on Learning Representations \(ICLR\),External Links:2310\.11324Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px4.p1.1)\.
- A\. K\. Taylor, J\. Zhang, E\. Ji, V\. Sahai, H\. Deng, Y\. Chen, Y\. Yuan, D\. Wu, J\. Gu, K\. Chang, N\. Peng, A\. Sahai, and W\. Wang \(2026\)TaoBench: do automated theorem prover LLMs generalize beyond Mathlib?\.External Links:2603\.12744Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px2.p1.3)\.
- Technology Innovation Institute \(2026\)Are arabic benchmarks reliable? QIMMA’s quality\-first approach to LLM evaluation\.External Links:2604\.03395Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px3.p1.1)\.
- The mathlib Community \(2020\)The Lean mathematical library\.External Links:1910\.09338Cited by:[§4\.1](https://arxiv.org/html/2605.29001#S4.SS1.p2.1)\.
- N\. Thomas \(2026a\)ChaosBench\-Logic v2: evaluating LLM logical reasoning over dynamical systems at scale\.External Links:2605\.24305Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px1.p1.1),[§3\.3](https://arxiv.org/html/2605.29001#S3.SS3.p1.1),[§4\.1](https://arxiv.org/html/2605.29001#S4.SS1.p1.1),[§4\.2](https://arxiv.org/html/2605.29001#S4.SS2.p3.1)\.
- N\. Thomas \(2026b\)Regime\-conditioned evaluation in multi\-context Bayesian optimization\.External Links:2605\.04895Cited by:[Appendix K](https://arxiv.org/html/2605.29001#A11.p2.3),[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px2.p1.3)\.
- R\. J\. Vandenberg and C\. E\. Lance \(2000\)A review and synthesis of the measurement invariance literature\.Organizational Research Methods3\(1\),pp\. 4–70\.External Links:[Document](https://dx.doi.org/10.1177/109442810031002)Cited by:[§1](https://arxiv.org/html/2605.29001#S1.p3.1),[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px8.p1.1)\.
- Y\. Yang and W\. Y\. Wang \(2026\)The benchmark illusion: on the incoherence of machine learning evaluation\.External Links:2602\.11898Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px3.p1.1)\.
- Z\. Yu, R\. Peng, K\. Ding, Y\. Li, Z\. Peng, M\. Liu, Y\. Zhang, Z\. Yuan, H\. Xin, W\. Huang, Y\. Wen, G\. Zhang, and W\. Liu \(2025\)FormalMATH: benchmarking formal mathematical reasoning of large language models\.External Links:2505\.02735Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Zawalski, M\. Boubdir, K\. Bałazy, B\. Nushi, and P\. Ribalta \(2025\)CoDeC: detecting data contamination in LLMs via in\-context learning\.Note:ICLR 2026 PosterExternal Links:2510\.27055Cited by:[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px7.p1.1)\.
- Y\. Zhou, Y\. Zhu, D\. Antognini, Y\. Kim, and Y\. Zhang \(2024a\)Paraphrase and solve: exploring and exploiting the impact of surface form on mathematical reasoning in large language models\.InNorth American Chapter of the Association for Computational Linguistics \(NAACL\),External Links:2404\.11500Cited by:[Table 1](https://arxiv.org/html/2605.29001#S1.T1),[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px5.p1.1)\.
- Z\. Zhou, S\. Liu, M\. Ning, W\. Liu, J\. Wang, D\. F\. Wong, X\. Huang, Q\. Wang, and K\. Huang \(2024b\)Is your model really a good math reasoner? evaluating mathematical reasoning with Checklist\.External Links:2407\.08733Cited by:[Appendix E](https://arxiv.org/html/2605.29001#A5.p1.1),[Table 1](https://arxiv.org/html/2605.29001#S1.T1),[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px6.p1.1),[§8](https://arxiv.org/html/2605.29001#S8.p2.1)\.
- K\. Zhu, J\. Wang, J\. Zhou, Z\. Wang, H\. Chen, Y\. Wang, L\. Yang, W\. Ye, N\. Z\. Gong, Y\. Zhang, and X\. Xie \(2024\)PromptBench: towards evaluating the robustness of large language models on adversarial prompts\.External Links:2306\.04528Cited by:[Table 1](https://arxiv.org/html/2605.29001#S1.T1),[§2](https://arxiv.org/html/2605.29001#S2.SS0.SSS0.Px4.p1.1)\.
Appendices Supplementary material for*FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks*
## Appendix AParaphrase Generation Prompt
All paraphrases are generated with GPT\-4o using the following structured prompt\. The family identifier is replaced with the specific family description at generation time\.
> System:You are a mathematical linguist\. Generate exactly ONE alternative NL formulation of the given theorem according to the transformation family\. The paraphrase MUST: \(1\) preserve exact logical content and truth value, \(2\) apply the specified transformation type, \(3\) be grammatically correct\. Return ONLY the paraphrase\. User:Theorem: \{canonical\_statement\} Family: \{family\_description\} Generate one paraphrase applying this transformation:
## Appendix BN=103 Stability Table
Table 6:Per\-model SCR and accuracy at N=103 theorems \(all 9 models, sorted by SCR\)\. Gemini coverage = 96\.4%; all other models≥\\geq98\.6%\.
## Appendix CSample Size Guidance for SCR Estimation
A practitioner running FormInv needs to know how many theorems to sample\. The table below shows the bootstrap 95% CI half\-width for SCR at variousNN, computed via Binomial\(N,p\)\(N,p\)simulation atp=0\.75p=0\.75\(the mean SCR across our 9 models\):
Table 7:SCR 95% CI half\-width as a function of sample size \(Binomial simulation,p=0\.75p=0\.75, 5,000 bootstraps\)\.The FormInv v1 50\-theorem evaluation detects gaps≥24\\geq 24pp at 95% confidence \(the 32\-pp Haiku/DeepSeek gap satisfies this criterion:\[38%,62%\]\[38\\%,62\\%\]and\[70%,94%\]\[70\\%,94\\%\]do not overlap\)\. For gaps<15<15pp, 100\+ theorems are required\.
## Appendix DFALSE Controls Pilot
We constructed 5 FALSE theorem variants with verified counterexamples: \(1\)*Every integer≥2\\geq 2is prime*\(FALSE; 4 is not prime\), \(2\)*x2\>0x^\{2\}\>0for all realxx*\(FALSE;02=00^\{2\}=0\), \(3\)*Ifa∣ba\{\\mid\}banda∣ca\{\\mid\}cthenb=cb=c*\(FALSE;2∣42\{\\mid\}4,2∣62\{\\mid\}6,4≠64\{\\neq\}6\), \(4\)*A∩B=A⇒A=BA\{\\cap\}B=A\\Rightarrow A=B*\(FALSE;A=\{1\},B=\{1,2\}A=\\\{1\\\},B=\\\{1,2\\\}\), \(5\)*n\!\>nn\!\>nfor alln∈ℕn\\in\\mathbb\{N\}*\(FALSE;1\!=11\!=1\)\. Each was paraphrased across 5 families \(canonical, syntactic, quantifier, comparison\-order, notation\)\.
Table 8:FALSE controls: 25 items, 5 theorems, 8 models\. An always\-TRUE oracle achieves balanced accuracy=50\.0%=50\.0\\%\(0% on FALSE items\)\. All models beat the oracle; o4\-mini and DeepSeek\-R1 achieve perfect SCR\.Note: DeepSeek V3 = deepseek\-chat; DeepSeek R1 = deepseek\-reasoner \(reasoning model\)\. Reasoning models \(o4\-mini, R1\) show zero TRUE\-bias on FALSE items\.
The hardest item across models isfalse\_004\_subset\_iff\_equal\(“A=BA\{=\}BwheneverA∩B=AA\{\\cap\}B\{=\}A”\), where multiple models answer TRUE — a shared systematic error detectable by cross\-model unanimity\. Reasoning models \(o4\-mini, DeepSeek R1\) are immune to all five FALSE theorem types\.
#### Fleiss’sκ\\kappaon 9\-model main evaluation\.
On the 366\-item primary dataset, Fleiss’sκ=0\.297\\kappa=0\.297\(“fair” by Landis–Koch criteria\)\. The lowκ\\kappareflects the TRUE\-heavy distribution \(94\.2% TRUE rate\) rather than low model reliability: when 94% of items are TRUE, models naturally agree by saying TRUE, butκ\\kappapenalizes this as chance agreement\.κ=0\.297\\kappa=0\.297\(Landis–Koch: “fair”\) reflects the TRUE\-heavy distribution \(94\.2% TRUE\); genuine disagreements on the 5\.8% of items where models split are precisely what FormInv quantifies\. The effective discrimination signal comes from the 5\.8% of items where models disagree \(86 pairwise disagreements\), which FormInv’s per\-theorem Cochran’s Q test targets directly\.
## Appendix EMathCheck External Audit
We manually inspected 40 of 129 MathCheck\(Zhouet al\.,[2024b](https://arxiv.org/html/2605.29001#bib.bib32)\)Problem Understanding \(PU\) groups\. PU paraphrases are GPT\-4\-Turbo\-generated with NL plausibility verification \(93\.02% pass rate claimed\)\. No logical equivalence check was applied\.
Errors found \(4/40 = 10%\):
Group 38 \(high, Lean4\-disproved\)\.Original:*Alex weighs 2 less than 4 times Grace*\(4×125−2=4984\\times 125\-2=498\)\. PU:*Alex’s weight is twice as much less 2 than twice the doubled weight of Grace\.*Standard parse:2×\(2×\(2×125\)\)−2=998≠4982\\times\(2\\times\(2\\times 125\)\)\-2=998\\neq 498\. Lean4 proof:mathcheck\_g38\_malformed\_paraphrase\_is\_wrong\(998≠498998\\neq 498byomega\)\.
Group 26 \(moderate, Lean4\-disproved\)\.Original:*four times as old as*\(4b4b\)\. PU:*four times more than*\(strict English:\(4\+1\)b=5b≠4b\(4\+1\)b=5b\\neq 4b\)\. Lean4 proof:mathcheck\_g26\_times\_more\_neq\_times\_as\_old\(5b≠4b5b\\neq 4bforb\>0b\>0\)\.
Groups 5 and 18:question\-scope expansion and temporal scope ambiguity \(moderate\)\.
We also ran FormInv’s automated detection protocol on 30 MathCheck groups \(4 models, canonical \+ PU\)\. The protocol flagged Groups 25 and 27 via cross\-model unanimity \(all 4 models pass canonical, all/3 fail PU\):
Group 25 \(unit\-stripping, severity: high\)\.Canonical:*0\.50\.5hours/day for 7 days*, answer = 35\. PU:*30 minutes/day*, models compute10×30×7=210010\{\\times\}30\{\\times\}7=2100without unit conversion\. All 4 models wrong\.
Group 27 \(sub\-question redirection, severity: moderate\-high\)\.Verbose PU elicits the throw distance \(1200ft\) instead of distance outside range \(200ft\)\. 3/4 models wrong \(GPT\-4o correct by format\)\.
Full 129\-group results\.On all 129 MathCheck groups \(4 models, canonical \+ PU\), FormInv flagged 4 bad paraphrases \(Groups 25, 27, 75, 82 — 3\.1%\)\.
Table 9:MathCheck 129\-group ranking change\. Removing 4 FormInv\-detected bad paraphrases reverses GPT\-4o’s rank\.Why GPT\-4o was inflated:GPT\-4o gave correct answers on Groups 27, 75, and 82 \(the broken paraphrases\), while other models failed them\. Under the broken evaluation, these “correct” answers boosted GPT\-4o’s SCR\. Removing the bad items exposes GPT\-4o’s true weaknesses on canonical problems\.
## Appendix FLean4 Paraphrase Certificates
We provide 18 machine\-checked Lean4 proofs in the repository \(lean\_proofs/\):FormInvCertificates\.lean\(10 proofs, Lean 4\.29\.1, no Mathlib\) andFormInvMathlib\.lean\(8 proofs, Mathlib v4\.29\.0\)\.
T3 disproofs \(auto\-generated paraphrase errors are formally FALSE\):
1. 1\.F5 biconditional overreach\(Nat\.dvd\_add\): The paraphrase “aadividesm\+nm\{\+\}nexactly when it divides each summand” is false\.f5\_err1\_dvd\_add\_biconditional\_is\_false: counterexamplea=2a\{=\}2,m=1m\{=\}1,n=1n\{=\}1\.
2. 2\.F3 passive\-voice inversion\(Nat\.dvd\_zero\): The paraphrase “0is divided bynn” reads as0∣n0\\mid n, which is false forn≠0n\\neq 0\.f3\_err1\_zero\_divides\_one\_is\_false: counterexamplen=1n\{=\}1\.
T1 certificates \(paraphrase transformations are formally equivalence\-preserving\):
1. 1\.F6 comparison order:a≥b↔b≤aa\\geq b\\leftrightarrow b\\leq aby𝙸𝚏𝚏\.𝚛𝚏𝚕\\mathtt\{Iff\.rfl\}\(definitional\)\. Instances:x≥0↔0≤x\\sqrt\{x\}\\geq 0\\leftrightarrow 0\\leq\\sqrt\{x\}; primep⇒p≥2↔2≤pp\\Rightarrow p\\geq 2\\leftrightarrow 2\\leq p\.
2. 2\.F4 formal notation:nmodn=0↔n%n=0n\\bmod n=0\\leftrightarrow n\\%n=0by definitional equality\. Instances:Nat\.mod\_self,Nat\.zero\_mod\.
Mathlib proofs: F7 \(Nat\.prime\_def\), F8 \(Eq\.symm/add\_comm\), F6/Real\.sqrt \(Real\.sqrt\_nonneg\)\. All 4 T1 families are now machine\-certified\.
#### Theoretical basis of the 6/9 threshold\.
The Condorcet Jury Theorem \(1785\) states: if each ofnnindependent evaluators has probabilityp\>12p\>\\frac\{1\}\{2\}of correctly identifying a bad paraphrase, the probability that the majority vote is correct approaches 1 asn→∞n\\to\\infty\. Forn=9n=9,p=0\.85p=0\.85, the probability that≥6\\geq 6evaluators agree on the correct answer is∑k=69\(9k\)\(0\.85\)k\(0\.15\)9−k≈0\.9990\\sum\_\{k=6\}^\{9\}\\binom\{9\}\{k\}\(0\.85\)^\{k\}\(0\.15\)^\{9\-k\}\\approx 0\.9990\. The Dawid–Skene \(1979\) latent\-class model\(Dawid and Skene,[1979](https://arxiv.org/html/2605.29001#bib.bib45)\)generalizes this to heterogeneous, correlated evaluators via EM inference on competence parameters\. For independent evaluators with individual accuracyaa:P\(truth=v∣all9votev\)=a9/\(a9\+\(1−a\)9\)≈99\.99%P\(\\text\{truth\}=v\\mid\\text\{all \}9\\text\{ vote \}v\)=a^\{9\}/\(a^\{9\}\+\(1\-a\)^\{9\}\)\\approx 99\.99\\%ata=0\.85a=0\.85\. The independence assumption is the key caveat: LLM errors are significantly correlated across models\(Kim and others,[2025](https://arxiv.org/html/2605.29001#bib.bib44)\)\. We use models from 5 providers and 3 distinct architectures to reduce correlation\. The BFT result\(Lamportet al\.,[1982](https://arxiv.org/html/2605.29001#bib.bib43)\)provides a structural guarantee independent of independence:k=6k=6ofn=9n=9toleratesf≤2f\\leq 2Byzantine models \(k≥2f\+1k\\geq 2f\+1\)\. Empirically, on our 11 known\-bad paraphrases, each model’s True Negative Rate \(TNR = P\(correctly reject bad paraphrase\)\) averages≈87\.5%\\approx 87\.5\\%— at this rate,P\(≥6/9models reject\)≈98\.2%P\(\\geq 6/9\\text\{ models reject\}\)\\approx 98\.2\\%by the Binomial\(9,0\.875\)\(9,0\.875\)tail, independent of the correlated\-failure caveat\.
## Appendix GFull Per\-Family Result Tables
Table 10:Per\-family, per\-model failure rates\. Column 1: Claude Sonnet 4\.6 on 103\-theorem dataset \(760 items\)\. Columns 2–3: GPT\-4o, GPT\-4o\-mini on 50\-theorem shared dataset \(366 items each\)\. Bold marks the hardest family per model\.Underline: model\-specific robustness\. Bold: hardest family\.
## Appendix HParaphrase Quality Audit: Flagged Items
A forensic inspection of all cached model responses identified the following items where the paraphrase introduces a claim that differs from the source theorem\. In each case, we report the source, the paraphrase, why it is semantically incorrect, and which models failed \(with their response\)\.
### Category 1: Biconditional Overreach \(F5\)
These paraphrases use “exactly when,” “precisely if,” or “if and only if” to express a one\-directional implication, creating a false biconditional\.
1. 1\.thm\_0008\_nat\_dvd\_add \(F5\): Source:a∣m∧a∣n→a∣\(m\+n\)a\\mid m\\wedge a\\mid n\\to a\\mid\(m\+n\) Paraphrase: “a number divides the sum of two numbers*exactly when*it divides each\.” Error: the converse \(a∣\(m\+n\)→a∣ma\\mid\(m\+n\)\\to a\\mid m\) is false \(e\.g\.,3∣63\\mid 6but3∤43\\nmid 4\)\. Correct answer given by models \(FALSE\): Claude, GPT\-4o\-mini, DeepSeek\.
2. 2\.thm\_0027\_set\_subset\_trans \(F5\): Source:s⊆t∧t⊆u→s⊆us\\subseteq t\\wedge t\\subseteq u\\to s\\subseteq u Paraphrase: “ssis a subset ofuu*precisely when*it is a subset ofttandttis a subset ofuu\.” Error: the “precisely when” impliess⊆us\\subseteq uwould require the existence of such attfor everyuu, not guaranteed\.
### Category 2: Passive\-Voice Semantic Inversion \(F3\)
1. 1\.thm\_0007\_nat\_zero\_dvd \(F3\): Source:Nat\.zero\_dvd:0∣n↔n=00\\mid n\\leftrightarrow n=0 Paraphrase: “a natural number*is divided by zero*iff it is zero\.” Error: “is divided by zero” \(n÷\\div0\) inverts the divisibility relation \(0∣n0\\mid n\)\. All four tested models correctly return FALSE \(division by zero is undefined\)\. Fix: “is a multiple of zero” or “zero divides n\.”
### Category 3: Type\-Context Stripping \(F1, F7\)
1. 1\.thm\_0022\_sq\_abs \(F1\): Source:sq\_absoverLinearOrderedRing Paraphrase: “the square of the absolute value of a number equals the square of the number\.” Error: true overℝ\\mathbb\{R\}but false overℂ\\mathbb\{C\}\(where\|i\|2=1≠i2=−1\\lvert i\\rvert^\{2\}=1\\neq i^\{2\}=\-1\)\. Claude returns FALSE citing this counterexample\.
2. 2\.thm\_0014\_add\_comm \(F7\): Source:AddCommMonoid Paraphrase: “In a structure where addition is associative, has an identity element, and every element has an inverse, isa\+b=b\+aa\+b=b\+a?” Error: describes a*group*\(without commutativity axiom\), which does not guarantee commutativity\. All models correctly return FALSE\. The paraphrase omits the commutativity axiom fromAddCommMonoid\.
### Impact on Reported Metrics
After excluding these 5 confirmed\-bad items \(plus 6 additional borderline items identified in the full audit\), the per\-model IG and failure rates change modestly \(±\\pm0\.5% on mean IG\); the headline findings remain robust\. The primary effect is on F5 failure rates: removing the two biconditional\-overreach items reduces F5 rates by approximately 11 percentage points across all models, from 22% to∼\\sim11%\.
## Appendix IFailure Case Examples
#### Case 1: Active/Passive \(F3\), nat\_dvd\_zero, GPT\-4o
- •“Is it true that every natural number divides zero?” → TRUE \(correct\)
- •“Does zero have every natural number as a divisor?” → FALSE \(WRONG\)
- •“For any natural number n, is it correct that n divides zero?” → TRUE \(correct\)
Root cause: GPT\-4o confuses “n is a divisor of 0” with “0 is a divisor of n”\.
#### Case 2: Formal Notation \(F4\), nat\_mod\_self, Claude Sonnet
- •“any natural number divided by itself leaves a remainder of zero?” → TRUE \(correct\)
- •“Does every natural number have a modulus of zero when divided by itself?” → FALSE \(WRONG\)
- •“For any natural number n, is n modulo n equal to zero?” → FALSE \(WRONG\)
Root cause: “modulus” is lexically ambiguous \(absolute value vs\. remainder\)\.
#### Case 3: Comparison Order \(F6\), real\_sqrt\_nonneg, Claude Sonnet
- •“Is it true that the square root of any real number is always nonnegative?” → TRUE \(correct\)
- •“For every real numberxx, is the square root ofxxgreater than or equal to zero?” → TRUE \(correct\)
- •“For any real numberxx, is it the case that0is less than or equal to the square root ofxx?” → FALSE \(WRONG\)
Root cause: “x≥0\\sqrt\{x\}\\geq 0” and “0≤x0\\leq\\sqrt\{x\}” express the same inequality but Claude applies a directionality heuristic tied to which expression appears as the subject\. Claude Sonnet fails F6 \(comparison order\) at 16\.7%; GPT\-4o fails it at 0\.0% \(Table[5](https://arxiv.org/html/2605.29001#S5.T5)\)\.
## Appendix JPilot Cross\-Benchmark Study
A natural question is whether FormInv IG predicts performance on independent logical reasoning benchmarks, if so, IG would be a practically useful proxy beyond self\-referential measurement\. We evaluated all 9 models on two MMLU\(Hendryckset al\.,[2021a](https://arxiv.org/html/2605.29001#bib.bib8)\)subsets:formal\_logic\(60 items; deductive reasoning target task\) andglobal\_facts\(60 items; factual recall negative control\)\.
Table 11:Cross\-benchmark correlations\.n=9n=9models; partialrrcontrols for FormInv accuracy\. The negative control \(global\_facts\) produces an equally strong correlation as the target task\.Finding: specificity hypothesis falsified\.The negative control \(factual recall\) produces a correlation with IG nearly identical to the target task \(formal logic\):ρformal=0\.625≈ρcontrol=0\.622\\rho\_\{\\text\{formal\}\}=0\.625\\approx\\rho\_\{\\text\{control\}\}=0\.622\. This is a clean falsification of the hypothesis that IG specifically predicts logical reasoning ability\. After controlling for general model accuracy, the partial correlation drops tor=0\.335r=0\.335\(p=0\.38p=0\.38\), not significant atn=9n=9\. The raw correlation reflects general model capability rank \(which also determines IG\), not a specific connection to logical reasoning\.
Interpretation: IG measures a specific construct\.This negative result is informative about what IG measures\. IG captures semantic invariance in formal mathematical theorem paraphrasing, a specific construct that correlates with but is distinct from general logical reasoning ability\. The fact that IG does not add predictive power beyond accuracy on general benchmarks is consistent with Corollary[1](https://arxiv.org/html/2605.29001#Thmtheorem1): Mean\-IG and family\-conditional rankings capture different properties, and IG’s value lies in the per\-family diagnostic structure, not a summary correlation with other tasks\.
The planned full cross\-benchmark evaluation on FOLIO\(Hanet al\.,[2022](https://arxiv.org/html/2605.29001#bib.bib19)\)at Track B scale \(n≥20n\\geq 20models, 500\-theorem FormInv dataset\) will determine whether IG predicts performance on tasks specifically requiring biconditional and definitional reasoning, the families where FormInv failures are most concentrated\.
## Appendix KTrack B Full Findings A–D
We evaluated all 9 FormInv models on 100 automatically\-extracted ntp\-mathlib theorems\(Huet al\.,[2024](https://arxiv.org/html/2605.29001#bib.bib17)\)\(811 items, SHA2564c478c40\)\. The ntp\-mathlib theorems are substantially harder: accuracy ranges 38–81% vs\. 86–96% on curated\.
Finding A: The correlation that appears to hold is regime\-specific\.On ntp\-mathlib,r\(acc,IG\)=−0\.415r\(\\text\{acc\},\\text\{IG\}\)=\-0\.415\(p=0\.27p=0\.27,n=9n=9\), not significant\. This is the paper’s most important structural finding, analogous to howThomas \([2026b](https://arxiv.org/html/2605.29001#bib.bib36)\)shows that algorithm rankings apparent in one BO budget regime dissolve in another:*the correlation between accuracy and semantic invariance \(r = \-0\.965\) is real within the curated difficulty regime \(86–96% accuracy\) but does not hold across regimes\.*Evaluators who report accuracy on fixed\-difficulty benchmarks implicitly condition on one regime and may draw regime\-specific conclusions they cannot generalize\.
Finding B: F5 connective\-variation specificity is difficulty\-regime\-dependent\.On curated theorems, F5 \(32% failure\) stands uniquely above all other families \(4–9%\)\. On ntp\-mathlib,*all 8 families cluster at 28–35% failure*, the F5 specialness disappears when baseline difficulty is high\. Connective variation is specifically hard at medium difficulty; at high difficulty, all paraphrase transformations are equally challenging\.
Finding C: IG scales with theorem difficulty\.Average Mean\-IG rose from 10\.1% \(curated\) to 22\.2% \(ntp\-mathlib\), a 12\-percentage\-point increase when theorems are harder\. All 9 models have non\-trivial IG on ntp\-mathlib \(≥17%\\geq 17\\%\), confirming that semantic invariance failures are not a frontier\-model\-specific artifact but become more severe as difficulty increases\.
Finding D: Chain\-of\-thought reasoning reduces IG on hard theorems\.On curated theorems \(medium difficulty, 86–96% accuracy\), we found no systematic effect of reasoning mode on IG \(§[6\.1](https://arxiv.org/html/2605.29001#S6.SS1)\)\. On ntp\-mathlib \(hard, 38–81% accuracy\), the two reasoning models are the*most invariant*: DeepSeek R1 \(IG = 17\.3%\) and o4\-mini \(IG = 17\.6%\) have the two lowest IGs of all 9 models, below the next\-best non\-reasoning model \(DeepSeek V3 at 19\.9%\)\. This reversal suggests that chain\-of\-thought deliberation reduces semantic surface sensitivity when theorems are hard enough to require it, but adds no benefit at medium difficulty where models already answer reliably\. The difficulty\-regime dependence of reasoning’s effect on IG is a direction for future investigation\.
Scale extension \(Track B roadmap\)\.For future work, we also identify ProofNet\(Azerbayevet al\.,[2023](https://arxiv.org/html/2605.29001#bib.bib11)\)\(371 formal theorem statements with informal NL pairs\) as the systematic scale source\.
## Appendix LSample Adequacy: Full Analysis
FormInv usesK=7K=7–88paraphrases per theorem\. By Hoeffding’s inequality,\|p^t−pt\|≤log\(2/δ\)/\(2K\)\\lvert\\hat\{p\}\_\{t\}\-p\_\{t\}\\rvert\\leq\\sqrt\{\\log\(2/\\delta\)/\(2K\)\}with probability≥1−δ\\geq 1\-\\delta\. ForK=7K=7,δ=0\.05\\delta=0\.05: per\-theorem uncertainty≤±0\.51\\leq\\pm 0\.51\(large\)\. However, the mean uncertainty on*aggregate*Mean\-IG isO\(1/Kn\)O\(1/\\sqrt\{Kn\}\): withn=103n=103and the observed bimodal distribution \(79% of theorems contribute zero variance atIGt=0\\mathrm\{IG\}\_\{t\}=0, 21% contribute bounded variance atIGt≈0\.49\\mathrm\{IG\}\_\{t\}\\approx 0\.49\), the empirical standard error of Mean\-IG is≈±0\.001\\approx\\pm 0\.001\. The wide per\-theorem bound is pessimistic;K=7K=7is adequate for \(1\) detecting the bimodal structure \(IG=0\\mathrm\{IG\}=0vs\.IG≈0\.47\\mathrm\{IG\}\\approx 0\.47, separated by far more than the estimation uncertainty\) and \(2\) estimating the aggregate Mean\-IG reliably acrossn≥50n\\geq 50theorems, while insufficient for precise per\-theorem IG ranking \(requiringK≥46K\\geq 46for±0\.2\\pm 0\.2accuracy\)\.
## Appendix MIG Estimate Stability: Full Bootstrap Analysis
A practical concern for any proposed metric is whether a small evaluation set \(103 theorems\) gives stable estimates\. We repeatedly subsamplenntheorems \(fromn=10n=10ton=103n=103\) and compute Mean\-IG, reporting the 95% bootstrap confidence interval at each size\. The CI atN=50N=50is\(0\.042,0\.140\)\(0\.042,0\.140\)\(both bounds above zero\), narrowing to\(0\.056,0\.124\)\(0\.056,0\.124\)atN=103N=103\. Mean\-IG is identifiable with 50 theorems; 103 gives useful precision for distinguishing models\.
Figure 2:Bootstrap confidence intervals for Mean\-IG as a function of sample sizeNN\(subsampled from 103 theorems, 1000 bootstrap iterations\)\. The 95% CI atN=50N=50is\(0\.042,0\.140\)\(0\.042,0\.140\); atN=103N=103it narrows to\(0\.056,0\.124\)\(0\.056,0\.124\)\.Figure 3:Per\-theorem IG distribution across all 9 models and 103 theorems\. The distribution is strongly bimodal: 79% of theorems have IG=0=0\(consistent responses across all paraphrase families\) and 21% have IG≈0\.42\\approx 0\.42\(fail on 1–3 families\)\.Similar Articles
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
This paper introduces Formal Conjectures, an evolving benchmark of 2615 mathematical statements formalized in Lean 4, including open research conjectures for proof discovery and solved problems for auto-formalization, designed to evaluate automated reasoning systems with zero contamination.
Mathematical Reasoning in Large Language Models: Benchmarks, Architectures, Evaluation, and Open Challenges
This survey synthesizes recent advancements in mathematical reasoning with large language models, covering benchmarks, architectures, training strategies, and evaluation protocols. It identifies key challenges such as reasoning faithfulness and benchmark biases.
Revisiting a Pain in the Neck: A Semantic Reasoning Benchmark for Language Models
Researchers present SemanticQA, a benchmark for evaluating language models on semantic phrase processing tasks including idioms, noun compounds, and verbal constructions, revealing significant performance variation across model architectures and scales on semantic reasoning tasks.
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet is a large-scale multilingual multimodal benchmark of 30,676 Olympiad-level math problems spanning 47 countries and 17 languages, designed to evaluate mathematical reasoning and retrieval in generative and embedding-based models. Even state-of-the-art models like Gemini and GPT-5 struggle with the benchmark, highlighting significant room for improvement in mathematical AI.
A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation
Introduces A2RBench, an automated pipeline for generating formally verifiable abstract reasoning benchmarks for LLMs, using cycle consistency to ensure unique solutions, and reveals that current LLMs underperform humans significantly on 3D reasoning tasks.