Measuring Representation Robustness in Large Language Models for Geometry

arXiv cs.CL Papers

Summary

Researchers introduce GeoRepEval, a framework to evaluate LLM robustness across equivalent geometric problem representations (Euclidean, coordinate, vector). Testing 11 LLMs on 158 geometry problems, they find accuracy gaps up to 14 percentage points based solely on representation choice, with vector formulations being a consistent failure point.

arXiv:2604.16421v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone. We propose GeoRepEval, a representation-aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation-flip analyses, and regression controls for surface complexity. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest representation. Evaluating eleven LLMs on 158 curated high-school geometry problems (474 instances), we find accuracy gaps of up to 14 percentage points induced solely by representation choice. Vector formulations emerge as a consistent failure point, with Invariance@3 as low as 0.044 even after controlling for length and symbolic complexity. A convert-then-solve prompting intervention improves vector accuracy by up to 52 percentage points for high-capacity models, suggesting that failures reflect representation sensitivity rather than inability; however, low-capacity models show no gains, indicating deeper limitations. These results suggest that current models rely on representation-specific heuristics rather than abstract geometric reasoning. All datasets, prompts, and scripts are released at https://github.com/vedjaw/GeoRepEval.
Original Article Export to Word Export to PDF
View Cached Full Text

Cached at: 04/21/26, 07:02 AM

# Measuring Representation Robustness in Large Language Models for Geometry
Source: [https://arxiv.org/html/2604.16421](https://arxiv.org/html/2604.16421)
Vedant JawandhiaDepartment of Computer Science and Information Systems, BITS Pilani \{f20220627, dhruv\.kumar, yash\.sinha\}@pilani\.bits\-pilani\.ac\.inYash SinhaDepartment of Computer Science and Information Systems, BITS Pilani \{f20220627, dhruv\.kumar, yash\.sinha\}@pilani\.bits\-pilani\.ac\.inAnkan PalDepartment of Mathematics, BITS Pilani ankan\.pal@pilani\.bits\-pilani\.ac\.inDhruv KumarDepartment of Computer Science and Information Systems, BITS Pilani \{f20220627, dhruv\.kumar, yash\.sinha\}@pilani\.bits\-pilani\.ac\.in

###### Abstract

Large language models \(LLMs\) are increasingly evaluated on mathematical reasoning, yet their robustness to equivalent problem representations remains poorly understood\. In geometry, identical problems can be expressed in Euclidean, coordinate, or vector forms, but existing benchmarks report accuracy on fixed formats, implicitly assuming representation invariance and masking failures caused by representational changes alone\. We proposeGeoRepEval, a representation\-aware evaluation framework that measures correctness, invariance, and consistency at the problem level across parallel formulations, combining strict answer matching, bootstrap confidence intervals, paired McNemar tests, representation\-flip analyses, and regression controls for surface complexity\. We prove that our Invariance@3 metric decomposes accuracy into robust and fragile components and is bounded by the weakest representation\. Evaluating eleven LLMs on 158 curated high\-school geometry problems \(474 instances\), we find accuracy gaps of up to 14 percentage points induced solely by representation choice\. Vector formulations emerge as a consistent failure point, with Invariance@3 as low as 0\.044 even after controlling for length and symbolic complexity\. A convert\-then\-solve prompting intervention improves vector accuracy by up to 52 percentage points for high\-capacity models, suggesting that failures reflect representation sensitivity rather than inability; however, low\-capacity models show no gains, indicating deeper limitations\. These results suggest that current models rely on representation\-specific heuristics rather than abstract geometric reasoning\. All datasets, prompts, and scripts are released at[https://github\.com/vedjaw/GeoRepEval](https://github.com/vedjaw/GeoRepEval)\.

## 1Introduction

Large language models \(LLMs\) have demonstrated strong capabilities across reasoning\-intensive tasks\[[3](https://arxiv.org/html/2604.16421#bib.bib3),[22](https://arxiv.org/html/2604.16421#bib.bib22)\], including mathematical problem solving\[[7](https://arxiv.org/html/2604.16421#bib.bib7),[12](https://arxiv.org/html/2604.16421#bib.bib12)\], symbolic manipulation\[[16](https://arxiv.org/html/2604.16421#bib.bib16)\], and logical inference\[[36](https://arxiv.org/html/2604.16421#bib.bib36)\]\. Recent advances in scale, instruction tuning, and chain\-of\-thought prompting have led to substantial gains on benchmarks spanning arithmetic, algebra, and geometry\[[3](https://arxiv.org/html/2604.16421#bib.bib3),[36](https://arxiv.org/html/2604.16421#bib.bib36),[6](https://arxiv.org/html/2604.16421#bib.bib6),[22](https://arxiv.org/html/2604.16421#bib.bib22)\]\. However, growing evidence—sensitivity to prompt phrasing\[[40](https://arxiv.org/html/2604.16421#bib.bib40)\], adversarial perturbations\[[11](https://arxiv.org/html/2604.16421#bib.bib11)\], and surface\-level rewording\[[27](https://arxiv.org/html/2604.16421#bib.bib27)\]—suggests LLM performance can be brittle under representational changes\[[32](https://arxiv.org/html/2604.16421#bib.bib32)\]\. Geometry provides a uniquely structured testbed: the same problem can be expressed via Euclidean, coordinate, or vector representations without altering its semantic content\.

#### Research questions\.

This paper investigates whether modern LLMs exhibit representation\-invariant geometric reasoning by addressing three questions:

1. RQ1:Does performance vary systematically with representation choice, and can gaps be explained by surface features?
2. RQ2:Can explicit conversion\-based prompting improve cross\-representation consistency?

We evaluate whether models produce correct and consistent answers when the same geometry problem is expressed in Euclidean, coordinate, and vector forms—representations that differ in symbolic density and linguistic structure but encode identical mathematical constraints\[[16](https://arxiv.org/html/2604.16421#bib.bib16)\]\.

Prior work focuses on aggregate accuracy over fixed\-format benchmarks\[[7](https://arxiv.org/html/2604.16421#bib.bib7),[12](https://arxiv.org/html/2604.16421#bib.bib12),[9](https://arxiv.org/html/2604.16421#bib.bib9)\], treating representation as incidental\. Studies exploring paraphrasing robustness operate within single representational regimes\[[40](https://arxiv.org/html/2604.16421#bib.bib40)\], and geometry benchmarks rarely provide parallel formulations of the same problem\[[34](https://arxiv.org/html/2604.16421#bib.bib34)\]\. Current evaluations thus cannot determine whether LLMs internally normalize problems or rely on representation\-specific heuristics\.

We proposeGeoRepEval, a representation\-aware evaluation framework that constructs parallel Euclidean, coordinate, and vector versions of each problem and evaluates at the*problem level*—comparing answers across all three variants\. GeoRepEval introduces Invariance@3 \(fraction of correct problems in all three representations\) and Consistency@3 \(fraction of problems with identical output\), alongside McNemar tests, representation\-flip analysis, and regression controls for surface features \(token length, symbolic density\)\.

We evaluate eleven LLMs spanning open\-weight \(7B–12B\) and proprietary architectures on 474 curated geometry problems \(158 instances per geometric representation\)\. Euclidean–Vector accuracy gaps reach up to 14 pp \(Claude\-Haiku: 0\.60 Euclid vs\. 0\.46 Vector\), and Invariance@3—the fraction of problems solved correctly across all three representations—remains below 0\.47 for all models except GPT\-OSS\-20B \(0\.665\)\. At the low end, LLaMA\-3\.1\-8B achieves only 0\.044 Invariance@3, meaning fewer than 5% of problems are solved correctly in all three forms\. These gaps persist after regression controls for surface complexity \(p<0\.05p<0\.05for 8/11 models\), and item\-level analysis reveals vector formulations as the dominant single point of failure: the CCW pattern \(correct on Euclidean and Coordinate, wrong on Vector\) is 2–5×\\timesmore frequent than WCC \(Euclidean\-only failure\), indicating that models apply Euclidean reasoning “templates” fluently but produce brittle, error\-prone vector procedures\.

#### Contributions\.

- •GeoRepEval framework:A statistically grounded, representation\-aware evaluation framework with formally justified metrics \(Properties 1–3\) linking Invariance@3 and Consistency@3 to standard accuracy\.
- •Parallel benchmark:158 core geometry problems each in Euclidean, coordinate, and vector form \(474 instances per source\) from standard textbooks, with verified semantic equivalence\.
- •Empirical & explanatory analysis:Accuracy gaps up to 14 pp and Invariance@3 as low as 0\.044; a qualitative analysis \(§[5\.7](https://arxiv.org/html/2604.16421#S5.SS7)\) proposing procedural chain length, possible training distribution effects, and algebraic error compounding as plausible explanations for vector fragility\.
- •Prompting intervention:A convert\-then\-solve strategy that dramatically improves vector accuracy for mid/high\-capacity models \(up to 52 pp\), suggesting failures are representation\-specific; low\-capacity models show no benefit, indicating deeper limitations\.

## 2Related Work

### 2\.1Mathematical Reasoning and Geometry with Neural Models

Prior studies have evaluated LLM mathematical reasoning on benchmarks such as GSM8K and MATH\[[7](https://arxiv.org/html/2604.16421#bib.bib7),[12](https://arxiv.org/html/2604.16421#bib.bib12),[9](https://arxiv.org/html/2604.16421#bib.bib9),[16](https://arxiv.org/html/2604.16421#bib.bib16),[24](https://arxiv.org/html/2604.16421#bib.bib24)\], with complementary surveys assessing model capabilities more broadly\[[25](https://arxiv.org/html/2604.16421#bib.bib25),[13](https://arxiv.org/html/2604.16421#bib.bib13)\]\. Geometry\-focused work has explored diagram understanding, symbolic reasoning, and coordinate\-based formulations\[[29](https://arxiv.org/html/2604.16421#bib.bib29),[34](https://arxiv.org/html/2604.16421#bib.bib34)\]\. However, these evaluations typically assess each problem in a single canonical form, leaving it unclear whether success reflects abstract reasoning or sensitivity to representational encoding\. GeoRepEval departs from this paradigm by evaluating the*same*problem across Euclidean, coordinate, and vector formulations\.

### 2\.2Evaluation Frameworks for LLM Robustness

Benchmarks such as BIG\-bench\[[32](https://arxiv.org/html/2604.16421#bib.bib32)\], HELM\[[17](https://arxiv.org/html/2604.16421#bib.bib17)\], and Beyond\-the\-IID\[[4](https://arxiv.org/html/2604.16421#bib.bib4)\]stress\-test models across diverse conditions\. Work on representation shift\[[38](https://arxiv.org/html/2604.16421#bib.bib38)\], paraphrasing\[[27](https://arxiv.org/html/2604.16421#bib.bib27),[14](https://arxiv.org/html/2604.16421#bib.bib14)\], and adversarial perturbations\[[19](https://arxiv.org/html/2604.16421#bib.bib19)\]has revealed sensitivity to superficial changes despite semantic equivalence\. In mathematics, however, most benchmarks remain instance\-level and lack aligned variants of the same problem, limiting invariance assessment\. GeoRepEval fills this gap with problem\-level, representation\-aligned evaluation and paired statistical testing\.

### 2\.3Prompting and Reasoning Strategies

Chain\-of\-thought prompting\[[36](https://arxiv.org/html/2604.16421#bib.bib36)\], self\-consistency\[[35](https://arxiv.org/html/2604.16421#bib.bib35)\], tree\-of\-thought\[[37](https://arxiv.org/html/2604.16421#bib.bib37)\], and tool\-augmented methods\[[28](https://arxiv.org/html/2604.16421#bib.bib28)\]have improved reasoning performance, with zero\-shot and least\-to\-most strategies\[[15](https://arxiv.org/html/2604.16421#bib.bib15),[39](https://arxiv.org/html/2604.16421#bib.bib39)\]unlocking latent ability without exemplars\. However, these studies implicitly assume a fixed representation and focus on*how*models reason rather than*what form*reasoning is conditioned on\. Our work isolates representation as a controlled variable, testing whether LLM reasoning is representation\-invariant\.

## 3Methodology

GeoRepEval probes whether LLMs exhibit invariant reasoning across equivalent geometric representations via four stages: dataset construction, controlled inference, structured scoring, and statistical analysis \(Figure[1](https://arxiv.org/html/2604.16421#S3.F1)\)\. Letℳ\\mathcal\{M\}denote a language model\. Each problempip\_\{i\}has gold answerai∗a\_\{i\}^\{\*\}and is expressed in representationsr∈\{euc,coord,vec\}r\\in\\\{\\textsc\{euc\},\\textsc\{coord\},\\textsc\{vec\}\\\}, yielding promptsxirx\_\{i\}^\{r\}\. The predicted answera^ir=parse​\(ℳ​\(xir\)\)\\hat\{a\}\_\{i\}^\{r\}=\\text\{parse\}\(\\mathcal\{M\}\(x\_\{i\}^\{r\}\)\)is*correct*if it matchesai∗a\_\{i\}^\{\*\}after normalization \(§[3\.2](https://arxiv.org/html/2604.16421#S3.SS2)\)\. We capture this using a binary correctness indicatorcir∈\{0,1\}c\_\{i\}^\{r\}\\in\\\{0,1\\\}that flags whether the model successfully solved problempip\_\{i\}in its representation formrr, defined formally ascir=𝟏​\[normalize​\(a^ir\)=normalize​\(ai∗\)\]c\_\{i\}^\{r\}=\\mathbf\{1\}\[\\text\{normalize\}\(\\hat\{a\}\_\{i\}^\{r\}\)=\\text\{normalize\}\(a\_\{i\}^\{\*\}\)\]\.

1\. SourceSelection2\. CategoryBalancing3\. ParallelGeneration4\. ProblemFilteringEuclideanGeometryCoordinateGeometryVectorGeometryNormalization &Strict MarkingInvariance@3 &Statistical TestsDataset ConstructionControlled Inference \(LLMs\)

Figure 1:GeoRepEval pipeline overview\.The framework natively constructs and tracks mathematically equivalent variants \(Euclidean, Coordinate, and Vector\) through parallel LLM inference to isolate true reasoning capacity from representation sensitivity\.### 3\.1Dataset Construction

The construction pipeline proceeds in four stages:

#### Stage 1: Source Selection\.

Problems are drawn from three widely used Indian high\-school and junior\-college mathematics textbooks—NCERT \(Classes IX–XII\)\[[21](https://arxiv.org/html/2604.16421#bib.bib21)\], RD Sharma\[[30](https://arxiv.org/html/2604.16421#bib.bib30),[31](https://arxiv.org/html/2604.16421#bib.bib31)\], and RS Aggarwal\[[1](https://arxiv.org/html/2604.16421#bib.bib1),[2](https://arxiv.org/html/2604.16421#bib.bib2)\]—chosen for their emphasis on formal symbolic reasoning and well\-defined geometric constraints\.

#### Stage 2: Categorisation\.

Each problem is assigned to one of four categories:*length/distance*,*area/volume*,*ratio/proportion*, and*angle/direction*, with approximately balanced distribution\.

#### Stage 3: Parallel Generation\.

Every core problem is instantiated into three equivalent forms:Euclidean\(verbal/relational\),Coordinate\(Cartesian plane\), andVector\(dot/cross\-product operations\)\. Each variant preserves identical constraints and solution paths\. Variants are generated with Gemini Pro\[[10](https://arxiv.org/html/2604.16421#bib.bib10)\]and manually verified by a panel of mathematical experts\.

#### Stage 4: Filtering\.

Problems with ambiguous phrasing, missing parameters, or non\-exact answers are removed\. After filtering, each source contributes 158 core problems \(474 instances per source\)\.

### 3\.2Inference, Scoring, and Aggregation

Each model is evaluated on all three representations under identical conditions, producing answers in structured JSON \(Appendix[A](https://arxiv.org/html/2604.16421#A1)\)\. GeoRepEval also includes a*convert\-then\-solve*variant where models first rewrite coordinate/vector problems into Euclidean form\. Predicted answers are compared against gold solutions using strict normalization \(fractions, roots,π\\pi\-expressions to canonical form\):

cir=𝟏​\[normalize​\(a^ir\)=normalize​\(ai∗\)\]\.c\_\{i\}^\{r\}=\\mathbf\{1\}\\bigl\[\\text\{normalize\}\(\\hat\{a\}\_\{i\}^\{r\}\)=\\text\{normalize\}\(a\_\{i\}^\{\*\}\)\\bigr\]\.\(1\)This avoids numeric tolerance issues while accepting equivalent forms \(e\.g\.,32\\tfrac\{3\}\{2\}and1\.51\.5\)\.

Results are aggregated across representations and problems\. Beyond standard per\-representation accuracy, GeoRepEval computes two higher\-order metrics\.Invariance@3measures the fraction of problems solved correctly in*all three*representations:

Invariance@3=1N​∑i=1Ncieuc⋅cicoord⋅civec,\\text\{Invariance@3\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}c\_\{i\}^\{\\textsc\{euc\}\}\\cdot c\_\{i\}^\{\\textsc\{coord\}\}\\cdot c\_\{i\}^\{\\textsc\{vec\}\},\(2\)whereNNis the number of problems\.Consistency@3measures the fraction of problems where the model produces the*same*output \(regardless of correctness\) across all three representations:

Consistency@3=1N​∑i=1N𝟏​\[a^ieuc=a^icoord=a^ivec\]\.\\text\{Consistency@3\}=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\mathbf\{1\}\\bigl\[\\hat\{a\}\_\{i\}^\{\\textsc\{euc\}\}=\\hat\{a\}\_\{i\}^\{\\textsc\{coord\}\}=\\hat\{a\}\_\{i\}^\{\\textsc\{vec\}\}\\bigr\]\.\(3\)GeoRepEval also classifies each problem by its cross\-representation correctness pattern \(e\.g\., CCC, CCW, WWW\), yielding eight categories for structured failure analysis\.

### 3\.3Formal Properties of Evaluation Metrics

We establish three structural properties that justify our two\-metric design and clarify the relationship between accuracy, invariance, and consistency\. Throughout, letR=\{euc,coord,vec\}R=\\\{\\textsc\{euc\},\\textsc\{coord\},\\textsc\{vec\}\\\}denote the set of representations, and recall thatcir∈\{0,1\}c\_\{i\}^\{r\}\\in\\\{0,1\\\}is the correctness indicator for problemiiunder representationrr\. A model is*perfectly representation\-invariant*ifa^ir=a^ir′\\hat\{a\}\_\{i\}^\{r\}=\\hat\{a\}\_\{i\}^\{r^\{\\prime\}\}for all problemsiiand all representation pairs\(r,r′\)\(r,r^\{\\prime\}\)\.

#### Property 1 \(Accuracy Decomposition\)\.

For any representationrr, the per\-representation accuracy decomposes as:

Accr=Invariance@3\+1N​∑i=1Ncir​\(1−∏r′≠rcir′\)⏟fragile correctness​Fr\.\\text\{Acc\}^\{r\}=\\text\{Invariance@3\}\+\\underbrace\{\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}c\_\{i\}^\{r\}\\bigl\(1\-\\textstyle\\prod\_\{r^\{\\prime\}\\neq r\}c\_\{i\}^\{r^\{\\prime\}\}\\bigr\)\}\_\{\\text\{fragile correctness \}F^\{r\}\}\.\(4\)*Derivation\.*By definition,Accr=1N​∑icir\\text\{Acc\}^\{r\}=\\frac\{1\}\{N\}\\sum\_\{i\}c\_\{i\}^\{r\}\. We partition eachcirc\_\{i\}^\{r\}according to whether the model also succeeds on every other representation:cir=cir​∏r′≠rcir′\+cir​\(1−∏r′≠rcir′\)c\_\{i\}^\{r\}=c\_\{i\}^\{r\}\\prod\_\{r^\{\\prime\}\\neq r\}c\_\{i\}^\{r^\{\\prime\}\}\+c\_\{i\}^\{r\}\(1\-\\prod\_\{r^\{\\prime\}\\neq r\}c\_\{i\}^\{r^\{\\prime\}\}\)\. Summing overiiand dividing byNN, the first term equals Invariance@3 and the second equalsFrF^\{r\}\. The fragile\-correctness termFrF^\{r\}captures problems solved in representationrrbut not in all three, indicating reliance on representation\-specific cues rather than robust reasoning\.

#### Property 2 \(Upper Bound\)\.

Invariance@3 is bounded above by the weakest per\-representation accuracy:

Invariance@3≤minr∈R⁡Accr\.\\text\{Invariance@3\}\\leq\\min\_\{r\\in R\}\\;\\text\{Acc\}^\{r\}\.\(5\)*Derivation\.*Becausecir∈\{0,1\}c\_\{i\}^\{r\}\\in\\\{0,1\\\}, the product∏r∈Rcir≤cir\\prod\_\{r\\in R\}c\_\{i\}^\{r\}\\leq c\_\{i\}^\{r\}for everyrr\. Summing over problems,Invariance@3=1N​∑i∏rcir≤1N​∑icir=Accr\\text\{Invariance@3\}=\\frac\{1\}\{N\}\\sum\_\{i\}\\prod\_\{r\}c\_\{i\}^\{r\}\\leq\\frac\{1\}\{N\}\\sum\_\{i\}c\_\{i\}^\{r\}=\\text\{Acc\}^\{r\}\. Since this holds for everyrr, it holds for the minimum\. This formalises the*weakest\-link*effect: a model cannot achieve high invariance if it performs poorly on even one representation\.

#### Property 3 \(Consistency–Invariance Gap\)\.

Consistency@3 is always at least as large as Invariance@3:

Consistency@3≥Invariance@3\.\\text\{Consistency@3\}\\geq\\text\{Invariance@3\}\.\(6\)*Derivation\.*LetSi=𝟏​\[a^ieuc=a^icoord=a^ivec\]S\_\{i\}=\\mathbf\{1\}\[\\hat\{a\}\_\{i\}^\{\\textsc\{euc\}\}=\\hat\{a\}\_\{i\}^\{\\textsc\{coord\}\}=\\hat\{a\}\_\{i\}^\{\\textsc\{vec\}\}\]\(same output\) andIi=∏rcirI\_\{i\}=\\prod\_\{r\}c\_\{i\}^\{r\}\(all correct\)\. IfIi=1I\_\{i\}=1, then all three outputs match the gold answer, so they also match each other; henceIi≤SiI\_\{i\}\\leq S\_\{i\}for everyii, and averaging gives the result\. The gapConsistency@3−Invariance@3\\text\{Consistency@3\}\-\\text\{Invariance@3\}measures the fraction of problems where the model produces the same*wrong*answer across all three representations, indicating systematic errors orthogonal to representation sensitivity\.

GeoRepEval prioritises paired, problem\-level evaluation with semantically equivalent variants\. These structural properties ensure the metrics are principled, and McNemar tests with bootstrap CIs \(§[5\.5](https://arxiv.org/html/2604.16421#S5.SS5)\) confirm that observed differences are statistically systematic\.

## 4Experimental Setup

### 4\.1Datasets

GeoRepEval was evaluated on a custom\-curated geometry benchmark constructed from standard Indian high school and junior college mathematics textbooks, including NCERT\[[21](https://arxiv.org/html/2604.16421#bib.bib21)\], RD Sharma\[[30](https://arxiv.org/html/2604.16421#bib.bib30),[31](https://arxiv.org/html/2604.16421#bib.bib31)\], and RS Aggarwal\[[1](https://arxiv.org/html/2604.16421#bib.bib1),[2](https://arxiv.org/html/2604.16421#bib.bib2)\]\. These sources emphasize formal symbolic reasoning and precise geometric constraints, making them well suited for robustness evaluation\.

Each core problem was instantiated intothree mathematically equivalent representations—Euclidean, coordinate, and vector geometry—while preserving identical constraints, quantities, and solution targets\. Additional problems and parallel variants were generated and refined using the Gemini Pro model\. All variants were then manually verified by a panel of mathematical experts to ensure correctness, semantic parallelism, and answer uniqueness\. Problems with ambiguous phrasing or multiple valid interpretations were revised or removed\. All retained problems admit exact numeric or symbolic answers\.

Table 1:Dataset summary\.158 core problems drawn from three textbook sources, each instantiated in three representations \(474 total instances\)\. All problems are at the high\-school level\.Using multiple textbook sources reduces source\-specific bias, so that performance differences can be attributed to representation effects\.

### 4\.2Models and Evaluation Setup

Table 2:Large language models evaluated in this study\.All models were accessed via the OpenRouter API \([https://openrouter\.ai](https://openrouter.ai/)\) in a zero\-shot setting without any fine\-tuning\.†\\daggerGemma\-2\-9B produced valid responses for 110 of 158 problems due to output formatting failures; only valid responses are included in its evaluation\.Table[2](https://arxiv.org/html/2604.16421#S4.T2)lists the evaluated models, spanning open\-weight architectures \(LLaMA\[[33](https://arxiv.org/html/2604.16421#bib.bib33)\], Gemma, Qwen, Mistral\) and proprietary systems \(GPT\[[23](https://arxiv.org/html/2604.16421#bib.bib23)\], Gemini\[[10](https://arxiv.org/html/2604.16421#bib.bib10)\], Claude\)\.

Table 3:Evaluation metrics\.See §[3\.2](https://arxiv.org/html/2604.16421#S3.SS2)and §[3\.3](https://arxiv.org/html/2604.16421#S3.SS3)for formal definitions and properties\.All models were queried via the OpenRouter API using deterministic decoding \(temperature=0\.0=0\.0, top\-p=1\.0p=1\.0\), zero\-shot, with each prompt submitted once per problem–representation pair\. Evaluation is based solely on the extracted numeric answer from structured JSON output \(Appendix[A](https://arxiv.org/html/2604.16421#A1)\)\. Non\-conforming responses are treated as incorrect\. Bootstrap resampling \(B=10,000B=10\{,\}000\) and McNemar tests use fixed seeds for reproducibility\.

#### Manual verification of scoring\.

Although initial scoring used automated exact\-match comparison, a systematic manual review was required to ensure fairness\. Three recurring issues motivated this step: \(i\)*format mismatches*, where a model returned a mathematically equivalent but textually distinct answer \(e\.g\.,"1/2"vs\. the stored gold"0\.5", or"3\.162"vs\."sqrt\(10\)"\); \(ii\)*unparseable symbolic forms*, such asarccos\(1/sqrt\(3\)\)orarccos\(sqrt\(2/3\)\), which fell outside the coverage of our numeric\-extraction regex and could not be compared programmatically; and \(iii\)*non\-canonical expressions*, including unsimplified fractions \("6/4"vs\."3/2"\), reordered terms \("4\*sqrt\(3\)"vs\."sqrt\(48\)"\), or extraneous text embedded in the JSON field\. Each such case was adjudicated by a human annotator across all eleven models\. In total, manual review covered approximately 15–20% of the 5,214 evaluation instances \(474 instances×\\times11 models\), ensuring that reported metrics reflect genuine reasoning failures rather than answer\-format artifacts\.

## 5Results and Discussion

### 5\.1Experiment 1 — Accuracy by Representation

Figure[2](https://arxiv.org/html/2604.16421#S5.F2)shows that accuracy varies substantially across representations for all models\. Euclidean formulations consistently yield the highest accuracy, vector formulations the lowest, with coordinate geometry between\. These gaps persist across both open\-weight and proprietary models, demonstrating that scale alone does not eliminate representation\-induced degradation\.

![Refer to caption](https://arxiv.org/html/2604.16421v1/images/accuracy_by_representation.png)

Figure 2:Accuracy by geometry representation across models\.Each group shows performance under Euclidean, Coordinate, and Vector formulations of the same problems\.
### 5\.2Experiment 2 — Invariance and Robustness

Table[4](https://arxiv.org/html/2604.16421#S5.T4)reports representation\-wise accuracy, Invariance@3, and Consistency@3\. Euclidean representations achieve the highest accuracy across nearly all models\. Accuracy gaps reach up to 14 percentage points \(e\.g\., Claude\-Haiku: 0\.60 Euclid vs\. 0\.46 Vector\)\. Invariance@3 remains low for most models—LLaMA\-3\.1\-8B achieves only 0\.044—revealing that correct solutions across all three representations are rare\. Models with higher overall accuracy tend to exhibit higher consistency, suggesting robustness and correctness are tightly coupled\.

Table 4:Representation\-wise accuracy, invariance, and consistency metrics across evaluated language models\.Accuracy Gap = best−\-worst representation accuracy\. Inv@3 = Invariance@3 \(correct across all three\)\. Cons@3 = Consistency@3 \(identical output across all three\)\. Inconsistency =1−Cons@31\-\\text\{Cons@3\}\.
### 5\.3Experiment 3 — Representation\-Flip Patterns

Item\-level analysis \(Figure[3](https://arxiv.org/html/2604.16421#A2.F3), Appendix[B](https://arxiv.org/html/2604.16421#A2)\) reveals that WWW \(all wrong\) constitutes a large fraction, reflecting intrinsic difficulty\. Among partial cases, CCW and CWC dominate—vector and coordinate are frequent single failure points, while WCC \(Euclidean\-only failure\) is rare\.

### 5\.4Experiment 4 — Pairwise Transfer

Vector is the dominant single\-failure point across nearly all models, with failure counts 2–5×\\timeshigher than Euclidean \(Table[5](https://arxiv.org/html/2604.16421#A2.T5), Appendix[B](https://arxiv.org/html/2604.16421#A2)\)\. Transfer is strongest for Euclidean–Coordinate when Vector fails \(EC\|\|V up to 0\.23\), while CV\|\|E and EV\|\|C remain below 0\.06\. EC coherence consistently exceeds CV coherence, indicating more aligned reasoning pathways between Euclidean and Coordinate representations\.

### 5\.5Experiment 5 — Statistical Reliability

Bootstrap confidence intervals confirm that representation\-induced gaps are statistically stable for mid\- and high\-capacity models, with non\-overlapping intervals for Euclidean–Vector comparisons \(Claude\-Haiku, Gemini\-2\.5, GPT\-5\.2, GPT\-OSS\-20B\)\. McNemar’s paired tests\[[18](https://arxiv.org/html/2604.16421#bib.bib18)\]confirm significant effects for Euclidean–Vector and Coordinate–Vector comparisons \(p<0\.05p<0\.05\), while Euclidean–Coordinate differences are less consistently significant, reinforcing Vector as the primary divergence source\.

### 5\.6Qualitative Failure Analysis

We examine representative failure cases\. A triangle\-area problem solved correctly in Euclidean form \(12×5×12=30\\frac\{1\}\{2\}\\times 5\\times 12=30\) frequently produces algebraic errors in vector form \(12​\|A​B→×A​C→\|\\frac\{1\}\{2\}\|\\vec\{AB\}\\times\\vec\{AC\}\|\), with models misidentifying vectors or miscomputing cross products\. Similarly, distance problems solved via the coordinate formula fail when reformulated as‖v→‖\\\|\\vec\{v\}\\\|, with models misapplying norms\. These patterns suggest stronger associations with Euclidean/coordinate “templates” from training, while vector reasoning relies on fragile procedural chains\.

### 5\.7Analysis: Why Vector Representations Are Harder

We propose four plausible explanations for the consistent underperformance on vector formulations\. These are offered as conjectures supported by preliminary evidence from our experiments, not as confirmatory claims\.

#### H1: Surface Complexity\.

Vector formulations are inherently longer and more symbol\-heavy \(e\.g\., requiring expliciti^,j^,k^\\hat\{i\},\\hat\{j\},\\hat\{k\}notation\), and sequence length may correlate with generation difficulty\. Our logistic regression analysis \(§[5\.5](https://arxiv.org/html/2604.16421#S5.SS5)\) includes controls for prompt token length and symbolic density; the representation indicator variable remains significant \(p<0\.05p<0\.05for 8 of 11 models\) even with these covariates, providing preliminary evidence that the gap is not fully attributable to surface features alone\. However, we cannot rule out other confounds correlated with representation choice, so surface complexity remains a plausible partial contributor rather than a confirmed explanation\.

#### H2: Procedural Chain Length\.

Vector solutions typically demand longer, multi\-step algebraic chains\. For example, computing the area of a right triangle in Euclidean form requires one operation \(0\.5×b×h0\.5\\times b\\times h\), whereas the vector form requires computing two displacement vectors \(6 subtractions\), evaluating their cross product \(6 multiplications, 3 subtractions\), and calculating the magnitude \(3 squares, 1 square root\)\. This extended procedural chain plausibly increases the probability of a spontaneous calculation error before reaching the final answer, though we do not have a direct experimental control isolating chain length from other factors\.

#### H3: Training Data Distribution \(Speculative\)\.

Since training corpora for the evaluated models are not publicly documented, we cannot make direct claims about their composition\. However, it is plausible that Euclidean and coordinate geometry—standard topics in secondary\-school curricula—are more heavily represented in typical web and textbook data than vector geometry, which is generally introduced at the advanced high\-school or college level\. The success of the convert\-then\-solve intervention \(§[5\.8](https://arxiv.org/html/2604.16421#S5.SS8)\)—where vector accuracy improves by up to 52 pp simply by prompting the model to translate to Euclidean first—is consistent with this hypothesis, suggesting that models may possess the underlying conceptual understanding but lack well\-practiced procedural routines for vector arithmetic\. We stress that this remains speculative in the absence of access to training data\.

#### H4: Compounding Algebraic Errors\.

Unlike Euclidean geometry, which often relies on proportional reasoning, vector geometry requires strict adherence to algebraic rules \(signs, component\-wise arithmetic\)\. Qualitative inspection reveals that models frequently state the correct high\-level vector formulas \(e\.g\.,u→⋅v→=0\\vec\{u\}\\cdot\\vec\{v\}=0for orthogonality\) but make sign errors when expanding the components\. The dominance of the CCW pattern \(Correct Euclidean and Coordinate, Wrong Vector\) in the partial failure modes \(§[5\.3](https://arxiv.org/html/2604.16421#S5.SS3)\) is consistent with this explanation, suggesting that the models may possess adequate conceptual understanding while their symbolic execution of vector mechanics remains brittle\. However, we note this is observational evidence and a more controlled study isolating algebraic error propagation would be needed to confirm this hypothesis\.

### 5\.8Convert\-Then\-Solve: A Prompting Intervention

RQ3 asks whether conversion\-based prompting improves invariance\. We evaluate CTS on a representative subset of six models selected to span the full performance spectrum—from low\-capacity \(LLaMA\-3\.1\-8B\) through mid\-range \(Qwen\-2\.5\-7B, DeepSeek\-Chat\) to high\-capacity \(Claude\-Haiku, Gemini\-2\.5\-Flash, GPT\-OSS\-20B\)—as the CTS protocol requires two\-stage inference \(conversion \+ solving\), effectively doubling per\-instance API cost\. CTS dramatically improves accuracy for mid\- and high\-capacity models—vector accuracy jumps by up to 52 pp \(Gemini\-2\.5\-Flash: 0\.45→\\to0\.97\), and accuracy gaps narrow from 14 pp to 2–3 pp—confirming that direct\-evaluation failures reflect representation sensitivity rather than inability to solve the underlying problems\. Crucially, LLaMA\-3\.1\-8B shows no meaningful gains \(vector: 0\.13→\\to0\.16\), indicating that conversion scaffolding cannot compensate for fundamental capacity limitations\. Full results appear in Appendix[B\.1](https://arxiv.org/html/2604.16421#A2.SS1)\.

### 5\.9Discussion

Synthesis\.The experiments provide evidence that LLM performance varies significantly across equivalent geometric representations \(RQ2\), with the Consistency–Invariance gap \(Property 3\) revealing representation\-specific rather than systematic errors \(RQ1\)\. Convert\-then\-solve dramatically improves accuracy for capable models but fails for low\-capacity ones \(RQ3\), suggesting that representation sensitivity is the primary bottleneck for strong models, while weaker models face deeper reasoning limitations\.

Implications\.High accuracy on a single representation does not imply abstract reasoning competence\. Current models show stronger inductive biases toward forms prevalent in training data and requiring shorter procedural chains, motivating representation\-aware evaluation alongside standard benchmarks\. For safety\-critical applications, representation choice materially affects reliability; GeoRepEval enables systematic risk diagnosis\.

## 6Conclusion

We introduced GeoRepEval, a statistically grounded evaluation framework with formally justified metrics \(Properties 1–3\) for measuring representation invariance in LLM geometric reasoning\. Across eleven models on 158 curated problems \(474 instances per source\), we find performance gaps of up to 14 pp induced solely by representation choice, with Invariance@3 below 0\.47 for most models\. A qualitative analysis proposes procedural chain length, possible training distribution effects, and compounding algebraic errors as plausible explanations for vector fragility\. A convert\-then\-solve intervention dramatically improves vector accuracy for capable models \(up to 52 pp\), suggesting that failures are representation\-specific rather than capability\-based, while low\-capacity models show no gains—revealing a capacity\-dependent boundary for prompting interventions\.

These results suggest that current LLMs may rely partly on representation\-specific heuristics rather than fully abstract geometric reasoning, revealing fragility invisible to single\-format benchmarks\. Future work should extend GeoRepEval to chain\-of\-thought and few\-shot regimes, additional mathematical domains, and training\-time interventions for representation robustness\.

## Acknowledgments

We thank our institute for GPU infrastructure, and OpenRouter for unified API access across all models\. We further thank Shubh Srivastava and Aayan Ansari for their contribution to the human evaluation of a total of 4500 problems\. We also thank peers for discussions that refined the experimental design\.

AI Use Statement\.The authors are responsible for all content\. AI tools were used for grammar checking and generating initial problem variants \(manually verified\)\. All scientific content is the authors’ own\.

## References

- \[1\]R\. S\. Aggarwal\. 2019\.*Mathematics for Class XI*\.Bharati Bhawan Publications, New Delhi, India\.
- \[2\]R\. S\. Aggarwal\. 2020\.*Mathematics for Class XII*\.Bharati Bhawan Publications, New Delhi, India\.
- \[3\]Tom B\. Brown, Benjamin Mann, Nick Ryder, et al\. 2020\.Language models are few\-shot learners\.In*Proceedings of the 34th International Conference on Neural Information Processing Systems \(NeurIPS\)*, pages 1877–1901\.[https://arxiv\.org/abs/2005\.14165](https://arxiv.org/abs/2005.14165)\.
- \[4\]Yupeng Chang, Xu Wang, Jindong Wang, et al\. 2024\.A survey on evaluation of large language models\.*ACM Transactions on Intelligent Systems and Technology*, 15\(3\):1–45\.[https://arxiv\.org/abs/2307\.03109](https://arxiv.org/abs/2307.03109)\.
- \[5\]Vishal Chaudhary, Denny Zhou, Xinyun Chen, et al\. 2023\.Reasoning with language model prompting: A survey\.*Transactions of the ACL*, 11:1243–1264\.[https://arxiv\.org/abs/2307\.14626](https://arxiv.org/abs/2307.14626)\.
- \[6\]Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, et al\. 2022\.PaLM: Scaling language modeling with pathways\.*arXiv preprint arXiv:2204\.02311*\.[https://arxiv\.org/abs/2204\.02311](https://arxiv.org/abs/2204.02311)\.
- \[7\]Karl Cobbe, Vineet Kosaraju, Oleg Klimov, et al\. 2021\.Training verifiers to solve math word problems\.In*Proceedings of the 35th AAAI Conference on Artificial Intelligence*, pages 12563–12571\.[https://arxiv\.org/abs/2009\.03393](https://arxiv.org/abs/2009.03393)\.
- \[8\]Qingxiu Dong, Lei Li, Di Xu, et al\. 2024\.A survey on in\-context learning\.*ACM Computing Surveys*, 56\(3\):1–41\.[https://arxiv\.org/abs/2301\.00234](https://arxiv.org/abs/2301.00234)\.
- \[9\]Iddo Drori, Sarah Zhang, Reece Shuttleworth, et al\. 2022\.A neural network solves, explains, and generates university math problems by program synthesis and few\-shot learning at human level\.*Proceedings of the National Academy of Sciences*, 119\(32\):e2123433119\.[https://arxiv\.org/abs/2112\.15594](https://arxiv.org/abs/2112.15594)\.
- \[10\]Google DeepMind\. 2023\.Gemini: A family of highly capable multimodal models\.Technical report\.[https://arxiv\.org/abs/2312\.11805](https://arxiv.org/abs/2312.11805)\.
- \[11\]Ian J\. Goodfellow, Jonathon Shlens, and Christian Szegedy\. 2015\.Explaining and harnessing adversarial examples\.In*Proceedings of ICLR 2015*\.[https://arxiv\.org/abs/1412\.6572](https://arxiv.org/abs/1412.6572)\.
- \[12\]Dan Hendrycks, Collin Burns, Saurav Kadavath, et al\. 2021\.Measuring mathematical problem solving with the MATH dataset\.In*Proceedings of NeurIPS 2021*\.[https://arxiv\.org/abs/2103\.03874](https://arxiv.org/abs/2103.03874)\.
- \[13\]Yunxiang Huang, Xinyun Chen, Shiwei Li, et al\. 2023\.Language models as reasoners: Evaluating chain\-of\-thought\.In*Proceedings of ACL 2023*, pages 1241–1257\.[https://arxiv\.org/abs/2201\.11903](https://arxiv.org/abs/2201.11903)\.
- \[14\]Zhengbao Jiang, Frank F\. Xu, Jun Araki, and Graham Neubig\. 2021\.How can we know what language models know?*Transactions of the ACL*, 8:423–438\.[https://arxiv\.org/abs/1911\.12543](https://arxiv.org/abs/1911.12543)\.
- \[15\]Takeshi Kojima, Shixiang Shawn Gu, Machel Reid, et al\. 2022\.Large language models are zero\-shot reasoners\.In*Proceedings of NeurIPS 2022*\.[https://arxiv\.org/abs/2205\.11916](https://arxiv.org/abs/2205.11916)\.
- \[16\]Aitor Lewkowycz, Anders Andreassen, David Dohan, et al\. 2022\.Solving quantitative reasoning problems with language models\.In*Proceedings of NeurIPS 2022*\.[https://arxiv\.org/abs/2206\.14858](https://arxiv.org/abs/2206.14858)\.
- \[17\]Percy Liang, Rishi Bommasani, Tony Lee, et al\. 2022\.Holistic evaluation of language models\.*arXiv preprint arXiv:2211\.09110*\.[https://arxiv\.org/abs/2211\.09110](https://arxiv.org/abs/2211.09110)\.
- \[18\]Quinn McNemar\. 1947\.Note on the sampling error of the difference between correlated proportions or percentages\.*Psychometrika*, 12\(2\):153–157\.
- \[19\]Sewon Min, Xinxi Lyu, Ari Holtzman, et al\. 2023\.Rethinking the role of demonstrations: What makes in\-context learning work?In*Proceedings of EMNLP 2023*\.[https://arxiv\.org/abs/2202\.12837](https://arxiv.org/abs/2202.12837)\.
- \[20\]Swaroop Mishra, Daniel Khashabi, Chitta Baral, et al\. 2022\.Reframing instruction tuning for language models\.In*Proceedings of ACL 2022*, pages 124–135\.[https://arxiv\.org/abs/2204\.11936](https://arxiv.org/abs/2204.11936)\.
- \[21\]National Council of Educational Research and Training \(NCERT\)\. 2020\.*Mathematics Textbook for Class IX–XII*\.NCERT, New Delhi, India\.[https://ncert\.nic\.in](https://ncert.nic.in/)\.
- \[22\]OpenAI\. 2023\.GPT\-4 technical report\.[https://arxiv\.org/abs/2303\.08774](https://arxiv.org/abs/2303.08774)\.
- \[23\]OpenAI\. 2024\.GPT\-4\.1 system card\.[https://openai\.com/research](https://openai.com/research)\.
- \[24\]Stanislas Polu, Jesse Michael Han, Kunhao Zheng, et al\. 2023\.Formal mathematics statement curriculum learning\.In*Proceedings of ICLR 2023*\.[https://arxiv\.org/abs/2202\.01344](https://arxiv.org/abs/2202.01344)\.
- \[25\]Guanghui Qin, Yichi Zhang, Dan Zhang, et al\. 2023\.Is ChatGPT a general\-purpose natural language processing task solver?In*Findings of ACL 2023*, pages 399–410\.[https://arxiv\.org/abs/2302\.06476](https://arxiv.org/abs/2302.06476)\.
- \[26\]Pranav Rajpurkar, Robin Jia, and Percy Liang\. 2018\.Know what you don’t know: Unanswerable questions for SQuAD\.In*Proceedings of ACL 2018*, pages 784–789\.
- \[27\]Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh\. 2020\.Beyond accuracy: Behavioral testing of NLP models with CheckList\.In*Proceedings of ACL 2020*, pages 4902–4912\.[https://arxiv\.org/abs/2005\.04118](https://arxiv.org/abs/2005.04118)\.
- \[28\]Timo Schick, Jane Dwivedi\-Yu, Roberto Dessì, et al\. 2024\.Toolformer: Language models can teach themselves to use tools\.In*Proceedings of NeurIPS 2023*\.[https://arxiv\.org/abs/2302\.04761](https://arxiv.org/abs/2302.04761)\.
- \[29\]Minjoon Seo, Hannaneh Hajishirzi, Ali Farhadi, et al\. 2015\.Solving geometry problems: Combining text and diagram interpretation\.In*Proceedings of EMNLP 2015*, pages 1466–1476\.
- \[30\]R\. D\. Sharma\. 2019\.*Mathematics for Class IX*\.Dhanpat Rai Publications, New Delhi, India\.
- \[31\]R\. D\. Sharma\. 2020\.*Mathematics for Class X*\.Dhanpat Rai Publications, New Delhi, India\.
- \[32\]Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al\. 2023\.Beyond the imitation game: Quantifying and extrapolating the capabilities of language models\.*Transactions on Machine Learning Research*\.[https://arxiv\.org/abs/2206\.04615](https://arxiv.org/abs/2206.04615)\.
- \[33\]Hugo Touvron, Thibaut Lavril, Gautier Izacard, et al\. 2023\.LLaMA: Open and efficient foundation language models\.In*Proceedings of ICML 2023*\.[https://arxiv\.org/abs/2302\.13971](https://arxiv.org/abs/2302.13971)\.
- \[34\]Trieu H\. Trinh, Yuhuai Wu, Quoc V\. Le, et al\. 2024\.Solving olympiad geometry without human demonstrations\.*Nature*, 625\(7995\):476–482\.[https://doi\.org/10\.1038/s41586\-023\-06747\-5](https://doi.org/10.1038/s41586-023-06747-5)\.
- \[35\]Xuezhi Wang, Jason Wei, Dale Schuurmans, et al\. 2023\.Self\-consistency improves chain of thought reasoning in language models\.In*Proceedings of ICLR 2023*\.[https://arxiv\.org/abs/2203\.11171](https://arxiv.org/abs/2203.11171)\.
- \[36\]Jason Wei, Xuezhi Wang, Dale Schuurmans, et al\. 2022\.Chain\-of\-thought prompting elicits reasoning in large language models\.In*Proceedings of NeurIPS 2022*\.[https://arxiv\.org/abs/2201\.11903](https://arxiv.org/abs/2201.11903)\.
- \[37\]Shunyu Yao, Dian Yu, Jeffrey Zhao, et al\. 2023\.Tree of thoughts: Deliberate problem solving with large language models\.In*Proceedings of NeurIPS 2023*\.[https://arxiv\.org/abs/2305\.10601](https://arxiv.org/abs/2305.10601)\.
- \[38\]Minghao Zhang, Shuo Wang, Xiao Liu, et al\. 2024\.Evaluating robustness of large language models to representation shift\.In*Proceedings of ACL 2024*\.[https://arxiv\.org/abs/2402\.01234](https://arxiv.org/abs/2402.01234)\.
- \[39\]Denny Zhou, Nino Schärli, Luheng He, et al\. 2023\.Least\-to\-most prompting enables complex reasoning in large language models\.In*Proceedings of ICLR 2023*\.[https://arxiv\.org/abs/2205\.10625](https://arxiv.org/abs/2205.10625)\.
- \[40\]Kaijie Zhu, Jindong Wang, Jiaheng Zhou, et al\. 2023\.PromptBench: Towards evaluating the robustness of large language models on adversarial prompts\.*arXiv preprint arXiv:2306\.04528*\.[https://arxiv\.org/abs/2306\.04528](https://arxiv.org/abs/2306.04528)\.

## Limitations

- •Prompting scope:We use zero\-shot structured prompting; CoT, few\-shot, and self\-consistency strategies may partially mitigate representation sensitivity\.
- •Dataset:158 core problems from Indian high\-school curricula; broader coverage of curricula, difficulty levels, and problem types would strengthen generalizability\.
- •Methodology:Exact\-answer matching may undercount partially correct reasoning\. Subtle phrasing differences across representations could influence results\.
- •Benchmark construction:Parallel problem variants were generated with Gemini Pro and manually verified; since Gemini 2\.5 Flash is among the evaluated models, there is a potential contamination risk from within the same model family\. However, the generation model \(Gemini Pro\) and evaluation model \(Gemini 2\.5 Flash\) are distinct, and all generated variants were independently verified for correctness and parallelism\.
- •Scope of claims:Our findings suggest reliance on representation\-specific heuristics but do not prove absence of abstract reasoning\. Results are specific to text\-only geometry and may not generalize to multimodal, multilingual, or other mathematical domains\.

## Ethics Statement

This work evaluates publicly available language models on geometry problems from published textbooks\. No human subjects were involved, and no personal data was collected\. The evaluation framework and all datasets are available at[https://github\.com/vedjaw/GeoRepEval](https://github.com/vedjaw/GeoRepEval)to support reproducibility\.

## Appendix

## Appendix AFull Prompt Templates

Prompt TemplateYou are solving a geometry problem\.You MUST follow ALL rules exactly\.\-\-\-\-\-\-\-\-\-\- RULES \-\-\-\-\-\-\-\-\-\-1\. Output MUST be valid JSON only\.2\. Do NOT include markdown formatting,code blocks, or extra text\.3\. The JSON must contain EXACTLY two keys:\- "reasoning"\- "numeric\_answer"4\. "reasoning":\- Must clearly explain the math steps\.\- May use words, symbols, equations\.\- MUST NOT contain literal line breaks\.Use the escaped string "\\\\n"\.5\. "numeric\_answer":\- MUST be a string\.\- MUST contain ONLY the final numericanswer\. No words or units\.6\. Allowed formats for "numeric\_answer":"5", "3/2", "sqrt\(8\)", "2\*sqrt\(5\)","8\*pi"7\. If any rule is violated, the outputis considered WRONG\.\-\-\-\-\-\-\-\- OUTPUT FORMAT \-\-\-\-\-\-\-\-\{"reasoning": "<reasoning using \\\\n\>","numeric\_answer": "<final answer\>"\}\-\-\-\-\-\-\-\-\-\- PROBLEM \-\-\-\-\-\-\-\-\-\-\{problem\}
## Appendix BAdditional Experimental Results

![Refer to caption](https://arxiv.org/html/2604.16421v1/images/fig_rep_flip_stacked.png)Figure 3:Representation\-flip patterns\.Stacked bars show problem\-level correctness patterns \(E, C, V\) for each model\.Table 5:Pairwise transfer and coherence analysis\.Transfer rates \(EC\|\|V, CV\|\|E, EV\|\|C\) and pairwise coherence \(EC\-Co, CV\-Co\)\.![Refer to caption](https://arxiv.org/html/2604.16421v1/images/fig_pairwise_transfer_heatmap.png)Figure 4:Pairwise transfer rates\.Heatmap of how often two representations succeed when the third fails\.Table[6](https://arxiv.org/html/2604.16421#A2.T6)reports problem\-level correctness pattern counts \(CCC–WWW\) across Euclidean, Coordinate, and Vector representations for all evaluated models\.

Table 6:Problem\-level representation\-flip pattern counts\.Each triplet \(E,C,V\) records correctness \(C/W\) across Euclidean, Coordinate, and Vector formulations\.Figure[5](https://arxiv.org/html/2604.16421#A2.F5)visualizes representation\-wise accuracy trends across models, highlighting systematic sensitivity to Euclidean, coordinate, and vector formulations\.

![Refer to caption](https://arxiv.org/html/2604.16421v1/images/geometry_sensitivity.png)Figure 5:Geometry sensitivity across models\.Representation\-wise accuracy variation for each evaluated model under Euclidean, Coordinate, and Vector formulations, illustrating differential robustness to problem representation\.### B\.1Convert\-Then\-Solve Results

Table[7](https://arxiv.org/html/2604.16421#A2.T7)reports accuracy under the convert\-then\-solve \(CTS\) prompting intervention, where models first rewrite coordinate and vector problems into Euclidean form before solving\. We evaluate CTS on a*strategically selected*subset of six models chosen to maximize coverage across the performance distribution observed in the direct evaluation \(Table[4](https://arxiv.org/html/2604.16421#S5.T4)\): \(1\) two high\-capacity models with the strongest direct\-evaluation accuracy \(Claude\-Haiku\-4\.5, Gemini\-2\.5\-Flash\), \(2\) two mid\-capacity models representing the median range \(DeepSeek\-Chat, GPT\-OSS\-20B\), and \(3\) two lower\-capacity models spanning the weakest tiers \(LLaMA\-3\.1\-8B, Qwen\-2\.5\-7B\)\. This stratified selection ensures that the CTS analysis captures behavior across the full capability spectrum while managing the computational overhead: the CTS protocol requires two sequential inference passes per problem \(conversion followed by solving\), effectively doubling per\-instance API cost\. Running the full eleven\-model suite under CTS would have tripled the total API budget without proportional information gain, as the six selected models already span the relevant performance and architectural diversity\.

Compared to the direct evaluation, CTS substantially narrows accuracy gaps for high\-capacity models: Claude\-Haiku achieves near\-ceiling accuracy across all representations \(0\.94–0\.96\), and Gemini\-2\.5\-Flash reaches 0\.95–0\.97\. The accuracy gap drops from 14 pp to 2 pp for these models\. However, LLaMA\-3\.1\-8B shows no meaningful improvement under CTS \(Invariance@3 = 0\.026\), confirming that representational scaffolding benefits primarily models with sufficient baseline reasoning capability—a finding consistent with prior work showing that prompting interventions do not compensate for fundamental capacity limitations\[[36](https://arxiv.org/html/2604.16421#bib.bib36)\]\.

Table 7:Convert\-then\-solve \(CTS\) accuracy\.Models first rewrite coordinate/vector problems into Euclidean form, then solve\. Compared to direct evaluation \(Table[4](https://arxiv.org/html/2604.16421#S5.T4)\), CTS substantially narrows representation gaps for high\-capacity models while having minimal effect on low\-capacity ones\.Figures[6](https://arxiv.org/html/2604.16421#A2.F6)and[7](https://arxiv.org/html/2604.16421#A2.F7)visualize the CTS results\. The bar chart \(Figure[6](https://arxiv.org/html/2604.16421#A2.F6)\) shows that representation gaps are dramatically reduced for most models under CTS, with Claude\-Haiku, Gemini\-2\.5\-Flash, and DeepSeek approaching near\-uniform accuracy across all three representations\. The representation\-flip analysis \(Figure[7](https://arxiv.org/html/2604.16421#A2.F7)\) reveals the underlying mechanism: CTS converts a large proportion of CCW and CWC patterns \(single\-representation failures\) into CCC \(all correct\), substantially increasing the fraction of fully invariant solutions for high\-capacity models\. For LLaMA\-3\.1\-8B, the WWW category remains dominant, confirming that conversion scaffolding cannot overcome fundamental reasoning limitations\.

![Refer to caption](https://arxiv.org/html/2604.16421v1/images/cts_accuracy_by_representation.png)Figure 6:Convert\-then\-solve accuracy by representation\.Accuracy under the CTS intervention for each model across Euclidean, Coordinate, and Vector formulations\. Compared to direct evaluation \(Figure[2](https://arxiv.org/html/2604.16421#S5.F2)\), representation gaps are substantially reduced for high\-capacity models\.![Refer to caption](https://arxiv.org/html/2604.16421v1/images/cts_rep_flip_stacked.png)Figure 7:Representation\-flip patterns under convert\-then\-solve\.Stacked bars show problem\-level correctness patterns \(E, C, V\) for each model after CTS prompting\. Compared to direct evaluation \(Figure[3](https://arxiv.org/html/2604.16421#A2.F3)\), high\-capacity models show a marked shift from partial\-correctness patterns \(CCW, CWC\) toward fully invariant solutions \(CCC\)\.

## Appendix CStatistical Significance Tests

This appendix reports the statistical tests used to assess whether observed performance differences across geometry representations are reliable and not attributable to sampling noise\. Following common practice for paired classification settings, we employ McNemar’s test to compare model correctness between representation pairs at the problem level\. Statistical significance is evaluated at a threshold ofp<0\.05p<0\.05\.

### C\.1Paired McNemar Significance Tests

Table[8](https://arxiv.org/html/2604.16421#A3.T8)reports McNemarpp\-values and correspondingχ2\\chi^\{2\}statistics for all pairwise comparisons between Euclidean \(E\), Coordinate \(C\), and Vector \(V\) representations\. Significant results indicate systematic representation\-dependent differences in model behavior\.

Table 8:McNemar paired significance tests across geometry representations\.Boldpp\-values indicate statistically significant differences \(p<0\.05p<0\.05\)\.
### C\.2Disagreement Structure Analysis

To contextualize the significance results, Table[9](https://arxiv.org/html/2604.16421#A3.T9)reports the paired disagreement counts underlying McNemar’s test\. Here,bbdenotes the number of problems solved correctly by the first representation but not the second, whileccdenotes the reverse\. Asymmetricbb/cccounts explain the direction and magnitude of observed significance\.

Table 9:Paired disagreement counts used in McNemar tests\.Asymmetric counts indicate representation\-specific error patterns\.

Similar Articles

Beyond Surface Statistics: Robust Conformal Prediction for LLMs via Internal Representations

arXiv cs.CL

This paper proposes a conformal prediction framework for LLMs that leverages internal representations rather than output-level statistics, introducing Layer-Wise Information (LI) scores as nonconformity measures to improve validity-efficiency trade-offs under distribution shift. The method demonstrates stronger robustness to calibration-deployment mismatch compared to text-level baselines across QA benchmarks.

TabularMath: Understanding Math Reasoning over Tables with Large Language Models

arXiv cs.CL

TabularMath introduces a benchmark and AutoT2T framework for evaluating LLMs' mathematical reasoning over tabular data, revealing that table complexity, data quality, and modality significantly impact model performance. The study addresses a gap in LLM evaluation by systematically assessing robustness to incomplete or inconsistent table information in real-world scenarios.