LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

arXiv cs.AI 05/19/26, 04:00 AM Papers

Summary

Introduces LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier LLMs on structured linear algebra computation across matrix dimensions, revealing that LLM mathematical failure is structurally constrained and transitions from execution errors to computational abandonment at 4x4 scale.

arXiv:2605.16675v1 Announce Type: new Abstract: We introduce LinAlg-Bench, a diagnostic benchmark evaluating 10 frontier large language models on structured linear algebra computation across a strict dimensional gradient of 3x3, 4x4, and 5x5 matrices. Spanning 9 task types and 660 SymPy-certified problems, the benchmark exhaustively evaluates 6,600 model outputs. Beyond binary accuracy, LinAlg-Bench introduces a three-stage automated forensic pipeline classifying 1,156 failures into ten primary error tags with fine-grained subtypes, revealing that LLM mathematical failure is not random but structurally constrained by algorithm type and matrix dimension. Our central finding is a sharp behavioral threshold at 4x4 scale: below it, models fail through execution errors -- sign tracking failures, arithmetic drift, and parity errors; above it, failure transitions to computational abandonment, with models fabricating responses through tool roleplay, constraint-consistent confabulation, and structured hallucination rather than attempting computation. This fabrication-to-abandonment transition is near-universal across all model tiers and architectures, suggesting a working memory limit rather than a knowledge gap, supported by three scale-emergent error types absent at 3x3 but present at 4x4 and 5x5. We further show that solution strategy rigidity is a near-perfect predictor of 5x5 determinant accuracy, document constraint-aware confabulation as a novel structured hallucination failure mode, and release all data, model outputs, error labels, and judge pipeline publicly.

Original Article

View Cached Full Text

Cached at: 05/19/26, 06:34 AM

# LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning
Source: [https://arxiv.org/html/2605.16675](https://arxiv.org/html/2605.16675)
Shradha Agarwal Deepak Rajbhar Tariq J\. Department of Nuclear Engineering and Computer Science Missouri University of Science and Technology Rolla, Missouri sabrc@mst\.edu

###### Abstract

We introduce LinAlg\-Bench, a diagnostic benchmark evaluating 10 frontier large language models on structured linear algebra computation across a strict dimensional gradient of3×33\\times 3,4×44\\times 4, and5×55\\times 5matrices\. Spanning 9 task types and 660 SymPy\-certified problems, the benchmark exhaustively evaluates 6,600 model outputs\. Beyond binary accuracy, LinAlg\-Bench introduces a three\-stage automated forensic pipeline classifying 1,156 failures into ten primary error tags with fine\-grained subtypes, revealing that LLM mathematical failure is not random but structurally constrained by algorithm type and matrix dimension\. Our central finding is a sharp behavioral threshold at4×44\\times 4scale: below it, models fail through execution errors — sign tracking failures, arithmetic drift, and parity errors; above it, failure transitions to computational abandonment, with models fabricating responses through tool roleplay, constraint\-consistent confabulation, and structured hallucination rather than attempting computation\. This fabrication\-to\-abandonment transition is near\-universal across all model tiers and architectures, suggesting a working memory limit rather than a knowledge gap, supported by three scale\-emergent error types absent at3×33\\times 3but present at4×44\\times 4and5×55\\times 5\. We further show that solution strategy rigidity is a near\-perfect predictor of5×55\\times 5determinant accuracy, document constraint\-aware confabulation as a novel structured hallucination failure mode, and release all data, model outputs, error labels, and judge pipeline publicly\.

## 1Introduction

Do frontier large language models reason mathematically, or do they approximate the surface structure of mathematical discourse? The prevailing account of LLM mathematical failure is essentially statistical — models fail because they have seen insufficient training examples, or because the problem exceeds some general capability threshold\. This paper argues for a stronger, structurally grounded account: LLM failure under recursive computational load is not random, it follows a predictable pattern determined by algorithm family and matrix dimension, and transitions sharply from execution failure to computational abandonment at a measurable complexity threshold\.

Linear algebra provides an ideal testbed because operations are algorithmically precise, scale naturally across matrix dimensions without changing the underlying task, and decompose into a cognitive hierarchy in which each level adds exactly one computational requirement absent in the level below: matrix reading, parallel arithmetic, sequential state tracking, recursive sign management, and compositional operations built on prior levels\. A model that fails on a5×55\\times 5determinant cannot attribute the failure to unfamiliar mathematics — the algorithm is identical to the3×33\\times 3case it solved correctly\.

At3×33\\times 3, nine of ten models achieve perfect or near\-perfect determinant accuracy\. At5×55\\times 5, only three models exceed 50% on determinants, and eigenvalue accuracy collapses to near\-zero for all models\. This pattern — preserved knowledge, collapsed execution — is the signature of a working memory constraint rather than a knowledge deficit\. The mode of failure also changes categorically: at3×33\\times 3, models attempt computation and fail within it; at5×55\\times 5, models abandon computation before attempting it, fabricating responses through structured hallucination that satisfies surface mathematical constraints while being fundamentally incorrect\. This fabrication\-to\-abandonment threshold, documented across 6,600 exhaustively evaluated outputs, is the central empirical contribution of this paper\.

Models separate cleanly into three tiers defined by the dimensional boundary at which this threshold is crossed — Tier 3 models collapse at the Recursive level, Tier 2 partially withstands Recursive but collapses at the Compositional level, Tier 1 sustains Recursive but converges to near\-zero eigenvalue accuracy by5×55\\times 5\. A targeted ablation study further shows that enforcing algorithmically efficient strategies does not recover accuracy, establishing that the bottleneck operates at the level of autoregressive execution depth rather than method selection\.

LinAlg\-Bench contributes: 660 SymPy\-verified problems across 9 tasks and 3 matrix dimensions with all 6,600 outputs exhaustively evaluated; a three\-stage forensic pipeline classifying 1,156 failures into ten error tags validated against 593 human\-annotated labels; the fabrication\-to\-abandonment behavioral threshold as a novel, falsifiable empirical finding; and constraint\-aware confabulation as a structurally distinct hallucination failure mode\. All data, model outputs, error labels, and judge pipeline are publicly released\.

## 2Related Work

Mathematical reasoning evaluation has progressed from arithmetic word problems\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.16675#bib.bib3)\)through competition\-level challenges\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.16675#bib.bib9)\), unified multi\-domain evaluations\(Mishraet al\.,[2022](https://arxiv.org/html/2605.16675#bib.bib21)\), to symbolic mathematics\(Lample and Charton,[2020](https://arxiv.org/html/2605.16675#bib.bib13)\)and multi\-level comprehension\(Liuet al\.,[2024](https://arxiv.org/html/2605.16675#bib.bib16)\)\. These benchmarks establish difficulty gradients but measure only final\-answer accuracy — a model failing on GSM8K and a model failing on a5×55\\times 5determinant are treated identically despite failing for categorically different reasons\. LinAlg\-Bench addresses this gap by holding the algorithm fixed and scaling only matrix dimension, isolating computational depth from mathematical knowledge — a design not available in benchmarks where difficulty and novelty are confounded\. This enables a controlled test of whether failure mode, not merely failure rate, changes with task complexity\.

Transformer models solve multi\-step compositional tasks via shallow pattern\-matching rather than genuine reasoning\(Dziriet al\.,[2023](https://arxiv.org/html/2605.16675#bib.bib6)\), cannot reliably self\-correct arithmetic errors\(Huanget al\.,[2023](https://arxiv.org/html/2605.16675#bib.bib10)\), and confabulate primarily through factual misgrounding\(Maynezet al\.,[2020](https://arxiv.org/html/2605.16675#bib.bib17)\)or overconfident generation\(Kadavathet al\.,[2022](https://arxiv.org/html/2605.16675#bib.bib11)\)\. LinAlg\-Bench reveals a structurally distinct failure mode — constraint\-aware confabulation — in which models fabricate mathematically plausible responses under computational overload: eigenvalue sums match the matrix trace in 45% of fabrications, and magnitudes respect the Frobenius norm bound in 85% of cases \(n=20n\{=\}20Ungrounded\_Guess cases\)\. A benchmark evaluating only necessary conditions would systematically misclassify these fabrications as correct\. The tool roleplay collapse documented in Section[5\.3](https://arxiv.org/html/2605.16675#S5.SS3), in which models simulate invoking external tools they cannot access, represents a further structured hallucination mechanism not previously documented\.

The mechanistic interpretability literature has developed tools for localising computation within transformer architectures\(Vig,[2019](https://arxiv.org/html/2605.16675#bib.bib31); Menget al\.,[2022](https://arxiv.org/html/2605.16675#bib.bib19); Conmyet al\.,[2023](https://arxiv.org/html/2605.16675#bib.bib4)\)\. LinAlg\-Bench contributes at the behavioural level: the forensic taxonomy generates testable hypotheses for future mechanistic work — sign\-tracking failures predict late\-layer parity circuit involvement, and Complete\_Collapse predicts suppression of those circuits rather than their corruption\. Full mechanistic validation is reserved for follow\-up work\.

## 3Benchmark Design

LinAlg\-Bench organises 9 linear algebra tasks into five cognitive levels based on the computational demands each operation places on a language model’s processing pipeline\. This taxonomy is not a difficulty ranking but a causal decomposition: each level adds exactly one computational requirement absent in the level below \(Appendix[B](https://arxiv.org/html/2605.16675#A2)\)\. The critical boundary falls between Sequential and Recursive: determinant computation requires maintaining hierarchical parity state across nested cofactor expansions simultaneously, while rank and nullity require only independent row operations\. Eigenvalue at5×55\\times 5compounds this as a deliberate stress test — not a standard capability measurement — requiring a full5×55\\times 5determinant followed by degree\-5 polynomial root\-finding\.

Table 3\.1:Five cognitive levels of the LinAlg\-Bench taxonomy\. Each level adds exactly one computational requirement absent in the level below\. The critical boundary falls between Sequential and Recursive: row operations are independent, while cofactor expansion requires maintaining hierarchical parity state across nested submatrix expansions simultaneously\.The benchmark comprises 660 integer\-entry problems \(220 per dimension; det=50, eig=30, others=20 per model\), ground\-truth verified against SymPy symbolic computation\(Meureret al\.,[2017](https://arxiv.org/html/2605.16675#bib.bib20)\)\. Determinant and eigenvalue receive larger sample allocations \(50 and 30 questions respectively\) because they are the primary Recursive and Compositional tasks under investigation; the remaining seven subcategories usen=20n\{=\}20, sufficient to establish ceiling and floor effects at Reading, Arithmetic, and Sequential levels\. Ten frontier models spanning reasoning\-optimised, mixture\-of\-experts, and standard instruction\-tuned architectures were evaluated zero\-shot at temperature 0 via API; full model details are in Appendix[A\.4](https://arxiv.org/html/2605.16675#A1.SS4)\. Code execution and external tools were disabled — this is a deliberate synthetic stress test of unaided recursive execution under increasing depth, not a practical evaluation of tool\-use capability\. Format sensitivity was evaluated at4×44\\times 4and5×55\\times 5across three input representations \(LaTeX, Tabular, List\) for nine models; format analysis targets non\-reasoning architectures where parsing load interacts with computational depth\.

## 4Results

Reading and Arithmetic level tasks — trace, transpose, matrix\-vector, multiplication, matrix\-power — remain essentially flat across all three dimensions, confirming that basic matrix comprehension and parallel arithmetic place no meaningful load at any scale\. Sequential tasks \(rank, nullity\) show a gradual, monotonic decline consistent with increasingGaussian elimination chainlength\. The cliff concentrates entirely at the Recursive and Compositional levels\. Full per\-model, per\-subcategory accuracy including rank, nullity, trace, and transpose breakdowns is reported in Appendix[C](https://arxiv.org/html/2605.16675#A3)\(Tables C\.1–C\.3\)\.

![Refer to caption](https://arxiv.org/html/2605.16675v1/x1.png)Figure 4\.1:Accuracy trajectories across matrix dimensions for the Recursive level \(determinant, left\) and Compositional level \(eigenvalues, right\), grouped by tier\. Lines show tier mean accuracy; shaded bands show the min–max range within each tier\. Tier 1 \(blue\): OpenAI\-o1, Gemini\-3\.0\-Pro, DeepSeek\-V3, Qwen3\-235B\. Tier 2 \(green\): GPT\-5\.2, Mistral\-Large\. Tier 3 \(red\): Claude\-4\.5\-Sonnet, Qwen2\.5\-72B, Llama\-3\.3\-70B, GPT\-4o\. All proportions are exact counts from 6,600 exhaustively evaluated outputs \(2,200 per dimension; 220 per model×\\times10 models\)\.The determinant task carries the cliff signal\. At3×33\\times 3, all models achieve near\-perfect accuracy\. At4×44\\times 4, the cliff begins to manifest with the first structural collapses among lower\-tier models\. The decisive break occurs at5×55\\times 5: a 32 percentage point step\-discontinuity from4×44\\times 4to5×55\\times 5determinant accuracy — far exceeding the gradual Sequential decline — proving the bottleneck is recursive state\-tracking, not general difficulty scaling\. Eigenvalue accuracy collapses to near\-zero across all models at5×55\\times 5, consistent with its design as a deliberate stress test under maximum recursive load rather than a standard capability measurement\.

Models separate into three tiers defined by their5×55\\times 5determinant boundary\. Tier 1 \(OpenAI\-o1, Gemini\-3\.0\-Pro, Qwen3\-235B, DeepSeek\-V3\) maintains≥74\\geq 74% accuracy\. Tier 2 \(GPT\-5\.2, Mistral\-Large\) falls to 36–54%\. Tier 3 \(Claude\-4\.5\-Sonnet, Qwen2\.5\-72B, Llama\-3\.3\-70B, GPT\-4o\) collapses to 0–12%\. Critically, all three tiers converge to near\-zero eigenvalue accuracy by5×55\\times 5— the Recursive and Compositional levels represent two distinct failure thresholds, not a single continuum\. This universal collapse at the Compositional level provides stark empirical evidence for the compositionality gap\(Presset al\.,[2022](https://arxiv.org/html/2605.16675#bib.bib27)\), where models fail on compound tasks despite mastering the underlying sub\-tasks\.

Format sensitivity is negligible at4×44\\times 4but emerges at5×55\\times 5\. The effect distinguishes parsing limits from computational limits: while Arithmetic tasks \(e\.g\., matrix\-vector\) maintain near\-perfect accuracy under standard LaTeX notation, altering the input to Tabular or List formats induces artificial parsing failures — GPT\-4o drops to 50–54% and Claude\-4\.5\-Sonnet to 59–61% under non\-LaTeX formats at5×55\\times 5, compared to 63% and 66% under LaTeX respectively\. At the Recursive level, however, format preferences diverge, and fully collapsed models fail regardless of notation — confirming that the determinant collapse is a fundamental working memory bottleneck, not a parsing artifact\. Full format sensitivity data is reported in Appendix[D](https://arxiv.org/html/2605.16675#A4)\.

## 5Forensic Error Analysis

### 5\.1Classification Methodology

Forensic error classification was applied to all 1,156 incorrect responses \(100 at3×33\\times 3, 394 at4×44\\times 4, 662 at5×55\\times 5\), following the first\-error principle: the earliest step at which computation diverges from ground truth is tagged, not the final observable symptom\. The taxonomy comprises ten primary tags applicable across all nine subcategories — execution errors \(sign\_error,arithmetic,carry\_down\_error,memory\_loss\), structural errors \(hallucination,method\_fail\), and artifacts \(input\_transcription,generation\_truncation,formatting\_mismatch,other\_unmapped\), defined in Table[5\.1](https://arxiv.org/html/2605.16675#S5.T1)— plus four eigenvalue\-specific extensions \(generation\_loop,algebraic\_precedence,false\_verification,variable\_entanglement\) that arise exclusively during characteristic polynomial expansion and root\-finding, defined in Appendix[E](https://arxiv.org/html/2605.16675#A5), Table E\.2\. All distributional analysis in Section 5 uses the ten primary tags; eigenvalue extensions are reported in Appendix[H](https://arxiv.org/html/2605.16675#A8)\.

Table 5\.1:Ten primary error tags of the LinAlg\-Bench forensic taxonomy, grouped by family\. Classification follows the first\-error principle: the earliest step at which computation diverges from ground truth is tagged\. Two prompt\-level rules precede semantic judgment: the Magnitude Rule \(\|\|wrong\|\|==\|\|correct\|\|→\\rightarrowsign\_error\) and the Truncation Precheck \(no final answer→\\rightarrowgeneration\_truncation\)\.Two rules are enforced before any semantic judgment: the Magnitude Rule \(\|\|wrong\|\|==\|\|correct\|\|→\\rightarrowsign\_error, neverarithmetic\) and the Truncation Precheck \(no final answer→\\rightarrowgeneration\_truncation, neverhallucination\)\. Classification was performed by a three\-stage automated pipeline following the LLM\-as\-a\-Judge paradigm\(Zhenget al\.,[2023](https://arxiv.org/html/2605.16675#bib.bib34)\): the Build Judge identifies the first divergence, the Validate Judge re\-examines under adversarial framing, and the Meta\-Auditor resolves batch\-level disagreements\. To validate pipeline reliability, 593 responses were independently hand\-labeled by the authors, weighted toward5×55\\times 5cases where automated agreement is lowest\. Pipeline agreement with human labels reaches 89\.7% at5×55\\times 5and 92\.6% overall \(Appendix[E](https://arxiv.org/html/2605.16675#A5), Table E\.4\)\. The five tags accounting for 91\.5% of failures achieve consistent agreement across all dimensions; seven low\-frequency tags fall below the 70% threshold at one or more dimensions and are excluded from distributional analysis, collectively representing only 3\.6% of all classified failures\. All 6,600 outputs were evaluated exactly once at temperature 0; because this constitutes a complete enumeration of the defined problem set rather than a sample, standard errors and confidence intervals over model accuracy are not applicable\.

### 5\.2Failure Mode Shift

Failure modes shift systematically with matrix dimension\.sign\_errorleads at3×33\\times 3\(33\.0%\) but drops to 17\.2% at5×55\\times 5\. Conversely,hallucinationrises from 17\.0% to 47\.1% — a transition from execution failure to computational abandonment\.

![Refer to caption](https://arxiv.org/html/2605.16675v1/x2.png)Figure 5\.1:Error tag distribution across matrix dimensions, expressed as percentage of total failures per dimension \(n=100n\{=\}100at3×33\\times 3;n=394n\{=\}394at4×44\\times 4;n=662n\{=\}662at5×55\\times 5\)\. Stage 1 \(3×3→4×43\\times 3\\to 4\\times 4\): scale\-emergent errors first appear, marking the onset of working memory stress\. Stage 2 \(4×4→5×54\\times 4\\to 5\\times 5\): the full regime flip completes —hallucinationovertakessign\_erroras the dominant failure mode\.This collapse unfolds in two distinct stages\. First, scale\-emergent errors \(generation\_loop,memory\_loss,variable\_entanglement\) appear at4×44\\times 4, marking the onset of working memory stress as execution falters\. Second, the full regime flip completes at5×55\\times 5, wherehallucinationovertakessign\_error, driven overwhelmingly by Complete\_Collapse \(explicit abandonment\)\. Parity errors mirror this trajectory: surging at4×44\\times 4as recursive depth stresses tracking, then retreating at5×55\\times 5as models abandon the cofactor algorithm altogether \(Appendix[F](https://arxiv.org/html/2605.16675#A6)\)\. The dominance of sign errors at lower dimensions and their persistence across model tiers is consistent with findings that transformers systematically struggle with signed arithmetic operations\(Leeet al\.,[2023](https://arxiv.org/html/2605.16675#bib.bib14)\)\. Sign errors are not uniformly distributed across model tiers: Gemini\-3\.0\-Pro contributes zero sign errors across all dimensions, while Tier 3 models account for 221 of 270 total sign errors \(81\.9%\), consistent with the tier separation observed at the Recursive level\.

### 5\.3Constraint\-Aware Confabulation and Tool Collapse

hallucinationat5×55\\times 5is not random\. When models cannot compute eigenvalues, fabricated responses are structured by partial mathematical knowledge: eigenvalue sums match the matrix trace in 45% of fabrications, and magnitudes respect the Frobenius norm bound in 85% of cases \(n=20n\{=\}20Ungrounded\_Guess cases\)\. A benchmark evaluating only necessary conditions would systematically misclassify these fabrications as correct\.

Across 205 Complete\_Collapse instances at5×55\\times 5eigenvalue, the failure follows a consistent three\-step pattern: the model correctly identifies the task and sets up the characteristic equation, generates a meta\-statement citing complexity, then fabricates eigenvalues via simulated tool invocation — citing Python, NumPy, or WolframAlpha it does not have access to\. Tool use was disabled for all models; Gemini\-3\.0\-Pro and Llama\-3\.3\-70B explicitly output the system constraint before violating it, confirming tool invocation is hallucinated rather than real\. This contrasts sharply with models explicitly trained to call external APIs\(Schicket al\.,[2023](https://arxiv.org/html/2605.16675#bib.bib30)\); the tool\-use syntax present in pretraining data appears to be triggered as a collapse response when mathematical execution is overloaded\. This simulated tool invocation likely stems from the prevalence of program\-aided reasoning in pretraining mixtures\(Gaoet al\.,[2023](https://arxiv.org/html/2605.16675#bib.bib7)\); when internal computation fails, the model falls back on code\-generation syntax\.

The pattern has one structurally significant exception\. DeepSeek\-V3 achieves 10% eigenvalue accuracy at5×55\\times 5\(3/30\) — the only open\-weight model to pass any5×55\\times 5eigenvalue problems — while using cofactor expansion 78% of the time, a strategy profile indistinguishable from failing models\. Forensic inspection of its 3 successful responses confirms genuine computation: DeepSeek correctly expands the full5×55\\times 5characteristic polynomial before applying bisection root\-finding, independently arriving at the same numerical strategy as OpenAI\-o1\. Its 27 remaining eigenvalue failures follow the identical tool\-roleplay collapse pattern — establishing that superior execution raises the working memory ceiling rather than eliminating it\. DeepSeek’s success is not a capability advantage in any general sense; it is a precision advantage at the specific computational boundary where other models abandon\.

Full model\-specific collapse personas are documented in Appendix[I](https://arxiv.org/html/2605.16675#A9)\.

## 6Interpretation

### 6\.1The Working Memory Account

The dimensional shift documented in Section[5](https://arxiv.org/html/2605.16675#S5)— execution errors giving way to computational abandonment — is a threshold transition, not gradual degradation\. We propose a working memory account as the most parsimonious explanation, supported by three independent lines of evidence\.

The first is the dimensional gradient\. Cofactor expansion of ann×nn\\times ndeterminant requires maintainingnnactive sign states across recursive sub\-expansions simultaneously — a demand that grows superlinearly with dimension\. The same algorithm, applied by the same models, succeeds reliably at3×33\\times 3and collapses at5×55\\times 5\. The failure cannot be attributed to a knowledge deficit: models correctly describe the algorithm at5×55\\times 5before failing to execute it\.

The second is the error regime flip\. At3×33\\times 3, 64% of failures are execution errors — models attempt computation and fail within it\. At5×55\\times 5, 47\.1% of failures arehallucination, of which 81\.7% are Complete\_Collapse \(explicit abandonment; the full Abandonment block including Silent\_Omission and Premature\_Assertion reaches 89\.4%\)\. Gradual degradation would produce monotonically increasing errors of all types; the data shows a categorical mode switch\.

The third is scale\-emergent error types\.generation\_loop,memory\_loss, andvariable\_entanglementare all absent at3×33\\times 3and appear only at4×44\\times 4and5×55\\times 5— mechanistically impossible when computational chains are short, emerging precisely at the dimensional boundary where working memory load increases\. The universality of the threshold across model tiers, architectures, and training paradigms rules out a training\-data explanation\.

sign\_errordoes not follow this pattern\. Sign error persistence across dimensions suggests a tokenisation\-level phenomenon — consistent with known transformer limitations in tracking arithmetic sign state\(Nogueiraet al\.,[2021](https://arxiv.org/html/2605.16675#bib.bib23)\)and known BPE tokenisation instabilities in numerical representations\(Brownet al\.,[2020](https://arxiv.org/html/2605.16675#bib.bib2)\)— a separate mechanism requiring separate explanation\. Architectural hypotheses generated by this account are documented as falsifiable conjectures in Appendix[L](https://arxiv.org/html/2605.16675#A12)\.

### 6\.2Algorithmic Efficiency vs\. Autoregressive Complexity

In our zero\-shot evaluations, solution strategy rigidity emerged as a near\-perfect predictor of5×55\\times 5determinant accuracy: models rigidly locked into cofactor expansion collapsed entirely, while successful models \(e\.g\., OpenAI\-o1, Gemini\-3\.0\-Pro\) employed Gaussian\-dominant strategies\. The intuitive conclusion was that operation count was the primary causal bottleneck: cofactor expansion requiresO\(n\!\)O\(n\!\)operations versusO\(n3\)O\(n^\{3\}\)for Gaussian elimination, saturating working memory through recursive parity\-tracking\.

To test whether this predictor represented the true causal bottleneck, we conducted a targeted ablation study\. Five cofactor\-dominant models \(GPT\-4o, Llama\-3\.3\-70B, Qwen2\.5\-72B, Mistral\-Large, and Claude\-4\.5\-Sonnet\) were explicitly instructed to abandon cofactor expansion and use Gaussian elimination exclusively\. If total operation count were the true bottleneck, enforcing the algorithmically efficientO\(n3\)O\(n^\{3\}\)path should have recovered accuracy\.

It did not\. As shown in Figure[6\.1](https://arxiv.org/html/2605.16675#S6.F1)\(Panel A\), accuracy remained at or near 0% for fully collapsed models, and did not meaningfully recover for partially capable models — Mistral\-Large achieved only 26\.9% on previously failed problems, and Claude\-4\.5\-Sonnet a mere 5\.3% \(n=19n\{=\}19\)\. The sole exception is DeepSeek\-V3, which achieves 74% determinant accuracy and 10% eigenvalue accuracy at5×55\\times 5— the only model to sustain meaningful performance at both the Recursive and Compositional levels while remaining cofactor\-dominant \(78%\)\. Its success stems from superior execution quality: systematic sub\-minor simplification allows it to artificially reduce the autoregressive depth of the cofactor algorithm, representing a unique, partial solution to the same underlying bottleneck and demonstrating that execution quality can partially substitute for strategy adaptation\.

Forensic trace analysis \(Figure[6\.1](https://arxiv.org/html/2605.16675#S6.F1), Panel B\) reveals the mechanism and exposes a fundamental dichotomy between algorithmic complexity and autoregressive complexity\. While Gaussian elimination minimizes total operations, it is autoregressively deep: it requires a strictly sequential chain of highly dependent fractional arithmetic \(e\.g\.,R3←R3−173R1R\_\{3\}\\leftarrow R\_\{3\}\-\\frac\{17\}\{3\}R\_\{1\}\)\. Panel B shows that the moment this fractional dependency is introduced — typically between steps 2 and 4 — execution survival plummets across all five models\.

A single hallucinated numerator or denominator early in the row\-reduction acts as a poison pill\. Because the transformer must attend to its own corrupted fractions to compute each subsequent update, the error cascades irreversibly — consistent with snowballing hallucination dynamics\(Zhanget al\.,[2023](https://arxiv.org/html/2605.16675#bib.bib33)\)— collapsing the final diagonal product to zero or a spurious value\. This is not a failure of strategy selection — it is a failure of precision maintenance under sequential dependency\.

This structural vulnerability explains why cofactor expansion dominates at4×44\\times 4across most models — and why Tier 3 models that remain locked into cofactor at5×55\\times 5collapse entirely\. While cofactor requires vastly more absolute operations, its steps are autoregressively flat: broad, parallel, integer\-based minor calculations with no fractional carry\-forward between terms\. It is mathematically safer for a transformer to execute a high volume of independent integer steps than a low volume of deeply dependent fractional ones\. The primary bottleneck in LLM mathematical reasoning is not the volume of computation, but the model’s inability to maintain strict numerical attention across deep, sequentially dependent fractional chains without error propagation — a failure mode that process reward models were specifically designed to address\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.16675#bib.bib15)\)\.

![Refer to caption](https://arxiv.org/html/2605.16675v1/x3.png)Figure 6\.1:Forced Gaussian elimination ablation\.Panel A:5×55\\times 5determinant accuracy under natural zero\-shot prompting versus forced Gaussian elimination for five cofactor\-dominant models\. Enforcing the algorithmically efficientO\(n3\)O\(n^\{3\}\)strategy yields no meaningful accuracy recovery\. Mistral\-Large shows partial improvement \(26\.9% on previously failed problems\); Claude\-4\.5\-Sonnet: 5\.3% \(n=19n\{=\}19\)\.Panel B: Percentage of responses maintaining error\-free execution at each computation step\. All models sustain error\-free execution through step 2; collapse concentrates at steps 3–4 when the first sequentially dependent fractional row operations are introduced\.

## 7Discussion

LinAlg\-Bench contributes to evaluation methodology by demonstrating that diagnostic benchmarks can generate falsifiable mechanistic hypotheses rather than merely leaderboard rankings\. A benchmark evaluating determinants at a single matrix size would classify Llama\-3\.3\-70B, GPT\-4o, and Claude\-4\.5\-Sonnet as capable — all achieve≥98\\geq 98% at3×33\\times 3— while concealing complete collapse at5×55\\times 5\. The fabrication\-to\-abandonment threshold, the two\-stage working memory account, and the constraint\-aware confabulation finding are each directly testable by independent replication and mechanistic interpretability tools\.

Several limitations constrain the scope of claims\. First, the benchmark covers linear algebra only — a domain with precise algorithmic structure and exact verifiable answers\. Whether the fabrication\-to\-abandonment threshold generalises to other mathematical domains remains an open empirical question\. Second, all models were evaluated at temperature 0 with a single run per problem; higher\-temperature majority voting may alter the threshold location, and self\-consistency sampling is a natural extension\. Third, integer\-entry matrices and exact\-match evaluation eliminate floating\-point ambiguity but exclude numerical robustness as a failure surface\. Fourth, tool availability and API\-level constraints mean that some model behaviours — particularly GPT\-5\.2’s apparent tool circumvention — cannot be definitively attributed to specific implementation choices without access to inference infrastructure\.

Three directions follow directly\. First, constraint\-aware confabulation warrants cross\-domain investigation: whether LLMs fabricate constraint\-consistent answers under computational overload in chemistry, physics, or statistics is a testable hypothesis with direct evaluation methodology implications\. Second, the working memory account generates specific mechanistic predictions testable by activation patching at the cofactor expansion steps\. Third, the ablation finding — that enforcing correct strategy does not recover accuracy — suggests that multi\-step reasoning training curricula should be evaluated on their ability to induce execution robustness, not just accuracy at fixed difficulty\.

## 8Conclusion

LinAlg\-Bench reveals that LLM mathematical failure is not random but structurally constrained — determined by algorithm family, matrix dimension, and autoregressive execution depth\. The fabrication\-to\-abandonment threshold at4×44\\times 4scale, near\-universal across model tiers and architectures, is consistent with a working memory limit rather than a knowledge gap\. Two supporting findings strengthen this account: scale\-emergent error types appear at precisely the dimensional boundary where recursive depth increases, and a forced\-strategy ablation shows that enforcing algorithmically efficient methods does not recover accuracy — the bottleneck is execution depth, not method selection\. Constraint\-aware confabulation — models fabricating eigenvalues that satisfy verifiable mathematical constraints — demonstrates that hallucination under computational overload is structured by partial knowledge rather than arbitrary\. These are falsifiable claims that follow directly from the data and invite verification by mechanistic interpretability tools and cross\-domain replication\. All data, model outputs, error labels, and judge pipeline are publicly released\.

#### Data Availability\.

## References

- The Claude 3 model family: Opus, Sonnet, Haiku\.Technical reportAnthropic\.External Links:[Link](https://anthropic.com/)Cited by:[§A\.4](https://arxiv.org/html/2605.16675#A1.SS4.SSS0.Px1.p1.1)\.
- T\. Brown, B\. Mann, N\. Ryder, M\. Subbiah, J\. Kaplan, P\. Dhariwal, A\. Neelakantan, P\. Shyam, G\. Sastry, A\. Askell,et al\.\(2020\)Language models are few\-shot learners\.Advances in Neural Information Processing Systems33,pp\. 1877–1901\.Cited by:[§6\.1](https://arxiv.org/html/2605.16675#S6.SS1.p5.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§2](https://arxiv.org/html/2605.16675#S2.p1.1)\.
- A\. Conmy, A\. N\. Mavor\-Parker, A\. Lynch, S\. Heimersheim, and A\. Garriga\-Alonso \(2023\)Towards automated circuit discovery for mechanistic interpretability\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§2](https://arxiv.org/html/2605.16675#S2.p3.1)\.
- DeepSeek\-AI \(2024\)DeepSeek\-V3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§A\.4](https://arxiv.org/html/2605.16675#A1.SS4.SSS0.Px1.p1.1)\.
- N\. Dziri, X\. Lu, M\. Sclar, X\. L\. Li, L\. Jian, B\. Y\. Lin, P\. West, C\. Bhagavatula, R\. Le Bras, J\. D\. Hwang, S\. Sanyal, S\. Welleck, X\. Ren, A\. Ettinger, Z\. Harchaoui, and Y\. Choi \(2023\)Faith and fate: limits of transformers on compositionality\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§2](https://arxiv.org/html/2605.16675#S2.p2.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)PAL: program\-aided language models\.InInternational Conference on Machine Learning,Cited by:[§5\.3](https://arxiv.org/html/2605.16675#S5.SS3.p2.1)\.
- Google DeepMind \(2024\)Gemini 1\.5: unlocking multimodal understanding across millions of tokens of context\.arXiv preprint arXiv:2403\.05530\.Cited by:[§A\.4](https://arxiv.org/html/2605.16675#A1.SS4.SSS0.Px1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Kadavath, A\. Arora, S\. Basart, E\. Tang, D\. Song, and J\. Steinhardt \(2021\)Measuring mathematical problem solving with the MATH dataset\.InAdvances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2605.16675#S2.p1.1)\.
- J\. Huang, X\. Chen, S\. Mishra, H\. S\. Zheng, A\. W\. Yu, X\. Song, and D\. Zhou \(2023\)Large language models cannot self\-correct reasoning yet\.arXiv preprint arXiv:2310\.01798\.Cited by:[§2](https://arxiv.org/html/2605.16675#S2.p2.1)\.
- S\. Kadavath, T\. Conerly, A\. Askell, T\. Henighan, D\. Drain, E\. Perez, N\. Schiefer, Z\. Hatfield\-Dodds, N\. DasSarma, E\. Tran\-Johnson,et al\.\(2022\)Language models \(mostly\) know what they know\.arXiv preprint arXiv:2207\.05221\.Cited by:[§2](https://arxiv.org/html/2605.16675#S2.p2.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[Appendix A](https://arxiv.org/html/2605.16675#A1.p1.1)\.
- G\. Lample and F\. Charton \(2020\)Deep learning for symbolic mathematics\.InInternational Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2605.16675#S2.p1.1)\.
- N\. Lee, K\. Sreenivasan, J\. D\. Lee, K\. Lee, and D\. Papailiopoulos \(2023\)Teaching arithmetic to small transformers\.arXiv preprint arXiv:2307\.03381\.Cited by:[§5\.2](https://arxiv.org/html/2605.16675#S5.SS2.p2.4)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[§6\.2](https://arxiv.org/html/2605.16675#S6.SS2.p6.2)\.
- H\. Liu, Z\. Zheng, Y\. Qiao, H\. Duan, Z\. Fei, F\. Zhou, W\. Zhang, S\. Zhang, D\. Lin, and K\. Chen \(2024\)MathBench: evaluating the theory and application proficiency of LLMs with a hierarchical mathematics benchmark\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 6884–6915\.Cited by:[§2](https://arxiv.org/html/2605.16675#S2.p1.1)\.
- J\. Maynez, S\. Narayan, B\. Bohnet, and R\. McDonald \(2020\)On faithfulness and factuality in abstractive summarization\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics,pp\. 1906–1919\.Cited by:[§2](https://arxiv.org/html/2605.16675#S2.p2.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in GPT\.Advances in Neural Information Processing Systems35\.Cited by:[§2](https://arxiv.org/html/2605.16675#S2.p3.1)\.
- Meta AI \(2024\)The Llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§A\.4](https://arxiv.org/html/2605.16675#A1.SS4.SSS0.Px1.p1.1)\.
- A\. Meurer, C\. P\. Smith, M\. Paprocki, O\. Čertík, S\. B\. Kirpichev, M\. Rocklin, A\. Kumar, S\. Ivanov, J\. K\. Moore, S\. Singh,et al\.\(2017\)SymPy: symbolic computing in Python\.PeerJ Computer Science3,pp\. e103\.Cited by:[Appendix C](https://arxiv.org/html/2605.16675#A3.p1.1),[1st item](https://arxiv.org/html/2605.16675#A5.I1.i1.p1.1),[§3](https://arxiv.org/html/2605.16675#S3.p2.3)\.
- S\. Mishra, M\. Finlayson, P\. Lu, L\. Tang, S\. Chang, T\. Kwiatkowski, C\. B\. Shiue, S\. Welleck, C\. Baral, Y\. Choi,et al\.\(2022\)Lila: a unified benchmark for mathematical reasoning\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing,Cited by:[§2](https://arxiv.org/html/2605.16675#S2.p1.1)\.
- Mistral AI \(2024\)Mistral large\.Technical reportMistral AI\.External Links:[Link](https://mistral.ai/)Cited by:[§A\.4](https://arxiv.org/html/2605.16675#A1.SS4.SSS0.Px1.p1.1)\.
- R\. Nogueira, Z\. Jiang, and J\. Lin \(2021\)Investigating the limitations of transformers with simple arithmetic tasks\.arXiv preprint arXiv:2102\.13019\.Cited by:[§6\.1](https://arxiv.org/html/2605.16675#S6.SS1.p5.1)\.
- OpenAI \(2024a\)GPT\-4 technical report\.Technical reportOpenAI\.Cited by:[§A\.4](https://arxiv.org/html/2605.16675#A1.SS4.SSS0.Px1.p1.1)\.
- OpenAI \(2024b\)OpenAI o1 system card\.Technical reportOpenAI\.External Links:[Link](https://openai.com/index/openai-o1-system-card)Cited by:[§A\.4](https://arxiv.org/html/2605.16675#A1.SS4.SSS0.Px1.p1.1)\.
- OpenRouter \(2024\)OpenRouter: a unified API for LLMs\.External Links:[Link](https://openrouter.ai/)Cited by:[§A\.4](https://arxiv.org/html/2605.16675#A1.SS4.SSS0.Px1.p1.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2022\)Measuring and narrowing the compositionality gap in language models\.arXiv preprint arXiv:2210\.03350\.Cited by:[§4](https://arxiv.org/html/2605.16675#S4.p3.3)\.
- Qwen Team \(2024\)Qwen2\.5 technical report\.arXiv preprint arXiv:2412\.15115\.Cited by:[§A\.4](https://arxiv.org/html/2605.16675#A1.SS4.SSS0.Px1.p1.1)\.
- Qwen Team \(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§A\.4](https://arxiv.org/html/2605.16675#A1.SS4.SSS0.Px1.p1.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§5\.3](https://arxiv.org/html/2605.16675#S5.SS3.p2.1)\.
- J\. Vig \(2019\)A multiscale visualization of attention in the transformer model\.InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics: System Demonstrations,pp\. 37–42\.External Links:[Document](https://dx.doi.org/10.18653/v1/P19-3007)Cited by:[§2](https://arxiv.org/html/2605.16675#S2.p3.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. Chi, Q\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems,Vol\.35\.Cited by:[Appendix A](https://arxiv.org/html/2605.16675#A1.p1.1)\.
- M\. Zhang, O\. Press, W\. Merrill, A\. Liu, and N\. A\. Smith \(2023\)How language model hallucinations can snowball\.arXiv preprint arXiv:2305\.13534\.Cited by:[§6\.2](https://arxiv.org/html/2605.16675#S6.SS2.p5.1)\.
- L\. Zheng, W\. Chiang, Y\. Sheng, S\. Zhuang, Z\. Wu, Y\. Zhuang, Z\. Lin, Z\. Li, D\. Li, E\. Xing, H\. Zhang, J\. E\. Gonzalez, and I\. Stoica \(2023\)Judging LLM\-as\-a\-judge with MT\-Bench and chatbot arena\.InAdvances in Neural Information Processing Systems,Vol\.36\.Cited by:[§5\.1](https://arxiv.org/html/2605.16675#S5.SS1.p2.9)\.

## Appendix ABenchmark Inference Prompts

All models were evaluated zero\-shot at temperature 0\. The prompt consists of a system prompt and a user prompt, both fixed per subcategory\. A base zero\-shot chain\-of\-thought \(ZS\-CoT\) instruction was included to elicit reasoning traces\[Weiet al\.,[2022](https://arxiv.org/html/2605.16675#bib.bib32), Kojimaet al\.,[2022](https://arxiv.org/html/2605.16675#bib.bib12)\]\(*“Show all computation steps clearly”*;*“Show your work”*\) but no few\-shot examples, method guidance, or algorithm\-specific instructions were provided\. Tool use was disabled for all models via API configuration\.

### A\.1System Prompt

The system prompt is shared across all subcategories\. Only the answer\-type directive \(underlined below\) varies:

You are a precise mathematical assistant\.Show all computation steps clearly\.Always put your final\[answer\-type\]answer inside\\boxed\{\}\.

Table A\.1:Answer\-type directive by subcategory\.
### A\.2User Prompt Template

The user prompt template is identical across all nine subcategories:

Solve this linear algebra problem\.Show your work and give the final answer\.\{problem\_text\} \{problem\_matrix\}

where\{problem\_text\}is the task instruction \(e\.g\. “Compute the determinant of matrixAA”\) and\{problem\_matrix\}is the matrix in one of three input formats described in Section[A\.3](https://arxiv.org/html/2605.16675#A1.SS3)\.

### A\.3Input Format Examples

The same3×33\{\\times\}3determinant problem is shown below in all three input formats used in the benchmark\. Format sensitivity was evaluated at4×44\{\\times\}4and5×55\{\\times\}5only; the3×33\{\\times\}3dimension is shown here for brevity\.

Table A\.2:Three input format representations of the same determinant problem\. All three encode identical numerical content; only the notation differs\.
### A\.4Model Details and Selection Rationale

Table[A\.3](https://arxiv.org/html/2605.16675#A1.T3)reports the exact model identifiers and access configuration used for all benchmark inference runs, sourced directly from the production inference code\. All models were evaluated zero\-shot at temperature 0 with tool use and external code execution disabled\. Token budgets were set to 8,192 for most subcategories and 16,384 for DeepSeek\-V3, GPT\-5\.2, and Qwen3\-235B\. Responses reaching the token ceiling without producing a final answer were classified asgeneration\_truncation, strictly separating infrastructure limits from computational collapse\.

#### Reproducibility window\.

To establish a reproducible temporal baseline, all inference across all 10 models was conducted within a strict window between 25 December 2025 and 3 May 2026\. Six models were accessed via the OpenRouter routing layer\[OpenRouter,[2024](https://arxiv.org/html/2605.16675#bib.bib26)\]: DeepSeek\-V3\[DeepSeek\-AI,[2024](https://arxiv.org/html/2605.16675#bib.bib5)\], Qwen3\-235B and Qwen2\.5\-72B\[Qwen Team,[2025](https://arxiv.org/html/2605.16675#bib.bib29),[2024](https://arxiv.org/html/2605.16675#bib.bib28)\], Mistral\-Large\[Mistral AI,[2024](https://arxiv.org/html/2605.16675#bib.bib22)\], Claude\-4\.5\-Sonnet\[Anthropic,[2024](https://arxiv.org/html/2605.16675#bib.bib1)\], and Llama\-3\.3\-70B\[Meta AI,[2024](https://arxiv.org/html/2605.16675#bib.bib18)\]\. OpenRouter does not guarantee snapshot pinning; the date window bounds this uncertainty\. Four models were accessed via direct provider APIs: OpenAI\-o1, GPT\-4o, and GPT\-5\.2 via the OpenAI API\[OpenAI,[2024a](https://arxiv.org/html/2605.16675#bib.bib24),[b](https://arxiv.org/html/2605.16675#bib.bib25)\], and Gemini\-3\.0\-Pro via the Google AI Studio GenAI SDK\[Google DeepMind,[2024](https://arxiv.org/html/2605.16675#bib.bib8)\]\.

Table A\.3:Model identifiers and inference configuration\.Note\.All model IDs copied verbatim from production inference code\. Paper name Gemini\-3\.0\-Pro refers to model IDgemini\-3\.1\-pro\-preview\. Paper name DeepSeek\-V3 refers to the base/chat MoE modeldeepseek\-v3\.2; this is not the DeepSeek\-R1 reasoning model\.

### A\.5Model Selection Rationale

The 10 models were selected to cover five independent axes of variation, enabling the benchmark to test whether failure patterns are specific to particular training regimes, architectures, or providers\.

Table A\.4:Selection rationale\. Models were selected to cover five independent axes of variation: training paradigm, architecture, provider, weight availability, and scale\. No model was selected on the basis of expected performance outcome\. Selection was finalised before any benchmark runs were conducted\.
### A\.6Excluded Models

Qwen\-2\.5\-Math\-72B was accessed via a HuggingFace Space inference endpoint—not an API\-controlled environment—and is excluded as results are not reproducible under the same inference protocol\. Claude\-3\.5\-Sonnet was registered as a standby model and replaced by Claude\-4\.5\-Sonnet before benchmark runs began; no Claude\-3\.5\-Sonnet results are included in the benchmark\.

## Appendix BBenchmark Task Descriptions

This appendix describes all nine subcategories in the LinAlg\-Bench benchmark\. For each task we show the exact instruction seen by the model, the algorithmic steps required for a correct solution, and how computational demand scales across the three matrix dimensions\. Tasks are grouped by cognitive level\.

#### Trace· Cognitive level: Reading

Instruction:Compute the trace of matrixAA\. Put your final numerical answer inside\\\\backslashboxed\{\}\.

Required steps:Sum the main diagonal entriesa11\+a22\+⋯\+anna\_\{11\}\+a\_\{22\}\+\\cdots\+a\_\{nn\}\.

Expected output:Single integer scalar\.

Primary failure at 5×\\times5:None — trace is format\-invariant and dimension\-invariant\. Serves as ceiling baseline\.

#### Transpose· Cognitive level: Reading

Instruction:Compute the transpose of matrixAA\. Put your final matrix answer inside\\\\backslashboxed\{\}\.

Required steps:Reflect all entries across the main diagonal: entry\(i,j\)\(i,j\)moves to\(j,i\)\(j,i\)\.

Expected output:n×nn\\times nmatrix\.

Primary failure at 5×\\times5:None — purely positional, no arithmetic\. Serves as ceiling baseline\.

#### Matrix\-Vector Multiplication· Cognitive level: Arithmetic

Instruction:Compute the matrix\-vector productAxAx\. Put your final vector answer inside\\\\backslashboxed\{\}\.

Required steps:For each row ofAA, compute the dot product with vectorxx: sum ofaij⋅xja\_\{ij\}\\cdot x\_\{j\}acrossjj\.

Expected output:Column vector of lengthnn\.

Primary failure at 5×\\times5:Product\_Sign\_Error and Double\_Negative\_Trap when operands are negative\.

#### Matrix Multiplication· Cognitive level: Arithmetic

Instruction:Compute the matrix productABAB, whereAAandBBare bothn×nn\\times nmatrices\. Put your final matrix answer inside\\\\backslashboxed\{\}\.

Required steps:For each output entry\(i,j\)\(i,j\), compute dot product of rowiiofAAwith columnjjofBB\.

Expected output:n×nn\\times nmatrix\.

Primary failure at 5×\\times5:Rule\_Interference and Double\_Negative\_Trap; errors compound across all 25 entries\.

#### Matrix Power· Cognitive level: Arithmetic

Instruction:ComputeAkA^\{k\}\. Put your final matrix answer inside\\\\backslashboxed\{\}\.

Required steps:ComputeA2=A⋅AA^\{2\}=A\\cdot A, thenA3=A2⋅AA^\{3\}=A^\{2\}\\cdot A, repeatingk−1k\-1multiplications\. Each multiplication requiresn2n^\{2\}dot products of lengthnn\.

Expected output:n×nn\\times nmatrix\.

Primary failure at 5×\\times5:Compounding sign errors — a single Product\_Sign\_Error in an early power corrupts all subsequent powers\.

#### Rank· Cognitive level: Sequential

Instruction:Compute the rank of matrixAA\. Put your final integer answer inside\\\\backslashboxed\{\}\.

Required steps:Perform row reduction \(Gaussian elimination\) to echelon form\. Count non\-zero pivot rows\.

Expected output:Integer in\{0,1,…,n\}\\\{0,1,\\ldots,n\\\}\.

Primary failure at 5×\\times5:Memory\_Loss — model loses earlier pivot state when deciding whether late rows are zero\.

#### Nullity· Cognitive level: Sequential

Instruction:Compute the nullity of matrixAA\. Put your final integer answer inside\\\\backslashboxed\{\}\.

Required steps:Compute rank via row reduction, then apply rank\-nullity theorem: nullity=n−rank=n\-\\text\{rank\}\.

Expected output:Integer in\{0,1,…,n\}\\\{0,1,\\ldots,n\\\}\.

Primary failure at 5×\\times5:Inherits all rank failures; additionally, off\-by\-one errors in pivot counting propagate directly\.

#### Determinant· Cognitive level: Recursive

Instruction:Compute the determinant of matrixAA\. Put your final numerical answer inside\\\\backslashboxed\{\}\.

Required steps:Cofactor expansion:det\(A\)=∑ja1j⋅\(−1\)1\+j⋅M1j\\det\(A\)=\\sum\_\{j\}a\_\{1j\}\\cdot\(\-1\)^\{1\+j\}\\cdot M\_\{1j\}whereM1jM\_\{1j\}is the\(n−1\)×\(n−1\)\(n\-1\)\\times\(n\-1\)minor\. Alternatively: Gaussian elimination to upper triangular form, then multiply diagonal\.

Expected output:Single integer scalar\.

Primary failure at 5×\\times5:Alternating\_Drift \(cofactor models lose±1\\pm 1checkerboard after 3–4 terms\); Gaussian\-dominant models maintain high accuracy\. Strategy is the primary predictor of success at 5×\\times5\.

#### Eigenvalues· Cognitive level: Compositional

Instruction:Find all eigenvalues of matrixAA\. Put your final answers inside\\\\backslashboxed\{\}\.

Required steps:Stage 1: FormA−λIA\-\\lambda I\. Stage 2: Expanddet\(A−λI\)\\det\(A\-\\lambda I\)to obtain characteristic polynomial of degreenn\. Stage 3: Find all roots of the polynomial\.

Expected output:Set ofnneigenvalues \(may be real or complex\)\.

Primary failure at 5×\\times5:Complete\_Collapse dominates at 5×\\times5 \(80\.4% of failures\): models correctly identify the task, then abandon computation with meta\-statements citing complexity, fabricating eigenvalues via simulated tool calls\.DeepSeek\-V3andOpenAI\-o1demonstrate the task is solvable with sufficient execution capacity\.

## Appendix CFull Accuracy Tables

Tables[C\.1](https://arxiv.org/html/2605.16675#A3.T1)–[C\.3](https://arxiv.org/html/2605.16675#A3.T3)report complete per\-model, per\-task accuracy across all three matrix dimensions\. All models were evaluated zero\-shot at temperature 0\. The benchmark comprises 9 tasks across 5 cognitive levels, with 220 questions per dimension \(660 total\)\. Ground truth verified using SymPy symbolic computation\[Meureret al\.,[2017](https://arxiv.org/html/2605.16675#bib.bib20)\]\. Models ordered by tier membership\.

Table C\.1:Complete accuracy table — 3×\\times3 matrices \(220 questions, 10 models\)\. Numbers show correct responses out of maximum per category\.Table C\.2:Complete accuracy table — 4×\\times4 matrices \(220 questions, 10 models\)\.Table C\.3:Complete accuracy table — 5×\\times5 matrices \(220 questions, 10 models\)\. Tier 1: OpenAI\-o1, Gemini\-3\.0\-Pro, DeepSeek\-V3, Qwen3\-235B\. Tier 2: GPT\-5\.2, Mistral\-Large\. Tier 3: Claude\-4\.5\-Sonnet, Qwen2\.5\-72B, Llama\-3\.3\-70B, GPT\-4o\.
## Appendix DFormat Sensitivity Analysis

Format sensitivity was evaluated across three input representations — LaTeX \(standard bmatrix notation\), List \(nested Python\-style list\), and Tabular \(pipe\-delimited grid\) — at4×44\{\\times\}4and5×55\{\\times\}5matrix dimensions only; the3×33\{\\times\}3dimension was excluded as pilot evaluation confirmed near\-universal ceiling performance across all formats at that scale, making format the least interesting variable to study\.n=200n\{=\}200questions per model per format across 9 subcategories\.

At4×44\{\\times\}4format sensitivity is negligible across all tiers\. At5×55\{\\times\}5it is strictly complexity\-gated:

- •Tier 1 modelsremain format\-robust\. Gemini achieves 30/30 under all formats at 5×\\times5\.
- •Tier 2 modelsshow divergent model\-specific sensitivity\. GPT\-5\.2 incurs aLaTeX penalty\(14/30 LaTeX vs 26–27/30 Tabular/List\); Mistral shows the opposite\. These are opposing effects, not systematic format bias\.
- •Tier 3 modelscollapse completely regardless of format — GPT\-4o achieves 0/30 and Llama at most 1/30 across all three formats\. The dimensional boundary, not input notation, is the binding constraint\.

The key implication: format sensitivity is asecond\-order effectthat manifests only when working memory is partially saturated \(Tier 2 at 5×\\times5\)\. When sufficient \(Tier 1\) or fully saturated \(Tier 3\), format is irrelevant\.

![Refer to caption](https://arxiv.org/html/2605.16675v1/x4.png)Figure D\.1:Format sensitivity across matrix dimensions and model tiers \(determinant subcategory,n=30n\{=\}30per model per format\)\. At4×44\{\\times\}4\(top row\) format variance is minimal\. At5×55\{\\times\}5\(bottom row\) format sensitivity emerges but is strictly complexity\-gated\. Red\-shaded panel indicates complete Tier 3 collapse\.†\\daggerOpenAI\-o1 excluded due to inference cost constraints\.Table D\.1:Format sensitivity raw correct counts — 4×\\times4 matrices \(200 questions per model per format, 9 subcategories\)\. Three sub\-columns per model: LaTeX, List, Tabular\. All counts from a fully enumerated evaluation\.†\\daggerOpenAI\-o1 excluded\.Table D\.2:Format sensitivity raw correct counts — 4×\\times4 matrices, Tier 3 models\.Table D\.3:Format sensitivity raw correct counts — 5×\\times5 matrices \(200 questions per model per format, 9 subcategories\)\. Eigenvalue row shows near\-universal zero across all models and formats, confirming eigenvalue collapse is format\-invariant\.†\\daggerOpenAI\-o1 excluded\. Format sensitivity covers 9 models×\\times3 formats×\\times200 questions = 5,400 evaluations per dimension\.

Table D\.4:Format sensitivity raw correct counts — 5×\\times5 matrices, Tier 3 models\.†\\daggerOpenAI\-o1 excluded\. Format sensitivity covers 9 models×\\times3 formats×\\times200 questions = 5,400 evaluations per dimension\.

## Appendix EJudge Pipeline Validation

The forensic classification pipeline operates in two independent stages\. Both stages useGemini 3\.1 Pro Previewattemperature 0\.0for deterministic output\.

### E\.1Build Judge \(Stage 1 — First\-Pass Classification\)

The Build Judge acts as a forensic auditor\. For each failed response:

- •For all subcategories except eigenvalues,the Build Judgeindependently computes the correct solution from scratch before reading the model response\. Gemini\-3\.0\-Pro achieves perfect accuracy on all non\-eigenvalue subcategories across all three matrix dimensions, confirming it as a reliable computation oracle for the judge role\. All judge outputs are additionally cross\-validated against the SymPy\-certified ground truth\[Meureret al\.,[2017](https://arxiv.org/html/2605.16675#bib.bib20)\]as a consistency check\. For eigenvalues, the Build Judge receives the SymPy\-certified answer directly as its reference and focuses solely on tracing where and how the model’s reasoning diverged from the correct solution path\.
- •Traces the model responsestep\-by\-stepto identify thefirst erroneous value, not just the final wrong answer\.
- •Applies theMagnitude Ruleas the primary boundary enforcer betweensign\_errorandarithmetic: if\|wrong\|=\|correct\|\\lvert\\text\{wrong\}\\rvert=\\lvert\\text\{correct\}\\rvert\(element\-wise for matrices/vectors\)→\\rightarrowsign\_error; if magnitudes differ→\\rightarrowarithmetic\.
- •RecordsSolution\_Strategy\(e\.g\., cofactor expansion vs\. Gaussian elimination\) alongside the error tag\.
- •OutputsFirst\_Error\_StepandFirst\_Error\_Descriptionfor full forensic traceability\.
- •Requires aconfidence level\(high/medium/low\) for every classification\.

### E\.2Validate Judge \(Stage 2 — Independent Verification\)

Stage 2 is not an independent classifier but anindependent verification stepdesigned to stress\-test the Build Judge’s proposed hypothesis against strict boundary rules, minimising false positives:

- •Asecond Gemini callis made with no access to the Build Judge’s reasoning or forensic observation — only the error tag and original model response are passed\.
- •Atag\-specific verification questionis injected per classification \(see Table E\.1\), enforcing a targeted diagnostic check\.
- •The Validate Judge canconfirm\(true\),correct\(false\+ corrected tag\), orflag\(needs\_review\)\.
- •Final tag cascade: if validated==falseandcorrected\_tagnon\-empty→\\rightarrowcorrected\_tagbecomesfinal\_tag; otherwise originalError\_Tagis retained\.

Overall Validate Judge agreement was 83\.0% at3×33\{\\times\}3, 83\.2% at4×44\{\\times\}4, and 82\.0% at5×55\{\\times\}5\. Per\-tag agreement rates are reported in Table E\.3\.

### E\.3Human Batch Review \(Stage 3 — Taxonomy Refinement\)

In a third stage, systematic disagreements between Build and Validate Judge outputs were reviewed manually at batch level using a structured spreadsheet audit\. Recurring disagreement patterns identified ambiguities in the taxonomy definitions, which were resolved by refining the judge prompts iteratively\. This process produced the final taxonomy — for example, deprecating the overly broadFluency\_Maskingtag in favour ofSilent\_Sign\_Flip, and splittingtranscriptionintoinput\_transcriptionandcarry\_down\_error\.

Table E\.1:Tag\-specific verification questions injected into the Validate Judge prompt\. For each primary tag assigned by the Build Judge, the Validate Judge receives the corresponding verification question and boundary rule before confirming or correcting the classification\.Table E\.2:Error tag and sub\-tag applicability by problem subcategory\. A checkmark \(✓\\checkmark\) indicates the tag can structurally occur in that subcategory; a dash \(—\) indicates structural impossibility confirmed by the pipeline taxonomy\. Ten primary tags \(rows 1–10\) apply across all subcategories with the exception oftrace, which has nosign\_error,hallucination, orcarry\_down\_errordue to its purely additive nature\. Four eigenvalue\-specific tags \(†\\dagger\) apply exclusively to theeigsubcategory, reflecting unique failure modes in characteristic polynomial expansion\.†\\daggerEigenvalue\-specific tags \(generation\_loop,algebraic\_precedence,false\_verification,variable\_entanglement\) are additional primary tags that arise exclusively during the root\-finding stage of characteristic polynomial expansion\. The benchmark abstract references ten primary tags, which reflects thebase\_primary\_tagsapplied across all subcategories; these four are additional and eigenvalue\-only\.

### E\.4Agreement Scope and Distributional Validity

Tag\-level agreement rates in Table E\.3 vary substantially across dimensions\. This variation is expected and does not undermine the pipeline’s distributional findings for three reasons\.

First, the five tags that drive the distributional analysis in Section 5 —sign\_error,hallucination,arithmetic,input\_transcription, andmethod\_fail— collectively account for91\.5% of all classified failures\. These five tags achieve the highest and most consistent agreement rates across all dimensions \(Table E\.3\), confirming that the pipeline is most reliable precisely where it matters most\.

Second, tags with<70%\{<\}70\\%agreement at any dimension \(generation\_truncation,memory\_loss,variable\_entanglement,algebraic\_precedence,other\_unmapped,generation\_loop\) are low\-frequency tags that collectively account for fewer than8\.5% of failures\. Their exclusion from distributional analysis in Section 5 is conservative by design — these tags are retained in the full taxonomy for completeness but do not drive any quantitative claims in the paper\.

Third, the3×33\{\\times\}3dimension achieves 100% human agreement, and4×44\{\\times\}4achieves 96\.7%, confirming that pipeline reliability degrades gracefully with complexity rather than collapsing\. The5×55\{\\times\}5figure of 89\.7% reflects the genuine difficulty of classifying failures at maximum recursive depth, not a systematic pipeline bias\.

Table E\.3:Per\-tag Validate Judge Agreement Rates†\\daggerThese tags fall below 70% agreement at the indicated dimension; cases at that dimension are excluded from error rate calculations in Section 5\. Together,†\\dagger,‡\\ddagger, and∗\*tags account for only 42 of 1,156 total failures \(3\.6%\) and do not affect any distributional finding reported in the paper\.

‡\\ddaggerAgreement<70%\{<\}70\\%at4×44\{\\times\}4;4×44\{\\times\}4cases excluded from error rate calculations in Section 5\.

∗\*Agreement<70%\{<\}70\\%across all dimensions; fully excluded from distributional analysis in Section 5\. The five primary tags \(hallucination,sign\_error,arithmetic,input\_transcription,method\_fail\) account for 91\.5% of all classified failures and maintain the highest agreement rates throughout\.

### E\.5Pipeline vs\. Human\-Labeled Agreement

This appendix reports pipeline agreement with independently hand\-labeled responses\. Human validation was weighted toward5×55\{\\times\}5cases where automated agreement is lowest\. Labeling was performed by the authors independently of the pipeline, with disagreements resolved by a third pass\.

Table E\.4:Pipeline agreement with human\-labeled responses\.

## Appendix FSign Error Subtype Analysis

Sign errors are classified into three structural families:Parity\(failure to apply or resolve\(−1\)i\+j\(\-1\)^\{i\+j\}in cofactor expansion — exclusive to determinant and eigenvalue subcategories\),Product\(wrong sign in a multiplication or scaling step — occurs across all arithmetic tasks\), andStandalone\(isolated sign flip orthogonal to both mechanisms — occurs across all subcategories with signed outputs\)\. Total sign errors across all dimensions: 270 \(3×\\times3: 33, 4×\\times4: 123, 5×\\times5: 114\)\.

The subtype distribution shifts systematically with matrix dimension\. At 3×\\times3 \(nn=33\), the product block dominates at 75\.8%, contributed by Product\_Sign\_Error \(48\.5%\), Double\_Negative\_Trap \(21\.2%\), and Rule\_Interference \(6\.1%\)\. The parity block is near\-absent at 3\.0%, contributed entirely by a single Parity\_Sign\_Error case; Alternating\_Drift and Cofactor\_Neglect record zero cases as the expansion sequence is too shallow to stress sustained parity tracking\. At 4×\\times4 \(nn=123\), the parity block surges to 61\.0% — the single largest shift in the dataset — contributed by Alternating\_Drift \(34\.1%\), Parity\_Sign\_Error \(13\.8%\), and Cofactor\_Neglect \(13\.0%\)\. The product block falls to 32\.5%\. At 5×\\times5 \(nn=114\), the parity block partially retreats to 47\.6% while the product block recovers to 40\.8%, reflecting working memory saturation within cofactor\-dominant models: as recursive depth exceeds reliable parity tracking capacity, failures increasingly manifest as multiplication\-step errors rather than sustained pattern tracking failures — consistent with the working memory account in Section 6\.1\.

Table F\.1:Sign error subtype×\\timesquestion\-type distribution across all three matrix dimensions\. Each non\-zero cell reports raw count with normalised failure rate in parentheses \(failures÷\\divtotal attempts: det=500, eig=300, other=1,400 pooled across 7 subcategories×\\times200 attempts each\)\. Raw counts are not directly comparable across subcategory groups due to unequal question allocation; normalised rates enable fair cross\-subcat comparison\. Dashes \(—\) denote structural impossibility — parity\-block subtypes cannot occur in non\-cofactor subcategories as these algorithms contain no\(−1\)i\+j\(\-1\)^\{i\+j\}step\. Zero \(0\) denotes an empirical zero — the error is theoretically possible but did not occur in this dataset\. Bold counts indicate observed failures\.![[Uncaptioned image]](https://arxiv.org/html/2605.16675v1/figures/TableF1.png)— = Structural zero \(algorithm constraint\) \| 0 = Empirical zero \(possible but not observed\) \| Bold = observed failure count

![Refer to caption](https://arxiv.org/html/2605.16675v1/x5.png)Figure F\.1:Sign error block composition by matrix dimension, shown as percentage of total sign errors per dimension \(nn=33, 123, 114\)\. Parity block rises from 3\.0% at 3×\\times3 to 61\.0% at 4×\\times4\. Note: the 3×\\times3 parity count reflects a single Parity\_Sign\_Error instance \(1/33 sign errors = 3\.0%\); the table in Appendix F reports this same case as 0\.2% of total determinant attempts, reflecting a different denominator\. Product block dominates at 3×\\times3 \(75\.8%\)\. Standalone block accounts for the remaining percentage and is not shown separately\.Table F\.2:Full subcat\-level breakdown of sign errors by subtype and matrix dimension\. Confirms that Parity\-family subtypes are structurally absent from non\-recursive subcategories\. ‘Other’ subcategories with zero sign errors at a given dimension are omitted for brevity\.Table F\.3:Total sign errors by model and matrix dimension\. All counts from the validated pipeline output \(sign\_error\_summary\.csv\)\. Models ordered by tier then by total sign errors descending within tier\.Table F\.4:One representative failure example per sign error subtype drawn from the validated pipeline output\.Model Wroteshows the exact erroneous expression;Correctshows the expected expression\. All examples are from 5×\\times5 matrix tasks where sign errors are most consequential\.![[Uncaptioned image]](https://arxiv.org/html/2605.16675v1/figures/TableF4.png)
## Appendix GHallucination Subtype Analysis

Hallucination failures are classified into two structural blocks:Abandonment\(model disengages from computation before completing meaningful steps — explicit meta\-statement or silent omission\) andFabrication\(model continues through the response but invents values inconsistent with the input\)\. Unlike sign errors, no hallucination subtype is structurally impossible in any subcategory — all zeros in Table 1 are empirical\. The complete absence of the abandonment block at 3×\\times3 is abehaviouralfinding, not an algorithm constraint: models at 3×\\times3 recursive depthengagewith computation rather than abandon it\. Total hallucination failures across all dimensions: 436 \(3×\\times3: 17, 4×\\times4: 107, 5×\\times5: 312\), concentrated almost entirely in eigenvalue \(237 at 5×\\times5\) and determinant \(72 at 5×\\times5\) subcategories\.

HALLUCINATION is the dominant failure mode at 4×\\times4 \(27\.2%\) and overwhelmingly so at 5×\\times5 \(47\.1%\), rising from 17\.0% at 3×\\times3 across a fully enumerated evaluation of 6,600 outputs\. It represents not failure at a specific computational step but failure to engage with computation at all\. Six subtypes fall into two blocks\. At 3×\\times3 \(nn=17\), all hallucinations are Fabrication block \(100\.0%\) — models attempt computation but invent results\. The Abandonment block is entirely absent, confirming that at 3×\\times3 recursive depth models engage with computation rather than disengage from it\. At 4×\\times4 \(nn=107\), the Abandonment block surges to 80\.4% — Complete\_Collapse alone accounts for 75\.7% of all 4×\\times4 hallucinations as eigenvalue complexity crosses the disengagement threshold\. The Fabrication block falls to 19\.6%\. At 5×\\times5 \(nn=312\), the Abandonment block reaches 89\.4%, with Complete\_Collapse contributing 205 of 237 eigenvalue failures\. The Fabrication block further retreats to 10\.6%\. The monotonic rise of Abandonment and fall of Fabrication across dimensions is the forensic signature of the fabrication\-to\-abandonment transition described in Section 5\.3\. Figure[G\.1](https://arxiv.org/html/2605.16675#A7.F1)shows the shift from Fabrication\-dominant to Abandonment\-dominant hallucination across the three matrix dimensions\.

Table G\.1:Hallucination subtype×\\timesquestion\-type distribution across all three matrix dimensions\. Each non\-zero cell reports raw count with normalised failure rate in parentheses \(failures÷\\divtotal attempts: det=500, eig=300, other=1,400 pooled across seven subcategories at 20 questions×\\times10 models each\)\. All proportions are exact counts from a fully enumerated evaluation of 6,600 outputs; no sampling uncertainty applies\. All zeros are empirical — no hallucination subtype is structurally impossible in any subcategory\.![[Uncaptioned image]](https://arxiv.org/html/2605.16675v1/figures/TableG1.png)0 = Empirical zero \(possible but not observed\) \| No structural zeros exist in this table \| Bold = observed failure count

![Refer to caption](https://arxiv.org/html/2605.16675v1/x6.png)Figure G\.1:Hallucination block composition by matrix dimension, shown as percentage of total hallucination failures per dimension \(nn=17, 107, 312\)\. The fabrication\-to\-abandonment transition is visible as a clean crossover between 3×\\times3 and 4×\\times4: Fabrication block falls from 100\.0% to 19\.6% while Abandonment block rises from 0\.0% to 80\.4%, reaching 89\.4% at 5×\\times5\. The complete absence of Abandonment at 3×\\times3 is abehaviouralfinding — models at 3×\\times3 recursive depthengagewith computation rather than abandon it\. All percentages are exact counts from a fully enumerated evaluation;nnvalues reflect total hallucination failures per dimension\.Table G\.2:Hallucination failures within the ‘other’ subcategory group by subtype and matrix dimension\. Confirms that hallucination in non\-recursive tasks is negligible — onlymatrix\_power\(nn=1\) and rank \(nn=7\) show any hallucination cases outside the determinant/eigenvalue block\.Table G\.3:Total hallucination failures by model and matrix dimension\. All counts from the validated pipeline output \(hallucination\_summary\.csv\)\. Models ordered by tier then by total hallucination count descending within tier\.Table G\.4:One representative failure example per hallucination subtype drawn from the validated pipeline output\. Model Output shows the exact language orbehaviourobserved; bold monospace indicates the precise erroneous expression or statement\. All examples are from 5×\\times5 matrix tasks where hallucination is most consequential\.![[Uncaptioned image]](https://arxiv.org/html/2605.16675v1/figures/TableG4.png)
## Appendix HError Tag Distribution

Table[H\.1](https://arxiv.org/html/2605.16675#A8.T1)reports the primary error tag distribution across all three matrix dimensions\. All 1,156 failures are drawn from a fully enumerated evaluation of 6,600 outputs \(2,200 per dimension; 220 per model×\\times10 models\); no sampling uncertainty applies\. Percentages reflect proportion of total failures per dimension, enabling cross\-dimensional comparison of error type prevalence\. Eigenvalue\-specific extensions \(GENERATION\_LOOP, ALGEBRAIC\_PRECEDENCE, FALSE\_VERIFICATION, VARIABLE\_ENTANGLEMENT\) apply exclusively to the eigenvalue subcategory\. Output\-level primary tags \(FORMATTING\_MISMATCH, GENERATION\_TRUNCATION, OTHER\_UNMAPPED\) reflect structural or format errors rather than computational failures and are excluded from distributional analysis in Section 5\.

Table H\.1:Primary error tag distribution across matrix dimensions\. Each cell reports raw count with percentage of total failures per dimension in parentheses\. All 1,156 classified failures are from a fully enumerated evaluation of 6,600 outputs \(2,200 per dimension; 220 per model×\\times10 models\); no sampling uncertainty applies\. Eigenvalue\-specific extensions apply to the eigenvalue subcategory only\. Output\-level primary tags reflect structural or format errors rather than computational failures\.Error Tag3×\\times3 \(NN=100\)4×\\times4 \(NN=394\)5×\\times5 \(NN=662\)TotalPrimary tagsSIGN\_ERROR33 \(33\.0%\)123 \(31\.2%\)114 \(17\.2%\)270HALLUCINATION17 \(17\.0%\)107 \(27\.2%\)312 \(47\.1%\)436ARITHMETIC29 \(29\.0%\)76 \(19\.3%\)118 \(17\.8%\)223METHOD\_FAIL1 \(1\.0%\)17 \(4\.3%\)41 \(6\.2%\)59INPUT\_TRANSCRIPTION8 \(8\.0%\)15 \(3\.8%\)47 \(7\.1%\)70MEMORY\_LOSS02 \(0\.5%\)7 \(1\.1%\)9CARRY\_DOWN\_ERROR2 \(2\.0%\)1 \(0\.3%\)03GENERATION\_TRUNCATION5 \(5\.0%\)5 \(1\.3%\)11 \(1\.7%\)21FORMATTING\_MISMATCH4 \(4\.0%\)15 \(3\.8%\)019OTHER\_UNMAPPED017 \(4\.3%\)4 \(0\.6%\)21Eigenvalue\-specific extensionsGENERATION\_LOOP015 \(3\.8%\)1 \(0\.2%\)16ALGEBRAIC\_PRECEDENCE0000FALSE\_VERIFICATION1 \(1\.0%\)06 \(0\.9%\)7VARIABLE\_ENTANGLEMENT01 \(0\.3%\)1 \(0\.2%\)2TOTAL100 \(100\.0%\)394 \(100\.0%\)662 \(100\.0%\)1,156Table[H\.2](https://arxiv.org/html/2605.16675#A8.T2)reports the overall failure rate per question subcategory across all three matrix dimensions\. Each cell shows the number of failures over total attempts \(failures/attempts\), with the percentage failure rate in parentheses\. Determinant and eigenvalue subcategories drive the majority of failures at all dimensions\.Eigenvalue reaches a 98\.7% failure rate at 5×\\times5— the highest of any subcategory at any dimension in the benchmark\. Trace and Transpose maintain near\-zero failure rates at all dimensions, confirming their role as ceiling\-level baselines\.

Table H\.2:Overall failure rate per question subcategory across matrix dimensions\. Each cell reports failures/total attempts with percentage failure rate\. Denominators reflect 50 questions×\\times10 models==500 for determinant, 30×\\times10==300 for eigenvalues, and 20×\\times10==200 for all other subcategories\. All counts are exact from the fully enumerated evaluation\.
## Appendix IModel Collapse Personas at 5imesimes5 Eigenvalue

All models received identical zero\-shot prompts with code execution disabled\. Eigenvalue collapse personas are characterised from 5×\\times5 eigenvalue failures only\. Personas are consistent across multiple instances per model\.DeepSeek\-V3 and OpenAI\-o1are the sole exceptions with genuine unaided computation — DeepSeek\-V3 produces 3 correct solutions before collapsing to tool\-roleplay for the remaining 27; OpenAI\-o1 produces 1 correct solution with 29 collapse failures following an authority\-appeal pattern\. All other models produce zero genuine solutions at 5×\\times5 eigenvalue\. Tiers follow the accuracy\-based classification in Section 4\.2\. Plausibility heuristic analysis \(trace vs spectral radius constraint satisfaction\) is reported in Section 5\.3 for the 20 Ungrounded\_Guess cases only; persona characterisation for other models is based on qualitative response inspection\.

Table I\.1:Model\-specific collapse personas at 5×\\times5 eigenvalue problems\.nnreflects collapse instances per model analysed for persona classification \(eigenvalue subcategory only; 5×\\times5 dimension\)\. Signature phrases are representative verbatim or near\-verbatim excerpts from model outputs\.∗DeepSeek\-V3 instance count reflects 27 collapse failures; 3 genuine successes excluded\.†OpenAI\-o1 instance count reflects 29 collapse failures; 1 genuine success excluded\.ModelTierCollapse StrategySignature Phrase / BehaviournnTier 1 — Fully withstand L4OpenAI\-o1T1Authority appeal“In practice one nearly always resorts to numerical methods — claims values pinned down via standard software”29†Gemini\-3\.0\-ProT1Prompt conflict \+ trace\-anchored fabrication“Outputs system prohibition verbatim \(without using any tool, tool use prohibited\), then produces trace\-consistent integer eigenvalues”30DeepSeek\-V3T1Genuine computation \(3\) \+ collapse \(27\)“Correctly forms characteristic polynomial; applies numerical root\-finding \(bisection\)\. 27 remaining failures follow tool\-roleplay pattern”27∗Qwen3\-235BT1Simulated execution“Use Numerical Diagonalisation \(Simulated\) — explicitly invents the concept of pretended tool execution”20Tier 2 — Partially withstand L4GPT\-5\.2T2Unexecuted code blocks“Writes syntactically correct sympy/numpy Python blocks then hallucinates the output the code would have produced”30Mistral\-LargeT2Mid\-response strategy pivot“Debates QR iteration explicitly in response text, then: Given the complexity, I decide to use a computational tool”30Tier 3 — Collapse at L4Claude\-4\.5\-SonnetT3Fabricated polynomial factorisation“Claims a computer algebra system produced a perfectly factored degree\-5 characteristic polynomial with fabricated roots”30GPT\-4oT3Hypothetical tool assumption“Let’s assume we use a computational tool — transitions directly from the characteristic matrix to a fabricated final answer”30Llama\-3\.3\-70BT3Prompt conflict \+ ungrounded collapse“Outputs prohibition verbatim then invokes NumPy anyway, ultimately collapsing to an ungrounded guess”30Qwen2\.5\-72BT3Simulated execution“Simulate the process or use known tools like WolframAlpha — claims to approximate via simulated numerical algorithm”30

Tier 1 models collapse throughauthority appeal and tool\-roleplay— they acknowledge the computational demand and simulate or invoke external computation rather than attempt it directly\. Tier 2 models showmid\-response strategy pivots— they begin with algebraic approaches then abandon mid\-computation\. Tier 3 models collapseimmediately and unconditionally— with no meaningful algebraic steps preceding the fabricated answer\. Across all tiers, the common thread isconstraint\-aware confabulation: models fabricate eigenvalues that fall within spectral radius plausibility bounds \(17/20 Ungrounded\_Guess cases satisfy the Frobenius norm constraint\), while trace satisfaction occurs in only 9/20 cases and determinant satisfaction in only 1/20 cases — confirming that models rely on superficial magnitude heuristics rather than genuine algebraic constraint satisfaction\.

∗DeepSeek\-V3 instance count reflects 27 collapse failures; 3 genuine successes excluded from persona characterisation\.

†OpenAI\-o1 instance count reflects 29 collapse failures; 1 genuine success excluded from persona characterisation\.

## Appendix JSolution Strategy Analysis

This appendix details the zero\-shot observational data that motivated the forced\-Gaussian ablation study in Section 6\.2 and Appendix K\. Solution strategy for each determinant response was classified by the Build Judge as Cofactor Expansion, Gaussian Elimination, or Hybrid \(mixed or inconsistent strategy within a single response\)\. Percentages reflect strategy classification across all 50 determinant questions per model per dimension\.

While this observational data reveals that strategy rigidity is a near\-perfect predictor of 5×\\times5 determinant accuracy in a natural zero\-shot setting, the ablation study \(Appendix K\) demonstrates that this correlation is not causal: enforcing the algorithmically efficient Gaussian strategy does not recover accuracy for collapsed models\.

Table J\.1:Solution strategy comparison across 4×\\times4 and 5×\\times5 matrix dimensions, sorted by tier\. Cofactor/Gaussian/Hybrid percentages reflect classified strategy per det response and sum to 100% per model per dimension\. Det accuracy computed over 50 questions at 5×\\times5\. Tendency: Adaptive = reduces cofactor reliance from 4×\\times4 to 5×\\times5; Rigid = holds or increases cofactor reliance; Compensating = holds cofactor reliance but succeeds via superior execution quality\.![[Uncaptioned image]](https://arxiv.org/html/2605.16675v1/figures/TableJ1.png)![Refer to caption](https://arxiv.org/html/2605.16675v1/x7.png)Figure J\.1:Determinant accuracy at 4×\\times4 and 5×\\times5 grouped by dominant solution strategy\. Models within each strategy group are ordered by 5×\\times5 accuracy\. At 4×\\times4 all strategy groups achieve broadly similar high accuracy — strategy does not yet discriminate\. At 5×\\times5 Gaussian\-dominant and hybrid models maintain high accuracy while cofactor\-dominant models collapse\. The transitional group \(GPT\-5\.2, Mistral\-Large\) adapts strategy but lacks execution quality to fully capitalise\. DeepSeek\-V3 is the sole cofactor\-dominant exception, succeeding via superior execution quality as documented in Section 6\.2\. All proportions are exact counts from a fully enumerated evaluation \(det subcategory only; 25 questions×\\times10 models at 4×\\times4, 50 questions×\\times10 models at 5×\\times5\)\. Note: While natural strategy selection strongly predicts success in this zero\-shot setting, the targeted ablation study in Section 6\.2 \(detailed in Appendix K\) demonstrates that enforcing Gaussian elimination does not rescue Tier 2 or Tier 3 models, proving the ultimate bottleneck is autoregressive execution depth\.
## Appendix KStrategy Ablation Study: Forced Gaussian Elimination

This appendix documents a targeted ablation study designed to test whether strategy selection causally explains 5×\\times5 determinant failure\. Five cofactor\-dominant models were explicitly instructed to use Gaussian elimination on problems drawn from their individual failure sets\. If operation count were the primary bottleneck, bypassingO\(n\!\)O\(n\!\)cofactor expansion in favour ofO\(n3\)O\(n^\{3\}\)Gaussian elimination should yield measurable accuracy improvements\.

### K\.1Experimental Design

#### Model selection\.

Five models were selected on the basis of cofactor dominance in their natural zero\-shot strategy profiles: GPT\-4o \(100% cofactor\), Llama\-3\.3\-70B \(100% cofactor\), Qwen2\.5\-72B \(78% cofactor\), Mistral\-Large \(98% cofactor\), and Claude\-4\.5\-Sonnet \(96% cofactor\)\. All five are Tier 2 or Tier 3 models that fail at 5×\\times5 determinants under standard evaluation conditions\.

#### Problem selection\.

Each model was evaluated on problems drawn from its own 5×\\times5 determinant failure set — problems the model answered incorrectly in the main benchmark evaluation\. Sample sizes reflect the available cofactor\-dominant failure pool per model\. Llama\-3\.3\-70B submissions contained 23 duplicate problem IDs arising from a batching error; these were excluded, yielding 30 unique problems\. Claude\-4\.5\-Sonnet was evaluated on a subset of its failure set \(nn=19 of 44 available failures\); results for this model are treated as indicative\. All other models were evaluated on their complete or near\-complete failure sets\.

#### Intervention\.

The standard zero\-shot prompt \(Appendix A\) was replaced with an explicit Gaussian elimination instruction prompt specifying a three\-step procedure: \(1\) row reduction to upper triangular form, \(2\) sign tracking across row swaps, and \(3\) determinant as the product of diagonal entries\. The full prompt text is reproduced in Section K\.2\. Tool use remained disabled\. Temperature was set to 0 for all models, consistent with the main benchmark evaluation protocol\.

### K\.2Intervention Prompt

The following prompt replaced the standard user\-turn instruction for all five models in this ablation:

Compute the determinant of the following 5×\\times5 matrix\. You must use Gaussian elimination \(row reduction\) only\. Do not use cofactor expansion\.Steps to follow:1\.Reduce the matrix to upper triangular form using elementary row operations\.2\.Track the sign: each row swap multiplies the determinant by−1\-1\.3\.The determinant equals the product of the main diagonal entries of the upper triangular matrix, multiplied by the accumulated sign from any row swaps\.Show all row operations explicitly\. Then state the final determinant\.

The system prompt from Appendix A\.1 was retained unchanged across all models\.

### K\.3Results

Table K\.1:Forced Gaussian elimination ablation results\.nnreflects unique problems per model\. Accuracy is computed over responses with a valid extracted answer; no\-response cases are excluded from the denominator and noted separately\.
### K\.4Interpretation

Four of five models show near\-zero accuracy under forced Gaussian elimination, statistically indistinguishable from their natural zero\-shot performance\. Strategy enforcement does not rescue accuracy for GPT\-4o \(2\.0%\), Qwen2\.5\-72B \(0\.0%\), Llama\-3\.3\-70B \(0\.0%\), or Claude\-4\.5\-Sonnet \(5\.3%\)\.

Mistral\-Large is the partial exception: 7 of 26 previously failed problems were solved correctly under forced Gaussian \(26\.9%\)\. This recovery is consistent with Mistral’s position at the Tier 2 boundary and its higher baseline execution quality relative to Tier 3 models\. However, a 26\.9% accuracy rate on problems the model originally failed — compared to its baseline 5×\\times5 determinant accuracy of 36% — does not constitute a meaningful recovery of capability\. Mistral continues to fail 73% of the time under the intervention\.

Execution trace analysis across all five models reveals a consistent failure mechanism: models correctly set up the Gaussian procedure but introduce arithmetic errors within the first four fractional row operations\. Because each subsequent step attends to the previous step’s output, a single corrupted fraction at step 2 or 3 propagates irreversibly through all remaining operations, typically collapsing the final diagonal product to zero or a spurious integer\. This cascade pattern — not strategy selection — is the proximate cause of failure\.

The ablation corroborates the interpretation developed in Section 6\.2: strategy choice correlates with 5×\\times5 determinant accuracy across models, but enforcing an alternative strategy does not recover accuracy\. The bottleneck is not which algorithm is selected but the ability to execute any multi\-step algorithm requiring deep, sequentially dependent fractional arithmetic without error propagation\. This reframes the working memory account \(Section 6\.1\): the LLM capacity constraint does not operate at the level of algorithm selection, but rather at the level of precision maintenance across deep, sequentially dependent fractional chains\.

## Appendix LModel Performance Profiles

This appendix documents the distinctive failure signature of each of the 10 models evaluated in LinAlg\-Bench, drawn entirely from empirical data\. Profiles report per\-level accuracy across all three dimensions, dominant failure modes, solution strategy behaviour, and the most diagnostically significant finding for each model\. No architectural speculation is included; all observations are grounded in the benchmark’s error taxonomy and forensic classifications\. Profiles are ordered by tier and then by overall 5×\\times5 accuracy within tier\.

Level accuracy values are computed as correct responses divided by maximum possible for that level and dimension\. Det = Determinant \(Recursive level\)\. Eig = Eigenvalue \(Compositional level\)\. Dominant failure reports the primary error tag at the dimension where the model first shows meaningful failure\. Full per\-model accuracy tables are in Appendix C; strategy classifications are in Appendix J; collapse personas are in Appendix I\.

### L\.1OpenAI\-o1 — Tier 1

Strategy:82\.5% Gaussian at 5×\\times5 — highest Gaussian adoption in benchmark\. Explicit bisection strategy for eigenvalue root\-finding; iteratesp\(λ\)p\(\\lambda\)at interval boundaries before converging\. One of two models to achieve any 5×\\times5 eigenvalue accuracy \(1/30\)\. Failure at 5×\\times5 eigenvalue is identical tool\-roleplay collapse pattern seen across all models, establishing that superior execution raises the working memory ceiling but does not eliminate it\.

Diagnostic signal:Perfect determinant accuracy across all dimensions\. 5×\\times5 eigenvalue failure \(96\.7%\) is pure Complete\_Collapse — the working memory threshold exists even for Tier 1\.

### L\.2Gemini\-3\.0\-Pro — Tier 1

Strategy:Massively pivots to Gaussian at 5×\\times5 \(8% cofactor vs 30% at 4×\\times4; 88% Gaussian vs 38% at 4×\\times4\)\. Strategy adaptation fully explains maintained determinant accuracy\. 0% eigenvalue at 5×\\times5 despite 100% determinant — confirms that the two tasks represent distinct failure thresholds\. Trace\-consistency check: satisfies trace constraint in 5/7 fabricated eigenvalue responses \(71%\), the highest rate in the benchmark — suggesting Gemini uses trace as primary plausibility anchor during confabulation\.

Diagnostic signal:Strongest strategy adapter in the benchmark\. Eigenvalue collapse at 5×\\times5 is complete despite perfect determinant — the most direct evidence that Recursive and Compositional are distinct thresholds\.

### L\.3DeepSeek\-V3 — Tier 1

Strategy:78% cofactor at 5×\\times5 — only Tier 1 model that does not shift to Gaussian\-dominant strategy yet maintains high determinant accuracy\. Forensic inspection reveals systematic zero\-seeking before expansion and explicit sub\-minor simplification stated in working text, effectively reducing autoregressive depth within cofactor\. Only open\-weight model to pass any 5×\\times5 eigenvalue problems \(3/30\); uses bisection root\-finding independently of OpenAI\-o1\. Degradation front\-loaded: 11\.3pp drop at 3×\\times3→\\rightarrow4×\\times4, only 6\.4pp at 4×\\times4→\\rightarrow5×\\times5 — consistent with reasoning chain depth providing graceful degradation at the R1 distillation boundary\.

Diagnostic signal:Critical test case for Section 6\.2: cofactor\-dominant yet high\-performing\. Demonstrates execution quality can partially compensate for suboptimal strategy — but only to a ceiling \(27 eigenvalue failures follow identical tool\-roleplay collapse\)\.

### L\.4Qwen3\-235B — Tier 1

Non\-monotonic determinant trajectory:96%→\\rightarrow98%→\\rightarrow90% — the only model to improve from 3×\\times3 to 4×\\times4 before declining at 5×\\times5\. Rank and nullity reach 100% at 5×\\times5, unique in the dataset, consistent with its dual\-mode thinking architecture and 36 trillion training tokens providing deep algorithm memorisation at sub\-recursive complexity levels\. Failure mode shifts from GENERATION\_LOOP at 4×\\times4 \(gets trapped testing rational roots for irrational eigenvalue polynomials\) to HALLUCINATION at 5×\\times5 — crossing the abandonment threshold one dimension later than Tier 3 models\. MoE architecture with 22B active parameters drops only 10\.9pp total from 3×\\times3 to 5×\\times5\.

Diagnostic signal:Non\-monotonic trajectory and 100% Sequential at 5×\\times5 make Qwen3 the key data point for the training pipeline hypothesis\. Failure mode transition from loop to collapse tracks the working memory account\.

### L\.5GPT\-5\.2 — Tier 2

Strategy:Gaussian\-dominant\. Large format sensitivity swing at 5×\\times5 determinant — one of two models showing strong format\-dependent accuracy variation at the Recursive level\. Apparent tool circumvention noted at 5×\\times5: Gemini\-3\.0\-Pro and GPT\-5\.2 explicitly output the no\-tool constraint before producing tool\-simulated outputs, confirming tool invocation is hallucinated rather than real\. API\-level inference constraints mean this cannot be definitively attributed to specific implementation choices without access to inference infrastructure\.

Diagnostic signal:Clearest example of Tier 2 profile: strong through Sequential, partial collapse at Recursive, complete collapse at Compositional\. Format sensitivity at Recursive level is one of the benchmark’s cleanest sensitivity signals\.

### L\.6Mistral\-Large — Tier 2

Strategy:98% cofactor at 5×\\times5 — among the most cofactor\-rigid in the benchmark\. Despite this, achieves 36% determinant at 5×\\times5, the highest among cofactor\-dominant models, consistent with higher execution quality within a suboptimal method\. Forced Gaussian ablation: 26\.9% accuracy on previously failed problems — the only model to show meaningful partial recovery, consistent with its position at the Tier 2 boundary\. Survival curve \(Figure[6\.1](https://arxiv.org/html/2605.16675#S6.F1)Panel B\) stays higher longer than all other ablation models before cascading\. MoE architecture with 41B active parameters drops 31\.4pp across dimensions — substantially worse than Qwen3\-235B at 22B active parameters despite similar architecture, suggesting training pipeline design matters more than parameter count\.

Diagnostic signal:Same MoE architecture as Qwen3\-235B but 3×\\timesthe active parameters and dramatically worse scale robustness — suggesting training pipeline design matters more than parameter count\. Forced Gaussian partial recovery is the clearest execution\-quality signal in the ablation\.

### L\.7Claude\-4\.5\-Sonnet — Tier 3

Strategy:96% cofactor at 5×\\times5\. Forced Gaussian ablation results were only 1 out of 19\. Explicitly outputs no\-tool constraint before simulating tool invocation at 5×\\times5 eigenvalue — confirming tool hallucination pattern\. Arithmetic level remains 100% at 5×\\times5 — one of three models with perfect arithmetic accuracy at maximum dimension, confirming that shallow arithmetic is intact while recursive computation collapses\.

Diagnostic signal:Most interesting Tier 3 profile because Reading Reading stays at 100% at 5×\\times5 alongside Arithmetic — suggesting occasional input transcription failure rather than general degradation\. HALLUCINATION dominates failure at 5×\\times5\.

### L\.8GPT\-4o — Tier 3

Strategy:100% cofactor at 5×\\times5\. Earliest sequential collapse in dataset — 62\.5% at 4×\\times4 Sequential, the lowest among all 10 models at that level\. Determinant reaches 0% at 5×\\times5 — complete collapse\. Forced Gaussian ablation: 2\.0% \(1/49\) — near\-zero, confirming strategy enforcement irrelevant\. Largest format sensitivity swing in the benchmark at 5×\\times5 matvec:LaTeXcompletely rescues parsing failures\. API\-level no\-response rate at 5×\\times5 was 1/49 in ablation\.

Diagnostic signal:Most severe Tier 3 degradation trajectory\. Sequential collapse at 4×\\times4 is the earliest warning signal in the benchmark — predicts determinant failure one dimension before it appears\.

### L\.9Llama\-3\.3\-70B — Tier 3

Strategy:100% cofactor at 5×\\times5\. Anomalous 4×\\times4 profile: 86% determinant accuracy despite poor sequential performance \(57\.5%\) — suggesting the model uses cofactor expansion in a way that bypasses row\-operation state tracking\. Lowest 3×\\times3 eigenvalue accuracy in the dataset \(33\.3%\), the earliest Compositional failure signal\. Forced Gaussian ablation: 0% across 30 unique problems \(23 duplicate submissions in original batch excluded\)\. Dense model with standard RLHF training — collapses earliest and most completely, with no execution quality compensation\.

Diagnostic signal:The 3×\\times3 eigenvalue weakness \(33\.3%\) is the strongest early predictor of 5×\\times5 tier membership in the dataset\. 4×\\times4 determinant anomaly \(86% despite poor sequential\) makes Llama the clearest case of cofactor\-only execution masking underlying capability limits\.

### L\.10Qwen2\.5\-72B — Tier 3

Strategy:78% cofactor at 5×\\times5 \(some Gaussian exposure\)\. Earliest determinant collapse among all models — already at 26% at 4×\\times4, compared to other Tier 3 models at 78–86%\. Arithmetic level drops to 78\.3% at 5×\\times5, the most severe arithmetic degradation in the benchmark\. Dense model with standard instruction tuning\.

Diagnostic signal:Earliest det collapse \(4×\\times4: 26%\) combined with arithmetic degradation at 5×\\times5 makes Qwen2\.5\-72B the clearest case of generalised computational overload rather than recursive\-only failure\.

### L\.11Cross\-Model Diagnostic Summary

The following table summarises the key diagnostic signal for each model — the single most informative finding for mechanistic interpretability follow\-up\.

Table L\.1:Cross\-model diagnostic summary\. All observations are empirical\. Architectural hypotheses generated by these observations are framed as falsifiable conjectures for follow\-up mechanistic work and are not confirmed claims\.

LinAlg-Bench: A Forensic Benchmark Revealing Structural Failure Modes in LLM Mathematical Reasoning

Similar Articles

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

AdvancedMathBench: A Benchmark Suite for Advanced Mathematical Proof Generation and Verification

A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation

Submit Feedback

Similar Articles

ChaosBench-Logic v2: Evaluating LLM Logical Reasoning over Dynamical Systems at Scale

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

LLMEval-Logic: A Solver-Verified Chinese Benchmark for Logical Reasoning of LLMs with Adversarial Hardening

AdvancedMathBench: A Benchmark Suite for Advanced Mathematical Proof Generation and Verification

A2RBench: An Automatic Paradigm for Formally Verifiable Abstract Reasoning Benchmark Generation