The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary
Summary
This paper demonstrates that extended chain-of-thought reasoning degrades performance on deterministic state-tracking tasks due to information-theoretic limits of decoder-only attention, and proposes tool delegation when the reasoning horizon exceeds a threshold.
View Cached Full Text
Cached at: 06/02/26, 03:46 PM
# The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary
Source: [https://arxiv.org/html/2606.00376](https://arxiv.org/html/2606.00376)
###### Abstract
Extended chain\-of\-thought reasoning can degrade performance on deterministic state\-tracking tasks, not due to preference biases, but limits rooted in the information\-theoretic capacity of decoder\-only attention\. We establish: \(1\) an Attention Bottleneck Theorem with a complementary achievability construction, bounding state\-tracking capacity asO\(H⋅log\(L/H\)⋅dh\)O\(H\\cdot\\log\(L/H\)\\cdot\\sqrt\{d\_\{h\}\}\); \(2\) a context\-dependent error model yielding super\-exponential accuracy decay; \(3\) the State\-Space Jaccard metric distinguishing capability from preference failures; \(4\) a Deterministic Horizond∗∈\[19,31\]d^\{\*\}\\in\[19,31\]beyond which tool delegation becomes necessary\. Across 12 models and 8 task domains \(including SWE\-Bench, WebArena, and SQL\-Multi\), tool\-integrated reasoning consistently outperforms neural chain\-of\-thought; on the primary model suite it reaches 86–94% accuracy versus 24–42% for neural chain\-of\-thought\. Fine\-tuning on optimal\-length traces yields<<5% improvement, confirming an architectural ceiling, and high cross\-model correlation \(r=0\.81r=0\.81–0\.910\.91\) indicates these failures are architectural rather than training\-specific\. Our results provide principled guidance for when pure neural reasoning should yield to hybrid approaches in agentic systems\.
large language models, chain\-of\-thought reasoning, transformer architectures, inference\-time compute, attention mechanisms, state\-space search, information theory, tool augmentation, mechanistic interpretability, reasoning limits
## 1Introduction
The dominant paradigm in large language model reasoning posits that extended deliberation improves accuracy\. OpenAI’s o1\(OpenAI,[2024](https://arxiv.org/html/2606.00376#bib.bib77)\), DeepSeek\-R1\(Guoet al\.,[2025](https://arxiv.org/html/2606.00376#bib.bib44)\), and similar architectures invest heavily in chain\-of\-thought \(CoT\) generation, with the implicit thesis that additional inference\-time compute yields better reasoning\(Snellet al\.,[2025](https://arxiv.org/html/2606.00376#bib.bib45); Brownet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib46)\)\. Recent work shows test\-time compute can be more effective than parameter scaling, with compute\-optimal strategies improving efficiency by4×4\\times\(Snellet al\.,[2025](https://arxiv.org/html/2606.00376#bib.bib45); Wuet al\.,[2025](https://arxiv.org/html/2606.00376#bib.bib48)\)\. However,Balachandranet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib47)\)demonstrate improvements diminish with problem difficulty, andSuiet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib4)\)provide a broad survey documenting the “overthinking” phenomenon\.
We challenge this thesis fordeterministic state\-space search, tasks requiring exact sequences of operations transforming initial state𝝈0\\boldsymbol\{\\sigma\}\_\{0\}to target𝝉\\boldsymbol\{\\tau\}through finite operator set𝒪\\mathcal\{O\}\. Such problems pervade software engineering, formal verification, and sequential planning\(Dziriet al\.,[2023](https://arxiv.org/html/2606.00376#bib.bib34); Kambhampatiet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib49); Valmeekamet al\.,[2023](https://arxiv.org/html/2606.00376#bib.bib50)\)\. In these domains, correctness is binary; approximation is failure\. Unlike open\-ended generation where “mostly correct” may suffice, state\-space search demands exact tracking, a requirement that exposes fundamental limitations of decoder\-only attention\.
##### The Core Phenomenon\.
On permutation puzzles solvable by BFS in<<0\.1s, state\-of\-the\-art reasoning models fail after*minutes*of deliberation\. Analysis of 2,847 failed traces revealsState\-Space Decoherence: accumulated errors causing complete divergence from ground truth\.111Stratified sample of 30 failures per model\-task combination; see Appendix[K](https://arxiv.org/html/2606.00376#A11)for sampling methodology\.This failure is*caused*by extended reasoning, not mitigated by it\. The pattern is striking: at depth 10, models maintain 78% accuracy; by depth 30, accuracy drops to 34%; beyond depth 50, performance approaches random guessing \(Figure[1](https://arxiv.org/html/2606.00376#S5.F1)\)\.
##### Two Competing Effects\.
Following the framing ofKimet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib55)\), we identify tension between:
- •Complexity at reasoning time:Deeper CoT increases cumulative error probability through attention entropy dispersion\(Gong and Zhang,[2024](https://arxiv.org/html/2606.00376#bib.bib6); Barberoet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib7)\)\. Each step degrades signal\-to\-noise ratio in the residual stream\.
- •Flexibility at tool time:External computation sidesteps hard subproblems entirely\(Gaoet al\.,[2023](https://arxiv.org/html/2606.00376#bib.bib23); Gouet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib25); Panet al\.,[2023](https://arxiv.org/html/2606.00376#bib.bib26)\)\. Tools provide exact computation without attention\-based state tracking\.
##### Distinguishing from Prior Work\.
Wuet al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib2)\)document the inverted\-U curve, attributing it to “Simplicity Bias”, a preference for shorter reasoning\. Their framework predicts training interventions can recover performance\. We propose a*complementary architectural diagnosis*: even models that*attempt*long reasoning cannot maintain accuracy because autoregressive attention lacks substrate for exact state tracking\. This is theSimulator Fallacy\(Bender and Koller,[2020](https://arxiv.org/html/2606.00376#bib.bib51)\): conflating token prediction with algorithm execution\.
##### Divergent Predictions\.
The frameworks make testable predictions \(Table[1](https://arxiv.org/html/2606.00376#S1.T1)\): \(i\) Wu et al\. predict fine\-tuning on optimal\-length traces yields\>\>30% recovery; we predict<<5% due to architectural ceiling\. \(ii\) Wu et al\. predict prompt\-level length encouragement yields\>\>10% gains; we observe<<2%\. \(iii\) Wu et al\. predict low cross\-model correlation \(training\-specific\); we observer\>0\.8r\>0\.8\(architectural\)\.
Table 1:Divergent predictions distinguishing Simplicity Bias\(Wuet al\.,[2026](https://arxiv.org/html/2606.00376#bib.bib2)\)from Decoherence \(this work\)\. Our predictions validated \(✓\\checkmark\)\.
##### Contributions\.
We make six contributions:
1. 1\.Attention Bottleneck Theoremwith information\-theoretic derivation and a complementary achievability construction, bounding capacityO\(H⋅log\(L/H\)⋅dh\)O\(H\\cdot\\log\(L/H\)\\cdot\\sqrt\{d\_\{h\}\}\)\(Section[4](https://arxiv.org/html/2606.00376#S4)\)\.
2. 2\.Context\-dependent error modelϵ\(d\)=ϵ0\+γd/Leff\\epsilon\(d\)=\\epsilon\_\{0\}\+\\gamma d/L\_\{\\mathrm\{eff\}\}derived from attention entropy, yielding super\-exponential accuracy decay \(Section[4](https://arxiv.org/html/2606.00376#S4)\)\.
3. 3\.Deterministic Horizond∗d^\{\*\}with closed\-form formula, validated via architecture ablations on open\-weight models \(d∗∝dh⋅Hd^\{\*\}\\propto\\sqrt\{d\_\{h\}\\cdot H\}\) and shown robust to context\-window truncation \(Section[4](https://arxiv.org/html/2606.00376#S4)\)\.
4. 4\.SSJ metricwith precision/recall decomposition distinguishing capability \(both decay\) from preference \(only recall decays\) failure \(Section[3](https://arxiv.org/html/2606.00376#S3)\)\.
5. 5\.Empirical validationacross 12 models, 8 task domains including real\-world benchmarks \(SWE\-Bench, WebArena, SQL\-Multi\), with cross\-architecture comparison \(Section[5](https://arxiv.org/html/2606.00376#S5)\)\.
6. 6\.Fine\-tuning experimentconfirming architectural ceiling prediction, providing key discrimination from preference\-based explanations \(Section[5\.3](https://arxiv.org/html/2606.00376#S5.SS3)\)\.
##### Conflict of Interest Disclosure\.
J\. Wu is affiliated with Brain Investing Limited and Stellaris AI Limited\. These entities provided no funding for this research and developed none of the models evaluated in this paper; all experiments were conducted independently using publicly available APIs and open\-weight checkpoints\. The remaining authors declare no competing interests\.
## 2Related Work
Our work synthesizes five research streams; a detailed positioning matrix appears in Appendix[A](https://arxiv.org/html/2606.00376#A1)\.
##### CoT Foundations\.
Chain\-of\-thought prompting was established byWeiet al\.\([2022](https://arxiv.org/html/2606.00376#bib.bib1)\), extended by zero\-shot CoT\(Kojimaet al\.,[2022](https://arxiv.org/html/2606.00376#bib.bib58)\), self\-consistency\(Wanget al\.,[2023](https://arxiv.org/html/2606.00376#bib.bib72)\), Tree\-of\-Thoughts\(Yaoet al\.,[2023](https://arxiv.org/html/2606.00376#bib.bib59)\), and Graph\-of\-Thoughts\(Bestaet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib60)\)\.Goodmanet al\.\([2022](https://arxiv.org/html/2606.00376#bib.bib61)\)showed bootstrapping through STaR training\.Nyeet al\.\([2021](https://arxiv.org/html/2606.00376#bib.bib62)\)introduced scratchpads for intermediate computation\.Liuet al\.\([2024b](https://arxiv.org/html/2606.00376#bib.bib14)\)proved CoT expands expressivity fromAC0\\text\{AC\}^\{0\}to polynomial\-time;Fenget al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib20)\)characterized attention pattern expressiveness\. ReAct\(Yaoet al\.,[2022](https://arxiv.org/html/2606.00376#bib.bib73)\)synergizes reasoning with acting\.
##### Overthinking Literature\.
Wuet al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib2)\)provide the closest concurrent work, documenting inverted\-U curves attributed to Simplicity Bias\.Chenet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib3)\)examine overthinking in o1\-like models\.Wanget al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib63)\)identify premature path abandonment\.Suiet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib4)\)survey efficient reasoning\.Marjanovicet al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib74)\)analyze DeepSeek\-R1’s “sweet spot\.”Suet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib5)\)study the overthinking\-underthinking spectrum\.
##### Working Memory and Over\-Squashing\.
Gong and Zhang \([2024](https://arxiv.org/html/2606.00376#bib.bib6)\)demonstrate attention entropy limits working memory\.Barberoet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib7)\)prove representational collapse from causal masking\.Liuet al\.\([2024a](https://arxiv.org/html/2606.00376#bib.bib8)\)document “lost in the middle\.”Xiaoet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib9)\)identify attention sinks\.Zhanget al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib10)\)extend entropy analysis\.Gerasimovet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib11)\)find representation collapse\.Levyet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib12)\)show length\-dependent degradation\.Olssonet al\.\([2022](https://arxiv.org/html/2606.00376#bib.bib64)\)discover induction heads\.Elhageet al\.\([2021](https://arxiv.org/html/2606.00376#bib.bib76)\)provide circuit frameworks\.Biettiet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib13)\)analyze memory from birth\.
##### Theoretical Foundations\.
Merrill and Sabharwal \([2024](https://arxiv.org/html/2606.00376#bib.bib75)\)establish expressivity bounds\.Merrillet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib15)\)show SSMs shareTC0\\text\{TC\}^\{0\}limits\.Bavandpouret al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib65)\)provide CoT step lower bounds\.Penget al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib16)\)prove composition impossibility via communication complexity\.Hahn \([2020](https://arxiv.org/html/2606.00376#bib.bib66)\)identify formal language limitations\.Delétanget al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib17)\)study Chomsky hierarchy relations\.Pérezet al\.\([2021](https://arxiv.org/html/2606.00376#bib.bib67)\)prove conditional Turing completeness\.Yunet al\.\([2020](https://arxiv.org/html/2606.00376#bib.bib18)\)establish universal approximation\.Bhattamishraet al\.\([2020](https://arxiv.org/html/2606.00376#bib.bib21)\)analyze attention mechanism power\.Stroblet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib22)\)provide formal language perspective\. Information\-theoretic foundations includeTishby and Zaslavsky \([2015](https://arxiv.org/html/2606.00376#bib.bib41)\)on bottlenecks,Shwartz\-Ziv and Tishby \([2017](https://arxiv.org/html/2606.00376#bib.bib42)\)on DNN information dynamics,Lewandowsky and Bauch \([2024](https://arxiv.org/html/2606.00376#bib.bib68)\)on information bottleneck framework, andDeb and Ogunfunmi \([2025](https://arxiv.org/html/2606.00376#bib.bib43)\)connecting transformers to information theory\.
##### Tool Augmentation\.
Gaoet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib23)\)introduced program\-aided language models\.Liet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib24)\)propose Chain\-of\-Code\.Chenet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib30)\)introduce Program of Thoughts\.Gouet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib25)\)achieve SOTA via tool integration\.Panet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib26)\)demonstrate symbolic solver delegation\.Schicket al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib27)\)enable self\-supervised tool learning\.Parisiet al\.\([2022](https://arxiv.org/html/2606.00376#bib.bib28)\)introduce tool\-augmented language models\.Luoet al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib29)\)apply RL for tool\-augmented math\.Qinet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib31)\)survey tool learning\.Mialonet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib32)\)survey augmented LMs\.Patilet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib33)\)connect LLMs to APIs\.
##### Compositional Reasoning\.
Dziriet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib34)\)show transformers solve compositional tasks via linearized matching\.Pettyet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib35)\)demonstrate depth provides diminishing returns for compositionality\. Benchmarks includeLake and Baroni \([2018](https://arxiv.org/html/2606.00376#bib.bib37)\), SCAN/CFQ\(Keyserset al\.,[2020](https://arxiv.org/html/2606.00376#bib.bib38)\), and COGS\(Kim and Linzen,[2020](https://arxiv.org/html/2606.00376#bib.bib39)\)\.Presset al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib36)\)introduce compositionality metrics;Ontañónet al\.\([2022](https://arxiv.org/html/2606.00376#bib.bib40)\)study improvement methods\.
Our differentiation:\(1\) We derive context\-dependent error from attention entropy, not constant per\-step\. \(2\) We extend single\-step over\-squashing to multi\-step chains with SSJ quantification\. \(3\) We formalize when tool delegation becomes*necessary*, not just beneficial\. \(4\) We validate on real\-world tasks and through fine\-tuning experiments\.
## 3Problem Setup and Metrics
##### State\-Space Search\.
Let𝒮\\mathcal\{S\}denote a finite state space and𝒪=\{o1,…,ok\}\\mathcal\{O\}=\\\{o\_\{1\},\\ldots,o\_\{k\}\\\}deterministic operators\. Given initial state𝝈0∈𝒮\\boldsymbol\{\\sigma\}\_\{0\}\\in\\mathcal\{S\}and target𝝉∈𝒮\\boldsymbol\{\\tau\}\\in\\mathcal\{S\}, the task is finding minimal sequence\(oi1,…,oim\)\(o\_\{i\_\{1\}\},\\ldots,o\_\{i\_\{m\}\}\)such thatoim∘⋯∘oi1\(𝝈0\)=𝝉o\_\{i\_\{m\}\}\\circ\\cdots\\circ o\_\{i\_\{1\}\}\(\\boldsymbol\{\\sigma\}\_\{0\}\)=\\boldsymbol\{\\tau\}\.
###### Definition 3\.1\(Step\-to\-First\-Error \(SFE\)\)\.
Given tracer=\[\(s1,o1\),…,\(sm,om\)\]r=\[\(s\_\{1\},o\_\{1\}\),\\ldots,\(s\_\{m\},o\_\{m\}\)\]wheresis\_\{i\}is claimed state,SFEis smallestiiwheresi≠oi−1∘⋯∘o1\(𝝈0\)s\_\{i\}\\neq o\_\{i\-1\}\\circ\\cdots\\circ o\_\{1\}\(\\boldsymbol\{\\sigma\}\_\{0\}\)\.
###### Definition 3\.2\(State\-Space Jaccard \(SSJ\)\)\.
At depthdd, let𝒮trued\\mathcal\{S\}\_\{\\text\{true\}\}^\{d\}be actually reachable states and𝒮^modeld\\hat\{\\mathcal\{S\}\}\_\{\\text\{model\}\}^\{d\}be claimed states:
SSJ\(d\)=\|𝒮trued∩𝒮^modeld\|\|𝒮trued∪𝒮^modeld\|\\textsc\{SSJ\}\(d\)=\\frac\{\|\\mathcal\{S\}\_\{\\text\{true\}\}^\{d\}\\cap\\hat\{\\mathcal\{S\}\}\_\{\\text\{model\}\}^\{d\}\|\}\{\|\\mathcal\{S\}\_\{\\text\{true\}\}^\{d\}\\cup\\hat\{\\mathcal\{S\}\}\_\{\\text\{model\}\}^\{d\}\|\}\(1\)We decompose into precision=\|∩\|/\|𝒮^model\|=\|\\cap\|/\|\\hat\{\\mathcal\{S\}\}\_\{\\text\{model\}\}\|and recall=\|∩\|/\|𝒮true\|=\|\\cap\|/\|\\mathcal\{S\}\_\{\\text\{true\}\}\|\.
Diagnostic power:If failure is*preference*\-based \(Simplicity Bias\), SSJ remains high, as models*could*track states but*choose*not to\. If*capability*\-based \(Decoherence\), both precision and recall decay, indicating drift into fictitious state spaces\. This provides direct empirical discrimination \(Section[5\.4](https://arxiv.org/html/2606.00376#S5.SS4)\)\.
### 3\.1Scope and Applicability
The phenomenon we analyze is specific to*deterministic state\-tracking tasks*: problems in which a unique correct state must be maintained exactly across a sequence of operations, so that any single deviation constitutes failure\. Canonical instances include tracing program state through sequential mutations, composing permutations or rotations, multi\-hop entity tracking across long contexts, and planning multi\-table SQL joins\. These contrast with two task classes for which our framework makes*no*prediction of decoherence\. The first is*open\-ended generation*\(summarization, dialogue, creative writing\), where many outputs are acceptable and there is no single ground\-truth state to corrupt\. The second is*approximate or probabilistic reasoning*, where partial credit is meaningful and short chains suffice; most GSM8K\-style arithmetic word problems, for example, require fewer than roughly 15 reasoning steps, well within the Deterministic Horizon, so extended chain\-of\-thought remains beneficial there rather than harmful\. The Deterministic Horizon therefore predicts both*when*extended reasoning helps \(d<d∗d<d^\{\*\}\) and when it degrades \(d\>d∗d\>d^\{\*\}\) for the deterministic class specifically; it is not a claim that longer reasoning is universally counterproductive\. This specificity reconciles our findings with the inverted\-U curve ofWuet al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib2)\): accuracy improves with depth up tod∗d^\{\*\}, then degrades super\-exponentially once decoherence sets in\.
## 4Theoretical Framework
### 4\.1Context\-Dependent Error Model
Wuet al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib2)\)model per\-step error asE\(N,M,T\)=T/\(NM\)E\(N,M,T\)=T/\(NM\): error depends on complexity but not*position*\. We propose context\-dependent error motivated by attention mechanics:
###### Definition 4\.1\(Context\-Dependent Error\)\.
Per\-step state\-tracking error at depthdd:
ϵ\(d\)=ϵ0\+γ⋅dLeff\\epsilon\(d\)=\\epsilon\_\{0\}\+\\gamma\\cdot\\frac\{d\}\{L\_\{\\mathrm\{eff\}\}\}\(2\)whereϵ0\\epsilon\_\{0\}is the baseline per\-step error,γ\\gammais the attention decay rate, andLeffL\_\{\\mathrm\{eff\}\}is the*effective decoherence length*: the number of reasoning steps over which attention maintains usable resolution of self\-generated state\. Notably,LeffL\_\{\\mathrm\{eff\}\}is a property of the model’s effective working memory and is far smaller than the raw context windowLL\(empiricallyLeff=O\(102\)L\_\{\\mathrm\{eff\}\}=O\(10^\{2\}\)steps versusL=O\(105\)L=O\(10^\{5\}\)tokens; Section[6\.1](https://arxiv.org/html/2606.00376#S6.SS1)\)\.
##### Derivation from Attention Entropy\.
FollowingGong and Zhang \([2024](https://arxiv.org/html/2606.00376#bib.bib6)\), letHdH\_\{d\}denote attention entropy at depthdd\. For state\-tracking, successful retrieval requires concentrated attention on anchor tokens\. Empirically,HdH\_\{d\}grows linearly with task load \(r=0\.73r=0\.73,p<0\.001p<0\.001\) while attention on anchors decreases\(Xiaoet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib9); Liuet al\.,[2024a](https://arxiv.org/html/2606.00376#bib.bib8)\)\. Modeling error as signal\-to\-noise ratio:
ϵ\(d\)∝HdAanchor\(d\)≈H0\+βdA0−δd≈ϵ0\+γd/Leff\\epsilon\(d\)\\propto\\frac\{H\_\{d\}\}\{A\_\{\\text\{anchor\}\}\(d\)\}\\approx\\frac\{H\_\{0\}\+\\beta d\}\{A\_\{0\}\-\\delta d\}\\approx\\epsilon\_\{0\}\+\\gamma d/L\_\{\\mathrm\{eff\}\}\(3\)using first\-order Taylor expansion valid ford≪Leffd\\ll L\_\{\\mathrm\{eff\}\}\. We validate this linear form in Section[6\.3](https://arxiv.org/html/2606.00376#S6.SS3)\.
###### Theorem 4\.2\(Decoherence Bound\)\.
Under the context\-dependent error of Equation \([2](https://arxiv.org/html/2606.00376#S4.E2)\),ϵ\(d\)=ϵ0\+γd/Leff\\epsilon\(d\)=\\epsilon\_\{0\}\+\\gamma d/L\_\{\\mathrm\{eff\}\}, with conditional independence given correct history:
P\(correct at depthm\)≤exp\(−mϵ0−γm\(m\+1\)2Leff\)P\(\\text\{correct at depth \}m\)\\leq\\exp\\left\(\-m\\epsilon\_\{0\}\-\\frac\{\\gamma m\(m\+1\)\}\{2L\_\{\\mathrm\{eff\}\}\}\\right\)\(4\)The quadratic term yields super\-exponential decay\.
###### Corollary 4\.3\(Absorbing\-Error Tightness\)\.
Under the absorbing\-error premise of State\-Space Decoherence \(a single state\-tracking error causes irrecoverable divergence\), “correct at depthmm” is equivalent to “no error in the firstmmsteps\.” Identifyingϵ\(i\)\\epsilon\(i\)with the discrete hazardP\(first error ati∣correct throughi−1\)P\(\\text\{first error at \}i\\mid\\text\{correct through \}i\-1\), the product in Theorem[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)holds as an exact identity, relaxed only by1−x≤e−x1\-x\\leq e^\{\-x\}\. The induced error\-indicator sequence is then positively autocorrelated; we measure lag\-1 autocorrelationρ1=0\.34\\rho\_\{1\}=0\.34\(95% CI\[0\.31,0\.37\]\[0\.31,0\.37\]; Appendix[AC](https://arxiv.org/html/2606.00376#A29)\), consistent with strong forward error propagation and inconsistent with independent errors \(ρ1=0\\rho\_\{1\}=0\)\.
### 4\.2Attention Bottleneck Theorem
We establish information\-theoretic capacity bounds on state\-tracking\. The key insight is that autoregressive attention must compress all historical state information through a fixed\-capacity channel at each step\.
###### Assumption 4\.4\(Effective Attention Window\)\.
Each head assigns weight≥δ/L\\geq\\delta/Lto at mostO\(L\)O\(\\sqrt\{L\}\)positions due to softmax concentration and interference effects\.
###### Assumption 4\.5\(Value Decorrelation\)\.
After layer normalization, value vectors satisfy\|ρij\|≤ρmax\|\\rho\_\{ij\}\|\\leq\\rho\_\{\\max\}withρmax≪1\\rho\_\{\\max\}\\ll 1\.
##### Justification of Assumptions\.
Assumption[4\.4](https://arxiv.org/html/2606.00376#S4.Thmtheorem4)formalizes the well\-documented concentration of softmax attention: as a sequence lengthens, attention mass localizes on a small subset of positions while the contribution of individual earlier tokens is progressively diluted, so the number of positions a head can resolve at usable weight grows far more slowly thanLL\(Barberoet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib7); Gerasimovet al\.,[2025](https://arxiv.org/html/2606.00376#bib.bib11)\)\. Intuitively, because each step must route all relevant history through this narrowing channel, the model cannot simultaneously keep many distinct prior states “in focus,” which is the mechanism behind the capacity bound below\. We adopt theO\(L\)O\(\\sqrt\{L\}\)rate as a tractable summary of this sub\-linear concentration rather than an exact law; Appendix[B\.7](https://arxiv.org/html/2606.00376#A2.SS7)shows the resulting bound shifts by<<20% under±\\pm20% perturbation ofHHand of this exponent, so our qualitative conclusions do not hinge on its precise form\. Consistent with this, we directly measure an effective decoherence lengthLeff=O\(102\)L\_\{\\mathrm\{eff\}\}=O\(10^\{2\}\)steps \(Section[6\.1](https://arxiv.org/html/2606.00376#S6.SS1)\), far below the raw context window, and low post\-LayerNorm inter\-value correlation supporting Assumption[4\.5](https://arxiv.org/html/2606.00376#S4.Thmtheorem5)\(Appendix[G](https://arxiv.org/html/2606.00376#A7)\)\. We therefore treat these as empirically grounded modeling assumptions and, accordingly, report the capacity result as a bound consistent with our measurements rather than an unconditional law\.
###### Theorem 4\.6\(Attention Bottleneck\)\.
Under Assumptions[4\.4](https://arxiv.org/html/2606.00376#S4.Thmtheorem4)–[4\.5](https://arxiv.org/html/2606.00376#S4.Thmtheorem5), the number of distinct states a decoder\-only transformer can reliably track is bounded:
\|𝒮track\|≤c\(δ,ρmax\)⋅2H⋅log2\(L/H\)⋅dh1/2\|\\mathcal\{S\}\_\{\\text\{track\}\}\|\\leq c\(\\delta,\\rho\_\{\\max\}\)\\cdot 2^\{H\\cdot\\log\_\{2\}\(L/H\)\\cdot d\_\{h\}^\{1/2\}\}\(5\)whereHHis heads,LLcontext length,dhd\_\{h\}head dimension\.
##### Numerical Example\.
For GPT\-4o \(H=96H=96,L=128L=128K,dh=128d\_\{h\}=128\)222Architectural parameters for proprietary models such as GPT\-4o and Claude are not officially disclosed; the values in this illustrative example are widely cited community estimates used only for order\-of\-magnitude intuition\. All quantitative validation of thedhH\\sqrt\{d\_\{h\}H\}scaling \(Section[6\.1](https://arxiv.org/html/2606.00376#S6.SS1)\) uses open\-weight models with publishedHHanddhd\_\{h\}\.: capacity≈296⋅10\.4⋅11\.3≈211,275\\approx 2^\{96\\cdot 10\.4\\cdot 11\.3\}\\approx 2^\{11,275\}states\. This seems enormous, but permutation tracking forn=16n=16elements throughd=50d=50steps requires distinguishing≈16\!50≈22212\\approx 16\!^\{50\}\\approx 2^\{2212\}trajectories, still within bounds\. However, the*per\-step*information requirement \(log216\!≈44\\log\_\{2\}16\!\\approx 44bits\) must flow through attention, which concentrates onO\(L\)O\(\\sqrt\{L\}\)positions\. This bottleneck causes the observed failures\.
###### Theorem 4\.7\(Lower\-Bound Construction\)\.
There exist state\-tracking tasks requiring states satisfyinglog2\|𝒮\|=Ω\(H⋅log\(L/H\)⋅dh1/2\)\\log\_\{2\}\|\\mathcal\{S\}\|=\\Omega\(H\\cdot\\log\(L/H\)\\cdot d\_\{h\}^\{1/2\}\)that transformers can solve\. This construction is tight inHHandLL; itsdh1/2d\_\{h\}^\{1/2\}exponent inherits the per\-head effective\-rank ansatz of Lemma[R\.5](https://arxiv.org/html/2606.00376#A18.Thmtheorem5), so we present it as an achievability result for the functional form of Theorem[4\.6](https://arxiv.org/html/2606.00376#S4.Thmtheorem6)rather than a proof of tightness indhd\_\{h\}\.
Proofs in Appendix[B](https://arxiv.org/html/2606.00376#A2)\. The lower bound uses explicit construction via sparse parity functions, followingSanfordet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib19)\)\.
### 4\.3Deterministic Horizon
###### Theorem 4\.8\(Deterministic Horizon\)\.
Letα\\alphabe the target success probability\. Setting the bound of Theorem[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)equal toα\\alpha\(continuous approximationm\(m\+1\)≈m2m\(m\+1\)\\approx m^\{2\}\) and solving the resulting quadratic, the maximum depthd∗d^\{\*\}satisfyingP\(correct\)≥αP\(\\text\{correct\}\)\\geq\\alphais:
d∗=−ϵ0Leff\+ϵ02Leff2\+2γLeffln\(1/α\)γd^\{\*\}=\\frac\{\-\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\+\\sqrt\{\\epsilon\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}^\{2\}\+2\\gamma L\_\{\\mathrm\{eff\}\}\\ln\(1/\\alpha\)\}\}\{\\gamma\}\(6\)For GPT\-4o \(ϵ0=0\.02\\epsilon\_\{0\}=0\.02,γ=0\.15\\gamma=0\.15,Leff=150L\_\{\\mathrm\{eff\}\}=150,α=0\.5\\alpha=0\.5\) this givesd∗=22\.3d^\{\*\}=22\.3; across the primary model suited∗∈\[19,31\]d^\{\*\}\\in\[19,31\]\. Open\-weight 7–8B models sit at the lower edge of this range \(d∗=19d^\{\*\}=19–2020\), rising tod∗=28d^\{\*\}=28at the 70–72B scale, in line with thedhH\\sqrt\{d\_\{h\}H\}capacity scaling \(Table[6](https://arxiv.org/html/2606.00376#S6.T6)\)\.
###### Corollary 4\.9\(Architecture Scaling\)\.
To leading order in the quadratic\-dominated regime, the horizon satisfiesd∗≈2Leffln\(1/α\)/γ∝Leffd^\{\*\}\\approx\\sqrt\{2L\_\{\\mathrm\{eff\}\}\\ln\(1/\\alpha\)/\\gamma\}\\propto\\sqrt\{L\_\{\\mathrm\{eff\}\}\}\. Treating the effective decoherence length as governed by per\-layer attention capacity \(Leff∝dhHL\_\{\\mathrm\{eff\}\}\\propto d\_\{h\}H\) yields the approximate scaling
d∗∝dh⋅H,d^\{\*\}\\propto\\sqrt\{d\_\{h\}\\cdot H\},\(7\)which we validate empirically on open\-weight models \(Section[6\.1](https://arxiv.org/html/2606.00376#S6.SS1)\); the linearϵ0\\epsilon\_\{0\}term introduces mild deviations from the exact⋅\\sqrt\{\\cdot\}form\. We stress thatLeff∝dhHL\_\{\\mathrm\{eff\}\}\\propto d\_\{h\}His a phenomenological ansatz on the per\-layer working\-memory budget; it is distinct from, and not implied by, thedh\\sqrt\{d\_\{h\}\}dependence of the per\-step capacity bound in Theorem[4\.6](https://arxiv.org/html/2606.00376#S4.Thmtheorem6), so we treat thedhH\\sqrt\{d\_\{h\}H\}law as an empirical scaling validated in Section[6\.1](https://arxiv.org/html/2606.00376#S6.SS1)rather than a corollary of the bottleneck theorem\. The raw context windowLLdoes not enter this relation, andd∗d^\{\*\}is empirically insensitive to context truncation far above the reasoning\-trace length\.
###### Theorem 4\.10\(Fine\-Tuning Upper Bound\)\.
For any training procedure on depth\-ddtraces, ifd\>d∗d\>d^\{\*\}, accuracy cannot exceed:
Accfine\-tune≤Accbaseline\+O\(d∗d\)\\text\{Acc\}\_\{\\text\{fine\-tune\}\}\\leq\\text\{Acc\}\_\{\\text\{baseline\}\}\+O\\left\(\\frac\{d^\{\*\}\}\{d\}\\right\)\(8\)regardless of training data distribution\. This provides an*architectural ceiling*that preference manipulation cannot overcome\.
This theorem is the key theoretical contribution distinguishing our work fromWuet al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib2)\)\. If Simplicity Bias were the sole cause, fine\-tuning should yield unbounded recovery; our analysis and experiments instead point to a fundamental architectural limit\.
## 5Experiments
### 5\.1Experimental Setup
##### Tasks\.
We evaluate on 8 task domains spanning synthetic and real\-world benchmarks, designed to require deterministic state tracking at varying depths:
Synthetic \(controlled complexity\):
- •PermutationProbe:nn\-element permutation puzzles via adjacent transpositions; BFS\-optimal depths 5–60;n∈\{8,12,16\}n\\in\\\{8,12,16\\\}
- •FSA\-Sim:kk\-state deterministic automaton simulation;k∈\{4,8,16\}k\\in\\\{4,8,16\\\}; sequence lengths 10–100
- •ArithChain: Multi\-step symbolic integration with carry propagation
- •CircuitTrace: Boolean circuit evaluation through layered gates
- •CodeProbe: Variable tracking through Python execution traces
Real\-world \(production\-relevant\):
- •SWE\-Bench\-State: 500 instances from SWE\-Bench\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib52)\)requiring multi\-file state tracking for bug localization
- •WebArena\-Nav: 500 instances from WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib53)\)involving multi\-step navigation with session state
- •SQL\-Multi: 500 instances requiring 3\+ table joins with schema tracking, derived from Spider\(Yuet al\.,[2018](https://arxiv.org/html/2606.00376#bib.bib54)\)
##### Models\.
12 models across six organizations, covering general\-purpose, reasoning\-specialized, and open\-weight categories:
- •General: GPT\-4o, Claude\-4\.5\-Sonnet, Claude\-4\.5\-Opus, Gemini\-1\.5\-Pro
- •Reasoning: o3, o3\-mini, DeepSeek\-R1, Gemini\-2\.0\-Flash\-Thinking
- •Open\-weight: Llama\-3\.1\-8B, Llama\-3\.3\-70B, Qwen\-2\.5\-7B, Qwen\-2\.5\-72B
Open\-weight models enable direct attention entropy extraction; API models provide commercial baseline\.
##### Conditions\.
Five experimental conditions isolate different factors:
- •C1: Unconstrained neural CoT, standard prompting
- •C2: Depth\-limited CoT \(oracle optimal length\), controls for length
- •C3: Tool\-integrated \(BFS solver access\), upper bound on achievable accuracy
- •C4: Explicit length encouragement \(“take as many steps as needed”\), tests preference manipulation
- •C5: Fine\-tuned on optimal\-length traces, tests training intervention
##### Replication and Statistics\.
Each model\-task\-condition combination was evaluated with3 independent runsusing seeds\{42,2024,2025\}\\\{42,2024,2025\\\}; reported values are means±\\pmstandard deviation across runs\. We employ 95% bootstrap CIs \(10K resamples\), Holm\-Bonferroni correction for multiple comparisons, TOST equivalence testing \(Δ=5%\\Delta=5\\%\), and Bayes factors for null hypothesis support\. Total: 12 models×\\times5 conditions×\\times8 tasks×\\times500 instances×\\times3 runs = 720,000 evaluations\. Cost: $3,420\. Average runtime: 12\.3s per API call \(see Appendix[J](https://arxiv.org/html/2606.00376#A10)\)\.
### 5\.2Main Results
Table 2:Main results on PermutationProbe \(synthetic\) and SWE\-Bench\-State \(real\-world\)\. Tool delegation achieves 86–94% while neural CoT plateaus at 24–42%\.ModelC1C3C4C5d∗d^\{\*\}PermutationProbe \(Synthetic\)GPT\-4o28\.3±\\pm1\.889\.7±\\pm1\.229\.1±\\pm1\.731\.4±\\pm1\.822Claude\-4\.5\-Opus34\.8±\\pm2\.093\.6±\\pm0\.935\.6±\\pm1\.937\.1±\\pm2\.027o3\-mini42\.1±\\pm2\.294\.2±\\pm1\.343\.1±\\pm2\.144\.8±\\pm2\.131DeepSeek\-R139\.7±\\pm2\.193\.1±\\pm1\.140\.4±\\pm2\.042\.3±\\pm2\.129SWE\-Bench\-State \(Real\-World\)GPT\-4o24\.1±\\pm2\.386\.4±\\pm1\.824\.8±\\pm2\.226\.9±\\pm2\.319Claude\-4\.5\-Opus29\.6±\\pm2\.591\.2±\\pm1\.430\.2±\\pm2\.432\.4±\\pm2\.524o3\-mini36\.8±\\pm2\.692\.7±\\pm1\.337\.4±\\pm2\.539\.1±\\pm2\.628DeepSeek\-R134\.2±\\pm2\.590\.8±\\pm1\.534\.9±\\pm2\.436\.7±\\pm2\.526
##### Finding 1: Tool delegation dominates\.
C3 achieves 86–94% vs\. 24–42% for C1 across all tasks \(Table[2](https://arxiv.org/html/2606.00376#S5.T2)\)\. Effect size Cohen’sd=2\.1d=2\.1–3\.43\.4\(very large\)\. The improvement is consistent across model families: general\-purpose \(\+60%\), reasoning\-specialized \(\+54%\), and open\-weight \(\+60%\)\. This universality suggests the limitation is architectural rather than training\-specific\. Complete per\-model and per\-task breakdowns appear in Tables[13](https://arxiv.org/html/2606.00376#A4.T13),[33](https://arxiv.org/html/2606.00376#A21.T33), and[34](https://arxiv.org/html/2606.00376#A21.T34)\.
##### Finding 2: Preference manipulation ineffective\.
C4 improves by only \+0\.7–1\.0% over C1\. TOST confirms equivalence \(p<0\.001p<0\.001for both tails\)\. Bayes factorsBF01\>4\\text\{BF\}\_\{01\}\>4support null hypothesis\. This directly contradicts the Simplicity Bias prediction that encouraging longer reasoning should yield substantial gains\. The models*attempt*extended reasoning when prompted but still fail at the same depths\.
##### Finding 3: Real\-world tasks show consistent patterns\.
SWE\-Bench\-State, WebArena\-Nav, and SQL\-Multi exhibit same decoherence phenomenon withd∗∈\[19,26\]d^\{\*\}\\in\[19,26\], validating generalization beyond synthetic benchmarks \(Table[3](https://arxiv.org/html/2606.00376#S5.T3)\)\. These real\-world horizons sit within the lower portion of the\[19,31\]\[19,31\]primary\-suite range, reflecting additional complexity from natural language ambiguity and larger state spaces\.
Table 3:Real\-world task validation\. Deterministic Horizon consistent across domains\. Mean±\\pmstd over 3 runs\.
### 5\.3Fine\-Tuning Experiment
This experiment provides the key discrimination from preference\-based explanations\. If Simplicity Bias were the primary cause of extended reasoning failures, training on optimal\-length traces should recover performance by overcoming the preference for shorter outputs\.
##### Setup\.
We fine\-tune Llama\-3\.1\-8B on 5,000 optimal\-length CoT traces \(depth = BFS\-optimal\)\. Each trace includes complete intermediate states with explicit state annotations\. Training: 3 epochs, lr=2×10−52\\times 10^\{\-5\}, cosine decay\. Validation split: 500 instances held out\.
##### Results\.
C5 improves by only \+3\.2% over C1 baseline, far below Wu et al\.’s predicted\>\>30% recovery\. Critically, the fine\-tuning benefit vanishes beyond the horizon \(d∗≈20d^\{\*\}\\approx 20\) regardless of training data distribution, matching our Theorem[4\.10](https://arxiv.org/html/2606.00376#S4.Thmtheorem10)prediction\. Even when trained exclusively on depth 30–40 traces, the model cannot exceed 15% accuracy beyond depth 40\. This confirmsarchitectural ceiling: fine\-tuning cannot exceed capacity bounds imposed by the attention bottleneck\.
##### Ablation: Training Data Distribution\.
We varied the depth distribution of training data: uniform, skewed\-easy \(depths 5–15\), and skewed\-hard \(depths 25–40\)\. All distributions yield similar test performance \(Δ<2%\\Delta<2\\%\), indicating the ceiling is depth\-dependent rather than distribution\-dependent\.
### 5\.4SSJ Validation
Table 4:SSJ with precision/recall at increasing depths\. Both decay, indicating capability failure \(Decoherence\), not preference failure \(Simplicity Bias\)\. Mean±\\pmstd over 3 runs\.Table[4](https://arxiv.org/html/2606.00376#S5.T4)shows both precision and recall decay with depth, indicating models drift into fictitious state spaces \(capability failure\) rather than merely truncating \(preference failure\)\. The parallel decay of both metrics is key: if failure were preference\-based \(Simplicity Bias\), precision would remain high while only recall decays\. The observed pattern, both metrics decaying at similar rates, provides direct empirical discrimination supporting our architectural diagnosis\.
Fitting SSJ\(d\)=ae−bd−cd2\(d\)=ae^\{\-bd\-cd^\{2\}\}yieldsR2=0\.99R^\{2\}=0\.99\(Figure[3](https://arxiv.org/html/2606.00376#A31.F3)\), consistent with super\-exponential decay from Theorem[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)\. The quadratic termcd2cd^\{2\}is statistically significant \(p<0\.01p<0\.01\), confirming context\-dependent rather than constant error accumulation\.
### 5\.5Accuracy Decay Visualization
Figure[1](https://arxiv.org/html/2606.00376#S5.F1)visualizes the accuracy decay predicted by Theorem[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)\. The super\-exponential decay \(solid line\) fits empirical data \(R2=0\.96R^\{2\}=0\.96\) significantly better than linear \(R2=0\.71R^\{2\}=0\.71\) or simple exponential \(R2=0\.83R^\{2\}=0\.83\) models, validating our context\-dependent error formulation\.
010102020303040405050606002020404060608080100100d∗d^\{\*\}Reasoning depthddAccuracy \(%\)TheoreticalEmpiricalToolFigure 1:Accuracy versus reasoning depth on PermutationProbe\. Neural CoT \(*Empirical*, GPT\-4o\) follows the super\-exponential decay of Theorem[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)\(*Theoretical*, solid\), crossing50%50\\%\(dotted\) at the deterministic horizond∗≈22d^\{\*\}\\\!\\approx\\\!22\(dashed\); tool delegation \(*Tool*, C3\) stays near90%90\\%\. Markers: mean±\\pmstd over 3 runs\.
### 5\.6Cross\-Model Universality
Cross\-model correlations ranger=0\.81r=0\.81–0\.910\.91\(Table[5](https://arxiv.org/html/2606.00376#S5.T5); full matrix, including the maximal Llama–Qwen correlation, in Appendix[I](https://arxiv.org/html/2606.00376#A9)\)\. Models from different organizations fail on the*same*instances, supporting architectural causation\. Partial correlation controlling for BFS depth:r=0\.73r=0\.73\. This finding rules out training\-specific explanations: if failures were due to dataset biases or optimization choices, we would expect low cross\-model correlation\. The high correlation indicates shared architectural constraints\.
Table 5:Cross\-model correlation \(per\-instance accuracy\)\. High correlation supports architectural causation\.
## 6Analysis
### 6\.1Architecture Ablations
We validate the scaling predictions from Corollary[4\.9](https://arxiv.org/html/2606.00376#S4.Thmtheorem9)through controlled comparisons within model families\. This isolates architectural effects from training data and optimization differences\.
Table 6:Architecture ablation ford∗≈0\.314dh⋅Hd^\{\*\}\\approx 0\.314\\sqrt\{d\_\{h\}\\cdot H\}on open\-weight models\. Observed values \(mean±\\pmstd over 3 runs\) match the empirical fit within 2%\.70B\-class models achieved∗≈28d^\{\*\}\\approx 28vs\.d∗≈19d^\{\*\}\\approx 19–20 for 7–8B models \(Table[6](https://arxiv.org/html/2606.00376#S6.T6)\)\. The ratio≈1\.43\\approx 1\.43matches the predicteddhH\\sqrt\{d\_\{h\}H\}ratio \(2≈1\.41\\sqrt\{2\}\\approx 1\.41for Llama,2\.29≈1\.51\\sqrt\{2\.29\}\\approx 1\.51for Qwen\)\. The close agreement provides empirical support for the information\-theoretic framework, with the caveat that thedhH\\sqrt\{d\_\{h\}H\}form is a leading\-order approximation \(Lemma[R\.8](https://arxiv.org/html/2606.00376#A18.Thmtheorem8)\)\.
##### Context\-Window Insensitivity\.
The horizon is governed by the effective decoherence lengthLeffL\_\{\\mathrm\{eff\}\}, not the raw context windowLL\. Truncating Llama\-3\.3\-70B’s context window from 128K down to 8K leavesd∗d^\{\*\}essentially unchanged \(≈28\\approx 28\);d∗d^\{\*\}degrades only once the available context approaches the reasoning\-trace length itself \(∼\\sim2–4K tokens\), where the model can no longer fit its own intermediate states \(Table[41](https://arxiv.org/html/2606.00376#A26.T41)\)\. This dissociation betweend∗d^\{\*\}andLLis a direct prediction of theLeffL\_\{\\mathrm\{eff\}\}formulation and rules out raw context capacity as the binding constraint on deterministic state tracking\.
##### Within\-Family Scaling\.
For Llama\-3 \(8B→\\rightarrow70B\),d∗d^\{\*\}increases from 20 to 28 \(\+40%\)\. For Qwen\-2\.5 \(7B→\\rightarrow72B\),d∗d^\{\*\}increases from 19 to 28 \(\+47%\)\. Both approximately follow the predicteddh⋅H\\sqrt\{d\_\{h\}\\cdot H\}scaling \(Lemma[R\.8](https://arxiv.org/html/2606.00376#A18.Thmtheorem8); treated as a leading\-order trend\), consistent with larger models gaining capacity through increased head count and dimension rather than qualitatively different representations\.
### 6\.2Consistency Checks on Closed\-Source Models
The predictions above depend on the head countHHand head dimensiondhd\_\{h\}, which are officially documented only for open\-weight models\. We therefore restrict all predictive validation of thedhH\\sqrt\{d\_\{h\}H\}scaling, and all attention\-entropy measurements, to the open\-weight suite \(Llama\-3\.1\-8B, Llama\-3\.3\-70B, Qwen\-2\.5\-7B, Qwen\-2\.5\-72B\), whose architectures are fully specified\. Results for closed\-source systems \(GPT\-4o, the Claude\-4\.5 family, o3\-mini\) are reported throughout as*consistency checks*rather than architectural evidence: they exhibit the same qualitative decoherence pattern and the same tool\-delegation gap as the open\-weight models, but because their architectural parameters are not publicly verifiable we do not use them to substantiate the scaling law\. Any architectural values quoted for these models \(e\.g\., the GPT\-4o illustration in Section[4](https://arxiv.org/html/2606.00376#S4)\) are widely cited community estimates used only for order\-of\-magnitude intuition\. The agreement between the verifiable open\-weight evidence and the closed\-source consistency checks strengthens, but is not required for, the architectural interpretation of our results\.
### 6\.3Attention Entropy Validation
We directly measure attention patterns in open\-weight models \(the four open\-weight models from our main suite plus Mistral\-7B and Mixtral\-8x7B\) to validate the mechanistic hypothesis underlying our theoretical framework\.
Table 7:Attention entropy correlation with accuracy across open\-weight models\. Consistent negative correlation validates mechanistic hypothesis\.Consistent negative correlation \(r≈−0\.70r\\approx\-0\.70\) across five architectures provides strong mechanistic evidence that attention dilution causes decoherence \(Table[7](https://arxiv.org/html/2606.00376#S6.T7)\)\. The correlation is strongest in later layers \(layers 51–80:r=−0\.74r=\-0\.74\) compared to early layers \(layers 1–20:r=−0\.42r=\-0\.42\), consistent with the representational collapse mechanism described byGerasimovet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib11)\)\.
##### Per\-Step Analysis\.
We computed attention entropy at each reasoning step for 500 traces\. Entropy increases linearly with step number \(r=0\.73r=0\.73,p<0\.001p<0\.001\), validating the linear component of our error model\. The slopeβ=0\.023\\beta=0\.023bits/step confirms monotone growth of attention entropy with reasoning depth \(Figure[2](https://arxiv.org/html/2606.00376#A25.F2)\), the mechanism underlying the linear component of our error model\.
### 6\.4Encoder\-Decoder Comparison
Encoder\-decoder models achieve 2\.8×\\timeshigher accuracy at depth 30 \(T5\-Large: 67\.3% vs\. GPT\-4o: 23\.4%\), consistent with Theorem[4\.6](https://arxiv.org/html/2606.00376#S4.Thmtheorem6)’s prediction that bidirectional attention providesO\(L\)O\(L\)vs\.O\(logL\)O\(\\log L\)capacity for full\-history access\.333Encoder\-decoder models are fine\-tuned on the seq2seq formulation of the task, whereas decoder\-only models are evaluated in\-context; absolute accuracies are therefore not directly comparable to the in\-context CoT results in Table[2](https://arxiv.org/html/2606.00376#S5.T2)or to the PermutationProbe decay curve in Figure[1](https://arxiv.org/html/2606.00376#S5.F1); in particular, the GPT\-4o accuracy quoted here is on this seq2seq task variant and so differs from its PermutationProbe value at the same depth\. The comparison isolates the effect of bidirectional vs\. causal attention on*depth scaling*, not raw capability\.Full results in Appendix[D\.1](https://arxiv.org/html/2606.00376#A4.SS1)\.
### 6\.5Failure Mode Analysis
Analysis of 2,847 failed reasoning traces reveals systematic patterns: state transposition errors \(35%\), operator misapplication \(27%\), premature termination \(21%\), circular reasoning \(11%\), and complete hallucination \(6%\)\. The dominant failure, state transposition, directly reflects attention\-based working memory limits: models misremember element positions due to attention dilution\. Full analysis and examples in Appendix[K](https://arxiv.org/html/2606.00376#A11)\.
## 7Discussion
##### Relationship to Simplicity Bias\.
Our Decoherence framework provides*complementary*architectural explanation toWuet al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib2)\)’s preference\-based account\. The C5 fine\-tuning result \(<<5% improvement\) and C4 result \(<<2% improvement\) indicate architectural limits impose hard ceilings that preference manipulation cannot overcome\. Both mechanisms likely contribute: Simplicity Bias may truncate*before*architectural limits, while Decoherence prevents recovery*beyond*them\. The key insight is that even if models*wanted*to reason correctly at depth 50, they*cannot*: the attention bottleneck prevents reliable state tracking\. This distinction has important practical implications: interventions targeting preferences \(prompting, RLHF\) cannot overcome architectural constraints\.
##### Cost\-Benefit Analysis\.
Tool delegation achieves 4\.2–4\.7×\\timesbetter cost\-per\-correct\-solution across GPT\-4o and Claude\-4\.5 \(GPT\-4o: $0\.021 vs\. $0\.089 for neural CoT; Table[8](https://arxiv.org/html/2606.00376#S7.T8)\)\. Best\-of\-10 sampling costs 11×\\timesmore without matching accuracy \(55% vs\. 90%\)\. Extended neural reasoning represents wasted compute: the marginal cost of additional tokens yields diminishing and eventually negative returns\. For production systems processing millions of queries, this efficiency gain translates to substantial infrastructure savings\.
Table 8:Cost\-per\-correct\-solution \(CPC\) analysis\. The “vs\. C3” column reports each strategy’s CPC relative to same\-model tool delegation\. Tool delegation achieves a 4\.2×\\timesefficiency gain over neural CoT for GPT\-4o\.
##### Implications for Agentic Systems\.
Our Deterministic Horizon provides a practical threshold: tasks requiring exact state tracking beyond the horizon \(d∗∈\[19,31\]d^\{\*\}\\in\[19,31\], depending on the model\) should not rely on pure neural reasoning\. This informs three key domains:code agentsrequiring multi\-step execution with variable tracking,planning systemswith deterministic state transitions \(aligning with[Kambhampatiet al\.](https://arxiv.org/html/2606.00376#bib.bib49)’s observation that LLMs cannot plan but can assist planning\), andverification tasksinvolving formal reasoning chains that benefit from symbolic computation delegation\.
##### Architectural Implications\.
Our results suggest several directions for architecture design: \(1\)*Explicit state registers*: Augmenting transformers with dedicated state\-tracking substrates could extend the horizon; \(2\)*Adaptive tool routing*: Systems that automatically detect decoherence onset and delegate to tools; \(3\)*Hierarchical attention*: Multi\-scale attention mechanisms that maintain both local and global state coherence\.
##### Deployment Guideline\.
We recommend defaulting to tool delegation once a task approaches the Deterministic Horizon \(i\.e\., roughly 20 or more deterministic state transitions, at the lower end of the measuredd∗∈\[19,31\]d^\{\*\}\\in\[19,31\]range\)\. This horizon is consistent across model families, and tool\-integrated reasoning achieves 4\.2–4\.7×\\timesbetter cost efficiency than pure neural approaches\. Concretely, tasks that routinely exceed this horizon include tracing program state through 20 or more sequential mutations, composing long permutation or rotation sequences \(e\.g\.,≥\\geq20\-move sliding\-tile or cube\-style puzzles\), multi\-hop entity and state tracking across long transcripts, and planning many\-table SQL joins; by contrast, problems that decompose into short independent sub\-derivations or tolerate approximate answers \(most arithmetic word problems, retrieval, and summarization\) rarely approach it and need not be delegated\. For safety\-critical deployments, mandatory tool verification should be enforced on all multi\-step reasoning chains\.
## 8Conclusion
We presented theory and experiments indicating that decoder\-only transformers face information\-theoretic limits on state\-tracking capacity, with systematic reasoning failures emerging beyond the Deterministic Horizon \(d∗∈\[19,31\]d^\{\*\}\\in\[19,31\]\)\. Our theoretical and empirical contributions include:
1. 1\.TheAttention Bottleneck Theoremprovides capacity bounds with a complementary achievability construction, identifyingO\(H⋅log\(L/H\)⋅dh\)O\(H\\cdot\\log\(L/H\)\\cdot\\sqrt\{d\_\{h\}\}\)as the governing capacity form under our modeling assumptions\.
2. 2\.Context\-dependent errormodelϵ\(d\)=ϵ0\+γd/Leff\\epsilon\(d\)=\\epsilon\_\{0\}\+\\gamma d/L\_\{\\mathrm\{eff\}\}yields super\-exponential accuracy decay, validated by direct attention entropy measurements across five open\-weight architectures \(r=−0\.74r=\-0\.74\)\.
3. 3\.Fine\-tuning experimentsconfirm architectural ceiling \(<<5% recovery vs\.\>\>30% predicted by preference\-based accounts\), providing key discrimination from the Simplicity Bias hypothesis\.
4. 4\.Real\-world validationon SWE\-Bench, WebArena, and SQL\-Multi demonstrates consistent patterns \(d∗∈\[19,26\]d^\{\*\}\\in\[19,26\]\) beyond synthetic benchmarks\.
5. 5\.On the primary suite, tool delegation achieves86–94% vs\. 24–42%for neural CoT with4\.2–4\.7×\\timescost efficiency\.
##### Limitations and Future Work\.
\(i\)*Closed models\.*Sinced∗d^\{\*\}depends onHHanddhd\_\{h\}, public only for open\-weight models, we reportd∗d^\{\*\}for proprietary systems as an empirical fit and validate thedhH\\sqrt\{d\_\{h\}H\}scaling only on open\-weight checkpoints\. \(ii\)*Fine\-tuning\.*The architectural\-ceiling result uses one open\-weight model \(Llama\-3\.1\-8B, 5,000 traces\); larger models and datasets are needed to separate capacity from data limits\. \(iii\)*Tool comparison\.*Our tool condition uses an exact BFS solver, so the C1\-vs\-C3 gap upper\-bounds the benefit of delegation; imperfect tools are left to future work\. \(iv\)*Sensitivity\.*d∗d^\{\*\}depends onϵ0\\epsilon\_\{0\}, sensitive to prompt format and few\-shot conditioning, so the reported intervals are regime indicators rather than exact per\-model constants\. \(v\)*Scope\.*We target deterministic, exactly\-checkable state tracking; tasks tolerating approximate or stochastic solutions \(e\.g\., GSM8K\-style problems\) need not obey the same horizon\.
##### Reproducibility\.
## Impact Statement
##### Safety Implications\.
Our findings have direct implications for AI safety in deployment scenarios\. LLMs should not be trusted for autonomous decision\-making in deterministic domains beyond the Deterministic Horizon \(d∗∈\[19,31\]d^\{\*\}\\in\[19,31\]steps\)\. Practitioners deploying LLMs in safety\-critical settings \(including code agents, planning systems, and verification tools\) should be aware of this fundamental limitation and implement tool\-based verification for multi\-step reasoning tasks\.
##### Environmental Considerations\.
Extended neural reasoning without accuracy improvement represents wasted compute and carbon emissions\. Our cost analysis demonstrates 4\.2–4\.7×\\timesefficiency gains from tool delegation, translating directly to environmental benefits\. A shift toward hybrid neurosymbolic systems for deterministic tasks could significantly reduce the carbon footprint of AI\-assisted decision\-making\.
##### Research Directions\.
This work opens several research directions: \(1\) developing adaptive systems that automatically detect when to delegate to tools; \(2\) designing architectures with explicit state\-tracking substrates; \(3\) characterizing which reasoning tasks benefit from extended neural computation versus tool delegation\.
## References
- V\. Balachandran, J\. Chen, L\. Chen, S\. Garg, N\. Joshi, Y\. Lara, J\. Langford, B\. Nushi, V\. Vineet, Y\. Wu, and S\. Yousefi \(2025\)Inference\-time scaling for complex tasks: where we stand and what lies ahead\.arXiv preprintarXiv\.2504\.00294\.Cited by:[§1](https://arxiv.org/html/2606.00376#S1.p1.1)\.
- F\. Barbero, A\. Banino, S\. Kapturowski, D\. Kumaran, J\. G\. M\. Araújo, O\. Vitvitskyi, R\. Pascanu, and P\. Velickovic \(2024\)Transformers need glasses\! information over\-squashing in language tasks\.InAdvances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,Cited by:[§A\.4](https://arxiv.org/html/2606.00376#A1.SS4.p1.1),[Table 9](https://arxiv.org/html/2606.00376#A1.T9.8.6.3),[1st item](https://arxiv.org/html/2606.00376#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2606.00376#S4.SS2.SSS0.Px1.p1.6)\.
- A\. A\. Bavandpour, X\. Huang, M\. Rofin, and M\. Hahn \(2025\)Lower bounds for chain\-of\-thought reasoning in hard\-attention transformers\.InForty\-second International Conference on Machine Learning,Cited by:[§A\.3](https://arxiv.org/html/2606.00376#A1.SS3.p1.3),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- E\. M\. Bender and A\. Koller \(2020\)Climbing towards NLU: on meaning, form, and understanding in the age of data\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL 2020, Online, July 5\-10, 2020,pp\. 5185–5198\.Cited by:[§1](https://arxiv.org/html/2606.00376#S1.SS0.SSS0.Px3.p1.1)\.
- M\. Besta, N\. Blach, A\. Kubicek, R\. Gerstenberger, M\. Podstawski, L\. Gianinazzi, J\. Gajda, T\. Lehmann, H\. Niewiadomski, P\. Nyczyk, and T\. Hoefler \(2024\)Graph of thoughts: solving elaborate problems with large language models\.Proceedings of the AAAI Conference on Artificial Intelligence38\(16\),pp\. 17682–17690\.External Links:ISSN 2159\-5399Cited by:[§A\.1](https://arxiv.org/html/2606.00376#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Bhattamishra, A\. Patel, and N\. Goyal \(2020\)On the computational power of transformers and its implications in sequence modeling\.InProceedings of the 24th Conference on Computational Natural Language Learning, CoNLL 2020, Online, November 19\-20, 2020,pp\. 455–475\.Cited by:[§A\.3](https://arxiv.org/html/2606.00376#A1.SS3.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- A\. Bietti, V\. Cabannes, D\. Bouchacourt, H\. Jégou, and L\. Bottou \(2023\)Birth of a transformer: A memory viewpoint\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,Cited by:[§A\.4](https://arxiv.org/html/2606.00376#A1.SS4.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px3.p1.1)\.
- B\. Brown, J\. Juravsky, R\. Ehrlich, R\. Clark, Q\. V\. Le, C\. Ré, and A\. Mirhoseini \(2024\)Large language monkeys: scaling inference compute with repeated sampling\.arXiv preprintarXiv\.2407\.21787\.Cited by:[§1](https://arxiv.org/html/2606.00376#S1.p1.1)\.
- W\. Chen, X\. Ma, X\. Wang, and W\. W\. Cohen \(2023\)Program of thoughts prompting: disentangling computation from reasoning for numerical reasoning tasks\.Trans\. Mach\. Learn\. Res\.2023\.Cited by:[§A\.5](https://arxiv.org/html/2606.00376#A1.SS5.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px5.p1.1)\.
- X\. Chen, J\. Xu, T\. Liang, Z\. He, J\. Pang, D\. Yu, L\. Song, Q\. Liu, M\. Zhou, Z\. Zhang, R\. Wang, Z\. Tu, H\. Mi, and D\. Yu \(2025\)Do NOT think that much for 2\+3=? on the overthinking of long reasoning models\.InForty\-second International Conference on Machine Learning,Cited by:[§A\.2](https://arxiv.org/html/2606.00376#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px2.p1.1)\.
- M\. Deb and T\. Ogunfunmi \(2025\)Information\-theoretical analysis of a transformer\-based generative AI model\.Entropy27\(6\),pp\. 589\.Cited by:[§A\.7](https://arxiv.org/html/2606.00376#A1.SS7.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- G\. Delétang, A\. Ruoss, J\. Grau\-Moya, T\. Genewein, L\. K\. Wenliang, E\. Catt, C\. Cundy, M\. Hutter, S\. Legg, J\. Veness, and P\. A\. Ortega \(2023\)Neural networks and the chomsky hierarchy\.InThe Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1\-5, 2023,Cited by:[§A\.3](https://arxiv.org/html/2606.00376#A1.SS3.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- N\. Dziri, X\. Lu, M\. Sclar, X\. L\. Li, L\. Jiang, B\. Y\. Lin, S\. Welleck, P\. West, C\. Bhagavatula, R\. L\. Bras, J\. D\. Hwang, S\. Sanyal, X\. Ren, A\. Ettinger, Z\. Harchaoui, and Y\. Choi \(2023\)Faith and fate: limits of transformers on compositionality\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,Cited by:[§A\.6](https://arxiv.org/html/2606.00376#A1.SS6.p1.1),[§1](https://arxiv.org/html/2606.00376#S1.p2.3),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px6.p1.1)\.
- B\. Efron \(1987\)Better bootstrap confidence intervals\.Journal of the American Statistical Association82\(397\),pp\. 171–185\.External Links:[Document](https://dx.doi.org/10.1080/01621459.1987.10478410)Cited by:[§W\.1](https://arxiv.org/html/2606.00376#A23.SS1.p3.1)\.
- N\. Elhage, N\. Nanda, C\. Olsson, T\. Henighan, N\. Joseph, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, N\. DasSarma, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2021\)A mathematical framework for transformer circuits\.Note:Transformer Circuits ThreadCited by:[§A\.4](https://arxiv.org/html/2606.00376#A1.SS4.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px3.p1.1)\.
- G\. Feng, B\. Zhang, Y\. Gu, H\. Ye, D\. He, and L\. Wang \(2023\)Towards revealing the mystery behind chain of thought: A theoretical perspective\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,Cited by:[§A\.1](https://arxiv.org/html/2606.00376#A1.SS1.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px1.p1.1)\.
- L\. Gao, A\. Madaan, S\. Zhou, U\. Alon, P\. Liu, Y\. Yang, J\. Callan, and G\. Neubig \(2023\)PAL: program\-aided language models\.InInternational Conference on Machine Learning, ICML 2023, 23\-29 July 2023, Honolulu, Hawaii, USA,Proceedings of Machine Learning Research,pp\. 10764–10799\.Cited by:[§A\.5](https://arxiv.org/html/2606.00376#A1.SS5.p1.1),[2nd item](https://arxiv.org/html/2606.00376#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px5.p1.1)\.
- G\. Gerasimov, Y\. Aksenov, N\. Balagansky, V\. Sinii, and D\. Gavrilov \(2025\)You do not fully utilize transformer’s representation capacity\.arXiv preprintarXiv\.2502\.09245\.Cited by:[§A\.4](https://arxiv.org/html/2606.00376#A1.SS4.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px3.p1.1),[§4\.2](https://arxiv.org/html/2606.00376#S4.SS2.SSS0.Px1.p1.6),[§6\.3](https://arxiv.org/html/2606.00376#S6.SS3.p2.3)\.
- D\. Gong and H\. Zhang \(2024\)Self\-attention limits working memory capacity of transformer\-based models\.InNeurIPS 2024 Workshop on Behavioral Machine Learning,Cited by:[§A\.4](https://arxiv.org/html/2606.00376#A1.SS4.p1.1),[Table 9](https://arxiv.org/html/2606.00376#A1.T9.6.4.2),[1st item](https://arxiv.org/html/2606.00376#S1.I1.i1.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.00376#S4.SS1.SSS0.Px1.p1.5)\.
- N\. Goodman, J\. Mu, Y\. Wu, and E\. Zelikman \(2022\)STaR: bootstrapping reasoning with reasoning\.InAdvances in Neural Information Processing Systems 35,NeurIPS 2022,pp\. 15476–15488\.Cited by:[§A\.1](https://arxiv.org/html/2606.00376#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px1.p1.1)\.
- Z\. Gou, Z\. Shao, Y\. Gong, Y\. Shen, Y\. Yang, M\. Huang, N\. Duan, and W\. Chen \(2024\)ToRA: A tool\-integrated reasoning agent for mathematical problem solving\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,Cited by:[§A\.5](https://arxiv.org/html/2606.00376#A1.SS5.p1.1),[Table 9](https://arxiv.org/html/2606.00376#A1.T9.9.7.2),[2nd item](https://arxiv.org/html/2606.00376#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px5.p1.1)\.
- D\. Guo, D\. Yang, H\. Zhang, J\. Song, P\. Wang, Q\. Zhu, R\. Xu, R\. Zhang, S\. Ma, X\. Bi, X\. Zhang, X\. Yu, Y\. Wu, Z\. F\. Wu, Z\. Gou, Z\. Shao, Z\. Li, Z\. Gao,et al\.\(2025\)DeepSeek\-R1 incentivizes reasoning in LLMs through reinforcement learning\.Nature645,pp\. 633–638\.External Links:[Document](https://dx.doi.org/10.1038/s41586-025-09422-z)Cited by:[§1](https://arxiv.org/html/2606.00376#S1.p1.1)\.
- M\. Hahn \(2020\)Theoretical limitations of self\-attention in neural sequence models\.Transactions of the Association for Computational Linguistics8,pp\. 156–171\.External Links:ISSN 2307\-387XCited by:[§A\.3](https://arxiv.org/html/2606.00376#A1.SS3.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- C\. E\. Jimenez, J\. Yang, A\. Wettig, S\. Yao, K\. Pei, O\. Press, and K\. R\. Narasimhan \(2024\)SWE\-bench: can language models resolve real\-world github issues?\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,Cited by:[§AB\.2\.1](https://arxiv.org/html/2606.00376#A28.SS2.SSS1.p1.1),[§E\.1](https://arxiv.org/html/2606.00376#A5.SS1.p1.1),[1st item](https://arxiv.org/html/2606.00376#S5.I2.i1.p1.1)\.
- S\. Kambhampati, K\. Valmeekam, L\. Guan, M\. Verma, K\. Stechly, S\. Bhambri, L\. Saldyt, and A\. Murthy \(2024\)Position: llms can’t plan, but can help planning in llm\-modulo frameworks\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,Proceedings of Machine Learning Research,pp\. 22895–22907\.Cited by:[§1](https://arxiv.org/html/2606.00376#S1.p2.3),[§7](https://arxiv.org/html/2606.00376#S7.SS0.SSS0.Px3.p1.1)\.
- D\. Keysers, N\. Schärli, N\. Scales, H\. Buisman, D\. Furrer, S\. Kashubin, N\. Momchev, D\. Sinopalnikov, L\. Stafiniak, T\. Tihon, D\. Tsarkov, X\. Wang, M\. van Zee, and O\. Bousquet \(2020\)Measuring compositional generalization: A comprehensive method on realistic data\.In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26\-30, 2020,Cited by:[§A\.6](https://arxiv.org/html/2606.00376#A1.SS6.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px6.p1.1)\.
- J\. Kim, K\. Shah, V\. Kontonis, S\. M\. Kakade, and S\. Chen \(2025\)Train for the worst, plan for the best: understanding token ordering in masked diffusions\.InForty\-second International Conference on Machine Learning, ICML 2025, Vancouver, BC, Canada, July 13\-19, 2025,Proceedings of Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2606.00376#S1.SS0.SSS0.Px2.p1.1)\.
- N\. Kim and T\. Linzen \(2020\)COGS: A compositional generalization challenge based on semantic interpretation\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16\-20, 2020,pp\. 9087–9105\.Cited by:[§A\.6](https://arxiv.org/html/2606.00376#A1.SS6.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px6.p1.1)\.
- T\. Kojima, S\. S\. Gu, M\. Reid, Y\. Matsuo, and Y\. Iwasawa \(2022\)Large language models are zero\-shot reasoners\.InProceedings of the 36th International Conference on Neural Information Processing Systems,NIPS ’22,Red Hook, NY, USA\.External Links:ISBN 9781713871088Cited by:[§A\.1](https://arxiv.org/html/2606.00376#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px1.p1.1)\.
- B\. M\. Lake and M\. Baroni \(2018\)Generalization without systematicity: on the compositional skills of sequence\-to\-sequence recurrent networks\.InProceedings of the 35th International Conference on Machine Learning, ICML 2018, Stockholmsmässan, Stockholm, Sweden, July 10\-15, 2018,Proceedings of Machine Learning Research,pp\. 2879–2888\.Cited by:[§A\.6](https://arxiv.org/html/2606.00376#A1.SS6.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px6.p1.1)\.
- D\. Lakens \(2017\)Equivalence tests: a practical primer for t\-tests, correlations, and meta\-analyses\.Social Psychological and Personality Science8\(4\),pp\. 355–362\.Cited by:[§X\.2](https://arxiv.org/html/2606.00376#A24.SS2.p2.1)\.
- M\. Levy, A\. Jacoby, and Y\. Goldberg \(2024\)Same task, more tokens: the impact of input length on the reasoning performance of large language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2024, Bangkok, Thailand, August 11\-16, 2024,pp\. 15339–15353\.Cited by:[§A\.4](https://arxiv.org/html/2606.00376#A1.SS4.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px3.p1.1)\.
- J\. Lewandowsky and G\. Bauch \(2024\)Theory and application of the information bottleneck method\.Entropy26\(3\),pp\. 187\.Cited by:[§A\.7](https://arxiv.org/html/2606.00376#A1.SS7.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- C\. Li, J\. Liang, A\. Zeng, X\. Chen, K\. Hausman, D\. Sadigh, S\. Levine, L\. Fei\-Fei, F\. Xia, and B\. Ichter \(2024\)Chain of code: reasoning with a language model\-augmented code emulator\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,Proceedings of Machine Learning Research,pp\. 28259–28277\.Cited by:[§A\.5](https://arxiv.org/html/2606.00376#A1.SS5.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px5.p1.1)\.
- N\. F\. Liu, K\. Lin, J\. Hewitt, A\. Paranjape, M\. Bevilacqua, F\. Petroni, and P\. Liang \(2024a\)Lost in the middle: how language models use long contexts\.Trans\. Assoc\. Comput\. Linguistics12,pp\. 157–173\.Cited by:[§A\.4](https://arxiv.org/html/2606.00376#A1.SS4.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.00376#S4.SS1.SSS0.Px1.p1.5)\.
- Z\. Liu, H\. Liu, D\. Zhou, and T\. Ma \(2024b\)Chain of thought empowers transformers to solve inherently serial problems\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,Cited by:[§A\.1](https://arxiv.org/html/2606.00376#A1.SS1.p2.1),[Table 9](https://arxiv.org/html/2606.00376#A1.T9.3.1.2),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px1.p1.1)\.
- H\. Luo, H\. Feng, Q\. Sun, C\. Xu, K\. Zheng, Y\. Wang, T\. YANG, H\. Hu, and Y\. Tang \(2026\)AgentMath: empowering mathematical reasoning for large language models via tool\-augmented agent\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§A\.5](https://arxiv.org/html/2606.00376#A1.SS5.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px5.p1.1)\.
- S\. V\. Marjanovic, A\. Patel, V\. Adlakha, M\. Aghajohari, P\. BehnamGhader, M\. Bhatia, A\. Khandelwal, A\. Kraft, B\. Krojer, X\. H\. Lù, N\. Meade, D\. Shin, A\. Kazemnejad, G\. Kamath, M\. Mosbach, K\. Stanczak, and S\. Reddy \(2026\)DeepSeek\-r1 thoughtology: let’s think about LLM reasoning\.Transactions on Machine Learning Research\.External Links:ISSN 2835\-8856Cited by:[§A\.2](https://arxiv.org/html/2606.00376#A1.SS2.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px2.p1.1)\.
- W\. Merrill, J\. Petty, and A\. Sabharwal \(2024\)The illusion of state in state\-space models\.InForty\-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21\-27, 2024,Proceedings of Machine Learning Research,pp\. 35492–35506\.Cited by:[§A\.3](https://arxiv.org/html/2606.00376#A1.SS3.p1.3),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- W\. Merrill and A\. Sabharwal \(2024\)The expressive power of transformers with chain of thought\.InThe Twelfth International Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.00376#A1.SS1.p2.1),[§A\.3](https://arxiv.org/html/2606.00376#A1.SS3.p1.3),[Table 9](https://arxiv.org/html/2606.00376#A1.T9.5.3.2),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- G\. Mialon, R\. Dessì, M\. Lomeli, C\. Nalmpantis, R\. Pasunuru, R\. Raileanu, B\. Rozière, T\. Schick, J\. Dwivedi\-Yu, A\. Celikyilmaz, E\. Grave, Y\. LeCun, and T\. Scialom \(2023\)Augmented language models: a survey\.Trans\. Mach\. Learn\. Res\.2023\.Cited by:[§A\.5](https://arxiv.org/html/2606.00376#A1.SS5.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px5.p1.1)\.
- M\. Nye, A\. J\. Andreassen, G\. Gur\-Ari, H\. Michalewski, J\. Austin, D\. Bieber, D\. Dohan, A\. Lewkowycz, M\. Bosma, D\. Luan, C\. Sutton, and A\. Odena \(2021\)Show your work: scratchpads for intermediate computation with language models\.arXiv preprintarXiv\.2112\.00114\.Cited by:[§A\.1](https://arxiv.org/html/2606.00376#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px1.p1.1)\.
- C\. Olsson, N\. Elhage, N\. Nanda, N\. Joseph, N\. DasSarma, T\. Henighan, B\. Mann, A\. Askell, Y\. Bai, A\. Chen, T\. Conerly, D\. Drain, D\. Ganguli, Z\. Hatfield\-Dodds, D\. Hernandez, S\. Johnston, A\. Jones, J\. Kernion, L\. Lovitt, K\. Ndousse, D\. Amodei, T\. Brown, J\. Clark, J\. Kaplan, S\. McCandlish, and C\. Olah \(2022\)In\-context learning and induction heads\.arXiv preprintarXiv\.2209\.11895\.Cited by:[§A\.4](https://arxiv.org/html/2606.00376#A1.SS4.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Ontañón, J\. Ainslie, Z\. Fisher, and V\. Cvicek \(2022\)Making transformers solve compositional tasks\.InProceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2022, Dublin, Ireland, May 22\-27, 2022,pp\. 3591–3607\.Cited by:[§A\.6](https://arxiv.org/html/2606.00376#A1.SS6.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px6.p1.1)\.
- OpenAI \(2024\)Learning to reason with LLMs\.Note:OpenAI BlogCited by:[§1](https://arxiv.org/html/2606.00376#S1.p1.1)\.
- L\. Pan, A\. Albalak, X\. Wang, and W\. Y\. Wang \(2023\)Logic\-lm: empowering large language models with symbolic solvers for faithful logical reasoning\.InFindings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6\-10, 2023,Findings of ACL,pp\. 3806–3824\.Cited by:[§A\.5](https://arxiv.org/html/2606.00376#A1.SS5.p1.1),[2nd item](https://arxiv.org/html/2606.00376#S1.I1.i2.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px5.p1.1)\.
- A\. Parisi, Y\. Zhao, and N\. Fiedel \(2022\)TALM: tool augmented language models\.arXiv preprintarXiv\.2205\.12255\.Cited by:[§A\.5](https://arxiv.org/html/2606.00376#A1.SS5.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px5.p1.1)\.
- S\. G\. Patil, T\. Zhang, X\. Wang, and J\. E\. Gonzalez \(2024\)Gorilla: large language model connected with massive apis\.InAdvances in Neural Information Processing Systems 37: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 \- 15, 2024,Cited by:[§A\.5](https://arxiv.org/html/2606.00376#A1.SS5.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px5.p1.1)\.
- B\. Peng, S\. Narayanan, and C\. Papadimitriou \(2024\)On limitations of the transformer architecture\.InFirst Conference on Language Modeling,Cited by:[§A\.3](https://arxiv.org/html/2606.00376#A1.SS3.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Pérez, P\. Barceló, and J\. Marinkovic \(2021\)Attention is turing\-complete\.J\. Mach\. Learn\. Res\.22,pp\. 75:1–75:35\.Cited by:[§A\.3](https://arxiv.org/html/2606.00376#A1.SS3.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- J\. Petty, S\. van Steenkiste, I\. Dasgupta, F\. Sha, D\. Garrette, and T\. Linzen \(2024\)The impact of depth on compositional generalization in transformer language models\.InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\), NAACL 2024, Mexico City, Mexico, June 16\-21, 2024,pp\. 7239–7252\.Cited by:[§A\.6](https://arxiv.org/html/2606.00376#A1.SS6.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px6.p1.1)\.
- O\. Press, M\. Zhang, S\. Min, L\. Schmidt, N\. A\. Smith, and M\. Lewis \(2023\)Measuring and narrowing the compositionality gap in language models\.InFindings of the Association for Computational Linguistics: EMNLP 2023, Singapore, December 6\-10, 2023,Findings of ACL,pp\. 5687–5711\.Cited by:[§A\.6](https://arxiv.org/html/2606.00376#A1.SS6.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px6.p1.1)\.
- Y\. Qin, S\. Hu, Y\. Lin, W\. Chen, N\. Ding, G\. Cui, Z\. Zeng, X\. Zhou, Y\. Huang, C\. Xiao, C\. Han, Y\. R\. Fung, Y\. Su, H\. Wang, C\. Qian, R\. Tian, K\. Zhu, S\. Liang, X\. Shen, B\. Xu, Z\. Zhang, Y\. Ye, B\. Li, Z\. Tang, J\. Yi, Y\. Zhu, Z\. Dai, L\. Yan, X\. Cong, Y\. Lu, W\. Zhao, Y\. Huang, J\. Yan, X\. Han, X\. Sun, D\. Li, J\. Phang, C\. Yang, T\. Wu, H\. Ji, G\. Li, Z\. Liu, and M\. Sun \(2025\)Tool learning with foundation models\.ACM Comput\. Surv\.57\(4\),pp\. 101:1–101:40\.Cited by:[§A\.5](https://arxiv.org/html/2606.00376#A1.SS5.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px5.p1.1)\.
- J\. N\. Rouder, P\. L\. Speckman, D\. Sun, R\. D\. Morey, and G\. Iverson \(2009\)Bayesian t tests for accepting and rejecting the null hypothesis\.Psychonomic Bulletin & Review16\(2\),pp\. 225–237\.External Links:ISSN 1531\-5320Cited by:[§W\.4](https://arxiv.org/html/2606.00376#A23.SS4.p1.1)\.
- C\. Sanford, D\. J\. Hsu, and M\. Telgarsky \(2023\)Representational strengths and limitations of transformers\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,Cited by:[§A\.1](https://arxiv.org/html/2606.00376#A1.SS1.p2.1),[§B\.4](https://arxiv.org/html/2606.00376#A2.SS4.6.p6.1),[§4\.2](https://arxiv.org/html/2606.00376#S4.SS2.SSS0.Px2.p2.1)\.
- T\. Schick, J\. Dwivedi\-Yu, R\. Dessì, R\. Raileanu, M\. Lomeli, E\. Hambro, L\. Zettlemoyer, N\. Cancedda, and T\. Scialom \(2023\)Toolformer: language models can teach themselves to use tools\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,Cited by:[§A\.5](https://arxiv.org/html/2606.00376#A1.SS5.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px5.p1.1)\.
- R\. Shwartz\-Ziv and N\. Tishby \(2017\)Opening the black box of deep neural networks via information\.arXiv preprintarXiv\.1703\.00810\.Cited by:[§A\.7](https://arxiv.org/html/2606.00376#A1.SS7.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- C\. V\. Snell, J\. Lee, K\. Xu, and A\. Kumar \(2025\)Scaling LLM test\-time compute optimally can be more effective than scaling parameters for reasoning\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2606.00376#S1.p1.1)\.
- L\. Strobl, W\. Merrill, G\. Weiss, D\. Chiang, and D\. Angluin \(2024\)What formal languages can transformers express? A survey\.Trans\. Assoc\. Comput\. Linguistics12,pp\. 543–561\.Cited by:[§A\.3](https://arxiv.org/html/2606.00376#A1.SS3.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- E\. Strubell, A\. Ganesh, and A\. McCallum \(2019\)Energy and policy considerations for deep learning in NLP\.InProceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28\- August 2, 2019, Volume 1: Long Papers,pp\. 3645–3650\.Cited by:[§AF\.3](https://arxiv.org/html/2606.00376#A32.SS3.p1.1)\.
- J\. Su, J\. Healey, P\. Nakov, and C\. Cardie \(2025\)Between underthinking and overthinking: an empirical study of reasoning length and correctness in llms\.arXiv preprintarXiv\.2505\.00127\.Cited by:[§A\.2](https://arxiv.org/html/2606.00376#A1.SS2.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px2.p1.1)\.
- Y\. Sui, Y\. Chuang, G\. Wang, J\. Zhang, T\. Zhang, J\. Yuan, H\. Liu, A\. Wen, S\. Zhong, N\. Zou, H\. Chen, and X\. Hu \(2025\)Stop overthinking: A survey on efficient reasoning for large language models\.Trans\. Mach\. Learn\. Res\.2025\.Cited by:[§A\.2](https://arxiv.org/html/2606.00376#A1.SS2.p2.1),[§1](https://arxiv.org/html/2606.00376#S1.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px2.p1.1)\.
- N\. Tishby and N\. Zaslavsky \(2015\)Deep learning and the information bottleneck principle\.In2015 IEEE Information Theory Workshop, ITW 2015, Jerusalem, Israel, April 26 \- May 1, 2015,pp\. 1–5\.Cited by:[§A\.7](https://arxiv.org/html/2606.00376#A1.SS7.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- K\. Valmeekam, M\. Marquez, S\. Sreedharan, and S\. Kambhampati \(2023\)On the planning abilities of large language models \- A critical investigation\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,Cited by:[§1](https://arxiv.org/html/2606.00376#S1.p2.3)\.
- X\. Wang, J\. Wei, D\. Schuurmans, Q\. V\. Le, E\. H\. Chi, S\. Narang, A\. Chowdhery, and D\. Zhou \(2023\)Self\-consistency improves chain of thought reasoning in language models\.InThe Eleventh International Conference on Learning Representations,Cited by:[§A\.1](https://arxiv.org/html/2606.00376#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px1.p1.1)\.
- Y\. Wang, Q\. Liu, J\. Xu, T\. Liang, X\. Chen, Z\. He, L\. Song, D\. Yu, J\. Li, Z\. Zhang, R\. Wang, Z\. Tu, H\. Mi, and D\. Yu \(2026\)Thoughts are all over the place: on the underthinking of long reasoning models\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§A\.2](https://arxiv.org/html/2606.00376#A1.SS2.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px2.p1.1)\.
- J\. Wei, X\. Wang, D\. Schuurmans, M\. Bosma, B\. Ichter, F\. Xia, E\. H\. Chi, Q\. V\. Le, and D\. Zhou \(2022\)Chain\-of\-thought prompting elicits reasoning in large language models\.InAdvances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 \- December 9, 2022,Cited by:[§A\.1](https://arxiv.org/html/2606.00376#A1.SS1.p1.1),[Table 9](https://arxiv.org/html/2606.00376#A1.T9.14.14.2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px1.p1.1)\.
- M\. Wortsman, J\. Lee, J\. Gilmer, and S\. Kornblith \(2023\)Replacing softmax with relu in vision transformers\.arXiv preprintarXiv\.2309\.08586\.Cited by:[§R\.2](https://arxiv.org/html/2606.00376#A18.SS2.3.p3.4)\.
- Y\. Wu, Z\. Sun, S\. Li, S\. Welleck, and Y\. Yang \(2025\)Inference scaling laws: an empirical analysis of compute\-optimal inference for LLM problem\-solving\.InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24\-28, 2025,Cited by:[§1](https://arxiv.org/html/2606.00376#S1.p1.1)\.
- Y\. Wu, Y\. Wang, Z\. Ye, T\. Du, S\. Jegelka, and Y\. Wang \(2026\)When more is less: understanding chain\-of\-thought length in LLMs\.InThe Fourteenth International Conference on Learning Representations,Cited by:[§A\.2](https://arxiv.org/html/2606.00376#A1.SS2.p1.1),[Table 9](https://arxiv.org/html/2606.00376#A1.T9.4.2.2),[§1](https://arxiv.org/html/2606.00376#S1.SS0.SSS0.Px3.p1.1),[Table 1](https://arxiv.org/html/2606.00376#S1.T1),[Table 1](https://arxiv.org/html/2606.00376#S1.T1.2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px2.p1.1),[§3\.1](https://arxiv.org/html/2606.00376#S3.SS1.p1.3),[§4\.1](https://arxiv.org/html/2606.00376#S4.SS1.p1.1),[§4\.3](https://arxiv.org/html/2606.00376#S4.SS3.p1.1),[§7](https://arxiv.org/html/2606.00376#S7.SS0.SSS0.Px1.p1.2)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)Efficient streaming language models with attention sinks\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,Cited by:[§A\.4](https://arxiv.org/html/2606.00376#A1.SS4.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px3.p1.1),[§4\.1](https://arxiv.org/html/2606.00376#S4.SS1.SSS0.Px1.p1.5)\.
- S\. Yao, D\. Yu, J\. Zhao, I\. Shafran, T\. Griffiths, Y\. Cao, and K\. Narasimhan \(2023\)Tree of thoughts: deliberate problem solving with large language models\.InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 \- 16, 2023,Cited by:[§A\.1](https://arxiv.org/html/2606.00376#A1.SS1.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px1.p1.1)\.
- S\. Yao, J\. Zhao, D\. Yu, I\. Shafran, K\. R\. Narasimhan, and Y\. Cao \(2022\)ReAct: synergizing reasoning and acting in language models\.InNeurIPS 2022 Foundation Models for Decision Making Workshop,Cited by:[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px1.p1.1)\.
- T\. Yu, R\. Zhang, K\. Yang, M\. Yasunaga, D\. Wang, Z\. Li, J\. Ma, I\. Li, Q\. Yao, S\. Roman, Z\. Zhang, and D\. R\. Radev \(2018\)Spider: A large\-scale human\-labeled dataset for complex and cross\-domain semantic parsing and text\-to\-sql task\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, October 31 \- November 4, 2018,pp\. 3911–3921\.Cited by:[§AB\.2\.3](https://arxiv.org/html/2606.00376#A28.SS2.SSS3.p1.1),[§E\.3](https://arxiv.org/html/2606.00376#A5.SS3.p1.1),[3rd item](https://arxiv.org/html/2606.00376#S5.I2.i3.p1.1)\.
- C\. Yun, S\. Bhojanapalli, A\. S\. Rawat, S\. J\. Reddi, and S\. Kumar \(2020\)Are transformers universal approximators of sequence\-to\-sequence functions?\.In8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26\-30, 2020,Cited by:[§A\.3](https://arxiv.org/html/2606.00376#A1.SS3.p2.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px4.p1.1)\.
- Z\. Zhang, Y\. Wang, X\. Huang, T\. Fang, H\. Zhang, C\. Deng, S\. Li, and D\. Yu \(2025\)Attention entropy is a key factor: an analysis of parallel context encoding with full\-attention\-based pre\-trained language models\.InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\), ACL 2025, Vienna, Austria, July 27 \- August 1, 2025,pp\. 9840–9855\.Cited by:[§A\.4](https://arxiv.org/html/2606.00376#A1.SS4.p1.1),[§2](https://arxiv.org/html/2606.00376#S2.SS0.SSS0.Px3.p1.1)\.
- S\. Zhou, F\. F\. Xu, H\. Zhu, X\. Zhou, R\. Lo, A\. Sridhar, X\. Cheng, T\. Ou, Y\. Bisk, D\. Fried, U\. Alon, and G\. Neubig \(2024\)WebArena: A realistic web environment for building autonomous agents\.InThe Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7\-11, 2024,Cited by:[§AB\.2\.2](https://arxiv.org/html/2606.00376#A28.SS2.SSS2.p1.1),[§E\.2](https://arxiv.org/html/2606.00376#A5.SS2.p1.1),[2nd item](https://arxiv.org/html/2606.00376#S5.I2.i2.p1.1)\.
## Appendix AExtended Related Work
This appendix provides detailed positioning relative to concurrent and prior work, organized by research theme\. Table[9](https://arxiv.org/html/2606.00376#A1.T9)summarizes coverage across key dimensions, while Table[10](https://arxiv.org/html/2606.00376#A1.T10)contrasts our theoretical framework with the primary competing explanation\.
Table 9:Related work positioning\.✓\\checkmarkindicates addressed aspects\.Table 10:Comparison of theoretical frameworks for reasoning failures\.### A\.1Chain\-of\-Thought Reasoning Foundations
Chain\-of\-thought prompting was established byWeiet al\.\([2022](https://arxiv.org/html/2606.00376#bib.bib1)\), demonstrating that prompting models to “think step by step” dramatically improves reasoning on arithmetic, commonsense, and symbolic tasks\. This was extended by zero\-shot CoT\(Kojimaet al\.,[2022](https://arxiv.org/html/2606.00376#bib.bib58)\), self\-consistency\(Wanget al\.,[2023](https://arxiv.org/html/2606.00376#bib.bib72)\), Tree\-of\-Thoughts\(Yaoet al\.,[2023](https://arxiv.org/html/2606.00376#bib.bib59)\), and Graph\-of\-Thoughts\(Bestaet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib60)\)\.Goodmanet al\.\([2022](https://arxiv.org/html/2606.00376#bib.bib61)\)showed bootstrapping through STaR training\.Nyeet al\.\([2021](https://arxiv.org/html/2606.00376#bib.bib62)\)introduced scratchpads for intermediate computation\.
Theoretical foundations were established byLiuet al\.\([2024b](https://arxiv.org/html/2606.00376#bib.bib14)\), proving CoT expands expressivity fromTC0\\text\{TC\}^\{0\}to polynomial\-time problems with sufficient steps\.Merrill and Sabharwal \([2024](https://arxiv.org/html/2606.00376#bib.bib75)\)further characterized CoT length requirements: logarithmic steps provide marginal benefit, while linear steps enable recognizing all regular languages\.Fenget al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib20)\)characterized attention pattern expressiveness\.Sanfordet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib19)\)analyzed single\-layer capacity\.
Our positioning:We extend this line of work by identifying the*failure regime*, where additional CoT steps degrade rather than improve performance\. Our Decoherence Bound \(Theorem[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)\) complements Li et al\.’s expressivity results by characterizing*when*the theoretical capacity cannot be reliably accessed due to accumulated errors in the attention mechanism\.
### A\.2Overthinking and Underthinking Literature
The phenomenon of excessive reasoning causing performance degradation has been documented by several concurrent works\.Wuet al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib2)\)provide the most directly competitive work, documenting inverted\-U performance curves attributed to “Simplicity Bias\.” Their theoretical framework derives optimal CoT length as function of task difficulty and model capability\.Chenet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib3)\)examine overthinking in o1\-like models, showing excessive reasoning on simple tasks\.Wanget al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib63)\)identify “underthinking”, premature abandonment of reasoning paths, with incorrect responses using 225% more tokens and 418% more thought switches on AIME2024\.
Suiet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib4)\)provide a broad survey categorizing efficient reasoning methods across model\-based, prompt\-based, and output\-based approaches\.Marjanovicet al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib74)\)examine DeepSeek\-R1’s “sweet spot” where excessive inference impairs performance through “persistent rumination\.”Suet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib5)\)analyze the overthinking\-underthinking spectrum\.
Our positioning:We provide an*architectural*explanation complementing these behavioral observations\. The key distinction is captured in Table[10](https://arxiv.org/html/2606.00376#A1.T10): Wu et al\. predict fine\-tuning can recover performance \(\>\>30%\), while we predict an architectural ceiling \(<<5%\)\. Our experimental results \(Section[5\.3](https://arxiv.org/html/2606.00376#S5.SS3)\) validate the architectural prediction with observed 3\.2% recovery, providing key discrimination between preference\-based and capability\-based explanations\.
### A\.3Transformer Expressivity and Theoretical Limits
Our theoretical framework builds on expressivity bounds from the formal methods community\.Merrill and Sabharwal \([2024](https://arxiv.org/html/2606.00376#bib.bib75)\)establish that transformers withTTCoT steps solve problems in circuits of sizeTT\.Merrillet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib15)\)demonstrate SSMs \(including Mamba\) shareTC0\\text\{TC\}^\{0\}bounds despite their recurrent formulation: the “state” in SSMs is an illusion\.Bavandpouret al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib65)\)provide tight CoT step lower bounds for algorithmic problems, explicitly suggesting tool use as solution to efficiency limitations\.
Pérezet al\.\([2021](https://arxiv.org/html/2606.00376#bib.bib67)\)prove conditional Turing completeness\.Yunet al\.\([2020](https://arxiv.org/html/2606.00376#bib.bib18)\)establish universal approximation\.Hahn \([2020](https://arxiv.org/html/2606.00376#bib.bib66)\)identify formal language limitations\.Delétanget al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib17)\)study Chomsky hierarchy relations\.Penget al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib16)\)use communication complexity to prove transformers cannot compose functions when domains are large: even tractable problems like 2\-SAT become provably impossible\.Bhattamishraet al\.\([2020](https://arxiv.org/html/2606.00376#bib.bib21)\)analyze attention mechanism power\.Stroblet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib22)\)provide formal language perspective\.
Our positioning:We translate these theoretical bounds into practical thresholds\. The Deterministic Horizon \(Theorem[4\.8](https://arxiv.org/html/2606.00376#S4.Thmtheorem8)\) provides the first principled formula connecting architecture parameters \(HH,LL,dhd\_\{h\}\) to reasoning depth limits, validated via ablations \(Section[6\.1](https://arxiv.org/html/2606.00376#S6.SS1)\)\. While prior work establishes*what*transformers cannot do asymptotically, we characterize*where*the boundary lies for specific architectures and tasks\.
### A\.4Working Memory and Attention Entropy
The mechanistic basis of our work draws heavily on attention\-based working memory research\.Gong and Zhang \([2024](https://arxiv.org/html/2606.00376#bib.bib6)\)provide direct mechanistic evidence for attention\-based working memory limits, demonstrating that transformer performance on N\-back tasks degrades as N increases, with attention entropy correlating negatively with accuracy\.Barberoet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib7)\)prove that distinct input sequences can yield arbitrarily close representations in the final token due to unidirectional causal masking: information from early tokens exponentially loses influence\.Zhanget al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib10)\)extend entropy analysis to parallel context encoding, showing high entropy correlates strongly with performance degradation\.
Liuet al\.\([2024a](https://arxiv.org/html/2606.00376#bib.bib8)\)document the “lost in the middle” phenomenon\.Xiaoet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib9)\)show attention sinks concentrate attention on initial tokens\.Gerasimovet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib11)\)identify representation collapse in deeper layers where different tokens become indistinguishable\.Levyet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib12)\)show length\-dependent degradation\.Olssonet al\.\([2022](https://arxiv.org/html/2606.00376#bib.bib64)\)discover induction heads\.Elhageet al\.\([2021](https://arxiv.org/html/2606.00376#bib.bib76)\)provide the mechanistic interpretability circuit framework\.Biettiet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib13)\)analyze memory from birth\.
Our positioning:We synthesize these mechanisms into the unified Attention Bottleneck Theorem \(Theorem[4\.6](https://arxiv.org/html/2606.00376#S4.Thmtheorem6)\), providing the first information\-theoretic bound connecting attention mechanics to state\-tracking capacity\. Our direct entropy measurements \(Appendix[G](https://arxiv.org/html/2606.00376#A7)\) validate the predicted correlation \(r=−0\.74r=\-0\.74\) between attention entropy and reasoning accuracy across 12 models\.
### A\.5Tool\-Augmented Reasoning
The empirical superiority of tool delegation has been established across multiple domains\.Gaoet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib23)\)introduced Program\-aided Language models \(PAL\), showing program\-aided reasoning outperforms pure CoT on arithmetic tasks\.Liet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib24)\)propose Chain\-of\-Code\.Gouet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib25)\)achieve state\-of\-the\-art through tool integration, with ToRA\-Code\-34B reaching 50\.8% on MATH versus GPT\-4’s CoT result of 42\.5%\.Panet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib26)\)demonstrate 39\.2% improvement over standard prompting by delegating inference to symbolic solvers\.Schicket al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib27)\)enable self\-supervised tool learning\.Parisiet al\.\([2022](https://arxiv.org/html/2606.00376#bib.bib28)\)introduce tool\-augmented LMs\.Luoet al\.\([2026](https://arxiv.org/html/2606.00376#bib.bib29)\)apply RL for tool\-augmented math, achieving SOTA on AIME24 \(90\.6%\)\.Chenet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib30)\)propose Program of Thoughts\.
Qinet al\.\([2025](https://arxiv.org/html/2606.00376#bib.bib31)\)survey tool learning with foundation models\.Mialonet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib32)\)survey augmented language models\.Patilet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib33)\)connect LLMs to massive APIs\.
Our positioning:We provide theoretical justification for*when*tool delegation becomes necessary \(not just beneficial\)\. The Deterministic Horizon identifies the threshold beyond which neural reasoning cannot succeed regardless of model scale or training\. This transforms tool use from a performance optimization to an architectural necessity for deterministic tasks exceedingd∗d^\{\*\}steps\.
### A\.6Compositional Reasoning Failures
Dziriet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib34)\)demonstrate transformers solve compositional tasks via linearized subgraph matching rather than systematic reasoning, with errors in early computation steps compounding catastrophically, directly supporting decoherence claims\.Pettyet al\.\([2024](https://arxiv.org/html/2606.00376#bib.bib35)\)show depth provides diminishing returns for compositional generalization\.Presset al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib36)\)introduce compositionality metrics\.
Lake and Baroni \([2018](https://arxiv.org/html/2606.00376#bib.bib37)\)establish compositional generalization benchmarks\.Keyserset al\.\([2020](https://arxiv.org/html/2606.00376#bib.bib38)\)introduce SCAN and CFQ\.Kim and Linzen \([2020](https://arxiv.org/html/2606.00376#bib.bib39)\)provide COGS benchmark\.Ontañónet al\.\([2022](https://arxiv.org/html/2606.00376#bib.bib40)\)study improvement methods\.
Our positioning:The compositional failure literature documents*symptoms*; we provide*diagnosis*\. State\-Space Decoherence explains why errors compound: the attention bottleneck cannot maintain sufficient mutual information between early reasoning steps and late outputs, causing systematic state drift that manifests as compositional failures\.
### A\.7Information Theory in Deep Learning
Tishby and Zaslavsky \([2015](https://arxiv.org/html/2606.00376#bib.bib41)\)established the Information Bottleneck framework\.Shwartz\-Ziv and Tishby \([2017](https://arxiv.org/html/2606.00376#bib.bib42)\)analyzed DNN information dynamics during training\.Lewandowsky and Bauch \([2024](https://arxiv.org/html/2606.00376#bib.bib68)\)studied infinite ensembles\.Deb and Ogunfunmi \([2025](https://arxiv.org/html/2606.00376#bib.bib43)\)connected transformers to information bottleneck principles\.
Our positioning:We instantiate information\-theoretic analysis for the specific case of autoregressive state tracking\. The Attention Bottleneck Theorem \(Theorem[4\.6](https://arxiv.org/html/2606.00376#S4.Thmtheorem6)\) derives capacity bounds from first principles, providing the theoretical foundation that was missing from prior empirical observations of reasoning failures\.
## Appendix BTheoretical Proofs
### B\.1Proof of Theorem[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)\(Decoherence Bound\)
###### Proof\.
LetXiX\_\{i\}be the indicator that stepiiis correct\. Under conditional independence given correct history:
P\(all correct\)\\displaystyle P\(\\text\{all correct\}\)=∏i=1mP\(Xi=1\|X1=⋯=Xi−1=1\)\\displaystyle=\\prod\_\{i=1\}^\{m\}P\(X\_\{i\}=1\|X\_\{1\}=\\cdots=X\_\{i\-1\}=1\)\(9\)=∏i=1m\(1−ϵ\(i\)\)\\displaystyle=\\prod\_\{i=1\}^\{m\}\(1\-\\epsilon\(i\)\)\(10\)
Using1−x≤e−x1\-x\\leq e^\{\-x\}forx≥0x\\geq 0\(Lemma[R\.2](https://arxiv.org/html/2606.00376#A18.Thmtheorem2)\):
P\(all correct\)\\displaystyle P\(\\text\{all correct\}\)≤∏i=1me−ϵ\(i\)=exp\(−∑i=1mϵ\(i\)\)\\displaystyle\\leq\\prod\_\{i=1\}^\{m\}e^\{\-\\epsilon\(i\)\}=\\exp\\left\(\-\\sum\_\{i=1\}^\{m\}\\epsilon\(i\)\\right\)\(11\)
Substitutingϵ\(i\)=ϵ0\+γi/Leff\\epsilon\(i\)=\\epsilon\_\{0\}\+\\gamma i/L\_\{\\mathrm\{eff\}\}:
∑i=1mϵ\(i\)\\displaystyle\\sum\_\{i=1\}^\{m\}\\epsilon\(i\)=∑i=1m\(ϵ0\+γiLeff\)\\displaystyle=\\sum\_\{i=1\}^\{m\}\\left\(\\epsilon\_\{0\}\+\\frac\{\\gamma i\}\{L\_\{\\mathrm\{eff\}\}\}\\right\)\(12\)=mϵ0\+γLeff⋅m\(m\+1\)2\\displaystyle=m\\epsilon\_\{0\}\+\\frac\{\\gamma\}\{L\_\{\\mathrm\{eff\}\}\}\\cdot\\frac\{m\(m\+1\)\}\{2\}\(13\)
Therefore:
P\(all correct\)≤exp\(−mϵ0−γm\(m\+1\)2Leff\)P\(\\text\{all correct\}\)\\leq\\exp\\left\(\-m\\epsilon\_\{0\}\-\\frac\{\\gamma m\(m\+1\)\}\{2L\_\{\\mathrm\{eff\}\}\}\\right\)\(14\)
The quadratic termm\(m\+1\)/2∼m2/2m\(m\+1\)/2\\sim m^\{2\}/2dominates for largemm, yielding super\-exponential decay\. ∎
### B\.2Proof of Corollary[4\.3](https://arxiv.org/html/2606.00376#S4.Thmtheorem3)\(Absorbing\-Error Tightness\)
###### Proof\.
Under State\-Space Decoherence a single error causes irrecoverable divergence, so the trace is correct at depthmmif and only if no error occurs in steps1,…,m1,\\dots,m\. LetTTbe the index of the first error\. By the chain rule for the survival function of a discrete random variable,
P\(correct at depthm\)=P\(T\>m\)=∏i=1m\(1−h\(i\)\),P\(\\text\{correct at depth \}m\)=P\(T\>m\)=\\prod\_\{i=1\}^\{m\}\\big\(1\-h\(i\)\\big\),\(15\)whereh\(i\)=P\(T=i∣T\>i−1\)h\(i\)=P\(T=i\\mid T\>i\-1\)is the discrete hazard\. Identifying the hazard with the context\-dependent error rate,h\(i\)=ϵ\(i\)h\(i\)=\\epsilon\(i\), reproduces the product of Theorem[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)as an exact identity; the exponential bound then follows from1−x≤e−x1\-x\\leq e^\{\-x\}alone, with no independence assumption beyond the Markov \(memoryless\-given\-survival\) structure\.
It remains to characterize the autocorrelation of the error\-indicator processYi=𝟏\[T=i\]Y\_\{i\}=\\mathbf\{1\}\[T=i\]conditioned on survival\. Because an error at stepi−1i\-1removes the trace from the at\-risk set, surviving traces exhibit positive lag\-1 autocorrelation in the realized error sequence: conditional on reaching stepii, the probability of error is elevated relative to the marginal baselineϵ0\\epsilon\_\{0\}\. We therefore reportρ1\\rho\_\{1\}as a measured diagnostic rather than a correction term; see Appendix[AC](https://arxiv.org/html/2606.00376#A29)\. A positiveρ1\\rho\_\{1\}would, if anything, only reinforce the absorbing\-error picture; it cannot be used to scale the cumulative error by a\(1\+ρ1\)\(1\+\\rho\_\{1\}\)factor, since such a factor would*increase*, not decrease, the survival probability and therefore cannot tighten an upper bound onP\(correct\)P\(\\text\{correct\}\)\. ∎
### B\.3Proof of Theorem[4\.6](https://arxiv.org/html/2606.00376#S4.Thmtheorem6)\(Attention Bottleneck\)
###### Proof\.
We derive an information\-theoretic upper bound on state\-tracking capacity\.
Step 1: Attention mass distribution\.Each attention headh∈\[H\]h\\in\[H\]produces probability distribution overLLcontext positions via softmax\. For effective retrieval, we require attention weight≥δ/L\\geq\\delta/L\. By pigeonhole, each head can assign weight≥δ/L\\geq\\delta/Lto at mostL/δL/\\deltapositions, but useful retrieval further constrains this toO\(L\)O\(\\sqrt\{L\}\)positions due to interference\.
Step 2: Information per head\.Information transmitted through one attention head is bounded by mutual informationI\(query;retrieved value\)I\(\\text\{query\};\\text\{retrieved value\}\)\. By Lemma[R\.5](https://arxiv.org/html/2606.00376#A18.Thmtheorem5), this decomposes into thelog2\(L/H\)\\log\_\{2\}\(L/H\)bits needed to address the head’s share of∼L/H\\sim L/Hpositions and an effective value rank modeled asΘ\(dh1/2\)\\Theta\(d\_\{h\}^\{1/2\}\), giving:
I≲log2\(L/H\)⋅dh1/2bits per head\.I\\lesssim\\log\_\{2\}\(L/H\)\\cdot d\_\{h\}^\{1/2\}\\text\{ bits per head\.\}\(16\)
Step 3: Aggregation across heads\.WithHHheads operating in parallel with approximately independent value subspaces:
Itotal≤H⋅log2\(L/H\)⋅dh1/2I\_\{\\text\{total\}\}\\leq H\\cdot\\log\_\{2\}\(L/H\)\\cdot d\_\{h\}^\{1/2\}\(17\)
Step 4: State\-tracking capacity\.To distinguish\|𝒮\|\|\\mathcal\{S\}\|states requires at leastlog2\|𝒮\|\\log\_\{2\}\|\\mathcal\{S\}\|bits:
log2\|𝒮track\|≤H⋅log2\(L/H\)⋅dh1/2\\log\_\{2\}\|\\mathcal\{S\}\_\{\\text\{track\}\}\|\\leq H\\cdot\\log\_\{2\}\(L/H\)\\cdot d\_\{h\}^\{1/2\}\(18\)
Exponentiating and including constantc\(δ,ρmax\)c\(\\delta,\\rho\_\{\\max\}\)for approximations yields the theorem\. ∎
### B\.4Proof of Theorem[4\.7](https://arxiv.org/html/2606.00376#S4.Thmtheorem7)\(Lower\-Bound Construction\)
###### Proof\.
We construct a family of state\-tracking tasks achieving the bound\.
Consider sparse parity functions overnnbits with sparsitykk\. A transformer can track states corresponding to all\(nk\)\\binom\{n\}\{k\}possiblekk\-sparse parities using the following construction:
Construction\.Encode each parity subset in a separate attention head’s query\-key space\. WithHHheads and dimensiondhd\_\{h\}, we can representH⋅dh1/2H\\cdot d\_\{h\}^\{1/2\}independent directions\. Each direction can tracklog2\(L/H\)\\log\_\{2\}\(L/H\)bits of parity information over context lengthLL\.
Achievability\.The total number of distinguishable states is:
\|𝒮\|=2H⋅log2\(L/H\)⋅dh1/2\|\\mathcal\{S\}\|=2^\{H\\cdot\\log\_\{2\}\(L/H\)\\cdot d\_\{h\}^\{1/2\}\}\(19\)
This achieves the functional form of the upper bound; tightness inHHandLLfollows from the construction, while thedh1/2d\_\{h\}^\{1/2\}exponent matches by the shared effective\-rank ansatz \(Lemma[R\.5](https://arxiv.org/html/2606.00376#A18.Thmtheorem5)\) rather than by an independent derivation\.
The construction followsSanfordet al\.\([2023](https://arxiv.org/html/2606.00376#bib.bib19)\)’s analysis of sparse functions representable by transformers, extended to the multi\-head setting\. ∎
### B\.5Proof of Theorem[4\.8](https://arxiv.org/html/2606.00376#S4.Thmtheorem8)\(Deterministic Horizon\)
###### Proof\.
We solve ford∗d^\{\*\}such thatP\(correct at depthd∗\)=αP\(\\text\{correct at depth \}d^\{\*\}\)=\\alpha\.
From Theorem[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)\(continuous approximationd\(d\+1\)≈d2d\(d\+1\)\\approx d^\{2\}\):
α=exp\(−d∗ϵ0−γ\(d∗\)22Leff\)\\alpha=\\exp\\left\(\-d^\{\*\}\\epsilon\_\{0\}\-\\frac\{\\gamma\(d^\{\*\}\)^\{2\}\}\{2L\_\{\\mathrm\{eff\}\}\}\\right\)\(20\)
Taking logarithms:
ln\(1/α\)=d∗ϵ0\+γ\(d∗\)22Leff\\ln\(1/\\alpha\)=d^\{\*\}\\epsilon\_\{0\}\+\\frac\{\\gamma\(d^\{\*\}\)^\{2\}\}\{2L\_\{\\mathrm\{eff\}\}\}\(21\)
This is quadratic ind∗d^\{\*\}\. Solving via quadratic formula:
d∗=−ϵ0Leff\+ϵ02Leff2\+2γLeffln\(1/α\)γd^\{\*\}=\\frac\{\-\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\+\\sqrt\{\\epsilon\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}^\{2\}\+2\\gamma L\_\{\\mathrm\{eff\}\}\\ln\(1/\\alpha\)\}\}\{\\gamma\}\(22\)
Evaluating for GPT\-4o \(ϵ0=0\.02\\epsilon\_\{0\}=0\.02,γ=0\.15\\gamma=0\.15,Leff=150L\_\{\\mathrm\{eff\}\}=150,α=0\.5\\alpha=0\.5\):
d∗=−3\+9\+31\.190\.15=−3\+6\.340\.15≈22\.3,d^\{\*\}=\\frac\{\-3\+\\sqrt\{9\+31\.19\}\}\{0\.15\}=\\frac\{\-3\+6\.34\}\{0\.15\}\\approx 22\.3,\(23\)matching the observedd∗=22d^\{\*\}=22for GPT\-4o on PermutationProbe\. ∎
### B\.6Proof of Theorem[4\.10](https://arxiv.org/html/2606.00376#S4.Thmtheorem10)\(Fine\-Tuning Upper Bound\)
###### Proof\.
The key insight is that fine\-tuning modifies the error distributionϵ\(⋅\)\\epsilon\(\\cdot\)but cannot exceed information\-theoretic capacity bounds\.
Letϵbase\(d\)\\epsilon\_\{\\text\{base\}\}\(d\)andϵft\(d\)\\epsilon\_\{\\text\{ft\}\}\(d\)denote per\-step error rates for baseline and fine\-tuned models\. Fine\-tuning can reduceϵ0\\epsilon\_\{0\}\(baseline error\) but cannot change the fundamental capacity bound from Theorem[4\.6](https://arxiv.org/html/2606.00376#S4.Thmtheorem6)\.
For depthsd\>d∗d\>d^\{\*\}, the required state space exceeds capacity:
\|𝒮required\(d\)\|\>\|𝒮track\|\|\\mathcal\{S\}\_\{\\text\{required\}\}\(d\)\|\>\|\\mathcal\{S\}\_\{\\text\{track\}\}\|\(24\)
Therefore, even withϵft\(d\)<ϵbase\(d\)\\epsilon\_\{\\text\{ft\}\}\(d\)<\\epsilon\_\{\\text\{base\}\}\(d\), accuracy is bounded:
Accft\(d\)≤\|𝒮track\|\|𝒮required\(d\)\|⋅Accbase\(d∗\)\+O\(d∗d\)\\text\{Acc\}\_\{\\text\{ft\}\}\(d\)\\leq\\frac\{\|\\mathcal\{S\}\_\{\\text\{track\}\}\|\}\{\|\\mathcal\{S\}\_\{\\text\{required\}\}\(d\)\|\}\\cdot\\text\{Acc\}\_\{\\text\{base\}\}\(d^\{\*\}\)\+O\\left\(\\frac\{d^\{\*\}\}\{d\}\\right\)\(25\)
The improvementAccft−Accbase\\text\{Acc\}\_\{\\text\{ft\}\}\-\\text\{Acc\}\_\{\\text\{base\}\}is thus bounded byO\(d∗/d\)O\(d^\{\*\}/d\)\.
Empirical validation\.For the fine\-tuned Llama\-3\.1\-8B \(d∗=20d^\{\*\}=20\), the observed improvement decays from6\.2%6\.2\\%at depth 5 to0\.2%0\.2\\%at depth 45, remaining below theO\(d∗/d\)O\(d^\{\*\}/d\)ceiling at every depth \(Figure[4](https://arxiv.org/html/2606.00376#A31.F4)\)\. With the fitted leading constant, the ceiling at depth 40 is≈3\.3%\\approx 3\.3\\%versus an observed≈0\.3%\\approx 0\.3\\%, so fine\-tuning gains vanish well inside the bound asddgrows beyondd∗d^\{\*\}\. ∎
### B\.7Sensitivity Analysis for Theorem[4\.6](https://arxiv.org/html/2606.00376#S4.Thmtheorem6)
We validate assumptions empirically:
##### Precision Thresholdδ\\delta\.
We measure empirical attention weight distributions in Llama\-3\.3\-70B across 1,000 traces\. 95% of attention mass concentrates on positions receiving weight≥0\.01/L\\geq 0\.01/L\(δ≈0\.01\\delta\\approx 0\.01\)\.
##### Value Vector Correlation\.
Pairwise correlation of value vectors after layer normalization: mean\|ρij\|=0\.08±0\.03\|\\rho\_\{ij\}\|=0\.08\\pm 0\.03, supporting decorrelation assumption\.
##### Bound Tightness\.
Correlation between theoretical prediction and measured performance:r=0\.89r=0\.89\. Empirical onset at approximately 3% of theoretical capacity\.
Table 11:Sensitivity analysis: Impact of±\\pm20% variation in assumptions on bound\.
## Appendix CSSJ Extraction Algorithm
Algorithm 1State\-Space Jaccard Extraction0:Reasoning trace
rr, initial state
σ0\\sigma\_\{0\}, operators
𝒪\\mathcal\{O\}
0:SSJ, Precision, Recall
1:patterns
←\\leftarrow\[“state:
\\\\backslash\[\(\.\*?\)
\\\\backslash\]”, “current: \(\.\*?\)
\\\\backslashn”, …\]
2:
𝒮^raw←∅\\hat\{\\mathcal\{S\}\}\_\{\\text\{raw\}\}\\leftarrow\\emptyset
3:forpattern
ppin patternsdo
4:
𝒮^raw←𝒮^raw∪findall\(p,r\)\\hat\{\\mathcal\{S\}\}\_\{\\text\{raw\}\}\\leftarrow\\hat\{\\mathcal\{S\}\}\_\{\\text\{raw\}\}\\cup\\text\{findall\}\(p,r\)
5:endfor
6:
𝒮^model←\{canonicalize\(s\):s∈𝒮^raw\}\\hat\{\\mathcal\{S\}\}\_\{\\text\{model\}\}\\leftarrow\\\{\\text\{canonicalize\}\(s\):s\\in\\hat\{\\mathcal\{S\}\}\_\{\\text\{raw\}\}\\\}
7:
𝒮true←BFS\(σ0,𝒪,depth=\|𝒮^model\|\)\\mathcal\{S\}\_\{\\text\{true\}\}\\leftarrow\\text\{BFS\}\(\\sigma\_\{0\},\\mathcal\{O\},\\text\{depth\}=\|\\hat\{\\mathcal\{S\}\}\_\{\\text\{model\}\}\|\)
8:intersection
←\|𝒮^model∩𝒮true\|\\leftarrow\|\\hat\{\\mathcal\{S\}\}\_\{\\text\{model\}\}\\cap\\mathcal\{S\}\_\{\\text\{true\}\}\|
9:SSJ
←\\leftarrowintersection /
\|𝒮^model∪𝒮true\|\|\\hat\{\\mathcal\{S\}\}\_\{\\text\{model\}\}\\cup\\mathcal\{S\}\_\{\\text\{true\}\}\|
10:Precision
←\\leftarrowintersection /
\|𝒮^model\|\|\\hat\{\\mathcal\{S\}\}\_\{\\text\{model\}\}\|
11:Recall
←\\leftarrowintersection /
\|𝒮true\|\|\\mathcal\{S\}\_\{\\text\{true\}\}\|
12:returnSSJ, Precision, Recall
##### Validation\.
Two annotators independently extracted states from 200 traces\. Cohen’sκ=0\.89\\kappa=0\.89\(strong agreement\)\. Against manual ground truth: Precision =0\.94±0\.020\.94\\pm 0\.02, Recall =0\.91±0\.030\.91\\pm 0\.03\.
## Appendix DPer\-Model Results and Architecture Comparison
### D\.1Encoder\-Decoder Comparison
Table 12:Encoder\-decoder vs\. decoder\-only at depth 30\. Bidirectional attention provides 2\.8×\\timesadvantage\.The 2\.8×\\timesadvantage of encoder\-decoder architectures at depth 30 is consistent with Theorem[4\.6](https://arxiv.org/html/2606.00376#S4.Thmtheorem6)’s prediction that bidirectional attention providesO\(L\)O\(L\)capacity for full\-history access, compared toO\(logL\)O\(\\log L\)for causal attention in decoder\-only models\. This architectural comparison provides additional evidence that decoherence is fundamentally caused by causal masking constraints\.
### D\.2Full Model Results
Table 13:Complete accuracy \(%\) across all 12 models and 5 conditions on PermutationProbe\.
### D\.3Multi\-Task Results
Table 14:Results across all 8 task domains for GPT\-4o\.TaskC1C3d∗d^\{\*\}SSJ\-AccrrSyntheticPermutationProbe28\.389\.7220\.96FSA\-Sim31\.291\.2210\.94ArithChain26\.488\.3240\.93CircuitTrace29\.790\.1230\.95CodeProbe27\.889\.2190\.97Real\-WorldSWE\-Bench\-State24\.186\.4190\.92WebArena\-Nav21\.384\.2190\.89SQL\-Multi31\.488\.9210\.91
### D\.4Statistical Analysis
Table 15:Statistical tests for C1 vs C4 comparison \(preference manipulation\)\.All comparisons non\-significant with Bayes factors supporting null hypothesis \(BF\>014\{\}\_\{01\}\>4\)\. TOST confirms equivalence withinΔ=5%\\Delta=5\\%margin\.
## Appendix EReal\-World Task Details
### E\.1SWE\-Bench\-State
We extracted 500 instances from SWE\-Bench\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib52)\)requiring explicit state tracking across multiple files\. Criteria:
- •Bug fix requires tracking≥\\geq3 variables across≥\\geq2 files
- •Ground truth execution trace available
- •State changes are deterministic \(no randomness\)
##### Example\.
Bug in Django ORM requiring tracking of QuerySet state through filter chains, annotation, and aggregation across models\.py, views\.py, and tests\.py\.
### E\.2WebArena\-Nav
We extracted 500 instances from WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib53)\)involving multi\-step navigation with session state\. Criteria:
- •Task requires≥\\geq5 navigation steps
- •Session state \(cart, form data, authentication\) must be tracked
- •Deterministic success criteria
##### Example\.
E\-commerce checkout requiring: login→\\rightarrowadd items→\\rightarrowapply coupon→\\rightarrowselect shipping→\\rightarrowconfirm payment\. Each step modifies session state\.
### E\.3SQL\-Multi
We created 500 instances requiring 3\+ table joins with schema tracking\. Generated using templates from Spider\(Yuet al\.,[2018](https://arxiv.org/html/2606.00376#bib.bib54)\)with increased complexity\.
##### Example\.
Query requiring: join customers→\\rightarroworders→\\rightarroworder\_items→\\rightarrowproducts→\\rightarrowcategories with filtering and aggregation at each level\.
## Appendix FFine\-Tuning Experiment Details
### F\.1Training Procedure
##### Data Generation\.
We generated 5,000 optimal\-length CoT traces by running BFS and recording the solution path with intermediate states\. Format:
```
Problem: initial=[3,1,4,2,5], target=[1,2,3,4,5]
Step 1: swap(0,1) -> [1,3,4,2,5]
Step 2: swap(1,2) -> [1,4,3,2,5]
...
Solution: [1,2,3,4,5] (correct)
```
##### Training Configuration\.
- •Model: Llama\-3\.1\-8B\-Instruct
- •Learning rate:2×10−52\\times 10^\{\-5\}with cosine decay
- •Batch size: 8
- •Epochs: 3
- •Early stopping: patience 2 on validation loss
- •Hardware: 4×\\timesA100 80GB
### F\.2Results by Depth
Table 16:Fine\-tuning results by depth bin\.Improvement decays with depth as predicted by Theorem[4\.10](https://arxiv.org/html/2606.00376#S4.Thmtheorem10)\. At depth\>\>40, improvement is negligible despite training on optimal traces, confirming architectural ceiling\.
## Appendix GAttention Entropy Measurements
### G\.1Methodology
We extracted attention patterns from open\-weight models during reasoning:
1. 1\.Run model on 500 PermutationProbe instances
2. 2\.Extract attention matrices from all layers at each decoding step
3. 3\.Compute entropy:H=−∑ipilogpiH=\-\\sum\_\{i\}p\_\{i\}\\log p\_\{i\}for each head
4. 4\.Aggregate across heads and layers
5. 5\.Correlate with per\-step accuracy
### G\.2Results by Layer
Table 17:Entropy\-accuracy correlation by layer \(Llama\-3\.3\-70B\)\.Correlation strengthens in later layers, consistent with representational collapse mechanism\.
## Appendix HEncoder\-Decoder Experiment Details
### H\.1Fine\-tuning Procedure
For T5\-Large and BART\-Large:
- •Learning rate:3×10−53\\times 10^\{\-5\}with linear warmup
- •Batch size: 8
- •Epochs: 10
- •Early stopping: patience 3
- •Input: “Solve: initial=\[…\], target=\[…\], ops=\[…\]”
- •Output: “op1, op2, …, opN”
### H\.2Results by Depth
Table 18:Encoder\-decoder accuracy by depth bin\.Encoder\-decoder advantage grows with depth: 1\.3×\\timesat depth 10, 3\.4×\\timesat depth 40\.
## Appendix ICross\-Model Correlation Details
Table 19:Full cross\-model correlation matrix\.Highest correlation \(0\.91\) between Llama and Qwen reflects similar pretraining\. High correlations \(r\>0\.81r\>0\.81\) across organizations support architectural causation\.
## Appendix JRuntime Analysis
Table 20:Average runtime per evaluation across models and conditions\.##### Computation Summary\.
Total experiment time: approximately 420 GPU\-hours for open\-weight models plus 280 hours of API wall\-clock time\. Attention entropy extraction required an additional 48 GPU\-hours on 2×\\timesA100 80GB\.
## Appendix KFailure Mode Analysis
We randomly sampled 2,880 failed C1 traces for detailed failure mode analysis \(30 per model\-task combination, stratified by reasoning depth\)\. After excluding 33 traces due to parsing errors \(malformed output format\) or incomplete outputs \(truncated mid\-reasoning\), 2,847 traces remained for analysis\. These reveal systematic patterns: state transposition errors \(35%\), operator misapplication \(27%\), premature termination \(21%\), circular reasoning \(11%\), and complete hallucination \(6%\)\.
### K\.1Sampling Methodology
To characterize failure modes systematically, we employed stratified sampling of failed C1 \(unconstrained neural CoT\) traces\.
##### Sampling procedure\.
1. 1\.For each model\-task combination \(12 models×\\times8 tasks = 96 combinations\), we identified all failed instances \(accuracy<<100% on the reasoning chain\)\.
2. 2\.Within each combination, we stratified by reasoning depth bins:≤\\leq10, 11–20, 21–30, 31–40, 41–50,\>\>50\.
3. 3\.We sampled 30 failed traces per model\-task combination \(5 per depth bin where available\), yielding a target of 2,880 traces\.
4. 4\.Two annotators independently categorized each trace; disagreements were resolved by discussion\.
##### Exclusions\.
Of the 2,880 sampled traces, 33 \(1\.15%\) were excluded:
- •Parsing errors \(n=21n=21\):Model output did not conform to the expected “Step N: operator→\\rightarrowstate” format, preventing automated state extraction\.
- •Incomplete outputs \(n=12n=12\):Traces truncated mid\-reasoning due to maximum token limits \(8,192 tokens\) or API timeouts\.
The final analysis corpus comprises2,847 failed traces\.
##### Inter\-annotator agreement\.
Cohen’sκ=0\.87\\kappa=0\.87\(strong agreement\) for primary failure mode classification\.
### K\.2Failure Mode Distribution
Analysis of 2,847 failed traces reveals systematic patterns:
Table 21:Failure mode distribution across 2,847 analyzed traces\.The dominant failure mode, state transposition, directly reflects attention\-based working memory limits: models misremember element positions due to attention dilution across extended reasoning chains\.
### K\.3Example: State Transposition
```
ΨProblem: Transform [3,1,4,2,5] to [1,2,3,4,5]
ΨStep 1: swap(0,1) -> [1,3,4,2,5] (correct)
ΨStep 2: swap(1,2) -> [1,4,3,2,5] (ERROR)
Ψ(Should be swap(1,3) -> [1,2,4,3,5])
ΨStep 3: swap(2,3) -> [1,4,2,3,5] (propagated)
Ψ...
```
The model correctly identifies the need to move elements but misremembers positions after Step 1, causing cascading errors\.
### K\.4Example: Circular Reasoning
```
ΨSteps 15-22:
ΨStep 15: [2,1,3,5,4,6,8,7]
ΨStep 16: [2,1,5,3,4,6,8,7]
ΨStep 17: [2,1,3,5,4,6,8,7] <- Same as Step 15
ΨStep 18: [2,1,5,3,4,6,8,7] <- Same as Step 16
Ψ...
```
The model enters a loop, repeatedly visiting the same states without progress toward the target\.
### K\.5Failure Mode by Depth
Table 22:Failure mode distribution by reasoning depth\. State transposition dominates at shallow depths; circular reasoning and hallucination increase with depth\.The shift in failure mode distribution with depth is consistent with the decoherence hypothesis: at shallow depths, discrete errors \(transposition, misapplication\) dominate, while at greater depths, systemic failures \(circular reasoning, hallucination\) become more prevalent as accumulated attention entropy degrades state coherence\.
## Appendix LCost Analysis
Table 23:Cost\-per\-correct\-solution analysis\. “vs\. C3” is relative to same\-model tool delegation where available; o3\-mini and Best\-of\-10 are compared to the GPT\-4o tool baseline\.Total experimental cost: $3,420 \(480,000 of the 720,000 evaluations ran against the eight API\-accessed models; the four open\-weight models were run locally, accounting for the remaining 240,000\)\.
## Appendix MReproducibility Checklist
### M\.1Model Versions
All 12 models evaluated with their exact version identifiers:
##### General\-Purpose Models \(4\)\.
- •GPT\-4o:gpt\-4o\-2024\-11\-20
- •Claude\-4\.5\-Sonnet:claude\-sonnet\-4\-5\-20250929
- •Claude\-4\.5\-Opus:claude\-opus\-4\-5\-20251101
- •Gemini\-1\.5\-Pro:gemini\-1\.5\-pro\-002\(2024\-09\-24\)
##### Reasoning\-Specialized Models \(4\)\.
- •o3:o3\-2025\-01\-31
- •o3\-mini:o3\-mini\-2025\-01\-31
- •DeepSeek\-R1:deepseek\-reasoner\(DeepSeek\-R1\-0528\)
- •Gemini\-2\.0\-Flash\-Thinking:gemini\-2\.0\-flash\-thinking\-exp\-01\-21
##### Open\-Weight Models \(4\)\.
- •Llama\-3\.1\-8B:meta\-llama/Llama\-3\.1\-8B\-Instruct
- •Llama\-3\.3\-70B:meta\-llama/Llama\-3\.3\-70B\-Instruct
- •Qwen\-2\.5\-7B:Qwen/Qwen2\.5\-7B\-Instruct
- •Qwen\-2\.5\-72B:Qwen/Qwen2\.5\-72B\-Instruct
### M\.2Hyperparameters
- •Temperature: 0 \(deterministic\)
- •Max tokens: 8192
- •Random seeds \(3 runs\): \{42, 2024, 2025\}
- •Instance generation seed: 42
- •Bootstrap resamples: 10,000
### M\.3Hardware
- •API experiments: N/A \(cloud\)
- •Fine\-tuning: 4×\\timesNVIDIA A100 80GB
- •Attention extraction: 2×\\timesNVIDIA A100 80GB
## Appendix NInstance Generation
### N\.1PermutationProbe
Algorithm 2PermutationProbe Instance Generation0:Size
nn, depth range
\[dmin,dmax\]\[d\_\{\\min\},d\_\{\\max\}\]
1:for
i=1i=1to num\_instancesdo
2:
σ0←identity permutation of\[1\.\.n\]\\sigma\_\{0\}\\leftarrow\\text\{identity permutation of \}\[1\.\.n\]
3:
d←uniform\(dmin,dmax\)d\\leftarrow\\text\{uniform\}\(d\_\{\\min\},d\_\{\\max\}\)
4:
τ←σ0\\tau\\leftarrow\\sigma\_\{0\}
5:for
j=1j=1to
dddo
6:
\(a,b\)←random distinct indices\(a,b\)\\leftarrow\\text\{random distinct indices\}
7:
τ←swap\(τ,a,b\)\\tau\\leftarrow\\text\{swap\}\(\\tau,a,b\)
8:endfor
9:Verify BFS depth from
σ0\\sigma\_\{0\}to
τ\\tauequals
dd
10:yield
\(σ0,τ,d\)\(\\sigma\_\{0\},\\tau,d\)
11:endfor
### N\.2Task Distribution
Table 24:Instance distribution by BFS\-optimal depth\.
## Appendix OBroader Societal Impact
### O\.1Safety Implications
Our findings have direct implications for AI safety:
- •Autonomous systems: LLMs should not be trusted for multi\-step reasoning in safety\-critical domains without tool verification
- •Code generation: Extended code reasoning may introduce subtle bugs; tool\-based verification essential
- •Planning: Sequential planning requiring exact state tracking should default to formal methods
### O\.2Environmental Considerations
Extended neural reasoning without accuracy improvement represents wasted compute and carbon emissions\. Our cost analysis shows 4\.2–4\.7×\\timesefficiency gains from tool delegation, translating directly to environmental benefits\.
### O\.3Limitations
- •Our analysis focuses on deterministic state\-tracking; stochastic reasoning may differ
- •Real\-world validation limited to three domains
- •Theoretical bounds rely on assumptions validated empirically but not proven
- •Results may not generalize to all reasoning tasks
## Appendix PNotation and Symbol Reference
For reader convenience, we provide a complete reference of all notation used throughout the paper\.
### P\.1Primary Symbols
Table 25:Primary notation used in theoretical framework\.
### P\.2Metric Symbols
Table 26:Metrics and evaluation symbols\.
### P\.3Information\-Theoretic Symbols
Table 27:Information\-theoretic notation\.
## Appendix QSupplementary Materials Index
This appendix provides a complete index to all supplementary materials for navigation\.
Table 28:Index of supplementary materials\.
## Appendix RSupporting Lemmas
We establish several supporting lemmas that underpin the main theoretical results\. These lemmas provide the technical foundation for Theorems[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)–[4\.10](https://arxiv.org/html/2606.00376#S4.Thmtheorem10)\.
### R\.1Lemmas for Decoherence Bound \(Theorem[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)\)
###### Lemma R\.1\(Error Accumulation\)\.
Let\{Xi\}i=1m\\\{X\_\{i\}\\\}\_\{i=1\}^\{m\}be indicator variables for correct state tracking at stepii\. Under context\-dependent errorϵ\(i\)=ϵ0\+γi/Leff\\epsilon\(i\)=\\epsilon\_\{0\}\+\\gamma i/L\_\{\\mathrm\{eff\}\}, the cumulative error probability satisfies:
∑i=1mϵ\(i\)=mϵ0\+γm\(m\+1\)2Leff\\sum\_\{i=1\}^\{m\}\\epsilon\(i\)=m\\epsilon\_\{0\}\+\\frac\{\\gamma m\(m\+1\)\}\{2L\_\{\\mathrm\{eff\}\}\}\(26\)
###### Proof\.
Direct computation:
∑i=1mϵ\(i\)\\displaystyle\\sum\_\{i=1\}^\{m\}\\epsilon\(i\)=∑i=1m\(ϵ0\+γiLeff\)\\displaystyle=\\sum\_\{i=1\}^\{m\}\\left\(\\epsilon\_\{0\}\+\\frac\{\\gamma i\}\{L\_\{\\mathrm\{eff\}\}\}\\right\)\(27\)=mϵ0\+γLeff∑i=1mi\\displaystyle=m\\epsilon\_\{0\}\+\\frac\{\\gamma\}\{L\_\{\\mathrm\{eff\}\}\}\\sum\_\{i=1\}^\{m\}i\(28\)=mϵ0\+γLeff⋅m\(m\+1\)2\\displaystyle=m\\epsilon\_\{0\}\+\\frac\{\\gamma\}\{L\_\{\\mathrm\{eff\}\}\}\\cdot\\frac\{m\(m\+1\)\}\{2\}\(29\)∎
###### Lemma R\.2\(Product Bound\)\.
For any sequence of events\{Ai\}i=1m\\\{A\_\{i\}\\\}\_\{i=1\}^\{m\}withP\(Ai\|A1∩⋯∩Ai−1\)≥1−ϵiP\(A\_\{i\}\|A\_\{1\}\\cap\\cdots\\cap A\_\{i\-1\}\)\\geq 1\-\\epsilon\_\{i\}:
P\(⋂i=1mAi\)≤exp\(−∑i=1mϵi\)P\\left\(\\bigcap\_\{i=1\}^\{m\}A\_\{i\}\\right\)\\leq\\exp\\left\(\-\\sum\_\{i=1\}^\{m\}\\epsilon\_\{i\}\\right\)\(30\)
###### Proof\.
Using1−x≤e−x1\-x\\leq e^\{\-x\}forx≥0x\\geq 0and the chain rule:
P\(⋂i=1mAi\)\\displaystyle P\\left\(\\bigcap\_\{i=1\}^\{m\}A\_\{i\}\\right\)=∏i=1mP\(Ai\|A1∩⋯∩Ai−1\)\\displaystyle=\\prod\_\{i=1\}^\{m\}P\(A\_\{i\}\|A\_\{1\}\\cap\\cdots\\cap A\_\{i\-1\}\)\(31\)≤∏i=1m\(1−ϵi\)\\displaystyle\\leq\\prod\_\{i=1\}^\{m\}\(1\-\\epsilon\_\{i\}\)\(32\)≤∏i=1me−ϵi=exp\(−∑i=1mϵi\)\\displaystyle\\leq\\prod\_\{i=1\}^\{m\}e^\{\-\\epsilon\_\{i\}\}=\\exp\\left\(\-\\sum\_\{i=1\}^\{m\}\\epsilon\_\{i\}\\right\)\(33\)∎
###### Lemma R\.3\(Super\-Exponential Characterization\)\.
The functionf\(m\)=exp\(−am−bm2\)f\(m\)=\\exp\(\-am\-bm^\{2\}\)fora,b\>0a,b\>0decays super\-exponentially, satisfying:
limm→∞f\(m\)e−cm=0for allc\>0\\lim\_\{m\\to\\infty\}\\frac\{f\(m\)\}\{e^\{\-cm\}\}=0\\quad\\text\{for all \}c\>0\(34\)
###### Proof\.
f\(m\)e−cm=exp\(−am−bm2\+cm\)=exp\(−\(a−c\)m−bm2\)\\frac\{f\(m\)\}\{e^\{\-cm\}\}=\\exp\(\-am\-bm^\{2\}\+cm\)=\\exp\(\-\(a\-c\)m\-bm^\{2\}\)\(35\)Sincebm2→∞bm^\{2\}\\to\\inftyasm→∞m\\to\\infty, the ratio approaches 0 regardless ofcc\. ∎
### R\.2Lemmas for Attention Bottleneck \(Theorem[4\.6](https://arxiv.org/html/2606.00376#S4.Thmtheorem6)\)
###### Lemma R\.4\(Attention Concentration\)\.
Under softmax attention with temperatureτ=1\\tau=1, the number of positions receiving weight≥δ/L\\geq\\delta/Lis at mostO\(L/δ\)O\(L/\\delta\)\. With interference fromHHheads, effective retrieval is further constrained toO\(L\)O\(\\sqrt\{L\}\)positions\.
###### Proof\.
Let𝐚=softmax\(𝐪⊤𝐊/dh\)\\mathbf\{a\}=\\text\{softmax\}\(\\mathbf\{q\}^\{\\top\}\\mathbf\{K\}/\\sqrt\{d\_\{h\}\}\)be the attention distribution overLLpositions\.
Part 1: Basic constraint\.Since∑i=1Lai=1\\sum\_\{i=1\}^\{L\}a\_\{i\}=1andai≥δ/La\_\{i\}\\geq\\delta/Lfor the “attended” positions, at mostL/δL/\\deltapositions can be attended\.
Part 2: Interference bound\.ConsiderHHheads attending to overlapping subsets\. By concentration inequalities for softmax \(following[Wortsmanet al\.](https://arxiv.org/html/2606.00376#bib.bib56)’s analysis\), the effective number of positions simultaneously attended with weight≥δ/L\\geq\\delta/Lacross all heads scales asO\(L⋅H\)O\(\\sqrt\{L\\cdot H\}\)\. For single\-head analysis, this reduces toO\(L\)O\(\\sqrt\{L\}\)\. ∎
###### Lemma R\.5\(Value Information Bound\)\.
Under Assumptions[4\.4](https://arxiv.org/html/2606.00376#S4.Thmtheorem4)–[4\.5](https://arxiv.org/html/2606.00376#S4.Thmtheorem5)with correlation boundρmax\\rho\_\{\\max\}, the information one attention head conveys about the retrieved value is bounded by
I\(𝐪;𝐯retrieved\)≲dh1/2⋅log2\(L/H\),I\(\\mathbf\{q\};\\mathbf\{v\}\_\{\\text\{retrieved\}\}\)\\lesssim d\_\{h\}^\{1/2\}\\cdot\\log\_\{2\}\(L/H\),\(36\)the product of an effective value rankΘ\(dh1/2\)\\Theta\(d\_\{h\}^\{1/2\}\)\(the number of reliably distinguishable value directions under decorrelation\) and thelog2\(L/H\)\\log\_\{2\}\(L/H\)bits required to address a head’s share of∼L/H\\sim L/Hpositions\.
###### Proof\.
The retrieved value is𝐯ret=∑i=1Lai𝐯i\\mathbf\{v\}\_\{\\text\{ret\}\}=\\sum\_\{i=1\}^\{L\}a\_\{i\}\\mathbf\{v\}\_\{i\}, where𝐚\\mathbf\{a\}is the attention distribution\. Its information content factors into an addressing term and a value\-content term\.
*Addressing\.*By Lemma[R\.4](https://arxiv.org/html/2606.00376#A18.Thmtheorem4), effective attention concentrates onO\(L\)O\(\\sqrt\{L\}\)positions; partitioned across theHHheads, each head resolves among∼L/H\\sim L/Hpositions, so specifying which content a head reads carries at mostlog2\(L/H\)\\log\_\{2\}\(L/H\)bits\.
*Value content\.*Under Assumption[4\.5](https://arxiv.org/html/2606.00376#S4.Thmtheorem5), value vectors satisfy\|𝔼\[𝐯i⊤𝐯j\]\|≤ρmax‖𝐯i‖‖𝐯j‖\|\\mathbb\{E\}\[\\mathbf\{v\}\_\{i\}^\{\\top\}\\mathbf\{v\}\_\{j\}\]\|\\leq\\rho\_\{\\max\}\\\|\\mathbf\{v\}\_\{i\}\\\|\\\|\\mathbf\{v\}\_\{j\}\\\|fori≠ji\\neq j\. Soft averaging over interfering positions reduces the number of reliably distinguishable value directions below the nominaldhd\_\{h\}; we model this effective rank asΘ\(dh1/2\)\\Theta\(d\_\{h\}^\{1/2\}\)\. Thisdh\\sqrt\{d\_\{h\}\}scaling is a modeling assumption rather than a consequence of the decorrelation bound alone: we adopt it as a tractable summary and validate the resultingdh⋅H\\sqrt\{d\_\{h\}\\cdot H\}horizon scaling directly on open\-weight models in Section[6\.1](https://arxiv.org/html/2606.00376#S6.SS1)\.
Combining theΘ\(dh1/2\)\\Theta\(d\_\{h\}^\{1/2\}\)distinguishable value channels, each addressing one of∼L/H\\sim L/Hpositions, givesI≲dh1/2⋅log2\(L/H\)I\\lesssim d\_\{h\}^\{1/2\}\\cdot\\log\_\{2\}\(L/H\)\. ∎
###### Lemma R\.6\(Multi\-Head Aggregation\)\.
ForHHattention heads with approximately independent value subspaces, the total information capacity satisfies:
Itotal≤H⋅Isingle\-head=H⋅log2\(L/H\)⋅dh1/2I\_\{\\text\{total\}\}\\leq H\\cdot I\_\{\\text\{single\-head\}\}=H\\cdot\\log\_\{2\}\(L/H\)\\cdot d\_\{h\}^\{1/2\}\(37\)
###### Proof\.
Multi\-head attention computes:
MHA\(𝐪,𝐊,𝐕\)=Concat\(head1,…,headH\)WO\\text\{MHA\}\(\\mathbf\{q\},\\mathbf\{K\},\\mathbf\{V\}\)=\\text\{Concat\}\(\\text\{head\}\_\{1\},\\ldots,\\text\{head\}\_\{H\}\)W^\{O\}\(38\)
Under independence of value subspaces \(ensured by distinctWhVW^\{V\}\_\{h\}projections and layer normalization\), information from each head can be summed:
Itotal=∑h=1HIh≤H⋅maxhIhI\_\{\\text\{total\}\}=\\sum\_\{h=1\}^\{H\}I\_\{h\}\\leq H\\cdot\\max\_\{h\}I\_\{h\}\(39\)
Substituting the single\-head bound from Lemma[R\.5](https://arxiv.org/html/2606.00376#A18.Thmtheorem5):
Itotal≤H⋅log2\(L/H\)⋅dh1/2I\_\{\\text\{total\}\}\\leq H\\cdot\\log\_\{2\}\(L/H\)\\cdot d\_\{h\}^\{1/2\}\(40\)
The factorlog2\(L/H\)\\log\_\{2\}\(L/H\)arises because each head effectively partitions attention acrossL/HL/Hpositions on average\. ∎
### R\.3Lemmas for Deterministic Horizon \(Theorem[4\.8](https://arxiv.org/html/2606.00376#S4.Thmtheorem8)\)
###### Lemma R\.7\(Quadratic Inversion\)\.
For the equationα=exp\(−dϵ0−γd2/\(2Leff\)\)\\alpha=\\exp\(\-d\\epsilon\_\{0\}\-\\gamma d^\{2\}/\(2L\_\{\\mathrm\{eff\}\}\)\), the solution forddis:
d=−ϵ0Leff\+ϵ02Leff2\+2γLeffln\(1/α\)γd=\\frac\{\-\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\+\\sqrt\{\\epsilon\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}^\{2\}\+2\\gamma L\_\{\\mathrm\{eff\}\}\\ln\(1/\\alpha\)\}\}\{\\gamma\}\(41\)
###### Proof\.
Takingln\\lnof both sides:
lnα=−dϵ0−γd22Leff\\ln\\alpha=\-d\\epsilon\_\{0\}\-\\frac\{\\gamma d^\{2\}\}\{2L\_\{\\mathrm\{eff\}\}\}\(42\)
Rearranging:
γd22Leff\+dϵ0\+lnα=0\\frac\{\\gamma d^\{2\}\}\{2L\_\{\\mathrm\{eff\}\}\}\+d\\epsilon\_\{0\}\+\\ln\\alpha=0\(43\)
Multiplying by2Leff/γ2L\_\{\\mathrm\{eff\}\}/\\gamma:
d2\+2ϵ0Leffγd\+2Lefflnαγ=0d^\{2\}\+\\frac\{2\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\}\{\\gamma\}d\+\\frac\{2L\_\{\\mathrm\{eff\}\}\\ln\\alpha\}\{\\gamma\}=0\(44\)
Applying the quadratic formula witha=1a=1,b=2ϵ0Leff/γb=2\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}/\\gamma,c=2Lefflnα/γc=2L\_\{\\mathrm\{eff\}\}\\ln\\alpha/\\gamma:
d\\displaystyle d=−b\+b2−4ac2a\\displaystyle=\\frac\{\-b\+\\sqrt\{b^\{2\}\-4ac\}\}\{2a\}\(45\)=−2ϵ0Leffγ\+4ϵ02Leff2γ2−8Lefflnαγ2\\displaystyle=\\frac\{\-\\frac\{2\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\}\{\\gamma\}\+\\sqrt\{\\frac\{4\\epsilon\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}^\{2\}\}\{\\gamma^\{2\}\}\-\\frac\{8L\_\{\\mathrm\{eff\}\}\\ln\\alpha\}\{\\gamma\}\}\}\{2\}\(46\)=−ϵ0Leff\+ϵ02Leff2\+2γLeffln\(1/α\)γ\\displaystyle=\\frac\{\-\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\+\\sqrt\{\\epsilon\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}^\{2\}\+2\\gamma L\_\{\\mathrm\{eff\}\}\\ln\(1/\\alpha\)\}\}\{\\gamma\}\(47\)
We take the positive root sinced\>0d\>0\. ∎
###### Lemma R\.8\(Scaling Law Derivation\)\.
Under the approximationϵ02Leff≪γln\(1/α\)\\epsilon\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}\\ll\\gamma\\ln\(1/\\alpha\), the Deterministic Horizon satisfies:
d∗≈2Leffln\(1/α\)γ−ϵ0Leffγ⟹d∗∝Leffd^\{\*\}\\approx\\frac\{\\sqrt\{2L\_\{\\mathrm\{eff\}\}\\ln\(1/\\alpha\)\}\}\{\\sqrt\{\\gamma\}\}\-\\frac\{\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\}\{\\gamma\}\\implies d^\{\*\}\\propto\\sqrt\{L\_\{\\mathrm\{eff\}\}\}\(48\)to leading order\. At realistic parameter values the linearϵ0\\epsilon\_\{0\}term is not negligible \(see remark below\), so this scaling is approximate\.
###### Proof\.
From Lemma[R\.7](https://arxiv.org/html/2606.00376#A18.Thmtheorem7):
d∗\\displaystyle d^\{\*\}=−ϵ0Leff\+ϵ02Leff2\+2γLeffln\(1/α\)γ\\displaystyle=\\frac\{\-\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\+\\sqrt\{\\epsilon\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}^\{2\}\+2\\gamma L\_\{\\mathrm\{eff\}\}\\ln\(1/\\alpha\)\}\}\{\\gamma\}\(49\)=2γLeffln\(1/α\)1\+ϵ02Leff2γln\(1/α\)−ϵ0Leffγ\\displaystyle=\\frac\{\\sqrt\{2\\gamma L\_\{\\mathrm\{eff\}\}\\ln\(1/\\alpha\)\}\\sqrt\{1\+\\frac\{\\epsilon\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}\}\{2\\gamma\\ln\(1/\\alpha\)\}\}\-\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\}\{\\gamma\}\(50\)
Underϵ02Leff≪γln\(1/α\)\\epsilon\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}\\ll\\gamma\\ln\(1/\\alpha\), using1\+x≈1\+x/2\\sqrt\{1\+x\}\\approx 1\+x/2for smallxx:
d∗\\displaystyle d^\{\*\}≈2γLeffln\(1/α\)−ϵ0Leffγ\\displaystyle\\approx\\frac\{\\sqrt\{2\\gamma L\_\{\\mathrm\{eff\}\}\\ln\(1/\\alpha\)\}\-\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\}\{\\gamma\}\(51\)=1γ\(2Leffln\(1/α\)⋅γ−ϵ0Leff\)\\displaystyle=\\frac\{1\}\{\\gamma\}\\left\(\\sqrt\{2L\_\{\\mathrm\{eff\}\}\\ln\(1/\\alpha\)\}\\cdot\\sqrt\{\\gamma\}\-\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\\right\)\(52\)=2Leffln\(1/α\)γ−ϵ0Leffγ\\displaystyle=\\sqrt\{\\frac\{2L\_\{\\mathrm\{eff\}\}\\ln\(1/\\alpha\)\}\{\\gamma\}\}\-\\frac\{\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\}\{\\gamma\}\(53\)
The dominant term scales asLeff\\sqrt\{L\_\{\\mathrm\{eff\}\}\}\.
Remark\.At the canonical values \(ϵ0=0\.02\\epsilon\_\{0\}=0\.02,γ=0\.15\\gamma=0\.15,Leff=150L\_\{\\mathrm\{eff\}\}=150,α=0\.5\\alpha=0\.5\), the dimensionless ratiox=ϵ02Leff/\(2γln\(1/α\)\)≈0\.29x=\\epsilon\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}/\(2\\gamma\\ln\(1/\\alpha\)\)\\approx 0\.29is not small, so the first\-order expansion carries a∼\\sim15% systematic correction\. We therefore treat theLeff\\sqrt\{L\_\{\\mathrm\{eff\}\}\}relationship as an approximate trend and validate the exact horizon \(Lemma[R\.7](https://arxiv.org/html/2606.00376#A18.Thmtheorem7)\) directly against measuredd∗d^\{\*\}rather than relying on the asymptotic form\. ∎
## Appendix SSSJ Algorithm and Extraction
This section provides complete algorithmic details for computing the State\-Space Jaccard \(SSJ\) metric introduced in Definition[3\.2](https://arxiv.org/html/2606.00376#S3.Thmtheorem2)\.
### S\.1Algorithm Description
The SSJ metric measures overlap between the ground\-truth reachable state set and the model’s claimed state set at each reasoning depth\. Algorithm[3](https://arxiv.org/html/2606.00376#alg3)provides the complete extraction procedure\.
Algorithm 3State\-Space Jaccard \(SSJ\) Extraction0:Initial state
σ0\\sigma\_\{0\}, operators
𝒪\\mathcal\{O\}, model trace
T=\[\(s1,o1\),…,\(sm,om\)\]T=\[\(s\_\{1\},o\_\{1\}\),\\ldots,\(s\_\{m\},o\_\{m\}\)\]
0:SSJ values, precision, recall at each depth
1:
Strue←\{σ0\}S\_\{\\text\{true\}\}\\leftarrow\\\{\\sigma\_\{0\}\\\}\{Ground\-truth reachable states\}
2:
Smodel←\{σ0\}S\_\{\\text\{model\}\}\\leftarrow\\\{\\sigma\_\{0\}\\\}\{Model\-claimed states\}
3:
σcurr←σ0\\sigma\_\{\\text\{curr\}\}\\leftarrow\\sigma\_\{0\}\{Current ground\-truth state\}
4:for
d=1d=1to
mmdo
5:
\(sd,od\)←T\[d\]\(s\_\{d\},o\_\{d\}\)\\leftarrow T\[d\]\{Model’s claimed state and operator\}
6:
σcurr←od\(σcurr\)\\sigma\_\{\\text\{curr\}\}\\leftarrow o\_\{d\}\(\\sigma\_\{\\text\{curr\}\}\)\{Apply operator to ground truth\}
7:
Strue←Strue∪\{σcurr\}S\_\{\\text\{true\}\}\\leftarrow S\_\{\\text\{true\}\}\\cup\\\{\\sigma\_\{\\text\{curr\}\}\\\}
8:
Smodel←Smodel∪\{sd\}S\_\{\\text\{model\}\}\\leftarrow S\_\{\\text\{model\}\}\\cup\\\{s\_\{d\}\\\}
9:
SSJ\[d\]←\|Strue∩Smodel\|/\|Strue∪Smodel\|\\text\{SSJ\}\[d\]\\leftarrow\|S\_\{\\text\{true\}\}\\cap S\_\{\\text\{model\}\}\|/\|S\_\{\\text\{true\}\}\\cup S\_\{\\text\{model\}\}\|
10:
Precision\[d\]←\|Strue∩Smodel\|/\|Smodel\|\\text\{Precision\}\[d\]\\leftarrow\|S\_\{\\text\{true\}\}\\cap S\_\{\\text\{model\}\}\|/\|S\_\{\\text\{model\}\}\|
11:
Recall\[d\]←\|Strue∩Smodel\|/\|Strue\|\\text\{Recall\}\[d\]\\leftarrow\|S\_\{\\text\{true\}\}\\cap S\_\{\\text\{model\}\}\|/\|S\_\{\\text\{true\}\}\|
12:endfor
13:returnSSJ, Precision, Recall
### S\.2Implementation Details
State representation:For PermutationProbe, states are represented as tuples\(a1,…,an\)\(a\_\{1\},\\ldots,a\_\{n\}\)\. For FSA\-Sim, states are automaton state identifiers\. For CodeProbe, states are variable binding dictionaries\.
Set operations:We use hash\-based sets forO\(1\)O\(1\)membership testing\. For large state spaces, we employ Bloom filters with false positive rate<0\.1%<0\.1\\%\.
Parsing model output:We extract claimed states using regex patterns matched to task\-specific formats\. For PermutationProbe:
```
pattern = r"Step \d+:.*-> \[([\d,\s]+)\]"
```
Error handling:If the model output cannot be parsed, the step is marked as an error withsd=∅s\_\{d\}=\\emptyset\.
### S\.3Diagnostic Power
The SSJ metric with precision/recall decomposition distinguishes two failure modes:
1. 1\.Preference failure \(Simplicity Bias\):The model*could*track states but*chooses*not to\. Signature: high precision \(claimed states are correct\), low recall \(many states omitted\)\.
2. 2\.Capability failure \(Decoherence\):The model*cannot*track states\. Signature: both precision and recall decay, indicating drift into fictitious state spaces\.
Our empirical results \(Table[4](https://arxiv.org/html/2606.00376#S5.T4)\) show parallel decay of both metrics, confirming capability failure\.
### S\.4Computational Complexity
Time:O\(m⋅\|S\|\)O\(m\\cdot\|S\|\)wheremmis trace length and\|S\|\|S\|is state size for comparison operations\.
Space:O\(m⋅\|S\|\)O\(m\\cdot\|S\|\)to store accumulated state sets\.
For typical experiments \(m≤100m\\leq 100,\|S\|≤1000\|S\|\\leq 1000elements\), extraction completes in<<1 second per trace\.
## Appendix TParameter Sensitivity Analysis
We analyze the sensitivity of our theoretical predictions to parameter choices and measurement uncertainty\.
### T\.1Decoherence Bound Parameters
The key parameters in Theorem[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)areϵ0\\epsilon\_\{0\}\(baseline error\) andγ\\gamma\(attention decay rate\)\. We estimated these from empirical data using maximum likelihood\.
Table 29:Parameter sensitivity for Decoherence Bound\.Key finding:The Deterministic Horizond∗d^\{\*\}is most sensitive to the baseline errorϵ0\\epsilon\_\{0\}: a 20% change inϵ0\\epsilon\_\{0\}shiftsd∗d^\{\*\}by about±\\pm2 steps, while the same change inγ\\gammashifts it by only±\\pm1\.2 steps\. Because the absorbing\-error reframing \(Corollary[4\.3](https://arxiv.org/html/2606.00376#S4.Thmtheorem3)\) treatsρ1\\rho\_\{1\}as a measured diagnostic rather than a multiplier,ρ1\\rho\_\{1\}does not enter the horizon and has no direct effect ond∗d^\{\*\}\. The qualitative conclusion, thatd∗∈\[19,31\]d^\{\*\}\\in\[19,31\], is robust across the entire confidence interval\.
### T\.2Bottleneck Theorem Parameters
The Attention Bottleneck \(Theorem[4\.6](https://arxiv.org/html/2606.00376#S4.Thmtheorem6)\) depends onδ\\delta\(precision threshold\) andρmax\\rho\_\{\\max\}\(value correlation bound\)\.
Table 30:Parameter sensitivity for Bottleneck Bound\.Key finding:The capacity bound is dominated by architectural parameters \(HH,dhd\_\{h\},LL\), which are known exactly\. The uncertainty inδ\\deltaandρmax\\rho\_\{\\max\}contributes only to the constant factorc\(δ,ρmax\)c\(\\delta,\\rho\_\{\\max\}\)\.
### T\.3Cross\-Validation of Parameter Estimates
We performed 5\-fold cross\-validation of the error model:
Table 31:Cross\-validation results for error model\.The small standard deviations \(std\(d^∗\)=0\.5\\text\{std\}\(\\hat\{d\}^\{\*\}\)=0\.5\) indicate robust parameter estimation and stable predictions across data splits\.
### T\.4Model Selection Robustness
We compared alternative error models to validate our context\-dependent formulation:
Table 32:Model comparison for error accumulation\.The context\-dependent model achieves the best trade\-off between fit quality \(R2=0\.91R^\{2\}=0\.91\) and parsimony \(lowest AIC\), supporting our theoretical derivation from attention mechanics\.
## Appendix UComplete Results by Task and Depth
This section provides complete experimental results across all models, tasks, and conditions\.
### U\.1Full Results: Synthetic Tasks
Table[33](https://arxiv.org/html/2606.00376#A21.T33)presents complete accuracy results for all synthetic task domains\. Each cell reports mean±\\pmstandard deviation across 3 independent runs\.
Table 33:Complete results on synthetic tasks\. All values are accuracy \(%\) with standard deviations\.Analysis:Across all synthetic tasks, tool delegation \(C3\) achieves 88–95% accuracy while neural CoT \(C1\) plateaus at 26–44%\. The consistency across task types \(permutations, automata, arithmetic, circuits, and code\) demonstrates that decoherence is task\-agnostic within the deterministic state\-tracking domain\. Preference manipulation \(C4\) provides negligible improvement \(<<2%\), and fine\-tuning \(C5\) achieves only 2–4% gains, confirming the architectural ceiling prediction\.
### U\.2Full Results: Real\-World Tasks
Table 34:Complete results on real\-world tasks\.Analysis:Real\-world tasks exhibit slightly lowerd∗d^\{\*\}values \(19–30\) compared to synthetic tasks \(22–31\), reflecting additional complexity from natural language ambiguity and larger state spaces\. However, the qualitative pattern is identical: C3 dramatically outperforms C1, and C4 provides minimal improvement\. The Deterministic Horizon remains consistent within each model across task types, supporting the architectural interpretation\.
### U\.3Results by Depth Bin
Table[35](https://arxiv.org/html/2606.00376#A21.T35)provides fine\-grained accuracy breakdowns across reasoning depth bins, enabling direct comparison with theoretical predictions\.
Table 35:Accuracy by reasoning depth \(PermutationProbe, GPT\-4o\)\.Analysis:The observed accuracy decay closely matches Theorem[4\.2](https://arxiv.org/html/2606.00376#S4.Thmtheorem2)predictions \(final row\)\. C3 maintains high accuracy across all depth bins, demonstrating that tool delegation effectively bypasses the attention bottleneck\. The convergence of C1, C4, and C5 at high depths confirms that neither preference manipulation nor fine\-tuning can overcome the architectural ceiling\.
### U\.4Open\-Weight Model Results
Table 36:Results for open\-weight models enabling attention analysis\.Analysis:Open\-weight models enable direct validation of the mechanistic hypothesis through attention entropy extraction\. The consistent negative correlation \(r≈−0\.72r\\approx\-0\.72\) across models and tasks confirms that attention entropy increases with reasoning depth and correlates with accuracy degradation, as predicted by the context\-dependent error model\.
## Appendix VNumerical Examples
We provide worked numerical examples demonstrating each theoretical bound with realistic parameter values\.
### V\.1Decoherence Bound Example
Parameters:ϵ0=0\.02\\epsilon\_\{0\}=0\.02,γ=0\.15\\gamma=0\.15,Leff=150L\_\{\\mathrm\{eff\}\}=150, target depthm=30m=30\.
Step 1: Compute cumulative error\.
∑i=130ϵ\(i\)\\displaystyle\\sum\_\{i=1\}^\{30\}\\epsilon\(i\)=30⋅0\.02\+0\.15⋅30⋅312⋅150\\displaystyle=30\\cdot 0\.02\+\\frac\{0\.15\\cdot 30\\cdot 31\}\{2\\cdot 150\}\(54\)=0\.60\+139\.5300\\displaystyle=0\.60\+\\frac\{139\.5\}\{300\}\(55\)=0\.60\+0\.465\\displaystyle=0\.60\+0\.465\(56\)=1\.065\\displaystyle=1\.065\(57\)
Step 2: Apply exponential bound\.
P\(correct at depth30\)≤exp\(−1\.065\)≈0\.345P\(\\text\{correct at depth \}30\)\\leq\\exp\(\-1\.065\)\\approx 0\.345\(58\)
Observed:34% accuracy at depth 30 for GPT\-4o, just below the bound, consistent with the absorbing\-error tightness of Corollary[4\.3](https://arxiv.org/html/2606.00376#S4.Thmtheorem3)\. Note that with the small effective decoherence lengthLeff=150L\_\{\\mathrm\{eff\}\}=150, the quadratic term \(0\.4650\.465\) is comparable to the linear term \(0\.600\.60\) already at depth 30, which is what drives the super\-exponential collapse; using the raw context windowL=O\(105\)L=O\(10^\{5\}\)here would render the quadratic term negligible and incorrectly predict near\-perfect accuracy\.
### V\.2Attention Bottleneck Example
Parameters for GPT\-4o:H=96H=96,L=128,000L=128,000,dh=128d\_\{h\}=128\.
Step 1: Compute log capacity\.
log2\|𝒮track\|\\displaystyle\\log\_\{2\}\|\\mathcal\{S\}\_\{\\text\{track\}\}\|≤H⋅log2\(L/H\)⋅dh1/2\\displaystyle\\leq H\\cdot\\log\_\{2\}\(L/H\)\\cdot d\_\{h\}^\{1/2\}\(59\)=96⋅log2\(128000/96\)⋅1280\.5\\displaystyle=96\\cdot\\log\_\{2\}\(128000/96\)\\cdot 128^\{0\.5\}\(60\)=96⋅log2\(1333\.3\)⋅11\.31\\displaystyle=96\\cdot\\log\_\{2\}\(1333\.3\)\\cdot 11\.31\(61\)=96⋅10\.38⋅11\.31\\displaystyle=96\\cdot 10\.38\\cdot 11\.31\(62\)≈11,275bits\\displaystyle\\approx 11,275\\text\{ bits\}\(63\)
Step 2: Convert to state count\.
\|𝒮track\|≤211,275≈103394\|\\mathcal\{S\}\_\{\\text\{track\}\}\|\\leq 2^\{11,275\}\\approx 10^\{3394\}\(64\)
Step 3: Compare with task requirement\.
For PermutationProbe withn=16n=16elements, trackingd=50d=50steps:
\|𝒮required\|=16\!50≈\(2\.09×1013\)50≈10666\|\\mathcal\{S\}\_\{\\text\{required\}\}\|=16\!^\{50\}\\approx\(2\.09\\times 10^\{13\}\)^\{50\}\\approx 10^\{666\}\(65\)
The task is*within*theoretical capacity \(10666≪10339410^\{666\}\\ll 10^\{3394\}\), but the per\-step information flow oflog2\(16\!\)≈44\\log\_\{2\}\(16\!\)\\approx 44bits must pass through attention bottleneck at each step\. With effective bandwidth∼\\sim100 bits/step \(empirical\), degradation accumulates\.
### V\.3Deterministic Horizon Example
Parameters:ϵ0=0\.02\\epsilon\_\{0\}=0\.02,γ=0\.15\\gamma=0\.15,Leff=150L\_\{\\mathrm\{eff\}\}=150,α=0\.5\\alpha=0\.5\.
Step 1: Apply formula\.
d∗\\displaystyle d^\{\*\}=−ϵ0Leff\+ϵ02Leff2\+2γLeffln\(1/α\)γ\\displaystyle=\\frac\{\-\\epsilon\_\{0\}L\_\{\\mathrm\{eff\}\}\+\\sqrt\{\\epsilon\_\{0\}^\{2\}L\_\{\\mathrm\{eff\}\}^\{2\}\+2\\gamma L\_\{\\mathrm\{eff\}\}\\ln\(1/\\alpha\)\}\}\{\\gamma\}\(66\)=−0\.02⋅150\+\(0\.02\)2⋅1502\+2⋅0\.15⋅150⋅ln\(2\)0\.15\\displaystyle=\\frac\{\-0\.02\\cdot 150\+\\sqrt\{\(0\.02\)^\{2\}\\cdot 150^\{2\}\+2\\cdot 0\.15\\cdot 150\\cdot\\ln\(2\)\}\}\{0\.15\}\(67\)=−3\+9\+31\.190\.15\\displaystyle=\\frac\{\-3\+\\sqrt\{9\+31\.19\}\}\{0\.15\}\(68\)=−3\+40\.190\.15\\displaystyle=\\frac\{\-3\+\\sqrt\{40\.19\}\}\{0\.15\}\(69\)=−3\+6\.3400\.15≈22\.3\\displaystyle=\\frac\{\-3\+6\.340\}\{0\.15\}\\approx 22\.3\(70\)
Observed:d∗=22d^\{\*\}=22for GPT\-4o, in excellent agreement\.
Step 2: Verifyd∗∝dhHd^\{\*\}\\propto\\sqrt\{d\_\{h\}H\}scaling \(open\-weight models\)\.
BecauseLeffL\_\{\\mathrm\{eff\}\}scales with per\-layer attention capacitydhHd\_\{h\}H, the horizon should scale asdhH\\sqrt\{d\_\{h\}H\}across models whose head count and head dimension are known\. For the open\-weight Llama family \(dh=128d\_\{h\}=128\):
dLlama\-70B∗dLlama\-8B∗≈128⋅64128⋅32=81924096=2≈1\.41\\frac\{d^\{\*\}\_\{\\text\{Llama\-70B\}\}\}\{d^\{\*\}\_\{\\text\{Llama\-8B\}\}\}\\approx\\sqrt\{\\frac\{128\\cdot 64\}\{128\\cdot 32\}\}=\\sqrt\{\\frac\{8192\}\{4096\}\}=\\sqrt\{2\}\\approx 1\.41\(71\)
Observed ratio:28/20=1\.4028/20=1\.40, in excellent agreement\. \(We restrict this check to open\-weight models because the head configurations of closed models are not public\.\)
### V\.4Scaling Law Verification Table
Table 37:Predicted vs\. observedd∗d^\{\*\}for open\-weight models, using the empirical fitd^∗=0\.314dhH\\hat\{d\}^\{\*\}=0\.314\\sqrt\{d\_\{h\}H\}\. Restricted to open\-weight models because closed\-model head configurations are not public\.Mean absolute error: 1\.1%\. The fit constantk=0\.314k=0\.314is obtained by least squares over the four open\-weight models\. We emphasize this is an*approximate empirical*relationship: the linearϵ0\\epsilon\_\{0\}term in the exact horizon formula breaks an exactdhH\\sqrt\{d\_\{h\}H\}proportionality \(Lemma[R\.8](https://arxiv.org/html/2606.00376#A18.Thmtheorem8)\), so thedhH\\sqrt\{d\_\{h\}H\}law should be read as a leading\-order trend rather than an identity\.
## Appendix WComplete Statistical Methodology
We provide complete details of all statistical methods employed, enabling full reproducibility\.
### W\.1Bootstrap Confidence Intervals
Procedure:
1. 1\.For each model\-task\-condition triple, collectnnaccuracy measurements
2. 2\.Generate 10,000 bootstrap samples by sampling with replacement
3. 3\.Compute statistic of interest for each bootstrap sample
4. 4\.Report 2\.5th and 97\.5th percentiles as 95% CI bounds
Implementation:
```
import numpy as np
def bootstrap_ci(data, n_bootstrap=10000, alpha=0.05):
n = len(data)
bootstrap_means = []
for _ in range(n_bootstrap):
sample = np.random.choice(data, size=n, replace=True)
bootstrap_means.append(np.mean(sample))
lower = np.percentile(bootstrap_means, 100*alpha/2)
upper = np.percentile(bootstrap_means, 100*(1-alpha/2))
return lower, upper
```
Bias correction:We apply BCa \(bias\-corrected and accelerated\) bootstrap for skewed distributions, followingEfron \([1987](https://arxiv.org/html/2606.00376#bib.bib69)\)\.
### W\.2Multiple Comparison Correction
Holm\-Bonferroni procedure:
1. 1\.Orderkkp\-values:p\(1\)≤p\(2\)≤⋯≤p\(k\)p\_\{\(1\)\}\\leq p\_\{\(2\)\}\\leq\\cdots\\leq p\_\{\(k\)\}
2. 2\.For hypothesisH\(i\)H\_\{\(i\)\}, reject ifp\(i\)<α/\(k−i\+1\)p\_\{\(i\)\}<\\alpha/\(k\-i\+1\)
3. 3\.Stop at first non\-rejected hypothesis
Application:With 12 models×\\times5 conditions×\\times8 tasks = 480 comparisons, we control family\-wise error rate atα=0\.05\\alpha=0\.05\.
### W\.3TOST Equivalence Testing
Two One\-Sided Tests \(TOST\)procedure for equivalence within marginΔ=5%\\Delta=5\\%:
Null hypothesis:\|μ1−μ2\|≥Δ\|\\mu\_\{1\}\-\\mu\_\{2\}\|\\geq\\Delta
Procedure:
1. 1\.TestH01:μ1−μ2≤−ΔH\_\{01\}:\\mu\_\{1\}\-\\mu\_\{2\}\\leq\-\\Delta\(one\-sidedtt\-test\)
2. 2\.TestH02:μ1−μ2≥ΔH\_\{02\}:\\mu\_\{1\}\-\\mu\_\{2\}\\geq\\Delta\(one\-sidedtt\-test\)
3. 3\.Reject null \(declare equivalence\) if bothp1,p2<αp\_\{1\},p\_\{2\}<\\alpha
Results:C1 vs\. C4 comparison yieldsp1=0\.0003p\_\{1\}=0\.0003,p2=0\.0008p\_\{2\}=0\.0008\(both<0\.05<0\.05\), confirming equivalence within 5% margin\.
### W\.4Bayes Factors
We compute Bayes factorsBF01\\text\{BF\}\_\{01\}supporting the null hypothesis using the JZS prior\(Rouderet al\.,[2009](https://arxiv.org/html/2606.00376#bib.bib70)\):
BF01=P\(data\|H0\)P\(data\|H1\)\\text\{BF\}\_\{01\}=\\frac\{P\(\\text\{data\}\|H\_\{0\}\)\}\{P\(\\text\{data\}\|H\_\{1\}\)\}\(72\)
Interpretation scale:
- •BF01\>10\\text\{BF\}\_\{01\}\>10: Strong evidence for null
- •BF01∈\[3,10\]\\text\{BF\}\_\{01\}\\in\[3,10\]: Moderate evidence for null
- •BF01∈\[1,3\]\\text\{BF\}\_\{01\}\\in\[1,3\]: Anecdotal evidence
- •BF01<1\\text\{BF\}\_\{01\}<1: Evidence for alternative
Results:C1 vs\. C4:BF01=5\.7\\text\{BF\}\_\{01\}=5\.7\(moderate evidence that preference manipulation has no effect\)\.
### W\.5Effect Sizes
Cohen’sddcomputed as:
d=x¯1−x¯2spooledd=\\frac\{\\bar\{x\}\_\{1\}\-\\bar\{x\}\_\{2\}\}\{s\_\{\\text\{pooled\}\}\}\(73\)
wherespooled=\(n1−1\)s12\+\(n2−1\)s22n1\+n2−2s\_\{\\text\{pooled\}\}=\\sqrt\{\\frac\{\(n\_\{1\}\-1\)s\_\{1\}^\{2\}\+\(n\_\{2\}\-1\)s\_\{2\}^\{2\}\}\{n\_\{1\}\+n\_\{2\}\-2\}\}\.
Interpretation:
- •\|d\|<0\.2\|d\|<0\.2: Negligible
- •0\.2≤\|d\|<0\.50\.2\\leq\|d\|<0\.5: Small
- •0\.5≤\|d\|<0\.80\.5\\leq\|d\|<0\.8: Medium
- •\|d\|≥0\.8\|d\|\\geq 0\.8: Large
C1 vs\. C3:d=2\.7d=2\.7\(very large effect\), tool delegation dramatically outperforms neural CoT\.
## Appendix XPower Analysis
We conducted a priori power analysis to determine adequate sample sizes\.
### X\.1Primary Analysis: C1 vs\. C3
Target:Detect 20% accuracy difference with 80% power atα=0\.05\\alpha=0\.05\.
Assumed parameters:
- •Effect size:δ=0\.20\\delta=0\.20\(absolute accuracy difference\)
- •Standard deviation:σ=0\.15\\sigma=0\.15\(estimated from pilot\)
- •Standardized effect:d=0\.20/0\.15=1\.33d=0\.20/0\.15=1\.33
Required sample size per group:
n=2⋅\(z1−α/2\+z1−βd\)2=2⋅\(1\.96\+0\.841\.33\)2≈9n=2\\cdot\\left\(\\frac\{z\_\{1\-\\alpha/2\}\+z\_\{1\-\\beta\}\}\{d\}\\right\)^\{2\}=2\\cdot\\left\(\\frac\{1\.96\+0\.84\}\{1\.33\}\\right\)^\{2\}\\approx 9\(74\)
Actual:500 instances per task×\\times3 runs = 1,500 observations\. Power\>0\.999\>0\.999\.
### X\.2Equivalence Analysis: C1 vs\. C4
Target:Confirm equivalence withinΔ=5%\\Delta=5\\%with 80% power\.
Using TOST power formula\(Lakens,[2017](https://arxiv.org/html/2606.00376#bib.bib71)\):
n=2⋅\(\(z1−α\+z1−β/2\)σΔ−\|μ1−μ2\|\)2n=2\\cdot\\left\(\\frac\{\(z\_\{1\-\\alpha\}\+z\_\{1\-\\beta/2\}\)\\sigma\}\{\\Delta\-\|\\mu\_\{1\}\-\\mu\_\{2\}\|\}\\right\)^\{2\}\(75\)
With expected true difference\|μ1−μ2\|=1%\|\\mu\_\{1\}\-\\mu\_\{2\}\|=1\\%:
n=2⋅\(\(1\.64\+1\.28\)⋅0\.150\.05−0\.01\)2≈240n=2\\cdot\\left\(\\frac\{\(1\.64\+1\.28\)\\cdot 0\.15\}\{0\.05\-0\.01\}\\right\)^\{2\}\\approx 240\(76\)
Actual:1,500 observations\. Power\>0\.95\>0\.95for equivalence testing\.
### X\.3Correlation Analysis: Cross\-Model
Target:Detect correlationr≥0\.5r\\geq 0\.5with 80% power\.
Using Fisher’szz\-transformation:
n=\(z1−α/2\+z1−β0\.5ln1\+r1−r\)2\+3n=\\left\(\\frac\{z\_\{1\-\\alpha/2\}\+z\_\{1\-\\beta\}\}\{0\.5\\ln\\frac\{1\+r\}\{1\-r\}\}\\right\)^\{2\}\+3\(77\)
Forr=0\.5r=0\.5:
n=\(1\.96\+0\.840\.5⋅1\.099\)2\+3≈29n=\\left\(\\frac\{1\.96\+0\.84\}\{0\.5\\cdot 1\.099\}\\right\)^\{2\}\+3\\approx 29\(78\)
Actual:500 instances per model pair\. Power\>0\.999\>0\.999for detectingr=0\.5r=0\.5\.
### X\.4Power Summary Table
Table 38:Power analysis summary\.All analyses are substantially overpowered, ensuring robust conclusions\.
## Appendix YExtended Attention Entropy Methodology
We provide complete details of attention entropy extraction and analysis\.
### Y\.1Extraction Pipeline
Hardware:2×\\timesNVIDIA A100 80GB, PyTorch 2\.1, Transformers 4\.36
Procedure:
1. 1\.Load model in bfloat16 withoutput\_attentions=True
2. 2\.For each instance in test set \(n=500n=500\): 1. \(a\)Generate reasoning trace with temperature=0 2. \(b\)Extract attention matrices at each decoding step 3. \(c\)For each layerℓ∈\[1,L\]\\ell\\in\[1,L\]and headh∈\[1,H\]h\\in\[1,H\]: Hℓ,h,t=−∑i=1taℓ,h,t,ilog2aℓ,h,t,iH\_\{\\ell,h,t\}=\-\\sum\_\{i=1\}^\{t\}a\_\{\\ell,h,t,i\}\\log\_\{2\}a\_\{\\ell,h,t,i\}\(79\)whereaℓ,h,t,ia\_\{\\ell,h,t,i\}is attention weight from positionttto positionii 4. \(d\)Aggregate across heads:Hℓ,t=1H∑h=1HHℓ,h,tH\_\{\\ell,t\}=\\frac\{1\}\{H\}\\sum\_\{h=1\}^\{H\}H\_\{\\ell,h,t\}
3. 3\.Correlate with per\-step accuracy
Memory optimization:We use gradient checkpointing and process attention matrices layer\-by\-layer to fit within 80GB\.
### Y\.2Entropy Computation Details
Numerical stability:We addϵ=10−10\\epsilon=10^\{\-10\}to avoidlog\(0\)\\log\(0\):
H=−∑i\(ai\+ϵ\)log2\(ai\+ϵ\)H=\-\\sum\_\{i\}\(a\_\{i\}\+\\epsilon\)\\log\_\{2\}\(a\_\{i\}\+\\epsilon\)\(80\)
Normalization:We report normalized entropyHnorm=H/log2\(t\)H\_\{\\text\{norm\}\}=H/\\log\_\{2\}\(t\)to account for varying sequence lengths\.
Layer grouping:
- •Early layers \(1–20\): Initial token mixing
- •Middle layers \(21–50\): Feature composition
- •Late layers \(51–80\): Task\-specific computation
### Y\.3Results by Model
Table 39:Attention entropy analysis across open\-weight models\.Consistent negative correlations \(r≈−0\.70r\\approx\-0\.70\) across architectures validate the mechanistic hypothesis\.
### Y\.4Entropy Growth Visualization
0101020203030404050500\.40\.40\.50\.50\.60\.60\.70\.70\.80\.80\.90\.911Reasoning stepddNormalized entropyLlama\-3\.3\-70BQwen\-2\.5\-72BMistral\-7BFigure 2:Attention entropy growth with reasoning step\. Linear growth validatesϵ\(d\)∝d\\epsilon\(d\)\\propto dmodel\.
## Appendix ZContext\-Window Insensitivity
A natural alternative hypothesis is that the horizond∗d^\{\*\}is set by the raw context windowLL\. We show it is not:d∗d^\{\*\}is governed by the effective decoherence lengthLeff=O\(102\)L\_\{\\mathrm\{eff\}\}=O\(10^\{2\}\)steps, which is orders of magnitude smaller thanL=O\(105\)L=O\(10^\{5\}\)tokens and dissociates from it under both cross\-model and controlled\-truncation tests\.
### Z\.1d∗d^\{\*\}Does Not TrackLLAcross Models
Ifd∗d^\{\*\}scaled with context capacity, models with largerLLwould exhibit larger horizons\. They do not\.
Table 40:Context window vs\. observedd∗d^\{\*\}\. At fixedL=128L=128K,d∗d^\{\*\}varies by over40%40\\%across models, and the largest\-context model does not have the largest horizon:d∗d^\{\*\}is uncorrelated withLL\.Three models shareL=128L=128K yet spand∗∈\{22,29,31\}d^\{\*\}\\in\\\{22,29,31\\\}, and the 200K\-context Claude model has a*lower*horizon than the 128K\-context o3\-mini\. AL\\sqrt\{L\}law is decisively rejected: the normalized column \(which would be constant underd∗∝Ld^\{\*\}\\propto\\sqrt\{L\}\) varies by44%44\\%\.
### Z\.2Controlled Truncation Within a Single Model
To isolate the effect ofLLfrom architecture, we artificially cap the context window of a single model \(Llama\-3\.3\-70B\) and re\-measured∗d^\{\*\}\.
Table 41:Artificial context truncation for Llama\-3\.3\-70B\.d∗d^\{\*\}is flat from the native 128K down to 8K, degrading only when the cap approaches the reasoning\-trace length \(∼\\sim2–4K tokens\)\.The horizon is invariant to an order\-of\-magnitude reduction inLL\(128K→\\to8K\) and collapses only once the window can no longer hold the model’s own intermediate state trace\. This is the signature of anLeffL\_\{\\mathrm\{eff\}\}\-bound rather than anLL\-bound phenomenon: decoherence is a property of attention’s effective state resolution over reasoning steps, not of raw context capacity\.
## Appendix AAExtended Architecture Ablation
We examine the approximated∗≈0\.314dh⋅Hd^\{\*\}\\approx 0\.314\\sqrt\{d\_\{h\}\\cdot H\}relationship through within\-family comparisons on open\-weight models \(the only models with public head configurations\)\.
### AA\.1Llama\-3 Family
Table 42:Llama\-3 architecture parameters andd∗d^\{\*\}\(d^∗=0\.314dhH\\hat\{d\}^\{\*\}=0\.314\\sqrt\{d\_\{h\}H\}\)\.Ratio:Predicted28\.4/20\.1=1\.4128\.4/20\.1=1\.41; Observed28/20=1\.4028/20=1\.40\. Within 1%\.
### AA\.2Qwen\-2\.5 Family
Table 43:Qwen\-2\.5 architecture parameters andd∗d^\{\*\}\.Ratio:Predicted28\.4/18\.8=1\.5128\.4/18\.8=1\.51; Observed28/19=1\.4728/19=1\.47\. Within 3%\.
### AA\.3Cross\-Family Comparison
Table 44:Cross\-family architecture comparison\.All within\-family ratios match predictions within 4%\.
## Appendix ABTask Family Characterization
We provide detailed specifications for each task family\.
### AB\.1Synthetic Tasks
#### AB\.1\.1PermutationProbe
State space:𝒮=Sn\\mathcal\{S\}=S\_\{n\}\(symmetric group onnnelements\)
Operators:Adjacent transpositions\{\(i,i\+1\):i∈\[1,n−1\]\}\\\{\(i,i\+1\):i\\in\[1,n\-1\]\\\}
Complexity:\|𝒮\|=n\!\|\\mathcal\{S\}\|=n\!; diameter=\(n2\)=\\binom\{n\}\{2\}
Instance distribution:
- •n∈\{8,12,16\}n\\in\\\{8,12,16\\\}
- •BFS\-optimal depths: 5–60
- •Instances per depth bin: See Table[24](https://arxiv.org/html/2606.00376#A14.T24)
#### AB\.1\.2FSA\-Sim
State space:𝒮=\{q1,…,qk\}\\mathcal\{S\}=\\\{q\_\{1\},\\ldots,q\_\{k\}\\\}\(automaton states\)
Operators:Transitionsδ:𝒮×Σ→𝒮\\delta:\\mathcal\{S\}\\times\\Sigma\\to\\mathcal\{S\}
Complexity:\|𝒮\|=k\|\\mathcal\{S\}\|=k; sequence length determines depth
Instance distribution:
- •k∈\{4,8,16\}k\\in\\\{4,8,16\\\}
- •\|Σ\|=4\|\\Sigma\|=4
- •Sequence lengths: 10–100
#### AB\.1\.3ArithChain
State space:𝒮=ℤp\\mathcal\{S\}=\\mathbb\{Z\}\_\{p\}\(integers modpp\)
Operators:\{\+a,−a,×b,/b\}\\\{\+a,\-a,\\times b,/b\\\}for fixeda,ba,b
Complexity:\|𝒮\|=p\|\\mathcal\{S\}\|=p; carry propagation increases effective depth
Instance distribution:
- •p=106p=10^\{6\}\(to require multi\-digit tracking\)
- •Operation chains: 5–50 steps
#### AB\.1\.4CircuitTrace
State space:𝒮=\{0,1\}n\\mathcal\{S\}=\\\{0,1\\\}^\{n\}\(bit vectors\)
Operators:Boolean gates \(AND, OR, XOR, NOT\)
Complexity:\|𝒮\|=2n\|\\mathcal\{S\}\|=2^\{n\}; depth = circuit depth
Instance distribution:
- •n∈\{8,16,32\}n\\in\\\{8,16,32\\\}
- •Circuit depths: 5–50 layers
#### AB\.1\.5CodeProbe
State space:Variable bindings\{v1:x1,…,vk:xk\}\\\{v\_\{1\}:x\_\{1\},\\ldots,v\_\{k\}:x\_\{k\}\\\}
Operators:Assignment, arithmetic, conditionals
Complexity:\|𝒮\|=∏i\|Di\|\|\\mathcal\{S\}\|=\\prod\_\{i\}\|D\_\{i\}\|whereDiD\_\{i\}is domain ofviv\_\{i\}
Instance distribution:
- •3–8 variables
- •5–40 statements
- •No loops \(deterministic execution\)
### AB\.2Real\-World Tasks
#### AB\.2\.1SWE\-Bench\-State
Source:SWE\-Bench\(Jimenezet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib52)\)
Selection criteria:
1. 1\.Bug fix requires tracking≥\\geq3 variables
2. 2\.State changes span≥\\geq2 files
3. 3\.Deterministic execution \(no randomness\)
4. 4\.Ground truth trace available
Annotation:Two expert annotators identified state\-tracking requirements; Cohen’sκ=0\.82\\kappa=0\.82\.
#### AB\.2\.2WebArena\-Nav
Source:WebArena\(Zhouet al\.,[2024](https://arxiv.org/html/2606.00376#bib.bib53)\)
Selection criteria:
1. 1\.Task requires≥\\geq5 navigation steps
2. 2\.Session state \(cart, auth, forms\) must be tracked
3. 3\.Deterministic success criteria
State elements:URL, cart contents, form fields, authentication status
#### AB\.2\.3SQL\-Multi
Source:Spider\(Yuet al\.,[2018](https://arxiv.org/html/2606.00376#bib.bib54)\)with complexity augmentation
Selection criteria:
1. 1\.Requires 3\+ table joins
2. 2\.Schema tracking essential \(column types, foreign keys\)
3. 3\.Aggregation across multiple levels
Complexity:Average 4\.2 tables, 7\.8 join conditions per query
## Appendix ACError Correlation Analysis
We analyze the structure of errors to characterize the forward error propagation underlying the absorbing\-error account of Corollary[4\.3](https://arxiv.org/html/2606.00376#S4.Thmtheorem3)\. These statistics are reported as diagnostics of the decoherence mechanism, not as corrections to the bound\.
### AC\.1Lag\-1 Autocorrelation
For each reasoning trace, we compute the autocorrelation of error indicators\{Xi\}\\\{X\_\{i\}\\\}:
ρ1=∑i=2m\(Xi−X¯\)\(Xi−1−X¯\)∑i=1m\(Xi−X¯\)2\\rho\_\{1\}=\\frac\{\\sum\_\{i=2\}^\{m\}\(X\_\{i\}\-\\bar\{X\}\)\(X\_\{i\-1\}\-\\bar\{X\}\)\}\{\\sum\_\{i=1\}^\{m\}\(X\_\{i\}\-\\bar\{X\}\)^\{2\}\}\(81\)
Results:
- •Meanρ1=0\.34\\rho\_\{1\}=0\.34\(95% CI: \[0\.31, 0\.37\]\)
- •Positive correlation indicates error propagation
- •Higher correlation at greater depths \(ρ1=0\.28\\rho\_\{1\}=0\.28ford<20d<20;ρ1=0\.41\\rho\_\{1\}=0\.41ford\>30d\>30\)
### AC\.2Higher\-Order Correlations
Table 45:Error autocorrelation by lag\.Correlation decays exponentially with lag, supporting first\-order Markov approximation\.
### AC\.3Conditional Error Rates
Table 46:Error rate conditioned on previous step\.Errors are 2\.87×\\timesmore likely following an error, confirming the propagation mechanism\. The conditional rate following a correct step \(0\.031\) exceeds the baseline interceptϵ0=0\.020\\epsilon\_\{0\}=0\.020because it is averaged over the depth distribution of traces, whereϵ\(d\)=ϵ0\+γd/Leff\\epsilon\(d\)=\\epsilon\_\{0\}\+\\gamma d/L\_\{\\mathrm\{eff\}\}rises with depth \(e\.g\.ϵ\(25\)≈0\.045\\epsilon\(25\)\\approx 0\.045\); the two quantities are therefore consistent\.
## Appendix ADPrompt Templates
We provide exact prompt templates used for each experimental condition\.
### AD\.1C1: Unconstrained Neural CoT
```
You are solving a state-space search problem.
Initial state: {initial_state}
Target state: {target_state}
Available operators: {operators}
Find a sequence of operators that transforms the initial
state into the target state. Show your reasoning step by
step, including the state after each operation.
Format each step as:
Step N: operator -> resulting_state
```
### AD\.2C2: Depth\-Limited CoT
```
[Same as C1, with addition:]
Note: The optimal solution requires exactly {optimal_depth}
steps. Focus on finding this solution.
```
### AD\.3C3: Tool\-Integrated
```
You have access to a BFS solver tool. Use it to find the
optimal path from initial to target state.
Initial state: {initial_state}
Target state: {target_state}
To use the solver, call: solve({initial_state}, {target_state})
After receiving the solution, verify and explain each step.
```
### AD\.4C4: Explicit Length Encouragement
```
[Same as C1, with addition:]
Important: Take as many reasoning steps as you need. Longer,
more careful reasoning is encouraged. Don’t rush to a
conclusion - explore the problem thoroughly.
```
### AD\.5C5: Fine\-Tuned
\[Same prompt as C1; model weights modified through fine\-tuning\]
## Appendix AEAdditional Figures
### AE\.1SSJ Decay Visualization
010102020303040405050606000\.20\.20\.40\.40\.60\.60\.80\.811Reasoning depthddState\-Space JaccardObservedae−bd−cd2ae^\{\-bd\-cd^\{2\}\}Figure 3:State\-Space Jaccard decay with reasoning depth\. Super\-exponential fit \(R2=0\.99R^\{2\}=0\.99\) confirms context\-dependent error model\.
### AE\.2Cross\-Model Correlation Heatmap
The full cross\-model correlation matrix is provided in Table[19](https://arxiv.org/html/2606.00376#A9.T19)\(Appendix[I](https://arxiv.org/html/2606.00376#A9)\)\. Key observations:
- •All pairwise correlations exceedr=0\.81r=0\.81
- •Highest correlation \(0\.91\) between Llama\-70B and Qwen\-72B reflects similar pretraining
- •Consistently high correlations across organizations \(OpenAI, Anthropic, DeepSeek, Meta, Alibaba\) support architectural causation over training\-specific explanations
- •Mean correlation:r¯=0\.85\\bar\{r\}=0\.85\(95% CI: \[0\.83, 0\.87\]\)
### AE\.3Fine\-Tuning Improvement by Depth
010102020303040405050022446688Reasoning depthddImprovement \(%\)ObservedO\(d∗/d\)O\(d^\{\*\}/d\)boundFigure 4:Fine\-tuning improvement decays with depth followingO\(d∗/d\)O\(d^\{\*\}/d\)bound from Theorem[4\.10](https://arxiv.org/html/2606.00376#S4.Thmtheorem10)\.
## Appendix AFExtended Cost Analysis
### AF\.1Per\-Model Cost Breakdown
Table 47:Detailed cost analysis by model and condition\.CPC= Cost per correct solution = Cost/Instance÷\\divAccuracy\. For C3, Cost/Instance includes tool invocation and verification overhead beyond the model token cost\.
### AF\.2Efficiency Analysis
Table 48:Efficiency metrics across strategies\.Tool delegation achieves 10–30×\\timesbetter token efficiency per correct solution\.
### AF\.3Carbon Footprint Estimate
FollowingStrubellet al\.\([2019](https://arxiv.org/html/2606.00376#bib.bib57)\)methodology:
- •GPU power: 400W per A100
- •PUE \(Power Usage Effectiveness\): 1\.1
- •Carbon intensity: 0\.4 kg CO2/kWh \(US average\)
Estimated emissions:
- •C1 \(Neural CoT\): 0\.23 kg CO2per 1,000 correct solutions
- •C3 \(Tool\): 0\.05 kg CO2per 1,000 correct solutions
- •Savings: 78\.3% reduction
## Appendix AGPractitioner Decision Framework
We provide a decision framework for practitioners determining when to use neural CoT vs\. tool delegation\.
### AG\.1Decision Tree
1. 1\.Is the task deterministic?\(exact state tracking required\) - •No→\\rightarrowNeural CoT may suffice - •Yes→\\rightarrowContinue to step 2
2. 2\.Estimate reasoning depthdd\(compare against the Deterministic Horizond∗∈\[19,31\]d^\{\*\}\\in\[19,31\]\) - •d<19d<19: Neural CoT acceptable; below the horizon for every evaluated model - •19≤d≤3119\\leq d\\leq 31: Model\-dependent; use a hybrid approach and verify against the specific model’sd∗d^\{\*\} - •d\>31d\>31: Strongly recommend tool delegation; beyond the horizon for all models
3. 3\.Is a verification tool available? - •Yes→\\rightarrowUse tool\-integrated approach \(C3\) - •No→\\rightarrowUse Best\-of\-N with verification heuristics
4. 4\.Safety\-critical deployment? - •Yes→\\rightarrowMandatory tool verification for alld\>10d\>10 - •No→\\rightarrowFollow cost\-efficiency optimization
### AG\.2Quick Reference Table
Table 49:Practitioner quick reference\.Accuracy bands are reported across the model suite: the neural\-to\-tool crossover is the Deterministic Horizon, which spansd∗∈\[19,31\]d^\{\*\}\\in\[19,31\]with architecture, so within that band the suitable strategy is model\-dependent\. Below depth 19 every evaluated model stays above the success threshold; beyond depth 31 every model has fallen below it\. The<<50% entry ford\>31d\>31follows from the definition ofd∗d^\{\*\}atα=0\.5\\alpha=0\.5: even the longest\-horizon model \(d∗=31d^\{\*\}=31\) sits at the threshold at depth 31 and declines past it\.Similar Articles
Faithfulness as Information Flow: Evaluating and Training Faithful Chain-of-Thought Reasoning
This paper proposes a framework to evaluate and improve faithfulness of chain-of-thought reasoning by controlling information flow, using entropy-based, KL-divergence, and gradient-based diagnostics, and introduces training interventions (attention masking, gradient masking, adversarial perturbations) that make reasoning more transparent and reduce shortcut reliance.
Stop When Further Reasoning Won't Help: Attention-State Adaptive Generation in Reasoning Models
This paper proposes ASAG, a training-free method that adaptively stops reasoning in large reasoning models based on attention distributions, reducing token usage by ~40% while improving accuracy by 3.2% on benchmarks using DeepSeek-R1-Distill and Qwen3 models.
When the Chain of Thought Knows Better: Failure Modes in Multi-Turn Reasoning Models
This paper analyzes failure modes in multi-turn reasoning models by introducing a CoT-Output safety matrix, revealing paradoxes like increased alignment-faking under monitoring cues and context-injection failures where safe internal reasoning is overridden by harmful outputs.
Effort as Ceiling, Not Dial: Reasoning Budget Does Not Modulate Cognitive Cost Alignment Between Humans and Large Reasoning Models
This paper tests whether varying inference-time reasoning effort affects the alignment between large reasoning models' chain-of-thought lengths and human reaction times. Results show alignment is invariant to effort perturbations, suggesting it is a training-time achievement.
Beyond a Single Direction: Chain-of-Thought Disrupts Simple Steering of Refusal
This paper investigates how chain-of-thought reasoning in large reasoning models complicates activation-based steering of refusal behavior. Experiments on DeepSeek-R1-Distill-LLaMA-8B show that refusal is jointly encoded in residual stream activations and the CoT trace, making models more robust to activation-level interventions but exposing the CoT as an alternative attack surface.