The Architecture of Errors: From Universal Impossibility to Patch-Local LLM Reliability
Summary
This paper argues that universal LLM reliability is impossible, but within operationally bounded patches (e.g., legal review, medical RAG), failures are sparse and repetitive, making reliability a local catalogue-discovery problem. It formalizes this with propositions and a corollary, relocating rather than dissolving the difficulty of long-context generation.
View Cached Full Text
Cached at: 06/01/26, 09:26 AM
# From Universal Impossibility to Patch-Local LLM Reliability
Source: [https://arxiv.org/html/2605.30628](https://arxiv.org/html/2605.30628)
## The Architecture of Errors: From Universal Impossibility to Patch\-Local LLM Reliability
Mikhail L\. Arbuzov Independent Researcher mike\.arbuzov54@gmail\.com &Lee Mosbacker Independent Researcher lee\.mosbacker@gmail\.com &Sisong Bei Independent Researcher qurining@gmail\.com Ziwei Dong Independent Researcher ziwei\.dong@alumni\.emory\.edu &Dmitri Kalaev Independent Researcher kalaevdr@gmail\.com &Alexey Shvets Palo Alto Networks ashvets@paloaltonetworks\.com
###### Abstract
Universal LLM reliability is not a finite\-library problem: across all possible tasks, tools, schemas, knowledge sources, and evaluator expectations, new intervention\-distinguishable failure modes can appear without bound, so no finite intervention dictionary can guarantee bounded residual error for every such mode\. But deployed systems do not operate over the whole universe\. They operate inside operationally bounded patches \(legal review, medical RAG, code repair, customer\-support agents, contract extraction\) with recurring tasks, schemas, tools, and evaluator expectations\. Within such patches, empirical evidence suggests failures are sparse, repetitive, and concentrated in a small recurring catalogue, so reliability becomes a local catalogue\-discovery and intervention\-coverage problem rather than an exponential token\-length problem\. We formalize this transition with two propositions and one corollary\. Proposition[1](https://arxiv.org/html/2605.30628#Thmproposition1)is the worst\-case\-mode\-wise negative result: no finite intervention dictionary covers every distinguishable failure mode of an unbounded domain\. Corollary[1](https://arxiv.org/html/2605.30628#Thmcorollary1)is the inverse\-discovery implication: the logarithmic upper bound on mode discovery cannot accommodate linearly more distinct tail modes without exponentially more observed hard\-failure events\. Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)is the positive patch\-local result: under log active\-mode exposure and head\-heavy coverage, a sufficient per\-hard\-decision intervention budget grows polylogarithmically in sequence length and becomes domain\-constant once the patch catalogue saturates\. The framework relocates rather than dissolves long\-context difficulty: where the number of hard decisions itself grows with task length, reliability remains hard; the contribution is to identify the on\-axis intervention rather than to make those regimes easy\.
## 1Introduction
The standard worry about long\-context autoregressive generation is exponential\. If every token has independent error probabilityee, then afternntokens the chance of a fully correct output is\(1−e\)n\(1\-e\)^\{n\}, which collapses to zero for any nontrivialeeand large enoughnn\(LeCun,[2023](https://arxiv.org/html/2605.30628#bib.bib28); Dziri et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib11)\)\. Earlier work in this series\(Arbuzov et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib1)\)argued that this worry is misplaced because errors are not distributed uniformly across tokens: only55–10%10\\%of tokens are “key” \(genuinely dependent on long\-range context\(Fang et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib12)\)\), while the remaining majority become nearly deterministic once enough context accumulates\. Replacing\(1−e\)n\(1\-e\)^\{n\}with the two\-rate modelP\(correct\)=\(1−ekey\)k\(1−enon\)n−kP\(\\text\{correct\}\)=\(1\-e\_\{\\text\{key\}\}\)^\{k\}\(1\-e\_\{\\text\{non\}\}\)^\{n\-k\}withk≪nk\\ll nscaling sublinearly innnrecovers the observed long\-context coherence and converts the reliability question from “how doesnngrow?” to “how doeskkgrow?”\.
#### From sparse tokens to recurring patterns\.
This paper takes the next step\. Sparsity tells us*where*errors live; the natural follow\-up is*what*they are\. Recent failure\-mode atlases supply a striking empirical answer: errors are not only sparse, they are also*repetitive*\. ErrorAtlas\(Ashury\-Tahan et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib3)\), covering 83 models across 35 datasets, organises observed failures into 17 named categories sorted by prevalence, with a long\-tailed distribution heavily concentrated in the head\. In code, two error types \(AssertionErrorandNameError\) cover 86\.35% of failures on HumanEval across 14 LLMs, replicated across 23 models on HumanEval Pro and MBPP Pro\(Wen et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib61)\)\. In math, MWPES\-300K\(Sun et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib49)\)categorises 304,865 errors from 15 LLMs across four math word\-problem datasets and reports that dataset characteristics shape error patterns systematically\. The same picture appears in multi\-hop QA\(Zhang et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib67)\), agentic tool use\(Cemri et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib4)\), and retrieval\-augmented generation\(Wood and Forbes,[2024](https://arxiv.org/html/2605.30628#bib.bib62)\)\. These observations suggest a third layer in the reliability architecture, complementing key\-token sparsity \(Layer 1\) and within\-key stratified\-manifold structure \(Layer 2 inArbuzov et al\.[2025](https://arxiv.org/html/2605.30628#bib.bib1)\)\. Within the small set of key tokens, only a fractionβ\\betaproduce*hard*failures, and inside bounded deployment patches those hard failures appear to cluster into a finite or effectively capped catalogue of recurring modes whose size grows much more slowly than the number of observed events\.
#### Contributions\.
The central contribution is a shift in the reliability object\. Universal LLM reliability is not a finite\-library problem, but patch\-local reliability can be treated as catalogue discovery and intervention coverage\. We formalise this transition with two propositions and one corollary\.
Proposition[1](https://arxiv.org/html/2605.30628#Thmproposition1)is the worst\-case\-mode\-wise negative result: if an unbounded domain keeps producing intervention\-distinguishable failures, no finite intervention dictionary can guarantee bounded residual error for every mode of the domain\. Corollary[1](https://arxiv.org/html/2605.30628#Thmcorollary1)is the inverse\-discovery implication of the logarithmic upper bound: the bound cannot accommodate linearly more distinct tail modes without exponentially more observed hard\-failure events, so open\-domain tail discovery has rapidly diminishing returns\. Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)is the matching positive result inside a fixed deployment patch\. Under log active\-mode exposure and head\-heavy coverage, a sufficient per\-hard\-decision intervention budget satisfiesm≥⌈\|Ceff\|1−ε/ehard⌉m\\geq\\lceil\|C\_\{\\text\{eff\}\}\|^\{1\-\\varepsilon/e\_\{\\text\{hard\}\}\}\\rceil, with a doubly\-logarithmic sequence\-length rate as the optimistic pre\-cap special case and a domain\-constant budget once the patch catalogue saturates\. The sequence\-level analogue is strictly tighter and approaches full\-catalogue coverage askkgrows; we surface that asymmetry rather than bury it\.
Around this transition the paper does four supporting things\. A three\-layer framework, comprising sparsity \(α\\alpha\), hard\-token stratification \(β\\beta\), and patch\-local mode catalogue \(\|CD\|\|C\_\{D\}\|\), separates*where*errors occur,*what*recurring forms they take, and*which*capability interventions address them \(§[3](https://arxiv.org/html/2605.30628#S3)\)\. Logarithmic mode discovery is stated as an empirical postulate, not a theorem, calibrated against ErrorAtlas, HumanEval, and MWPES withσ∈\[0\.87,1\.85\]\\sigma\\in\[0\.87,1\.85\]across anchors andσ≈1\.85\\sigma\\approx 1\.85carried as a conservative planning value\. Section[4](https://arxiv.org/html/2605.30628#S4)synthesises evidence for failure clustering, cluster\-selective interventions, and sublinear length scaling, drawing on∼\\sim60 prior published results including the six\-axis capability\-elimination harvest of 28 quantitatively\-anchored citations stratified into Patterns A/B/C \(Appendix[B](https://arxiv.org/html/2605.30628#A2)\)\. And the most\-cited steep\-decay counter\-evidence \(Appendix[C](https://arxiv.org/html/2605.30628#A3)\) is re\-audited and shown to decay primarily over task\-structure variables, including compositional graph size, fact count, log\-time horizon, capacity threshold, and evidence scope, rather than raw token length\.
## 2Related Work
#### Failure\-mode taxonomies\.
ErrorAtlas\(Ashury\-Tahan et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib3)\), MWPES\-300K\(Sun et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib49)\), the HumanEval categorisation ofWen et al\. \([2024](https://arxiv.org/html/2605.30628#bib.bib61)\), and the RFMDataset\(Guo et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib19)\)together suggest that, at any given corpus size, the number of distinct named failure modes is small \(typically 8–20\) and the cumulative coverage of the top modes is high\. Domain\-specific taxonomies for multi\-hop QA\(Zhang et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib67)\), multi\-agent systems\(Cemri et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib4)\), and tool\-augmented agents\(Yao et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib64)\)report the same pattern\.
#### Targeted interventions\.
Each named cluster has been the subject of focused intervention research\. Python execution closes most of the arithmetic\-error cluster\(Gao et al\.,[2023a](https://arxiv.org/html/2605.30628#bib.bib13); Chen et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib5); Gou et al\.,[2024b](https://arxiv.org/html/2605.30628#bib.bib18)\); constrained decoding eliminates format violations\(Suresh et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib50); Wang et al\.,[2025b](https://arxiv.org/html/2605.30628#bib.bib56); Dong et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib10)\); execution feedback resolves most code\-logic errors\(Shinn et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib48); Huang et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib21)\); process reward models reduce step\-level reasoning errors\(Wang et al\.,[2024b](https://arxiv.org/html/2605.30628#bib.bib58); Lightman et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib32)\); RAG strongly reduces hallucinations\(Wood and Forbes,[2024](https://arxiv.org/html/2605.30628#bib.bib62)\); preference optimisation curbs over\-refusal\(Karaman et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib23)\); structured uncertainty fixes most tool\-call failures\(Suri et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib51)\)\. Two structural observations recur: each intervention is*cluster\-selective*\(residuals belong to a different named cluster, not to the targeted one\), and*additivity is approximate but not perfect*\(Patel et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib42); Le,[2026](https://arxiv.org/html/2605.30628#bib.bib27)\)\.
#### Length\-decay benchmarks\.
A parallel literature measures the*shape*of the reliability decay curve\. Mild\-decay results \(Loong\(Wang et al\.,[2024a](https://arxiv.org/html/2605.30628#bib.bib57)\), GSM\-∞\\infty\(Zhou et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib68)\), RULER\(Hsieh et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib20)\), anchor\-based LLMs\(Pang et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib40)\)\) report log\-linear, sigmoidal, or threshold\-like decay inconsistent with smooth\(1−ε\)n\(1\-\\varepsilon\)^\{n\}\. Steep\-decay results\(Dziri et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib11); Kuratov et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib25); Kwa et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib26); Wan et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib54)\)are sometimes cited as exponential\-compounding evidence\. The apparent tension is reduced on careful reading: each of the steep\-decay results we re\-audit decays over a variable distinct from raw token length, which Appendix[C](https://arxiv.org/html/2605.30628#A3)documents\.
#### Gap\.
What is missing is a quantitative bridge between the small, repeating taxonomy catalogue and a polylog intervention budget\. We provide that bridge in §[3](https://arxiv.org/html/2605.30628#S3)\.
## 3Theoretical Framework
#### Four levels of\|C\|\|C\|\.
Before any proposition, a definitional setup\. LLM error analysis routinely conflates four objects that the rest of this paper needs to keep apart:
- •L1: failure events\.Concrete LLM errors observed in a benchmark; the raw data behind any taxonomy\.
- •L2: empirical taxonomy categories\.Researcher\-chosen labels grouping L1 events\. ErrorAtlas’s 17 categories, MWPES’s top\-4 classes, and HumanEval’sAssertionError/NameErrorpartition all live here\. L2 is observable but depends on the taxonomer’s resolution choices\.
- •L3: latent failure modes\.The unobserved underlying clusters that L2 approximates\. L3 is the quantity Postulate[1](https://arxiv.org/html/2605.30628#Thmpostulate1)is morally about, but it cannot be measured directly; we treat L2 as a noisy proxy for L3\.
- •L4: capability axes / interventions\.The engineering unit: a Python interpreter, a constrained decoder, a retrieval\-augmented generator\. L4 is coarser than L2 in practice; one capability axis typically targets several L2 categories at once \(§[4](https://arxiv.org/html/2605.30628#S4), Claim B\)\.
Throughout this section,\|C\|\|C\|refers to theL2count: that is what published taxonomies report\. Postulate[1](https://arxiv.org/html/2605.30628#Thmpostulate1)is therefore*engineering*\-relevant because L4 is coarser than L2 \(so a small intervention library can sweep across many categories at once\), and*epistemically*conditional because L2 is a noisy proxy for the actual L3 mode catalogue whose faithfulness has not been independently measured\. Tagging the formal claims at the right level keeps the framework honest where casual error\-talk slips\.
#### Roadmap\.
The framework separates three questions that are often conflated\. First,*where*do errors occur? Only a sparse subset of key decisions\. Second,*what*kinds of errors recur? A local catalogue of failure modes\. Third,*what fixes them*? A smaller library of capability interventions\. The two propositions of §[3\.4](https://arxiv.org/html/2605.30628#S3.SS4)formalise the transition from universal impossibility to patch\-local tractability: there is no finite intervention dictionary that covers all possible deployments, but inside a fixed deployment patch a sufficient library is small and slowly growing\.
### 3\.1β\\beta\-stratification of key tokens
The two\-rate model ofArbuzov et al\. \([2025](https://arxiv.org/html/2605.30628#bib.bib1)\)distinguisheskkkey tokens \(error rateekeye\_\{\\text\{key\}\}\) fromn−kn\-knon\-key tokens \(enon≪ekeye\_\{\\text\{non\}\}\\ll e\_\{\\text\{key\}\}\)\. Empirical atlases suggest that even within the key\-token class errors are not uniformly distributed: most key tokens are “decisions” for which the model has stable representations; only a fraction concentrate the actual failures\.
###### Definition 1\(Hard fraction\)\.
Partition thekkkey tokens of a sequence into easy and hard subsets,
khard=βk,keasy=\(1−β\)k,β∈\(0,1\),k\_\{\\text\{hard\}\}=\\beta k,\\qquad k\_\{\\text\{easy\}\}=\(1\-\\beta\)k,\\qquad\\beta\\in\(0,1\),where hard key tokens have an elevated error rateeharde\_\{\\text\{hard\}\}corresponding to manifold\-transition decisions in the sense ofArbuzov et al\. \([2025](https://arxiv.org/html/2605.30628#bib.bib1), §3\.2\), and easy key tokens haveeeasy≈enone\_\{\\text\{easy\}\}\\approx e\_\{\\text\{non\}\}\.
Under Definition[1](https://arxiv.org/html/2605.30628#Thmdefinition1), the composed sequence\-level reliability becomes
P\(correct\)=\(1−ehard\)βk\(1−eeasy\)\(1−β\)k\(1−enon\)n−k\.P\(\\text\{correct\}\)\\;=\\;\(1\-e\_\{\\text\{hard\}\}\)^\{\\beta k\}\\,\(1\-e\_\{\\text\{easy\}\}\)^\{\(1\-\\beta\)k\}\\,\(1\-e\_\{\\text\{non\}\}\)^\{n\-k\}\.\(1\)We do not assume iid token errors: the three rates summarise*conditional per\-decision failure hazards*after conditioning on prior decisions being correct, and grouping by stratum yields the multiplicative survival expression\. Becauseeeasy≈enone\_\{\\text\{easy\}\}\\approx e\_\{\\text\{non\}\}once context accumulates, the failure rate is dominated by theβk\\beta khard tokens, and we treateharde\_\{\\text\{hard\}\}as the load\-bearing quantity\. The parameterβ\\betais*latent*: failure\-mode atlases reportPr\(category∣error occurred\)\\Pr\(\\text\{category\}\\mid\\text\{error occurred\}\), which does not identifyPr\(hard∣key\-token decision\)\\Pr\(\\text\{hard\}\\mid\\text\{key\-token decision\}\)\. We carryβ\\betasymbolically\.
### 3\.2Empirical postulate: logarithmic mode discovery
Stratification names a target, theβk\\beta khard\-token decisions, but does not yet tell us how many distinct ways those hard decisions can fail\. The empirical answer from Section[1](https://arxiv.org/html/2605.30628#S1)is that failures repeat: a small number of recurring patterns covers most of the mass\. The next question is how the catalogue’s size grows with the number of observed failures, because that is the quantity an intervention library has to keep up with\.
Two candidates exist in the type–token literature: Heaps’ law \(power\-law,\|C\|≈K⋅khardb\|C\|\\approx K\\cdot k\_\{\\text\{hard\}\}^\{b\}withb∈\[0\.4,0\.6\]b\\in\[0\.4,0\.6\]\(Manning et al\.,[2008](https://arxiv.org/html/2605.30628#bib.bib36)\)\), and logarithmic discovery\.Zipfian rank\-frequency does not imply logarithmic growth\.Heaps’ law is the correct type–token consequence of Zipf\-distributed events, and it is power\-law, not logarithmic\. We therefore state logarithmic mode discovery as an empirical postulate, defensible by direct measurement of error taxonomies, not as a theorem\.
Two sample\-size variables matter, and they are easy to conflate\. Leth\(n\)=βk\(n\)h\(n\)=\\beta k\(n\)denote the number of hard\-token decisions in a single sequence of lengthnn; letTTdenote the number of observed hard\-failure events in a corpus used to discover the domain catalogue\. These are not the same object:h\(n\)h\(n\)controls per\-sequence exposure,TTcontrols empirical discovery\. We writeCDC\_\{D\}for the full reachable catalogue inside domain patchDD,Cseen,D\(T\)C\_\{\\text\{seen\},D\}\(T\)for the subset discovered afterTTsampled failures, andCactive,D\(n\)C\_\{\\text\{active\},D\}\(n\)for the subset a single sequence of lengthnncan activate\. We writelnT\\ln Tandlnh\(n\)\\ln h\(n\)for clarity; small\-sample bounds may be read withln\(1\+T\)\\ln\(1\+T\)andln\(1\+h\(n\)\)\\ln\(1\+h\(n\)\), and discrete counts such asβk\\beta kare interpreted either as expected values or as⌊βk⌋\\lfloor\\beta k\\rfloorasymptotically\. These reformulations do not affect any rate claim below\.
###### Postulate 1\(Patch\-indexed catalogue discovery\)\.
Within a fixed application domainDD, the number of named recurring failure modes discovered afterTTsampled hard\-failure events is bounded by
\|Cseen,D\(T\)\|≤min\(AD\+σDlnT,\|CD\|\),σD\>0,AD≥0\.\|C\_\{\\text\{seen\},D\}\(T\)\|\\;\\leq\\;\\min\\bigl\(A\_\{D\}\+\\sigma\_\{D\}\\ln T,\\ \|C\_\{D\}\|\\bigr\),\\qquad\\sigma\_\{D\}\>0,\\ A\_\{D\}\\geq 0\.The cap\|CD\|\|C\_\{D\}\|is the domain\-imposed ceiling\. The postulate concerns catalogue*discovery*, not the number of hard decisions inside a single sequence\.
###### Assumption 2\(Per\-sequence active\-mode exposure\)\.
For a single sequence of lengthnnin domainDD,
\|Cactive,D\(n\)\|≤min\(AD′\+σD′lnh\(n\),\|CD\|\)\.\|C\_\{\\text\{active\},D\}\(n\)\|\\;\\leq\\;\\min\\bigl\(A^\{\\prime\}\_\{D\}\+\\sigma^\{\\prime\}\_\{D\}\\ln h\(n\),\\ \|C\_\{D\}\|\\bigr\)\.Primed constants are distinct from those of Postulate[1](https://arxiv.org/html/2605.30628#Thmpostulate1): corpus discovery and per\-sequence activation need not share the same rate\.
The two bounds answer different questions, and the rest of the paper picks the right one at each point\. Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)’s sequence\-length scaling rate is governed byCactive,D\(n\)C\_\{\\text\{active\},D\}\(n\)via Assumption[2](https://arxiv.org/html/2605.30628#Thmdefinition2): how much of the catalogue can a single sequence touch? Full domain\-library budgeting, the practical engineering target for a deployed system, usesCDC\_\{D\}: how large does the library need to be to cover the domain’s reachable failure modes? The cap\|CD\|\|C\_\{D\}\|enters both: once enough has been sampled \(or activated\) to saturate the domain ceiling, further growth stops and the budget becomes a domain\-constant function of\|CD\|\|C\_\{D\}\|alone\.
#### Empirical calibration\.
Existing LLM error taxonomies provide endpoint anchors for small named catalogues at observed corpus scales\. ErrorAtlas, HumanEval\-style code taxonomies\(Wen et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib61)\), and MWPES\(Sun et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib49)\)place\|C\|\|C\|roughly in the88–2020range across corpora ranging from approximately10410^\{4\}to3×1053\\times 10^\{5\}observed failures, yieldingσ∈\[0\.87,1\.85\]\\sigma\\in\[0\.87,1\.85\]under the simpleA=0A=0logarithmic calibration\. We useσ≈1\.85\\sigma\\approx 1\.85as a conservative planning value, not as a fitted law\. These endpoint counts are not discovery curves: the missing test is a subsample\-vs\-discovered\-modes measurement within a fixed deployment patch\. Full calibration details are in Appendix[E](https://arxiv.org/html/2605.30628#A5)\.
#### Capabilities are coarser than error categories\.
The category count\|C\|\|C\|is conservative for engineering because deployed interventions operate at the capability level, not the label level\. A Python interpreter can remove execution\-error components across arithmetic, unit conversion, counting, list manipulation, and date arithmetic; a constrained decoder can eliminate several structural output failures at once\. The formal coverage problem is therefore closer to weighted set cover over failure mass than one\-intervention\-per\-category counting\. The ranked\-category model used in Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)is a tractable proxy; the full capability harvest appears in Appendix[B](https://arxiv.org/html/2605.30628#A2)\.
#### L2→\\toL4 as weighted set cover\.
Formally, let each L4 interventionIjI\_\{j\}cover a subsetSj⊆CDS\_\{j\}\\subseteq C\_\{D\}of the L2 mode catalogue, with per\-mode residual\-reduction weightrij∈\[0,1\]r\_\{ij\}\\in\[0,1\]giving the fraction of modeii’s hard\-error mass thatIjI\_\{j\}removes\. The deployment problem chooses a libraryℐ\\mathcal\{I\}of size at mostmmto maximise covered hard\-error mass:
maxℐ:\|ℐ\|≤m∑i∈CDpimaxj:Ij∈ℐrij\.\\max\_\{\\mathcal\{I\}:\|\\mathcal\{I\}\|\\leq m\}\\;\\sum\_\{i\\in C\_\{D\}\}p\_\{i\}\\,\\max\_\{j:I\_\{j\}\\in\\mathcal\{I\}\}r\_\{ij\}\.Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)below uses the analytically tractable binary special caserij∈\{0,1\}r\_\{ij\}\\in\\\{0,1\\\}in which themm\-th\-ranked intervention covers themm\-th\-ranked mode \(one mode per intervention, ordered by marginal covered mass\)\. In real deployments one L4 capability typically covers several L2 categories simultaneously, so the ranked\-category budget of Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)is a conservative proxy for the true set\-cover optimum: the empirically achieved library size for a given residual target may be a constant factor smaller than the formula predicts\. Overlap, additivity, and prompt\-channel interference are treated empirically in the capability\-harvest of Appendix[B](https://arxiv.org/html/2605.30628#A2)\(“Additivity and its limits”\), not as a formal set\-cover optimisation\.
#### Operational patch\.
A deployment patchDDis more than a topic label\. It is an operational tuple fixing, over a chosen time window, the components that determine which failure modes are reachable at all:
D=\(𝒳,𝒮,𝒰,ℛ,ℰ,𝒫,H,τ\),D=\(\\mathcal\{X\},\\,\\mathcal\{S\},\\,\\mathcal\{U\},\\,\\mathcal\{R\},\\,\\mathcal\{E\},\\,\\mathcal\{P\},\\,H,\\,\\tau\),where𝒳\\mathcal\{X\}is the task input distribution,𝒮\\mathcal\{S\}the input/output schema family,𝒰\\mathcal\{U\}the user or client class,ℛ\\mathcal\{R\}the retrieval corpus or knowledge source,ℰ\\mathcal\{E\}the evaluator or acceptance criterion,𝒫\\mathcal\{P\}the policy and safety constraints,HHthe workflow horizon, andτ\\tauthe time window over which these components are fixed\. A*patch shift*occurs when one or more of these components changes enough to alter the reachable failure catalogueCDC\_\{D\}\. Legal review on contracts from a fixed jurisdiction is a different patch from legal review across mixed jurisdictions; medical RAG over a curated knowledge source is a different patch from medical RAG over arbitrary web content\. The patch\-local claims in this paper apply within a single, operationally fixed tuple of this form, not to “the domain” in any looser sense\.
#### Domain patches cap the engineering problem\.
A fixed deployment patch restricts the task family, schemas, tools, workflows, and evaluator expectations\. Prior work on localized representations\(Park et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib41); Li and Sarwate,[2025](https://arxiv.org/html/2605.30628#bib.bib30)\), cross\-domain performance variation\(Wang et al\.,[2024c](https://arxiv.org/html/2605.30628#bib.bib60)\), and long\-tail knowledge\(Mallen et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib35); Kandpal et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib22)\)supports the weaker claim that model behaviour is strongly domain\-dependent; it does not directly measure\|CD\|\|C\_\{D\}\|\. We therefore treat the patch’s reachable catalogue as finite or effectively capped as a modelling assumption motivated by domain heterogeneity, not as a theorem derived from boundedness\. The full evidence breakdown is in Appendix[D](https://arxiv.org/html/2605.30628#A4)\.
###### Corollary 1\(Inverse Discovery Cost\)\.
Postulate[1](https://arxiv.org/html/2605.30628#Thmpostulate1)caps the discovered catalogue from above\. Reading the bound in the inverse direction, the corpus must contain at least
T≥exp\(q−ADσD\)T\\;\\geq\\;\\exp\\\!\\left\(\\frac\{q\-A\_\{D\}\}\{\\sigma\_\{D\}\}\\right\)\(2\)observed hard failures before that cap can accommodateqqdistinct discovered modes\. Equivalently, the minimum sample budget compatible with the bound scales asTmin\(q\)=exp\(\(q−AD\)/σD\)T\_\{\\min\}\(q\)=\\exp\(\(q\-A\_\{D\}\)/\\sigma\_\{D\}\), so accommodatingΔq\\Delta qfurther modes raises that minimum budget by a factorexp\(Δq/σD\)\\exp\(\\Delta q/\\sigma\_\{D\}\)\.
#### Tightness reading\.
If the empirical discovery curve is approximately tight against the bound at the observed corpus scales, the inequality becomes an approximate equality,T\(q\)≈exp\(\(q−AD\)/σD\)T\(q\)\\approx\\exp\(\(q\-A\_\{D\}\)/\\sigma\_\{D\}\)\. At the conservative calibrationσD≈1\.85\\sigma\_\{D\}\\approx 1\.85, five additional discovered modes cost roughly15×15\\timesmore failures, and ten cost roughly220×220\\times\. The lower bound \([2](https://arxiv.org/html/2605.30628#S3.E2)\) holds unconditionally; the multiplicative\-cost reading needs tightness\.
#### Engineering meaning\.
The claim is about*newly distinguishable*tail modes, not ordinary failures, which remain common inside already\-discovered modes\. Combined with the head\-heavy coverage model of §[3\.3](https://arxiv.org/html/2605.30628#S3.SS3), the picture is one of asymmetric returns: the head is cheap to find and absorbs most error mass; the tail is expensive to discover and removes less residual error per mode\. Full derivation, the Heaps polynomial variant, the saturation regime, and a mode\-mediated capability\-gain sub\-corollary are in Appendix[A\.6](https://arxiv.org/html/2605.30628#A1.SS6)\.
### 3\.3Coverage by a targeted intervention library
With the catalogue defined, the next question is how much residual error a library ofmmtargeted interventions actually closes off\. Letp1≥p2≥⋯≥p\|C\|p\_\{1\}\\geq p\_\{2\}\\geq\\dots\\geq p\_\{\|C\|\}denote the ranked empirical hard\-error masses of the local failure modes, with∑ipi=1\\sum\_\{i\}p\_\{i\}=1\. The exact cumulative coverage of the top\-mmlibrary is the empirical step function
Femp\(m\)=∑i=1mpi,m∈\{0,1,…,\|C\|\},F\_\{\\text\{emp\}\}\(m\)\\;=\\;\\sum\_\{i=1\}^\{m\}p\_\{i\},\\qquad m\\in\\\{0,1,\\ldots,\|C\|\\\},\(3\)withFemp\(0\)=0F\_\{\\text\{emp\}\}\(0\)=0,Femp\(1\)=p1F\_\{\\text\{emp\}\}\(1\)=p\_\{1\}, andFemp\(\|C\|\)=1F\_\{\\text\{emp\}\}\(\|C\|\)=1by construction;FempF\_\{\\text\{emp\}\}handles every edge case directly and requires no patching\. For closed\-form analysis we use the continuum log\-head approximation
Flog\(m;\|C\|\)=min\(1,lnmln\|C\|\),m≥2,\|C\|≥2,F\_\{\\log\}\(m;\|C\|\)\\;=\\;\\min\\\!\\left\(1,\\ \\frac\{\\ln m\}\{\\ln\|C\|\}\\right\),\\qquad m\\geq 2,\\ \|C\|\\geq 2,\(4\)as a*planning approximation*toFempF\_\{\\text\{emp\}\}, not as the true distribution\. For our ErrorAtlas anchor \(\|C\|=17,m=5\|C\|=17,m=5\),Flog\(5;17\)≈56\.8%F\_\{\\log\}\(5;17\)\\approx 56\.8\\%; doubling the library tom=10m=10pushes it toFlog\(10;17\)≈81\.3%F\_\{\\log\}\(10;17\)\\approx 81\.3\\%\. A Zipf\-1 reference for the same anchor givesH5/H17≈66\.4%H\_\{5\}/H\_\{17\}\\approx 66\.4\\%, concentrating more head mass thanFlogF\_\{\\log\}, so the log form is conservative relative to Zipf\. Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)below is therefore a closed\-form planning approximation to empirical ranked coverage, not a distribution\-free theorem; the qualitative polylog conclusion is what survives across cumulative\-coverage families \(Appendix[A\.5](https://arxiv.org/html/2605.30628#A1.SS5)\)\. Empirical anchoring ofFempF\_\{\\text\{emp\}\}to a measured cumulative\-coverage curve in any single patch remains an explicit falsifiability test\.
One caveat carries through:FlogF\_\{\\log\}tracks ranked L2 categories covered, not L4 capability axes deployed\. Under the set\-cover model of §[3\.2](https://arxiv.org/html/2605.30628#S3.SS2), where one capability often sweeps several L2 categories, residual error at a givenmmis generally lower than \([4](https://arxiv.org/html/2605.30628#S3.E4)\) predicts; the bound is valid but loose, and empirically achieved library sizes may be a constant factor smaller than the formula predicts\. For\|C\|=1\|C\|=1, residual error is fully covered or fully uncovered by definition and neitherFempF\_\{\\text\{emp\}\}norFlogF\_\{\\log\}is informative\. After deploying the top\-mmlibrary, the residual per\-hard\-token error rate is
eres\(m\)=\(1−Flog\(m;\|C\|\)\)eharde\_\{\\text\{res\}\}\(m\)\\;=\\;\\bigl\(1\-F\_\{\\log\}\(m;\|C\|\)\\bigr\)\\,e\_\{\\text\{hard\}\}\(5\)under the log\-head approximation, or the correspondingFempF\_\{\\text\{emp\}\}expression if the per\-mode masses are known directly\.
### 3\.4From universal impossibility to patch\-local reliability
The finite\-catalogue claim is patch\-local\. It does not say that all possible LLM failures can be covered by a universal list of fixes\. Across all possible tasks, tools, schemas, knowledge sources, workflows, and evaluator expectations, new intervention\-distinguishable failures can keep appearing\. In that setting, a finite intervention dictionary is not a well\-posed reliability target\.
The positive result begins only after a deployment patchDDhas been fixed\. Inside a patch, the task family, schemas, tools, evaluator expectations, and admissible workflows recur\. The engineering question then changes from “Can one library cover all possible LLM failures?” to “How large must the local library be to cover enough of the reachable failure catalogueCDC\_\{D\}?”
###### Proposition 1\(No Universal Finite Intervention Dictionary\)\.
Fix a residual\-error toleranceε\\varepsilon\. If a domainDDcontains an infinite sequence of failures where each new failure is not covered by any finite intervention dictionary covering all earlier failures, then theε\\varepsilon\-resolution failure catalogueCDεC\_\{D\}^\{\\varepsilon\}is infinite\. Consequently, no finite intervention dictionary can guarantee residual error belowε\\varepsilonfor every intervention\-distinguishable mode inDD\.
#### Metric\.
This is a worst\-case, mode\-wise guarantee, not a claim about expected residual error under a fixed probability distribution\. Under a distributional metric, an infinite uncovered tail may still have arbitrarily small total mass; the proposition rules out only the worst\-case\-mode\-coverage reading of universal reliability, which is the reading the engineering literature implicitly assumes when it asks for “a fixed list of interventions that covers LLM use\.”
#### Proof sketch\.
Each new failure is intervention\-distinguishable from the failures before it\. The sequence therefore generates infinitely many distinct intervention modes, and any finite dictionary must miss some later mode\. The full proof, including the precise definition of mode\-level coverage, is in Appendix[A\.1](https://arxiv.org/html/2605.30628#A1.SS1)\.
#### Engineering meaning\.
Universal reliability is not a finite\-library problem\. A fixed list of interventions cannot cover open\-ended LLM use in general\. The rest of the paper is therefore about*patch\-local*reliability, not universal reliability\. Proposition[1](https://arxiv.org/html/2605.30628#Thmproposition1)is the inoculation against a misread: we are not claiming a fixed list of≈50\\approx 50named patterns covers LLM use in general\.
The positive result composesβ\\beta\-stratification \(Definition[1](https://arxiv.org/html/2605.30628#Thmdefinition1)\), the active\-mode bound \(Assumption[2](https://arxiv.org/html/2605.30628#Thmdefinition2)\), and the log\-coverage form \(Eq\.[4](https://arxiv.org/html/2605.30628#S3.E4)\) into a local budget statement\. The result is conditional engineering math, not an unconditional theorem about LLMs: it depends on the log\-coverage approximation, the active\-mode exposure assumption, and the patch hypothesis thatCDC\_\{D\}is finite or effectively capped\. Appendix[A\.2](https://arxiv.org/html/2605.30628#A1.SS2)states these conditions explicitly before the derivation\.
###### Proposition 2\(Patch\-Local Sufficient Intervention Budget\)\.
Fix a deployment patchDD\. Letε∈\(0,ehard\)\\varepsilon\\in\(0,e\_\{\\text\{hard\}\}\)be the target per\-hard\-decision residual error rate\. Under the log\-head approximationFlogF\_\{\\log\}of §[3\.3](https://arxiv.org/html/2605.30628#S3.SS3), a library covering the dominant local modes is sufficient for the target once
m≥⌈\|Ceff\|1−ε/ehard⌉,m\\;\\geq\\;\\left\\lceil\|C\_\{\\text\{eff\}\}\|^\{1\-\\varepsilon/e\_\{\\text\{hard\}\}\}\\right\\rceil,\(6\)whereCeffC\_\{\\text\{eff\}\}is either the active catalogueCactive,D\(n\)C\_\{\\text\{active\},D\}\(n\)touched by a single sequence, or the full reachable catalogueCDC\_\{D\}of the deployment patch\. IfCactive,D\(n\)C\_\{\\text\{active\},D\}\(n\)grows logarithmically with the number of hard decisions andk\(n\)=Θ\(logn\)k\(n\)=\\Theta\(\\log n\), then the pre\-cap sufficient budget grows doubly\-logarithmically in sequence length\. Once the patch catalogue saturates, the sufficient budget becomes domain\-constant in\|CD\|\|C\_\{D\}\|\. This is a sufficient, model\-implied budget under the stated approximation, not a proof of the true minimal intervention library\.
#### Proof sketch\.
The library removes cumulative hard\-error massF\(m;\|C\|\)F\(m;\|C\|\)\. The residual hard\-token error rate is therefore\(1−F\)ehard\(1\-F\)\\,e\_\{\\text\{hard\}\}\. Requiring this residual to be at mostε\\varepsilonrearranges, after substituting the log\-coverage form, to the stated bound\. The full step\-by\-step derivation, including the pre\-cap and cap regimes, is in Appendix[A\.2](https://arxiv.org/html/2605.30628#A1.SS2)\.
#### Engineering meaning\.
Once the patch is fixed, the problem changes\. The relevant engineering question is no longer whether arbitrary future failures exist, but how quickly the reachable local catalogue is discovered and how much hard\-error mass is removed by the top interventions\. The intervention budget for per\-hard\-token reliability is small and slowly growing, domain\-constant in the cap regime, with the exact size determined by the local discovery and rank\-coverage curves rather than by a universal prior\.
#### Sequence\-level caveat\.
Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)bounds residual error per hard decision\. A one\-shot sequence\-level target is strictly stricter because many hard decisions occur in one output: askkgrows the allowable residual per hard decision shrinks, and the required library approaches full\-catalogue coverage\. In some regimes the non\-hard\-token error mass alone already exceeds the sequence\-level budget, so hard\-token interventions cannot meet the SLA by themselves\. The full three\-regime analysis \(and the per\-hard\-token toleranceτseq\\tau\_\{\\text\{seq\}\}that converts a sequence target into the Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)bound\) is in Appendix[A\.3](https://arxiv.org/html/2605.30628#A1.SS3)\.
A common engineering rule of thumb, “cover the local head with on the order of tens of interventions,” is best understood as a per\-hard\-decision planning prior, not a derived constant from the formal model\. The per\-hard\-decision residual\-error reduction implied byFlogF\_\{\\log\}at typical\|C\|\|C\|and head\-heavy mass is large; the sequence\-level analogue requires correspondingly more interventions and tighter coverage of the tail\.
We call these statements propositions rather than theorems because their force is conditional: they formalise the consequences of the paper’s modelling assumptions rather than deriving a universal law of LLM reliability\. The doubly\-logarithmic rate is the optimistic special case under logarithmic active\-mode exposure; Appendix[A\.5](https://arxiv.org/html/2605.30628#A1.SS5)gives the Heaps and saturation variants\. Appendix[A\.4](https://arxiv.org/html/2605.30628#A1.SS4)expands on the conditional reading\.
## 4Empirical Evidence
We collect evidence for three load\-bearing claims: errors cluster into a small recurring set \(A\); each cluster is addressable by one targeted intervention \(B\); reliability decays sublinearly with output length \(C\)\.
#### Claim A: Failure\-mode clustering\.
The strongest single result is ErrorAtlas\(Ashury\-Tahan et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib3)\): 83 models×\\times35 datasets,≳104\\gtrsim 10^\{4\}failures, 17 named categories with a long\-tailed, head\-concentrated prevalence ordering\. Domain\-specific Paretos reproduce the shape inside each domain: math \(MWPES top\-4 categories dominate per model with strong cross\-model overlap\(Sun et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib49)\)\), code \(AssertionError\+NameError= 86\.35% on HumanEval, replicated across 23 models\(Wen et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib61)\)\), multi\-hop QA \(a single dominant “missing\-evidence” mode\(Zhang et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib67)\)\), agentic tool use \(under 20 recurring modes across two studies\(Cemri et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib4); Yao et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib64)\)\), and RAG \(a handful of distinguishable hallucination types\(Wood and Forbes,[2024](https://arxiv.org/html/2605.30628#bib.bib62)\)\)\. The catalogue is also*stable*across models: ErrorAtlas reports a fixed category hierarchy across 83 models; RFMDataset\(Guo et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib19)\)finds “strikingly similar failure mode distributions” across ten advanced reasoning models; EDIT\(Dai et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib8)\)measures a≈4\.7%\\approx 4\.7\\%key\-step fraction that holds model to model\.Schaeffer et al\. \([2025](https://arxiv.org/html/2605.30628#bib.bib45)\)provide the closest theoretical anchor, proving that the observed aggregate power\-law scaling of LLM evaluations requires the per\-problem success\-rate distribution to be heavy\-tailed nearp=0p=0, structurally related to Postulate 1\.
#### Claim B: Cluster\-selective capability interventions\.
A dedicated harvest yields 28 capability\-elimination citations across six independent axes \(arithmetic, code execution, format/structure, perception/grounding, knowledge/RAG, verification\), each axis independently confirmed by between three and nine citations\. The full table and the three structural patterns \(Pattern A by\-construction, B strong\-empirical\-with\-class\-shift, C moderate\-with\-shift\) appear in Appendix[B](https://arxiv.org/html/2605.30628#A2)\.
Arithmetic is the cleanest case: PAL\(Gao et al\.,[2023a](https://arxiv.org/html/2605.30628#bib.bib13)\)lifts GSM\-Hard from 20\.1% to 61\.5%, and the residuals are problem\-comprehension errors rather than arithmetic ones\. Format violations admit something stronger still: constrained decoding zeroes invalid\-token probability by construction\(Suresh et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib50); Wang et al\.,[2025b](https://arxiv.org/html/2605.30628#bib.bib56); Dong et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib10)\)\. Code\-logic errors yield to execution feedback \(Reflexion \+ AgentCoder push HumanEval pass@1 from 80% to 96\.3%, with residuals in spec\-misinterpretation\(Shinn et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib48); Huang et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib21)\)\); reasoning steps to process supervision \(Math\-Shepherd: Mistral\-7B MATH 28\.6%→\\rightarrow43\.5%\(Wang et al\.,[2024b](https://arxiv.org/html/2605.30628#bib.bib58)\)\); hallucinations to RAG, where Acurai\(Wood and Forbes,[2024](https://arxiv.org/html/2605.30628#bib.bib62)\)reports 100% elimination on RAGTruth, a benchmark\-conditional empirical result rather than a general by\-construction guarantee\. Two further clusters complete the picture: POROver lifts the not\-overrefused rate from 57\.6% to 82\.1%\(Karaman et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib23)\); SAGE\-Agent, targeting tool\-call clarification, raises When2Call accuracy from 36\.5% to 65\.2%\(Suri et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib51)\)\.
The class\-shift signature, in which post\-intervention residuals belong to structurally different classes, is the empirical content of the cluster\-selectivity property that underwrites Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)’s composition step\. Capabilities are coarser than error classes: a single Python interpreter removes execution\-error components from five named clusters; constrained decoding eliminates format and the structural component of “missing required element” jointly\. A negative control sharpens the selectivity claim\. DebugBench\(Tian et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib52)\)reports that execution feedback works for syntax and reference errors and is explicitly “unhelpful for logic errors,” exactly where the framework predicts capability provisioning fails\.
#### Claim C: Sublinear length scaling\.
Direct evidence for sublinear \(rather than exponential\) decay shows up across very different measurement designs\. Loong\(Wang et al\.,[2024a](https://arxiv.org/html/2605.30628#bib.bib57)\)reports GPT\-4o on Chain\-of\-Reasoning declining81\.6%→32\.9%81\.6\\%\\rightarrow 32\.9\\%across the1010K→\\rightarrow\>\>200K context bins; log\-linear overlogL\\log Land not well explained by smooth independent per\-token exponential decay\. GSM\-∞\\infty\(Zhou et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib68)\)attacks the independent\-error model from a different angle: exponentially more inference compute yields only*linear*AUC gains, and DeepSeek\-R1 maintains10%\+10\\%\+accuracy at 130 reasoning operations where\(1−ε\)130\(1\-\\varepsilon\)^\{130\}for any reasonableε\\varepsilonpredicts near\-zero\.Press et al\. \([2023](https://arxiv.org/html/2605.30628#bib.bib43)\)find a≈40%\\approx 40\\%compositionality gap that is approximately*constant*across the GPT\-3 family, direct evidence against per\-step independent compounding \(which would predict the gap to grow with chain length\)\.
Structural correlates of the same phenomenon appear elsewhere\.Pang et al\. \([2024](https://arxiv.org/html/2605.30628#bib.bib40)\)’s Anchor LLMs achieve≈99%\\approx 99\\%K/V reduction with only minor accuracy compromise: the effective number of distinct attention\-key vectors a model needs is far smaller thannn\. Think\-Prune\-Train\(Costello et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib7)\)raises Pass@1 while Pass@20 plateaus, the manifold\-transition signature\. RULER\(Hsieh et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib20)\)finds threshold behaviour inconsistent with smooth\(1−ε\)n\(1\-\\varepsilon\)^\{n\}\. METR\(Kwa et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib26)\)fits logistic\-in\-log\-length, which is mathematically sublinear in length itself\.
Self\-consistency\(Wang et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib59)\)is consistent with but does not*prove*clustered structure: Condorcet aggregation of iid Bernoulli chains can yield similar gain magnitudes, so we treat it as suggestive rather than diagnostic\. Prominent steep\-decay counter\-evidence\(Dziri et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib11); Kuratov et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib25); Wan et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib54); Karpinska et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib24)\)decays over variables distinct from raw token length; per\-paper re\-audits showing the relocation pattern are in Appendix[C](https://arxiv.org/html/2605.30628#A3)\.
## 5Practical Implications
#### Reliability engineering is local patch coverage\.
Within a fixed domain patch, Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)makes reliability a small\-catalogue engineering problem rather than an asymptotic scaling problem\. A team should initially budget for a library on the order of tens of interventions, then refine from the local mode\-discovery curveCseen,D\(T\)C\_\{\\text\{seen\},D\}\(T\)and the empirical rank\-frequency distribution\. As a rule of thumb consistent with the head\-mass figures in ErrorAtlas, HumanEval, and MWPES,≈50\\approx 50named interventions cover the bulk of the per\-hard\-token failure distribution in many measured domains\. This is a planning prior calibrated to current taxonomies, not a universal constant; the local mode\-discovery curve sets the actual library size for any given deployment\. The same base model in cardiology RAG, legal contract drafting, and code review yields three different libraries becauseCDC\_\{D\},ADA\_\{D\},σD\\sigma\_\{D\}all change with the deployment domain\.
#### Per\-hard\-token vs\. sequence\-level targets matter for SLA design\.
Per\-hard\-token residual error, the right metric for continuous\-correction systems, has a polylog budget\. Sequence\-level failure probability, the right metric for one\-shot systems, is strictly stricter and approaches full\-catalogue coverage askkgrows\. Production deployments should design SLAs to match the actual cost structure, not read the per\-token result as the production SLA\.
#### Library design as engineering, not asymptotic theory\.
An intervention library can be assembled module\-by\-module, in any order, as long as modules target distinct failure classes\. Layer\-separated interventions \(constrained decoding \+ retrieval \+ process supervision \+ tool call\) compose approximately additively in our data; theLe \([2026](https://arxiv.org/html/2605.30628#bib.bib27)\)caveat applies only to interventions sharing a prompt channel\. Library construction is therefore highly parallelisable\.
#### Capability libraries are coarser than error\-class libraries\.
Five named math classes \(arithmetic, units, counting, list manipulation, date arithmetic\) collapse under one Python interpreter; three code classes \(SyntaxError, NameError, most TypeError\) collapse under one execution\-feedback loop; format violations and the structural component of “missing required element” \(together more than20%20\\%of ErrorAtlas\) collapse under one constrained\-decoder deployment\. A team budgeting for≈50\\approx 50interventions per domain can expect to need fewer than 50 capability\-axis interventions while covering≈50\\approx 50named clusters\. The six axes of Appendix[B](https://arxiv.org/html/2605.30628#A2)are closer to the right unit of engineering accounting than the twelve clusters\.
## 6Discussion
#### By\-construction elimination as a boundary case\.
A subset of the capability\-elimination harvest deserves separate accounting\. Seven of the 28 citations \(Appendix[B](https://arxiv.org/html/2605.30628#A2); constrained decoders\(Suresh et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib50); Dong et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib10); Li et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib29); Zhang et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib66); OpenAI,[2024](https://arxiv.org/html/2605.30628#bib.bib38)\), proof kernels\(Ren et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib44)\), static syntax checks\) achieve residual error rate equal to zero*by mathematical construction*, not statistically\. Constrained decoders setP\(invalid token\)=0P\(\\text\{invalid token\}\)=0at every generation step; the class of grammar\-violating outputs is mathematically empty\. In this regime the polylog bound of Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)is loose: covered clusters contribute exactly zero to residual error, not a small positive quantity\. The result strengthens, rather than refutes, the framework\. And Pattern A applies precisely to structural/verifiable classes \(format, syntax, schema, invalid proofs\), which sit at the head of the heavy\-tailed cluster frequency distribution: the strongest mechanism lands exactly where the catalogue is densest\.
#### Counter\-evidence relocates, not dissolves\.
Five prominent steep\-decay papers\(Dziri et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib11); Kuratov et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib25); Kwa et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib26); Wan et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib54); Karpinska et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib24)\)are routinely cited as exponential\-compounding evidence\. Every one decays over a variable distinct from raw token length: compositional graph size\(Dziri et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib11)\), number of supporting facts\(Kuratov et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib25)\), log\-time horizon\(Kwa et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib26)\), capacity threshold\(Wan et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib54)\), evidence scope\(Karpinska et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib24)\)\. The apparent rapid decay is, in each case, in a quantity our framework already concentrates the action in \(khardk\_\{\\text\{hard\}\}and\|C\|\|C\|\), not in raw sequence length: a relocation, not a dissolution\. These regimes remain hard; the framework’s value is directing intervention toward capability provisioning along the actual decay axis rather than toward context\-window expansion that does not help\. Appendix[C](https://arxiv.org/html/2605.30628#A3)walks through each paper\.
#### Limitations\.
Postulate 1 is empirical, not derived; a domain with genuinely Heaps\-power\-law mode discovery would invalidate the doubly\-logarithmic special case while preserving the qualitative polylog conclusion\. The coverage form of §[3\.3](https://arxiv.org/html/2605.30628#S3.SS3)is an empirical best\-fit; readers should not interpret the numerical match against anchors as theoretical confirmation\. Inter\-cluster additivity is approximate;Le \([2026](https://arxiv.org/html/2605.30628#bib.bib27)\)’s caveat on prompt\-channel interference tightens the bound in shared\-channel settings\. Domain narrowness: Postulate 1’s empirical anchor rests on three taxonomies \(ErrorAtlas, HumanEval, MWPES\), all published 2025–2026 and covering general/code/math; the framework is untested on agentic workflows, long\-running scientific reasoning, and multi\-turn tool use over millions of tokens, precisely the regimes where\|C\|\|C\|may grow faster than logarithmically\. Theσ≈1\.85\\sigma\\approx 1\.85estimate is a single\-point calibration, not a discovery\-curve fit; a subsample\-vs\-distinct\-modes plot has not been published for any LLM error taxonomy at this writing\.β\\betais latent and we report no empirical range\. Taxonomy granularity \(the L2–L3 gap\) is unmeasured\. Patch\-shift between deployments changesAD,σD,βD,\|CD\|A\_\{D\},\\sigma\_\{D\},\\beta\_\{D\},\|C\_\{D\}\|simultaneously; libraries calibrated against one patch under\-cover the next\. Finally, the framework*relocates*the difficulty of long\-context reliability rather than resolving it: wherekhardk\_\{\\text\{hard\}\}grows with task length \(adversarial compositional structure, multi\-hop chains, long agent horizons\), reliability remains hard\. The contribution is to name the on\-axis intervention, not to make those regimes easy\. A falsifiability test the framework passes: DebugBench\(Tian et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib52)\)explicitly catalogues which classes execution\-feedback can and cannot repair, and the empirical split \(works for syntax/reference; “unhelpful for logic errors”\) is exactly what the framework’s selectivity claim predicts\.
## 7Conclusion
LLM reliability is often framed as an asymptotic scaling problem\. Universal reliability is not a finite\-library problem \(Proposition[1](https://arxiv.org/html/2605.30628#Thmproposition1)\): an unbounded domain that keeps producing intervention\-distinguishable failures cannot be covered by any finite dictionary that guarantees mode\-wise residual error below tolerance\. The contribution of this paper is to show that, once an operationally bounded deployment patch is fixed, reliability becomes a local engineering problem\. Prior work in this line\(Arbuzov et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib1)\)argued that the standard asymptotic framing rests on a false uniformity assumption: errors concentrate at≈5\\approx 5–10%10\\%of tokens\. This paper takes the next step: within that sparse set, errors are not only concentrated but also*repetitive*, clustering into a finite catalogue whose size grows logarithmically with observed failures under the empirical postulate of §[3\.2](https://arxiv.org/html/2605.30628#S3.SS2), or as a small power under the conservative Heaps alternative \(Appendix[A\.5](https://arxiv.org/html/2605.30628#A1.SS5)\)\. The conditional consequence \(Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)\) is that a sufficient per\-hard\-decision intervention budget scales polylogarithmically in sequence length within a fixed domain patch under the log active\-mode\-exposure assumption, and becomes a domain\-constant once the patch ceiling\|CD\|\|C\_\{D\}\|is reached\. Sequence\-level reliability targets are strictly tighter\. Available evidence is consistent with libraries on the order of tens of interventions covering the head of the per\-hard\-decision failure distribution in many fixed domains\. The exact number is patch\-indexed and should be estimated from local discovery and rank\-coverage curves, not assumed from cross\-domain priors\. This reframes the question from “can we bound the growth ofnn\-token error?” to “have we catalogued enough failure modes*inside the deployment patch*?” The latter is finite and addressable once an operationally bounded patch has been fixed\.
The same logarithmic discovery postulate carries an inverse interpretation \(Corollary[1](https://arxiv.org/html/2605.30628#Thmcorollary1), Appendix[A\.6](https://arxiv.org/html/2605.30628#A1.SS6)\): the logarithmic upper bound cannot accommodate linearly more distinct tail modes without exponentially more observed hard\-failure events\. Combined with head\-heavy failure mass, this explains why generic frontier post\-training can face diminishing reliability returns in open\-ended deployment settings\. The high\-mass head is discovered early, while the tail becomes increasingly expensive to find and contributes less marginal residual\-error reduction\. The framework therefore does not say frontier scaling is useless\. It says frontier\-only reliability is economically misaligned with fixed deployment patches: onceDDis known, local adaptation, tools, validators, retrieval, constrained decoding, and process supervision can target recurring failure mass directly, where generic post\-training would have to rediscover the same local repairs indirectly across an open\-ended task universe\.
The open empirical question this paper invites is direct measurement of the mode\-rate constantσ\\sigmaon new domains \(agentic, scientific, code\-in\-production\), and a corresponding sequence\-level validation that measuresSbaseS\_\{\\text\{base\}\},β\\beta, and active\-mode exposure directly in deployed systems\. The conceptual follow\-on goes in a different direction\. This paper names the engineering object: a patch\-local failure catalogue and the intervention budget that covers its head\. It does not say how production systems actually accumulate and govern that library over time\. That question, how the deployment\-time scaffold \(instructions, tools, retrieval, memory, orchestration, governance\) learns to cover the local failure topology, is the subject of follow\-on work in this line\. Reliability engineering, in that frame, is governance of the scaffold rather than scaling of the weights\.
## References
- Arbuzov et al\. \[2025\]Mikhail L\. Arbuzov, Sisong Bei, Ziwei Dong, Dmitri Kalaev, and Alexey Shvets\.Beyond exponential decay: Rethinking error accumulation in large language models\.*arXiv preprint arXiv:2505\.24187v2*, 2025\.Posted 06 May 2026; CC BY 4\.0 license\.
- Asai et al\. \[2024\]Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi\.Self\-RAG: Learning to retrieve, generate, and critique through self\-reflection\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- Ashury\-Tahan et al\. \[2026\]Shir Ashury\-Tahan, Yifan Mai, Elron Bandel, Michal Shmueli\-Scheuer, and Leshem Choshen\.ErrorMap and ErrorAtlas: Charting the failure landscape of large language models\.*arXiv preprint*, 2026\.
- Cemri et al\. \[2025\]Mert Cemri, Melissa Z\. Pan, Shuyi Yang, Lakshya A\. Agrawal, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Dan Klein, Kannan Ramchandran, Matei Zaharia, Joseph E\. Gonzalez, and Ion Stoica\.Why do multi\-agent LLM systems fail?*arXiv preprint*, 2025\.Introduces the MAST taxonomy \(Multi\-Agent System Failure Taxonomy\)\.
- Chen et al\. \[2023\]Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W\. Cohen\.Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks\.*Transactions on Machine Learning Research \(TMLR\)*, 2023\.
- Cheng et al\. \[2024\]Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Yantao Li, Jianbing Zhang, and Zhiyong Wu\.SeeClick: Harnessing GUI grounding for advanced visual GUI agents\.In*Proceedings of ACL*, 2024\.
- Costello et al\. \[2025\]Caia Costello, Simon Guo, Anna Goldie, and Azalia Mirhoseini\.Think, prune, train, improve: Scaling reasoning without scaling models\.*arXiv preprint*, 2025\.
- Dai et al\. \[2025\]Chengwei Dai, Kun Li, Wei Zhou, and Songlin Hu\.Capture the key in reasoning to enhance CoT distillation generalization\.In*Proceedings of ACL*, 2025\.Earlier arXiv version titled “Beyond Imitation: Learning Key Reasoning Steps from Dual Chain\-of\-Thoughts in Reasoning Distillation”\.
- Dantart \[2026\]Alex Dantart\.Reliability by design: Quantifying and eliminating fabrication risk in LLMs\. from generative to consultative AI: A comparative analysis in the legal domain and lessons for high\-stakes knowledge bases\.*arXiv preprint*, 2026\.
- Dong et al\. \[2025\]Yixin Dong, Charlie F\. Ruan, Yaxing Cai, Ruihang Lai, Ziyi Xu, Yilong Zhao, and Tianqi Chen\.XGrammar: Flexible and efficient structured generation engine for large language models\.In*Proceedings of MLSys*, 2025\.
- Dziri et al\. \[2023\]Nouha Dziri, Ximing Lu, Melanie Sclar, Xiang Lorraine Li, Liwei Jiang, Bill Yuchen Lin, Peter West, Chandra Bhagavatula, Ronan Le Bras, Jena D\. Hwang, Soumya Sanyal, Sean Welleck, Xiang Ren, Allyson Ettinger, Zaid Harchaoui, and Yejin Choi\.Faith and fate: Limits of transformers on compositionality\.In*Advances in Neural Information Processing Systems*, volume 36, 2023\.
- Fang et al\. \[2025\]Lizhe Fang, Yifei Wang, Zhaoyang Liu, Chenyang Zhang, Stefanie Jegelka, Jianfeng Gao, Bolin Ding, and Yisen Wang\.What is wrong with perplexity for long\-context language modeling?In*International Conference on Learning Representations \(ICLR\)*, 2025\.
- Gao et al\. \[2023a\]Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig\.PAL: Program\-aided language models\.In*Proceedings of ICML*, 2023a\.
- Gao et al\. \[2023b\]Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen\.Enabling large language models to generate text with citations\.In*Proceedings of EMNLP*, 2023b\.Introduces the ALCE benchmark\.
- Goodell et al\. \[2025\]Alex J\. Goodell, Simon N\. Chu, Dara Rouholiman, and Larry F\. Chu\.Large language model agents can use tools to perform clinical calculations\.*npj Digital Medicine*, 8\(1\):163, 2025\.doi:10\.1038/s41746\-025\-01475\-8\.
- Gou et al\. \[2025\]Boyu Gou, Ruohan Wang, Boyuan Zheng, Yanan Xie, Cheng Chang, Yiheng Shu, Huan Sun, and Yu Su\.Navigating the digital world as humans do: Universal visual grounding for GUI agents\.In*International Conference on Learning Representations \(ICLR\)*, 2025\.ICLR 2025 Oral\. Introduces the UGround model\.
- Gou et al\. \[2024a\]Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Nan Duan, and Weizhu Chen\.CRITIC: Large language models can self\-correct with tool\-interactive critiquing\.In*International Conference on Learning Representations \(ICLR\)*, 2024a\.
- Gou et al\. \[2024b\]Zhibin Gou, Zhihong Shao, Yeyun Gong, Yelong Shen, Yujiu Yang, Minlie Huang, Nan Duan, and Weizhu Chen\.ToRA: A tool\-integrated reasoning agent for mathematical problem solving\.In*International Conference on Learning Representations \(ICLR\)*, 2024b\.
- Guo et al\. \[2025\]Dadi Guo, Jiayu Liu, Zhiyuan Fan, Zhitao He, Haoran Li, Yuxin Li, Yumeng Wang, and Yi R\. Fung\.Mathematical proof as a litmus test: Revealing failure modes of advanced large reasoning models\.*arXiv preprint*, 2025\.Introduces the RFMDataset \(Reveal Failure Modes\)\.
- Hsieh et al\. \[2024\]Cheng\-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg\.RULER: What’s the real context size of your long\-context language models?In*Conference on Language Modeling \(COLM\)*, 2024\.
- Huang et al\. \[2023\]Dong Huang, Qingwen Bu, Jie M\. Zhang, Michael Luck, and Heming Cui\.AgentCoder: Multi\-agent\-based code generation with iterative testing and optimisation\.*arXiv preprint*, 2023\.
- Kandpal et al\. \[2023\]Nikhil Kandpal, Haikang Deng, Adam Roberts, Eric Wallace, and Colin Raffel\.Large language models struggle to learn long\-tail knowledge\.In*Proceedings of ICML*, 2023\.
- Karaman et al\. \[2024\]Batuhan K\. Karaman, Ishmam Zabir, Alon Benhaim, Vishrav Chaudhary, Mert R\. Sabuncu, and Xia Song\.POROver: Improving safety and reducing overrefusal in large language models with overgeneration and preference optimization\.*arXiv preprint*, 2024\.
- Karpinska et al\. \[2024\]Marzena Karpinska, Katherine Thai, Kyle Lo, Tanya Goyal, and Mohit Iyyer\.One thousand and one pairs: A “novel” challenge for long\-context language models\.In*Proceedings of EMNLP*, 2024\.
- Kuratov et al\. \[2024\]Yuri Kuratov, Aydar Bulatov, Pavel Anokhin, Ivan Rodkin, Dmitry Sorokin, Artyom Sorokin, and Mikhail Burtsev\.BABILong: Testing the limits of LLMs with long context reasoning\-in\-a\-haystack\.In*NeurIPS Datasets and Benchmarks Track*, 2024\.
- Kwa et al\. \[2025\]Thomas Kwa, Ben West, Joel Becker, Amy Deng, Katharyn Garcia, Max Hasin, Sami Jawhar, Megan Kinniment, Nate Rush, Sydney Von Arx, Ryan Bloom, Thomas Broadley, Haoxing Du, Brian Goodrich, Nikola Jurkovic, Luke Harold Miles, Seraphina Nix, Tao Lin, Neev Parikh, David Rein, Lucas Jun Koba Sato, Hjalmar Wijk, Daniel M\. Ziegler, Elizabeth Barnes, and Lawrence Chan\.Measuring AI ability to complete long software tasks\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2025\.METR technical report\.
- Le \[2026\]Yifan Le\.Schema key wording as an instruction channel in structured generation under constrained decoding\.*arXiv preprint*, 2026\.
- LeCun \[2023\]Yann LeCun\.Auto\-regressive LLMs are doomed\.Tweet, March 26, 2023\.[https://x\.com/ylecun/status/1640122342570336267](https://x.com/ylecun/status/1640122342570336267), 2023\.
- Li et al\. \[2026\]Linzhang Li, Yixin Dong, Guanjie Wang, Ziyi Xu, Alexander Jiang, and Tianqi Chen\.XGrammar\-2: Efficient dynamic structured generation engine for agentic LLMs\.*arXiv preprint*, 2026\.
- Li and Sarwate \[2025\]Xin Li and Anand D\. Sarwate\.Unraveling the localized latents: Learning stratified manifold structures in LLM embedding space with sparse mixture\-of\-experts\.*arXiv preprint*, 2025\.
- Li et al\. \[2022\]Yujia Li, David H\. Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, Thomas Hubert, Peter Choy, Cyprien de Masson d’Autume, Igor Babuschkin, Xinyun Chen, Po\-Sen Huang, Johannes Welbl, Sven Gowal, Alexey Cherepanov, James Molloy, Daniel J\. Mankowitz, Esme Sutherland Robson, Pushmeet Kohli, Nando de Freitas, Koray Kavukcuoglu, and Oriol Vinyals\.Competition\-level code generation with AlphaCode\.*Science*, 378\(6624\):1092–1097, 2022\.doi:10\.1126/science\.abq1158\.
- Lightman et al\. \[2024\]Hunter Lightman, Vineet Kosaraju, Yura Burda, Harri Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe\.Let’s verify step by step\.In*International Conference on Learning Representations \(ICLR\)*, 2024\.
- Liu et al\. \[2023\]Fangyu Liu, Julian Martin Eisenschlos, Francesco Piccinno, Syrine Krichene, Chenxi Pang, Kenton Lee, Mandar Joshi, Wenhu Chen, Nigel Collier, and Yasemin Altun\.DePlot: One\-shot visual language reasoning by plot\-to\-table translation\.In*Findings of ACL*, 2023\.
- Lu et al\. \[2024\]Yadong Lu, Jianwei Yang, Yelong Shen, and Ahmed Awadallah\.OmniParser for pure vision based GUI agent\.*arXiv preprint*, 2024\.
- Mallen et al\. \[2023\]Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi\.When not to trust language models: Investigating effectiveness of parametric and non\-parametric memories\.In*Proceedings of ACL*, 2023\.Introduces the PopQA benchmark\.
- Manning et al\. \[2008\]Christopher D\. Manning, Prabhakar Raghavan, and Hinrich Schütze\.*Introduction to Information Retrieval*\.Cambridge University Press, 2008\.Section on Heaps’ law:[https://nlp\.stanford\.edu/IR\-book/html/htmledition/heaps\-law\-estimating\-the\-number\-of\-terms\-1\.html](https://nlp.stanford.edu/IR-book/html/htmledition/heaps-law-estimating-the-number-of-terms-1.html)\.
- Niwa and Iso \[2024\]Ayana Niwa and Hayate Iso\.AmbigNLG: Addressing task ambiguity in instruction for NLG\.In*Proceedings of EMNLP*, 2024\.
- OpenAI \[2024\]OpenAI\.Introducing structured outputs in the API\.OpenAI Engineering Blog\.[https://openai\.com/index/introducing\-structured\-outputs\-in\-the\-api/](https://openai.com/index/introducing-structured-outputs-in-the-api/), 2024\.Published August 6, 2024\.
- OpenAI \[2025\]OpenAI\.GPT\-5 system card\.[https://cdn\.openai\.com/gpt\-5\-system\-card\.pdf](https://cdn.openai.com/gpt-5-system-card.pdf), 2025\.Published August 7, 2025\.
- Pang et al\. \[2024\]Jianhui Pang, Fanghua Ye, Derek F\. Wong, Xun He, Wenxiang Chen, and Longyue Wang\.Anchor\-based large language models\.In*Findings of ACL*, 2024\.
- Park et al\. \[2024\]Kiho Park, Yo Joong Choe, and Victor Veitch\.The linear representation hypothesis and the geometry of large language models\.In*Proceedings of ICML*, 2024\.
- Patel et al\. \[2026\]Khush Patel, Siva Surendira, Jithin George, and Shreyas Kapale\.The six sigma agent: Achieving enterprise\-grade reliability in LLM systems through consensus\-driven decomposed execution\.*arXiv preprint*, 2026\.
- Press et al\. \[2023\]Ofir Press, Muru Zhang, Sewon Min, Ludwig Schmidt, Noah A\. Smith, and Mike Lewis\.Measuring and narrowing the compositionality gap in language models\.In*Findings of EMNLP*, 2023\.
- Ren et al\. \[2025\]Z\. Z\. Ren, Zhihong Shao, Junxiao Song, Huajian Xin, Haocheng Wang, Wanjia Zhao, Liyue Zhang, Zhe Fu, Qihao Zhu, Dejian Yang, Z\. F\. Wu, Zhibin Gou, Shirong Ma, Hongxuan Tang, Yuxuan Liu, Wenjun Gao, Daya Guo, and Chong Ruan\.DeepSeek\-Prover\-V2: Advancing formal mathematical reasoning via reinforcement learning for subgoal decomposition\.*arXiv preprint*, 2025\.
- Schaeffer et al\. \[2025\]Rylan Schaeffer, Joshua Kazdan, John Hughes, Jordan Juravsky, Sara Price, Aengus Lynch, Erik Jones, Robert Kirk, Azalia Mirhoseini, and Sanmi Koyejo\.How do large language monkeys get their power \(laws\)?In*International Conference on Machine Learning \(ICML\)*, 2025\.
- Shang et al\. \[2024\]Yu Shang, Yu Li, Keyu Zhao, Likai Ma, Jiahe Liu, Fengli Xu, and Yong Li\.AgentSquare: Automatic LLM agent search in modular design space\.*arXiv preprint*, 2024\.
- Shi et al\. \[2024\]Yuling Shi, Songsong Wang, Chengcheng Wan, Min Wang, and Xiaodong Gu\.From code to correctness: Closing the last mile of code generation with hierarchical debugging\.*arXiv preprint*, 2024\.Introduces the MGDebugger method\.
- Shinn et al\. \[2023\]Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao\.Reflexion: Language agents with verbal reinforcement learning\.In*Advances in Neural Information Processing Systems*, volume 36, 2023\.
- Sun et al\. \[2025\]Yuhong Sun, Zhangyue Yin, Xuanjing Huang, Xipeng Qiu, and Hui Zhao\.Error classification of large language models on math word problems: A dynamically adaptive framework\.In*arXiv preprint*, 2025\.Introduces the MWPES\-300K dataset \(304,865 error samples from 15 LLMs across four MWP datasets\)\.
- Suresh et al\. \[2025\]Tarun Suresh, Debangshu Banerjee, Shubham Ugare, Sasa Misailovic, and Gagandeep Singh\.DINGO: Constrained inference for diffusion LLMs\.*arXiv preprint*, 2025\.
- Suri et al\. \[2025\]Manan Suri, Puneet Mathur, Nedim Lipka, Franck Dernoncourt, Ryan A\. Rossi, and Dinesh Manocha\.Structured uncertainty guided clarification for LLM agents\.*arXiv preprint*, 2025\.Introduces the SAGE\-Agent method\.
- Tian et al\. \[2024\]Runchu Tian, Yining Ye, Yujia Qin, Xin Cong, Yankai Lin, Yinxu Pan, Yesai Wu, Haotian Hui, Weichuan Liu, Zhiyuan Liu, and Maosong Sun\.DebugBench: Evaluating debugging capability of large language models\.In*Findings of ACL*, 2024\.
- Wada et al\. \[2025\]Akihiko Wada, Yuya Tanaka, Mitsuo Nishizawa, Akira Yamamoto, Toshiaki Akashi, Akifumi Hagiwara, Yayoi Hayakawa, Junko Kikuta, Keigo Shimoji, Katsuhiro Sano, Koji Kamagata, Atsushi Nakanishi, and Shigeki Aoki\.Retrieval\-augmented generation elevates local LLM quality in radiology contrast media consultation\.*npj Digital Medicine*, 8\(1\):395, 2025\.doi:10\.1038/s41746\-025\-01802\-z\.
- Wan et al\. \[2026\]Kaiyang Wan, Lang Gao, Honglin Mu, Preslav Nakov, Yuxia Wang, and Xiuying Chen\.A Fano\-style accuracy upper bound for LLM single\-pass reasoning in multi\-hop QA\.In*International Conference on Learning Representations \(ICLR\)*, 2026\.
- Wang et al\. \[2025a\]Benlu Wang, Iris Xia, Yifan Zhang, Junda Wang, Feiyun Ouyang, Shuo Han, Arman Cohan, Hong Yu, and Zonghai Yao\.From scores to steps: Diagnosing and improving LLM performance in evidence\-based medical calculations\.In*Proceedings of EMNLP*, 2025a\.Introduces the MedRaC benchmark\.
- Wang et al\. \[2025b\]Darren Yow\-Bang Wang, Zhengyuan Shen, Soumya Smruti Mishra, Zhichao Xu, Yifei Teng, and Haibo Ding\.SLOT: Structuring the output of large language models\.*arXiv preprint*, 2025b\.SLOT = Structured LLM Output Transformer\.
- Wang et al\. \[2024a\]Minzheng Wang, Longze Chen, Cheng Fu, Shengyi Liao, Xinghua Zhang, Bingli Wu, Haiyang Yu, Nan Xu, Lei Zhang, Run Luo, Yunshui Li, Min Yang, Fei Huang, and Yongbin Li\.Leave no document behind: Benchmarking long\-context LLMs with extended multi\-doc QA\.In*Proceedings of EMNLP*, 2024a\.
- Wang et al\. \[2024b\]Peiyi Wang, Lei Li, Zhihong Shao, R\. X\. Xu, Damai Dai, Yifei Li, Deli Chen, Y\. Wu, and Zhifang Sui\.Math\-Shepherd: Verify and reinforce LLMs step\-by\-step without human annotations\.In*Proceedings of ACL*, 2024b\.
- Wang et al\. \[2023\]Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou\.Self\-consistency improves chain of thought reasoning in language models\.In*International Conference on Learning Representations \(ICLR\)*, 2023\.
- Wang et al\. \[2024c\]Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, Tianle Li, Max Ku, Kai Wang, Alex Zhuang, Rongqi Fan, Xiang Yue, and Wenhu Chen\.MMLU\-Pro: A more robust and challenging multi\-task language understanding benchmark\.In*NeurIPS Datasets and Benchmarks Track \(Spotlight\)*, 2024c\.
- Wen et al\. \[2024\]Hao Wen, Yueheng Zhu, Chao Liu, Xiaoxue Ren, Weiwei Du, and Meng Yan\.Fixing function\-level code generation errors for foundation large language models\.*arXiv preprint*, 2024\.Introduces the LlmFix method\. Reports AssertionError 63\.64% \+ NameError 22\.71% = 86\.35% of HumanEval failures across 14 LLMs\.
- Wood and Forbes \[2024\]Michael C\. Wood and Adam A\. Forbes\.100% elimination of hallucinations on RAGTruth for GPT\-4 and GPT\-3\.5 Turbo\.*arXiv preprint*, 2024\.Introduces the Acurai method\.
- Xie et al\. \[2025\]Tianbao Xie, Jiaqi Deng, Xiaochuan Li, Junlin Yang, Haoyuan Wu, Jixuan Chen, Wenjing Hu, Xinyuan Wang, Yuhui Xu, Zekun Wang, Yiheng Xu, Junli Wang, Doyen Sahoo, Tao Yu, and Caiming Xiong\.Scaling computer\-use grounding via user interface decomposition and synthesis\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2025\.NeurIPS 2025 Spotlight\. Introduces the Jedi grounding dataset and OSWorld\-G benchmark\.
- Yao et al\. \[2024\]Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan\.τ\\tau\-bench: A benchmark for tool\-agent\-user interaction in real\-world domains\.*arXiv preprint*, 2024\.
- Zakka et al\. \[2024\]Cyril Zakka, Rohan Shad, Akash Chaurasia, Alex R\. Dalal, Jennifer L\. Kim, Michael Moor, Robyn Fong, Curran Phillips, Kevin Alexander, Euan Ashley, Jack Boyd, Kathleen Boyd, Karen Hirsch, Curtis Langlotz, Rita Lee, Joanna Melia, Joanna Nelson, Karim Sallam, Stacey Tullis, Melissa Ann Vogelsong, John Patrick Cunningham, and William Hiesinger\.Almanac — retrieval\-augmented language models for clinical medicine\.*NEJM AI*, 1\(2\):AIoa2300068, 2024\.doi:10\.1056/AIoa2300068\.
- Zhang et al\. \[2023\]Kexun Zhang, Hongqiao Chen, Lei Li, and William Wang\.Don’t fine\-tune, decode: Syntax error\-free tool use via constrained decoding\.*arXiv preprint*, 2023\.Introduces the ToolDec method\.
- Zhang et al\. \[2026\]Meiru Zhang, Zaiqiao Meng, and Nigel Collier\.Failure modes in multi\-hop QA: The weakest link effect and the recognition bottleneck\.*arXiv preprint*, 2026\.
- Zhou et al\. \[2025\]Yang Zhou, Hongyi Liu, Zhuoyan Chen, Yuandong Tian, and Beidi Chen\.GSM\-Infinite: How do your LLMs behave over infinitely increasing context length and reasoning complexity?*arXiv preprint*, 2025\.
## Appendix AFormal Proofs, Derivations, and Sensitivity Analysis
### A\.1Proof of Proposition[1](https://arxiv.org/html/2605.30628#Thmproposition1): No Universal Finite Intervention Dictionary
Fix a residual\-error toleranceε\\varepsilon\. Say that an intervention*covers*a failure mode if it reduces the residual error of that mode belowε\\varepsilon\. Coverage is mode\-level: an intervention that covers a mode covers every failure event in that mode\. The two conditions “DDis intervention\-unbounded” and “\|CDε\|=∞\|C\_\{D\}^\{\\varepsilon\}\|=\\infty” are equivalent under mode\-level coverage; an unbounded witnessing sequence is constructed by taking one representative per mode, and an infinite catalogue forces the existence of such a sequence\.
LetDDbe a domain\. CallDD*intervention\-unbounded*if it contains an infinite sequence of failuresf1,f2,f3,…f\_\{1\},f\_\{2\},f\_\{3\},\\ldotssuch that each newfjf\_\{j\}is not covered by any finite intervention dictionary that covers all earlier failures\{f1,…,fj−1\}\\\{f\_\{1\},\\ldots,f\_\{j\-1\}\\\}\.
#### Step 1: assume the opposite\.
Suppose, for contradiction, thatCDεC\_\{D\}^\{\\varepsilon\}is finite\. Then there are only finitely many intervention\-distinguishable failure modes inDD\. Write them asCDε=\{c1,…,cM\}C\_\{D\}^\{\\varepsilon\}=\\\{c\_\{1\},\\ldots,c\_\{M\}\\\}for some finiteMM\.
#### Step 2: what finiteness means\.
If there are onlyMMintervention\-distinguishable modes, then after allMMmodes have appeared in the sequence, every later failure must belong to one of the already\-seen modes\.
#### Step 3: same mode means same intervention class\.
Modes are defined at intervention resolutionε\\varepsilon\. If a later failure belongs to the same mode as an earlier failure, then the intervention dictionary that covers the earlier representative of that mode also covers the later failure below residual toleranceε\\varepsilon\.
#### Step 4: contradiction\.
Intervention\-unboundedness says exactly the opposite: each newfjf\_\{j\}is not covered by any finite dictionary that covers\{f1,…,fj−1\}\\\{f\_\{1\},\\ldots,f\_\{j\-1\}\\\}\. Thereforefjf\_\{j\}cannot belong to any earlier intervention mode, so eachfjf\_\{j\}introduces a new intervention\-distinguishable mode\. The sequencef1,f2,f3,…f\_\{1\},f\_\{2\},f\_\{3\},\\ldotsinduces infinitely many such modes, contradicting the assumption thatCDεC\_\{D\}^\{\\varepsilon\}is finite\. Hence\|CDε\|=∞\|C\_\{D\}^\{\\varepsilon\}\|=\\infty\.
#### Step 5: no finite dictionary covers every mode of the domain\.
Suppose, again for contradiction, that some finite intervention dictionaryℐ\\mathcal\{I\}covers every mode ofDD\. Thenℐ\\mathcal\{I\}covers every finite prefix\{f1,…,fj−1\}\\\{f\_\{1\},\\ldots,f\_\{j\-1\}\\\}for everyjj\. By intervention\-unboundedness, any dictionary covering that prefix fails to coverfjf\_\{j\}\. Thereforeℐ\\mathcal\{I\}does not coverfjf\_\{j\}, contradicting the assumption thatℐ\\mathcal\{I\}covers every mode ofDD\. ∎
### A\.2Derivation of Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2): Patch\-Local Sufficient Intervention Budget
#### Conditions used\.
Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)gives a sufficient, model\-implied budget under the following four assumptions:
1. 1\.Coverage model\.The cumulative hard\-error mass covered by the topmmmodes is approximated by the log\-head form Flog\(m;\|C\|\)=min\(1,lnmln\|C\|\)F\_\{\\log\}\(m;\|C\|\)\\;=\\;\\min\\\!\\left\(1,\\ \\frac\{\\ln m\}\{\\ln\|C\|\}\\right\)on the declared domainm≥2m\\geq 2,\|C\|≥2\|C\|\\geq 2\. We writeFFforFlogF\_\{\\log\}throughout this appendix unless otherwise noted\.
2. 2\.Non\-trivial, attainable target\.The residual target satisfiesε∈\(0,ehard\)\\varepsilon\\in\(0,e\_\{\\text\{hard\}\}\)\. Ifε≥ehard\\varepsilon\\geq e\_\{\\text\{hard\}\}no intervention is needed; ifε≤0\\varepsilon\\leq 0the target is unattainable unless all residual hard\-token error is eliminated\.
3. 3\.Patch\-local catalogue\.The result applies only after a deployment patchDDhas been fixed and its reachable catalogue is modelled as finite or effectively capped\.
4. 4\.Sequence\-length claim\.The doubly\-logarithmic rate further requires Assumption[2](https://arxiv.org/html/2605.30628#Thmdefinition2)andk\(n\)=Θ\(logn\)k\(n\)=\\Theta\(\\log n\)\. Without these, Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)still gives a catalogue\-size budget but not the same sequence\-length scaling\.
#### Step 1: define the target\.
Leteharde\_\{\\text\{hard\}\}be the baseline hard\-token error rate andε∈\(0,ehard\)\\varepsilon\\in\(0,e\_\{\\text\{hard\}\}\)the target after intervention\.
#### Step 2: define the effective catalogue\.
For per\-sequence scaling, setCeff=Cactive,D\(n\)C\_\{\\text\{eff\}\}=C\_\{\\text\{active\},D\}\(n\): how many failure modes can a single sequence of lengthnnactivate? For full deployment\-library budgeting, setCeff=CDC\_\{\\text\{eff\}\}=C\_\{D\}: how large must the library be to cover the recurring failure modes reachable insideDD?
#### Step 3: cumulative coverage\.
Let the ranked local failure modes have hard\-error massesp1≥p2≥⋯≥p\|Ceff\|p\_\{1\}\\geq p\_\{2\}\\geq\\cdots\\geq p\_\{\|C\_\{\\text\{eff\}\}\|\}with∑ipi=1\\sum\_\{i\}p\_\{i\}=1\. A library covering the topmmmodes removes cumulative hard\-error massF\(m;\|Ceff\|\)=∑i=1mpiF\(m;\|C\_\{\\text\{eff\}\}\|\)=\\sum\_\{i=1\}^\{m\}p\_\{i\}, approximated by the log\-coverage form of §[3\.3](https://arxiv.org/html/2605.30628#S3.SS3)\.
#### Step 4: residual after intervention\.
The uncovered mass fraction is1−F\(m;\|Ceff\|\)1\-F\(m;\|C\_\{\\text\{eff\}\}\|\), so the residual per\-hard\-token error rate iseres\(m\)=\(1−F\(m;\|Ceff\|\)\)eharde\_\{\\text\{res\}\}\(m\)=\(1\-F\(m;\|C\_\{\\text\{eff\}\}\|\)\)\\,e\_\{\\text\{hard\}\}\.
#### Step 5: impose the target\.
Requiringeres\(m\)≤εe\_\{\\text\{res\}\}\(m\)\\leq\\varepsilonand dividing byehard\>0e\_\{\\text\{hard\}\}\>0gives
1−F\(m;\|Ceff\|\)≤εehard,i\.e\.,F\(m;\|Ceff\|\)≥1−εehard\.1\-F\(m;\|C\_\{\\text\{eff\}\}\|\)\\;\\leq\\;\\frac\{\\varepsilon\}\{e\_\{\\text\{hard\}\}\},\\qquad\\text\{i\.e\.,\}\\qquad F\(m;\|C\_\{\\text\{eff\}\}\|\)\\;\\geq\\;1\-\\frac\{\\varepsilon\}\{e\_\{\\text\{hard\}\}\}\.
#### Step 6: substitute the log\-coverage form\.
Sinceε<ehard\\varepsilon<e\_\{\\text\{hard\}\}, the required coverage1−ε/ehard∈\(0,1\)1\-\\varepsilon/e\_\{\\text\{hard\}\}\\in\(0,1\), so the cap inFFis inactive before saturation\. UsingF=lnm/ln\|Ceff\|F=\\ln m/\\ln\|C\_\{\\text\{eff\}\}\|:
lnmln\|Ceff\|≥1−εehard\.\\frac\{\\ln m\}\{\\ln\|C\_\{\\text\{eff\}\}\|\}\\;\\geq\\;1\-\\frac\{\\varepsilon\}\{e\_\{\\text\{hard\}\}\}\.
#### Step 7: solve formm\.
Multiplying byln\|Ceff\|\>0\\ln\|C\_\{\\text\{eff\}\}\|\>0and exponentiating,
m≥exp\[\(1−εehard\)ln\|Ceff\|\]=\|Ceff\|1−ε/ehard\.m\\;\\geq\\;\\exp\\\!\\left\[\\left\(1\-\\frac\{\\varepsilon\}\{e\_\{\\text\{hard\}\}\}\\right\)\\ln\|C\_\{\\text\{eff\}\}\|\\right\]\\;=\\;\|C\_\{\\text\{eff\}\}\|^\{1\-\\varepsilon/e\_\{\\text\{hard\}\}\}\.Taking the ceiling \(sincemmis integer\-valued\) yieldsm≥⌈\|Ceff\|1−ε/ehard⌉m\\geq\\lceil\|C\_\{\\text\{eff\}\}\|^\{1\-\\varepsilon/e\_\{\\text\{hard\}\}\}\\rceil, the bound stated in Eq\. \([6](https://arxiv.org/html/2605.30628#S3.E6)\)\.
#### Step 8: sequence\-length rate\.
For per\-sequence scaling, setCeff=Cactive,D\(n\)C\_\{\\text\{eff\}\}=C\_\{\\text\{active\},D\}\(n\)\. Assumption[2](https://arxiv.org/html/2605.30628#Thmdefinition2)bounds\|Cactive,D\(n\)\|≤min\(AD′\+σD′lnh\(n\),\|CD\|\)\|C\_\{\\text\{active\},D\}\(n\)\|\\leq\\min\(A^\{\\prime\}\_\{D\}\+\\sigma^\{\\prime\}\_\{D\}\\ln h\(n\),\\,\|C\_\{D\}\|\)\. In the pre\-cap regime, i\.e\., whileAD′\+σD′lnh\(n\)<\|CD\|A^\{\\prime\}\_\{D\}\+\\sigma^\{\\prime\}\_\{D\}\\ln h\(n\)<\|C\_\{D\}\|, the ceiling has not been reached and\|Cactive,D\(n\)\|=O\(lnh\(n\)\)\|C\_\{\\text\{active\},D\}\(n\)\|=O\(\\ln h\(n\)\)\. Withh\(n\)=βk\(n\)h\(n\)=\\beta k\(n\)andk\(n\)=Θ\(logn\)k\(n\)=\\Theta\(\\log n\), we geth\(n\)=Θ\(logn\)h\(n\)=\\Theta\(\\log n\)andlnh\(n\)=Θ\(loglogn\)\\ln h\(n\)=\\Theta\(\\log\\log n\), hence
\|Cactive,D\(n\)\|=O\(loglogn\),\|C\_\{\\text\{active\},D\}\(n\)\|\\;=\\;O\(\\log\\log n\),and substituting into the budget,
m=O\(\(loglogn\)1−ε/ehard\)\.m\\;=\\;O\\\!\\bigl\(\(\\log\\log n\)^\{1\-\\varepsilon/e\_\{\\text\{hard\}\}\}\\bigr\)\.This is the doubly\-logarithmic pre\-cap special case\.
#### Step 9: cap regime\.
Once active\-mode discovery saturates the patch catalogue,\|Cactive,D\(n\)\|=\|CD\|\|C\_\{\\text\{active\},D\}\(n\)\|=\|C\_\{D\}\|, som≥⌈\|CD\|1−ε/ehard⌉m\\geq\\lceil\|C\_\{D\}\|^\{1\-\\varepsilon/e\_\{\\text\{hard\}\}\}\\rceil, which no longer depends onnn\. The intervention budget is then domain\-constant\. ∎
### A\.3Sequence\-Level Reliability Derivation
Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)gives a per\-hard\-token residual target\. Production systems often care about a stricter target: the probability that the entire sequence is correct\.
Starting from the composed reliability of Eq\. \([1](https://arxiv.org/html/2605.30628#S3.E1)\), and writing the post\-intervention hard\-token rate aserese\_\{\\text\{res\}\},
P\(correct\)=\(1−eres\)βk\(1−eeasy\)\(1−β\)k\(1−enon\)n−k\.P\(\\text\{correct\}\)\\;=\\;\(1\-e\_\{\\text\{res\}\}\)^\{\\beta k\}\\,\(1\-e\_\{\\text\{easy\}\}\)^\{\(1\-\\beta\)k\}\\,\(1\-e\_\{\\text\{non\}\}\)^\{n\-k\}\.Define the non\-hard\-token survival factor
Sbase=\(1−eeasy\)\(1−β\)k\(1−enon\)n−k,S\_\{\\text\{base\}\}\\;=\\;\(1\-e\_\{\\text\{easy\}\}\)^\{\(1\-\\beta\)k\}\\,\(1\-e\_\{\\text\{non\}\}\)^\{n\-k\},\(7\)so thatP\(correct\)=\(1−eres\)βkSbaseP\(\\text\{correct\}\)=\(1\-e\_\{\\text\{res\}\}\)^\{\\beta k\}\\,S\_\{\\text\{base\}\}\. The sequence\-level targetP\(correct\)≥1−εseqP\(\\text\{correct\}\)\\geq 1\-\\varepsilon\_\{\\text\{seq\}\}becomes
\(1−eres\)βkSbase≥1−εseq,i\.e\.,\(1−eres\)βk≥1−εseqSbase\.\(1\-e\_\{\\text\{res\}\}\)^\{\\beta k\}\\,S\_\{\\text\{base\}\}\\;\\geq\\;1\-\\varepsilon\_\{\\text\{seq\}\},\\qquad\\text\{i\.e\.,\}\\qquad\(1\-e\_\{\\text\{res\}\}\)^\{\\beta k\}\\;\\geq\\;\\frac\{1\-\\varepsilon\_\{\\text\{seq\}\}\}\{S\_\{\\text\{base\}\}\}\.Three regimes follow\.
#### Regime \(i\): non\-hard\-token errors already violate the target\.
IfSbase<1−εseqS\_\{\\text\{base\}\}<1\-\\varepsilon\_\{\\text\{seq\}\}, then even settingeres=0e\_\{\\text\{res\}\}=0cannot meet the target, because the maximum possible survival after eliminating all hard\-token failures is onlySbaseS\_\{\\text\{base\}\}\. Hard\-token interventions alone cannot meet the sequence\-level SLA\. This is the route by which the exponential\-in\-nnconcern re\-enters and should be diagnosed before any catalogue\-budgeting exercise\.
#### Regime \(ii\): hard\-token residual error determines feasibility\.
IfSbase≥1−εseqS\_\{\\text\{base\}\}\\geq 1\-\\varepsilon\_\{\\text\{seq\}\}, the target may be feasible\. Taking the\(1/\(βk\)\)\(1/\(\\beta k\)\)\-th power of the rearranged inequality and isolatingerese\_\{\\text\{res\}\},
eres≤1−\(1−εseqSbase\)1/\(βk\)=:τseq\.e\_\{\\text\{res\}\}\\;\\leq\\;1\-\\left\(\\frac\{1\-\\varepsilon\_\{\\text\{seq\}\}\}\{S\_\{\\text\{base\}\}\}\\right\)^\{1/\(\\beta k\)\}\\;=:\\;\\tau\_\{\\text\{seq\}\}\.Applying Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)withε\\varepsilonreplaced byτseq\\tau\_\{\\text\{seq\}\},
m≥⌈\|Ceff\|1−τseq/ehard⌉\.m\\;\\geq\\;\\left\\lceil\|C\_\{\\text\{eff\}\}\|^\{1\-\\tau\_\{\\text\{seq\}\}/e\_\{\\text\{hard\}\}\}\\right\\rceil\.As the sequence\-level target becomes stricter \(εseq\\varepsilon\_\{\\text\{seq\}\}shrinks\),τseq→0\\tau\_\{\\text\{seq\}\}\\to 0and the exponent1−τseq/ehard→11\-\\tau\_\{\\text\{seq\}\}/e\_\{\\text\{hard\}\}\\to 1, som→\|Ceff\|m\\to\|C\_\{\\text\{eff\}\}\|\. Strict one\-shot sequence\-level reliability pushes the system toward full\-catalogue coverage\.
#### Regime \(iii\): baseline hard\-token error is already acceptable\.
Ifτseq≥ehard\\tau\_\{\\text\{seq\}\}\\geq e\_\{\\text\{hard\}\}, the baseline hard\-token rate already satisfies the sequence target \(no intervention is required becauseeres\(0\)=eharde\_\{\\text\{res\}\}\(0\)=e\_\{\\text\{hard\}\}from Eq\. \([5](https://arxiv.org/html/2605.30628#S3.E5)\)\)\.
#### Conclusion\.
Per\-hard\-token reliability is easier than sequence\-level reliability\. A library that gives a large reduction in residual hard\-token error may still be insufficient for one\-shot sequence\-level guarantees when many hard decisions occur in a single output\. This is why the main paper treats the “tens of interventions” rule of thumb as a per\-hard\-decision planning prior, not a one\-shot sequence\-level SLA\.
### A\.4Why these are propositions rather than unconditional theorems
Proposition[1](https://arxiv.org/html/2605.30628#Thmproposition1)is a definitional impossibility result: once intervention\-unboundedness is assumed, an infinite intervention\-resolution catalogue follows\. Its role is not to prove that every unbounded domain necessarily has infinite failure modes, but to show that open\-ended domains cannot be assumed to admit finite dictionaries\.
Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)is conditional engineering math\. It does not prove that LLM failures universally obey logarithmic mode discovery\. It proves that*if*a bounded patch has a finite or effectively capped reachable catalogue,*and if*cumulative intervention coverage follows the stated head\-heavy form,*then*a sufficient per\-hard\-decision intervention budget grows slowly and becomes constant after catalogue saturation\. It does not prove that the true minimal intervention library has the same scaling\.
The empirical burden therefore lies not in the algebra but in measuring, for each deployment patch, the local discovery curveCseen,D\(T\)C\_\{\\text\{seen\},D\}\(T\), the per\-sequence activationCactive,D\(n\)C\_\{\\text\{active\},D\}\(n\), the cumulative coverageF\(m;\|CD\|\)F\(m;\|C\_\{D\}\|\), the hard\-token fractionβD\\beta\_\{D\}, and the baseline hard\-token rateeharde\_\{\\text\{hard\}\}\. This is why the paper frames the results as a reliability\-engineering scaffold rather than as universal theorems about LLM behaviour\.
### A\.5Heaps Power\-Law Variant and Cluster\-Count Sensitivity
A reader who prefers to derive cluster\-count growth from standard Heaps’ law rather than from Postulate 1 obtains a qualitatively similar result\. Let\|C\|\(khard\)≈K⋅khardb\|C\|\(k\_\{\\text\{hard\}\}\)\\approx K\\cdot k\_\{\\text\{hard\}\}^\{b\}withb∈\(0,1\)b\\in\(0,1\)\. Canonical fits giveb≈0\.5b\\approx 0\.5for natural\-language vocabularies; failure\-mode taxonomies plateau much more sharply \(ErrorAtlas stabilises at\|C\|=17\|C\|=17across104\+10^\{4\}\{\+\}failures\), implying a small effectiveb≈0\.24b\\approx 0\.24–0\.320\.32under a crude no\-intercept endpoint read \(not a fitted discovery exponent\) in our setting\. Composing withk=Θ\(logn\)k=\\Theta\(\\log n\), we get\|C\|=O\(\(logn\)b\)\|C\|=O\(\(\\log n\)^\{b\}\), and Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)becomes
m=O\(\(logn\)b⋅\(1−ε/ehard\)\)\.m=O\\\!\\left\(\(\\log n\)^\{b\\cdot\(1\-\\varepsilon/e\_\{\\text\{hard\}\}\)\}\\right\)\.This is still polylogarithmic innnfor anyb∈\(0,1\)b\\in\(0,1\)and anyε<ehard\\varepsilon<e\_\{\\text\{hard\}\}\. The paper’s qualitative claim survives either choice of cluster\-count law\.
#### Symbolic\-form sensitivity across candidate laws\.
The polylog conclusion depends on which cluster\-count law one accepts; available evidence is consistent with multiple candidates because no subsample\-discovery curve has been published for any LLM failure\-mode taxonomy at this writing\. We therefore report symbolic rates rather than fitted constants\. Withh\(n\)=βk\(n\)h\(n\)=\\beta k\(n\):
- •Logarithmic:\|Cactive,D\(n\)\|=O\(logh\(n\)\)\|C\_\{\\text\{active\},D\}\(n\)\|=O\(\\log h\(n\)\)\. Ifk\(n\)=Θ\(logn\)k\(n\)=\\Theta\(\\log n\), thenm=O\(\(loglogn\)1−ε/ehard\)m=O\\bigl\(\(\\log\\log n\)^\{1\-\\varepsilon/e\_\{\\text\{hard\}\}\}\\bigr\)\.
- •Heaps:\|Cactive,D\(n\)\|=O\(h\(n\)b\)\|C\_\{\\text\{active\},D\}\(n\)\|=O\\bigl\(h\(n\)^\{b\}\\bigr\)withb∈\(0,1\)b\\in\(0,1\)\. Ifk\(n\)=Θ\(logn\)k\(n\)=\\Theta\(\\log n\), thenm=O\(\(logn\)b\(1−ε/ehard\)\)m=O\\bigl\(\(\\log n\)^\{b\\,\(1\-\\varepsilon/e\_\{\\text\{hard\}\}\)\}\\bigr\)\.
- •Saturating:\|Cactive,D\(n\)\|≤\|CD\|\|C\_\{\\text\{active\},D\}\(n\)\|\\leq\|C\_\{D\}\|\. Thenm=O\(\|CD\|1−ε/ehard\)m=O\\bigl\(\|C\_\{D\}\|^\{1\-\\varepsilon/e\_\{\\text\{hard\}\}\}\\bigr\), constant innnonce the patch ceiling is reached\.
The qualitative conclusion is robustunder thek\(n\)=Θ\(logn\)k\(n\)=\\Theta\(\\log n\)regime:mmgrows more slowly than any positive power ofnnunder every candidate, and the directional claim \(“a small library covers the head of the failure distribution in the per\-hard\-token regime”\) survives\. Only the exponent shifts: doubly\-logarithmic under logarithmic discovery,\(logn\)b\(\\log n\)^\{b\}with smallbbunder Heaps, constant in the cap regime\. The doubly\-logarithmic rate is the optimistic special case\. Ifk\(n\)k\(n\)grows as a positive power ofnn, the Heaps variant inherits that power and the polylog\-in\-nnlanguage fails along that axis; the framework’s intervention prescription still applies, but its asymptotic\-rate framing does not\.
Numerical constants require a measured discovery curveCseen,D\(T\)C\_\{\\text\{seen\},D\}\(T\)orCactive,D\(n\)C\_\{\\text\{active\},D\}\(n\)\. Existing taxonomies provide endpoint category counts at a single corpus scale \(ErrorAtlas at\|C\|=17\|C\|=17for≈104\\approx 10^\{4\}failures\), not discovery curves\. The subsample\-discovery measurement remains the explicit empirical test that would either tighten the postulate or fall back to the Heaps variant\. Until that measurement exists, the framework’s headline rate should be read as*polylogarithmic in the pre\-cap regime, domain\-constant in the cap regime*, with the specific exponent flagged as a falsifiability test rather than a fitted prediction\.
### A\.6Inverse Discovery Cost
The body Corollary[1](https://arxiv.org/html/2605.30628#Thmcorollary1)inverts the upper\-bound discovery postulate into a sample\-budget lower bound on novel\-mode discovery\. This appendix gives the algebra, the numerical anchors, the tightness assumption that converts the lower bound into an approximate inverse cost, sensitivity to the Heaps cluster\-count alternative of §[A\.5](https://arxiv.org/html/2605.30628#A1.SS5), the saturation regime, and a separate mode\-mediated gain sub\-corollary that connects discovery cost to broad capability proxies\.
#### Setup\.
Letq\(T\)=\|Cseen,D\(T\)\|q\(T\)=\|C\_\{\\text\{seen\},D\}\(T\)\|be the number of distinct failure modes discovered in patchDDafterTTobserved hard\-failure events\. Postulate[1](https://arxiv.org/html/2605.30628#Thmpostulate1)is an upper bound,
q\(T\)≤AD\+σDlnT,σD\>0,q\(T\)\\;\\leq\\;A\_\{D\}\+\\sigma\_\{D\}\\ln T,\\qquad\\sigma\_\{D\}\>0,on the discovered catalogue under the logarithmic upper\-bound assumption\.
#### Inversion \(lower bound onTT\)\.
The upper\-bound postulate is monotone inTT\. Taking the inverse direction yields a*lower bound*onTTrequired for the cap to accommodateqqdiscovered modes: ifq\>ADq\>A\_\{D\}distinct modes have been discovered afterTTevents, then
T≥exp\(q−ADσD\)\.T\\;\\geq\\;\\exp\\\!\\left\(\\frac\{q\-A\_\{D\}\}\{\\sigma\_\{D\}\}\\right\)\.This is the rigorous direction of the corollary\. The cap cannot accommodateqquntil the sample budget has grown exponentially inqq\.
#### Tightness assumption\.
The lower bound above is unconditional under Postulate[1](https://arxiv.org/html/2605.30628#Thmpostulate1)\. The stronger reading is that observed hard failures at scaleT\(q\)≈exp\(\(q−AD\)/σD\)T\(q\)\\approx\\exp\(\(q\-A\_\{D\}\)/\\sigma\_\{D\}\)actually deliverqqdiscovered modes, and that each additionalΔq\\Delta qmodes raises the sample budget by approximatelyexp\(Δq/σD\)\\exp\(\\Delta q/\\sigma\_\{D\}\)\. This stronger reading requires an additional assumption that the empirical discovery curve is approximately*tight*against the bound at the relevant corpus scales\. Without that assumption, the appendix gives a lower bound onTTonly\. With it,
T\(q\+Δq\)T\(q\)≈exp\(ΔqσD\)\.\\frac\{T\(q\+\\Delta q\)\}\{T\(q\)\}\\;\\approx\\;\\exp\\\!\\left\(\\frac\{\\Delta q\}\{\\sigma\_\{D\}\}\\right\)\.
#### Numerical anchors under tightness\.
At the conservative calibrationσD≈1\.85\\sigma\_\{D\}\\approx 1\.85\(§[3\.2](https://arxiv.org/html/2605.30628#S3.SS2)\),exp\(5/1\.85\)≈14\.9\\exp\(5/1\.85\)\\approx 14\.9andexp\(10/1\.85\)≈222\\exp\(10/1\.85\)\\approx 222: five extra modes need roughly15×15\\timesmore observed hard failures, ten extra modes need roughly220×220\\timesmore\. These are tightness\-conditional anchors, not unconditional consequences of Postulate[1](https://arxiv.org/html/2605.30628#Thmpostulate1)\. The subsample\-discovery measurement of §[3\.2](https://arxiv.org/html/2605.30628#S3.SS2)is precisely the test of whether tightness holds\.
#### What this is and is not\.
The corollary describes*new distinct\-mode discovery*\. Ordinary failures inside already\-discovered modes may remain common and cheap to observe; the exponential cost lives on the category\-novelty axis\. Conflating ordinary failure rate with novel\-mode arrival rate would over\-claim the result\.
#### Heaps alternative\.
Under the Heaps cluster\-count law of §[A\.5](https://arxiv.org/html/2605.30628#A1.SS5),q\(T\)=KTbq\(T\)=KT^\{b\}withb∈\(0,1\)b\\in\(0,1\), the corresponding inverse cost is polynomial rather than exponential:
T\(q\)=\(q/K\)1/b\.T\(q\)\\;=\\;\(q/K\)^\{1/b\}\.The exponential inverse\-cost reading is specific to the logarithmic upper bound; the broader qualitative claim that tail discovery has diminishing returns survives under any concave discovery curve\. Sensitivity is therefore: exponential under logarithmic\-and\-tight, polynomial under Heaps, undefined past the patch ceiling\.
#### Saturation regime\.
Onceq\(T\)≤\|CD\|q\(T\)\\leq\|C\_\{D\}\|has been saturated, discovery stops; the inversion applies only in the pre\-cap regime\. Inside saturated patches the corollary’s multiplicative\-cost reading is vacuous because no novel modes remain to discover, which is itself a property of the patch and not a failure of the corollary\.
#### Mode\-mediated capability gain \(sub\-corollary\)\.
Suppose broad capability or reliability gainGGinside the patch is approximately linear in the number of useful discovered modes,G\(q\)=G0\+γqG\(q\)=G\_\{0\}\+\\gamma qfor someγ\>0\\gamma\>0\. Composing with the logarithmic upper bound at tightness,
G\(T\)≈G0\+γAD\+γσDlnT,G\(T\)\\;\\approx\\;G\_\{0\}\+\\gamma A\_\{D\}\+\\gamma\\sigma\_\{D\}\\ln T,soGGgrows logarithmically in observed hard\-failure exposure under tightness\. Inverting,
T\(G\)≈exp\(G−G0−γADγσD\)\.T\(G\)\\;\\approx\\;\\exp\\\!\\left\(\\frac\{G\-G\_\{0\}\-\\gamma A\_\{D\}\}\{\\gamma\\sigma\_\{D\}\}\\right\)\.A linear gain in mode\-mediated broad reliability therefore corresponds to exponential growth in observed hard\-failure exposure under the postulate at tightness\. We deliberately phraseGGas a*mode\-mediated capability/reliability*proxy, not as “intelligence”: the corollary does not say frontier scaling is useless or that intelligence requires exponential data in any general sense\. It says, more narrowly, that for fixed deployment reliability where improvement is mediated by discovering new useful modes, generic open\-domain training pays a heavy data tax relative to direct patch\-local measurement and intervention\.
#### Engineering reading\.
The corollary explains, without invoking new mechanisms, two empirical signals: \(a\) why generic post\-training shows diminishing reliability returns once a domain’s head modes are covered, and \(b\) why patch\-local measurement combined with targeted tools, retrieval, validators, constrained decoding, and process supervision often outperforms more frontier\-scale data on the deployment SLA\. Frontier scaling and patch\-local engineering solve different problems: scaling improves the substrate; patch\-local engineering removes recurring deployment failure mass\.□\\square
## Appendix BFull Failure\-Mode Taxonomy and the Capability\-Elimination Harvest
The intervention literature provides at least one targeted countermeasure for each named cluster of §[4](https://arxiv.org/html/2605.30628#S4)\. Two organising granularities exist: at the capability level \(six axes\) the cluster\-selectivity property is most clearly visible; at the error\-class level \(twelve named clusters\) the evidence aligns with the taxonomies of §[4](https://arxiv.org/html/2605.30628#S4)but is finer\-grained than the underlying capability mechanisms\.
#### Six capability axes and three structural patterns\.
A dedicated harvest yields28 quantitatively\-anchored citationsacrosssix independent capability axes: Arithmetic \(Python/symbolic execution\), Code Execution \(REPL/sandbox feedback\), Format/Structure \(constrained decoding, FSMs, grammar engines\), Perception/Grounding \(visual grounding for GUI, charts, tables\), Knowledge/RAG \(dense retrieval and citation grounding\), Verification \(proof checkers, learned verifiers, classifier rerouting, process supervision\)\. Each axis is independently confirmed by between three and nine citations\. We stratify the 28 by kind of evidence into three patterns:
Pattern A: hard guarantees \(by\-construction\)\.Seven citations achieve residual error rate equal to zero*by construction*, restricted strictly to structural/verifiable classes: constrained decoders setP\(invalid token\)=0P\(\\text\{invalid token\}\)=0at every step\[Suresh et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib50), Zhang et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib66), Dong et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib10), Li et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib29), OpenAI,[2024](https://arxiv.org/html/2605.30628#bib.bib38)\]; static syntax checks reject programs with aSyntaxErrorbefore execution\[Wen et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib61)\]; proof kernels reject any output failing type\-checking\[Ren et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib44)\]\. The class of grammar\-violating outputs is mathematically empty under these mechanisms\.
Pattern B: strong empirical reductions with class\-shift signature\.Roughly fourteen citations report empirical reductions of8080–100%100\\%in a named error class, with the post\-intervention failure log dominated by structurally different residual classes\. Program\-of\-Thoughts on GSM8K\[Chen et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib5)\]: calculation errors drop from30%30\\%of failures to0%0\\%, residuals are62%62\\%reasoning \+36%36\\%misunderstanding\. OpenMedCalc\[Goodell et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib15)\]: “only interpretation errors were identified\.” Acurai\[Wood and Forbes,[2024](https://arxiv.org/html/2605.30628#bib.bib62)\]:100%100\\%hallucination elimination on RAGTruth \(95%95\\%CI9191–100%100\\%\), strong empirical rather than by\-construction\.
Pattern C: moderate reductions \(6060–80%80\\%\) with residuals shifting outside the target class\.Legal RAG\[Dantart,[2026](https://arxiv.org/html/2605.30628#bib.bib9)\]: fabricated citations\>30%\>30\\%→\\rightarrow<0\.2%<0\.2\\%\. GPT\-5 SimpleQA with web access\[OpenAI,[2025](https://arxiv.org/html/2605.30628#bib.bib39)\]:47%→9\.6%47\\%\\rightarrow 9\.6\\%\(inter\-condition, not within\-condition\)\. CRITIC\[Gou et al\.,[2024a](https://arxiv.org/html/2605.30628#bib.bib17)\]: toxic generation−79\.2%\-79\.2\\%\.
#### The 28 citations, organised by axis and pattern\.
#### Capabilities are coarser than error classes\.
Several entries in the harvest reveal “two\-for\-one” reductions where one capability addresses multiple named clusters: a single Python interpreter removes the execution\-error component of arithmetic, unit conversion, simple counting, list manipulation, and date arithmetic; constrained decoding eliminates by construction both format violations and the structural component of “missing required element”; code execution feedback strongly reducesSyntaxError,NameError, and mostTypeErrortogether; RAG strongly reduces factual hallucinations, fabricated citations, and outdated information jointly\. The practical capability library required is therefore smaller than the count of named error categories\.
#### The twelve named clusters\.
Eight of twelve categories have a strong citation with double\-digit percentage\-point improvement; three \(B, C, I\) lack a clean single\-cluster ablation; one \(F\) has consistent medium\-strength evidence\. Category I \(missing required elements\) splits mechanistically into I\.a \(structural, absorbed by D’s constrained decoders\) and I\.b \(semantic, absorbed by G/H\)\. Categories B \(units\) and C \(counting\) remain technical gaps: B is naturally folded into A; C is the smallest residual gap\.
#### Additivity and its limits\.
Patel et al\. \[[2026](https://arxiv.org/html/2605.30628#bib.bib42)\]demonstrate that stacking interventions targeting orthogonal failure modes can produce large compound reliability gains: their parallel\-consensus framework yields a14,700×14\{,\}700\\timesimprovement over single\-pass baseline, evidence for compound gains from decomposition plus consensus aggregation under their specific parallel\-voting regime rather than a direct demonstration that heterogeneous cluster\-targeted interventions compose additively without voting\.Shang et al\. \[[2024](https://arxiv.org/html/2605.30628#bib.bib46)\]similarly show supra\-additive gains when modules target distinct failure modes\.Le \[[2026](https://arxiv.org/html/2605.30628#bib.bib27)\]reports that schema\-level and prompt\-level instructions interact*non\-additively*when sharing a prompt channel: interventions on orthogonal processing layers \(decoding constraint vs\. retrieval vs\. training\-signal vs\. inference\-time tool call\) compose near\-additively, while those sharing a channel may interfere\.
#### The irreducible\-semantic residual\.
Of the 17 ErrorAtlas categories, 13 are addressed under Patterns A, B, or C by one of the six capability axes; four are not: residuals of inappropriate refusal beyond preference optimisation, specification misinterpretation not closed by clarification, problem\-decomposition reasoning bottlenecks, and a “user wanted something different” semantic remainder\. These are classes where the failure is in choosing what to do, not executing it\. Proposition[2](https://arxiv.org/html/2605.30628#Thmproposition2)’s prediction ofO\(log\)O\(\\log\)rather thanO\(0\)O\(0\)residual reliability reflects exactly this irreducible core\. The framework does not claim100%100\\%coverage of all failures; it claims polylog\-bounded*capability\-eliminable*failures, with the named semantic residual as the floor\.
## Appendix CCounter\-Evidence Re\-Audits
Five prominent papers are routinely cited as evidence that LLM reliability decays steeply with length\[Dziri et al\.,[2023](https://arxiv.org/html/2605.30628#bib.bib11), Kuratov et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib25), Kwa et al\.,[2025](https://arxiv.org/html/2605.30628#bib.bib26), Wan et al\.,[2026](https://arxiv.org/html/2605.30628#bib.bib54), Karpinska et al\.,[2024](https://arxiv.org/html/2605.30628#bib.bib24)\]\. A careful re\-reading shows that*every one*of these papers decays over a variable distinct from raw token lengthnn\. Identifying the decay axis is not the same as dissolving the practical concern: where compositional graph size, fact count, or evidence scope grow with problem length, the framework predicts steep failure curves\. The contribution is to identify which interventions help \(capability provisioning along the actual decay axis\) and which do not\.
#### Dziri et al\. \[[2023](https://arxiv.org/html/2605.30628#bib.bib11)\], Faith and Fate\.
GPT\-4 multi\-digit multiplication accuracy drops from 59% \(3\-digit\) to 4% \(4\-digit\) zero\-shot, with the authors theorising “probability of incorrect predictions converges exponentially to≈1\\approx 1for abstract compositional tasks\.” The decay variable is*compositional graph sizeNN*, not raw token length: the3×33\\times 3graph has on the order ofd2d^\{2\}partial products plus carries, and multi\-digit multiplication is engineered so thatkhard≈Nk\_\{\\text\{hard\}\}\\approx N, with every node in the computation graph a hard decision\. For natural\-language tasks wherekhard≪nk\_\{\\text\{hard\}\}\\ll n, Dziri’s regime is the*boundary case*in which our framework reduces to their result\. A clean\(1−ε\)N\(1\-\\varepsilon\)^\{N\}exponential cannot simultaneously reproduce 59% at 3\-digit and 4% at 4\-digit for any single per\-nodeε\\varepsilon: the observed drop is locally steeper than per\-node iid exponential, consistent with a finite catalogue of failure modes exhausting asNNgrows\.*Where the framework concedes*: in adversarial compositional tasks engineered sokhard≈Nk\_\{\\text\{hard\}\}\\approx N, the framework reduces to Dziri’s regime and does not relieve it; the relocation is informational, not magical\.
#### Kuratov et al\. \[[2024](https://arxiv.org/html/2605.30628#bib.bib25)\]\.
The abstract reports “performance declines sharply with increased reasoning complexity” and models “effectively utilise only1010–20%20\\%of the context\.” Read in detail, BABILong varies two axes: context lengthnn\(0K to1010M tokens\) and number of supporting factskk\(QA1=1=1fact to QA3=3=3facts\)\. The sharp decline is inkk, notnn\. For QA1 \(single\-fact\), most models “perform well up to 4,000 tokens”: a plateau, not exponential decay\. The famous RAG\-flat\-across\-length result \(60%60\\%on single\-fact QA*independent of context length*\) is the cleanest possible demonstration that when relevant evidence is in window, length does not matter\. Recurrent Memory Transformers maintaining performance to 50M tokens further confirms effectivekkis determined by architecture, not rawnn\. The framework’s concession is narrow but real: for multi\-hop tasks whose required fact count grows with task complexity, BABILong’s sharpkk\-axis decay is exactly what the framework predicts happens, not a counter\-example\.
#### Kwa et al\. \[[2025](https://arxiv.org/html/2605.30628#bib.bib26)\]\(METR\)\.
Per\-model success fits a logistic inlog\(human\-task\-duration\)\\log\(\\text\{human\-task\-duration\}\):S\(t\)=σ\(β\(logt−logh\)\)S\(t\)=\\sigma\(\\beta\(\\log t\-\\log h\)\)\. Logistic\-in\-log\-length is*mathematically sublinear*in length itself\. The 80%\-horizon being44–6×6\\timesshorter than the 50%\-horizon is a steep within\-model cliff but compatible by construction with the polylog result: a cliff at a specific capacity threshold is a manifold\-transition signature\. The famous exponential \(capability doubling every seven months\) is an*inter\-model*claim about how the horizonhhmoves across generations, orthogonal to within\-model decay shape\. An honest qualifier: the within\-model logistic\-in\-log cliff is steep on a practical scale\. Calling it “sublinear in length” is technically correct but engineering\-useful only for systems designed to operate well below the 50% horizon\.
#### Wan et al\. \[[2026](https://arxiv.org/html/2605.30628#bib.bib54)\], Fano\-style upper bound\.
This paper theorises super\-linear information\-demand growth and identifies an “accuracy cliff” at capacity overflow: “when the task’s information demand surpasses the model’s output capacity, performance does not degrade gracefully but instead collapses sharply\.” The behavioural prediction is a*threshold*, not smooth\(1−ε\)n\(1\-\\varepsilon\)^\{n\}\. A cliff at a specific capacity threshold is exactly the manifold\-transition behaviour our framework predicts at the boundary between covered and uncovered modes; Wan et al\.’s mechanism \(capacity overflow\) is one specific cause, complementary to ours\.*Where the framework concedes*: Wan documents a mechanism by which\|C\|\|C\|effectively exceeds the model’s coverage in a single forward pass; Pattern A interventions \(constrained decoding, formal verification\) are the kind of response the framework expects but does not automatically supply\.
#### Karpinska et al\. \[[2024](https://arxiv.org/html/2605.30628#bib.bib24)\]\.
GPT\-4o achieves 55\.8% pair accuracy across 1,001 minimally\-different true/false claim pairs about long fictional books \(mean length 127K tokens\)\. The paper reports performance by*evidence scope*, not by context length: 59\.8% on sentence\-level retrieval, 47\.6% on passage\-level, 41\.6% on global reasoning\. No per\-context\-length curve is reported\. The decay axis is the number of evidence pieces that must be integrated, akk\-axis quantity rather than annn\-axis one\. NoCha is therefore not counter\-evidence to a sublinear\-in\-nnclaim\. Evidence\-scope decay is exactly thekk\-axis observation the framework predicts cannot be addressed by scaling raw context length; it requires retrieval or process supervision along the actual decay axis\. The relocation is informational, not magical\.
#### Unifying observation\.
Each of the steep\-decay counter\-papers we re\-audit decays over a variable other thannn: compositional graph size, fact count, log\-time horizon, capacity threshold, or evidence scope\. The apparent rapid decay is, in each case, in a quantity our framework already concentrates the action in \(khardk\_\{\\text\{hard\}\}and\|C\|\|C\|\), not in raw sequence length\. This is a relocation, not a dissolution\. Wherekhardk\_\{\\text\{hard\}\}grows with task length \(adversarial compositional structure, multi\-hop fact chains, long horizons that force more decisions\), reliability remains hard; the framework’s value is directing intervention toward capability provisioning along the actual decay axis rather than toward context\-window or compute\-budget expansion that does not help\.
## Appendix DPatch Evidence Detail
The body’s claim that domain patches cap the engineering problem rests on a triangulation of indirect evidence\. None of the items below*measures*the patch ceiling\|CD\|\|C\_\{D\}\|directly; they support the weaker claim that model behaviour is strongly domain\-dependent and that operational neighbourhoods occupy bounded regions of the model’s behavioural space\.
The purpose of this evidence is motivational\. It supports patch\-indexing ofσD\\sigma\_\{D\},ADA\_\{D\},βD\\beta\_\{D\},\|CD\|\|C\_\{D\}\|, but the finite \(or effectively capped\) reachable catalogue remains an empirical modelling assumption to be measured per deployment, not derived from any of the rows above\.
## Appendix EEmpirical Calibration Detail
Three published LLM error taxonomies anchor the mode\-rate parameterσ\\sigmain the body’sσ∈\[0\.87,1\.85\]\\sigma\\in\[0\.87,1\.85\]range\. The simple calibration uses\|C\|≈A\+σlnT\|C\|\\approx A\+\\sigma\\ln TwithA=0A=0; this is a deliberately conservative readout that ignores any positive intercept and uses endpoint counts rather than discovery curves\.
These are endpoint counts, not discovery curves: they tell us*how many*named categories a taxonomer assigned at a single corpus scale, not how the count grew withTT\. They do not prove logarithmic mode discovery\. Their role in this paper is twofold: they motivate the empirical postulate of §[3\.2](https://arxiv.org/html/2605.30628#S3.SS2), and they identify the explicit empirical test that would either tighten the postulate or move the analysis to the Heaps variant of Appendix[A\.5](https://arxiv.org/html/2605.30628#A1.SS5): repeatedly subsample failures from a fixed deployment patchDDand plot discovered modes\|Cseen,D\(T\)\|\|C\_\{\\text\{seen\},D\}\(T\)\|againstTT\.Similar Articles
Diagnosis Is Not Prescription: Linguistic Co-Adaptation Explains Patching Hazards in LLM Pipelines
This paper identifies a 'Diagnostic Paradox' in multi-module LLM agents: the module most causally responsible for failures (the routing module) is not the best place to intervene, and patching it can harm performance. The authors propose the 'Linguistic Contract' hypothesis and present empirical evidence across three agent families.
Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks
This paper identifies a blind spot in long-context LLM reasoning benchmarks: they fail to control task position within the context, allowing positional failures to go undetected. The authors propose Context Rot Evaluation (CRE) to systematically vary task position, filler content, and context length, revealing severe accuracy drops for some models when reasoning tasks are placed in the middle of long contexts.
LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs
This paper introduces LGMT, a framework that uses first-order logic to generate semantically invariant test cases for evaluating LLM reasoning reliability. Experiments on six LLMs show that LGMT exposes hidden defects missed by static benchmarks, suggesting evaluation should focus on robustness under logical invariance.
Faithful uncertainty in LLM agents: calibration vs utility tradeoff in practice[D]
A practitioner discusses the calibration vs. utility tradeoff in LLM agents, sharing experience with a verifier-based pipeline that reduces hallucinated tool calls by ~60% but introduces latency costs and drops easy correct answers.
Trust, but Don't Verify: Epistemic Blind Spots in LLM Source Evaluation
This paper identifies a failure mode in LLMs where they do not verify the validity of numerical statistics when synthesizing multiple sources, instead relying on the stylistic markers of analytical rigor. The authors term this 'epistemic alignment' and show that it persists across models and domains, resisting prompting-based mitigations.