ERRORQUAKE: Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models

arXiv cs.LG Papers

Summary

The paper introduces Errorquake-10k, a benchmark for evaluating error severity in open-weight LLMs, showing that models with matched accuracy can have vastly different error severity distributions, and argues that severity should be reported alongside accuracy.

arXiv:2606.05170v1 Announce Type: new Abstract: At matched accuracy, open-weight LLMs differ substantially in the shape of their error severity distribution -- a difference invisible to the scalar error rate. Hallucination benchmarks report a single error count and treat all errors as equivalent, yet a wrong date and a fabricated court ruling differ by orders of magnitude. We introduce Errorquake-10k, a 10,000-query benchmark scoring each response on a continuous 0-4 severity scale across 8 domains and 5 difficulty tiers, and we fit per-model severity distributions for 21 open-weight models. For each model we estimate a severity distribution index (b, the Gutenberg-Richter upper-tail slope) with 95% bootstrap confidence intervals. Headline: across the 210 model pairs, 85 have disjoint 95% b confidence intervals at matched accuracy (|Delta epsilon| < 0.05) on human-consensus scoring, e.g. deepseek-v3.2 vs. ministral-14b at epsilon = 0.586 and Delta b = 0.47. A 519-item three-rater human validation study confirms measurement reliability (ICC(2,k=3) = 0.85), validates the LLM-judge ranking (rho = 0.89), and confirms the dense-model scaling correlation on human data (rho_s = -0.86). We prove a Non-Reducibility Theorem showing that severity profile and error rate are informationally non-redundant (I(b; model | epsilon) = 1.56 bits; 64.5% of cross-model b variance is unexplained by epsilon). A severity mechanism taxonomy (kappa = 0.83) reveals that error type shifts categorically with severity: low-severity errors are retrievals (71%); high-severity errors are fabrications (39%) -- and this composition differs by model size (p < 0.0001). Severity distribution should be reported alongside accuracy; it carries discriminative information that the error rate cannot.
Original Article
View Cached Full Text

Cached at: 06/05/26, 08:08 AM

# Heavy-Tailed Error Severity Distributions in Open-Weight Large Language Models
Source: [https://arxiv.org/html/2606.05170](https://arxiv.org/html/2606.05170)
###### Abstract

At matched accuracy, open\-weight LLMs differ substantially in the shape of their error severity distribution — a difference invisible to the scalar error rate\.Hallucination benchmarks report a single error count and treat all errors as equivalent, yet a wrong date and a fabricated court ruling differ by orders of magnitude\. We introduceErrorquake\-10k, a10,00010\{,\}000\-query benchmark scoring each response on a continuous0–44severity scale across88domains and55difficulty tiers, and we fit per\-model severity distributions for2121open\-weight models\. For each model we estimate aseverity distribution index\(bb, the Gutenberg–Richter upper\-tail slope\) with95%95\\%bootstrap CIs\.Headline:across the210210model pairs,8585have disjoint95%95\\%bbCIs at matched accuracy\(\|Δ​ε\|<0\.05\|\\Delta\\varepsilon\|<0\.05\) on human\-consensus scoring, e\.g\.deepseek\-v3\.2vs\.ministral\-14batε=0\.586\\varepsilon=0\.586andΔ​b=0\.47\\Delta b=0\.47\. A519519\-item three\-rater human validation study confirms measurement reliability \(ICC​\(2,k=3\)=0\.85\\mathrm\{ICC\}\(2,k\{=\}3\)=0\.85\), validates the LLM\-judge ranking \(ρ=0\.89\\rho=0\.89\), and confirms the dense\-model scaling correlation on human data \(ρs=−0\.86\\rho\_\{s\}=\-0\.86\)\. We prove a Non\-Reducibility Theorem showing that severity profile and error rate are informationally non\-redundant \(I​\(b;model∣ε\)=1\.56I\(b;\\,\\text\{model\}\\mid\\varepsilon\)=1\.56bits;64\.5%64\.5\\%of cross\-modelbbvariance is unexplained byε\\varepsilon\)\. A severity mechanism taxonomy \(κ=0\.83\\kappa=0\.83\) reveals that error type shifts categorically with severity: low\-severity errors are retrievals \(71%71\\%\); high\-severity errors are fabrications \(39%39\\%\) — and this composition differs by model size \(p<0\.0001p<0\.0001\)\.*Severity distribution should be reported alongside accuracy; it carries discriminative information that the error rate cannot\.*

## 1Introduction

Standard hallucination benchmarks\(Lin et al\.,[2022b](https://arxiv.org/html/2606.05170#bib.bib19); Li et al\.,[2023](https://arxiv.org/html/2606.05170#bib.bib16)\)report a single number — the error rateε\\varepsilon— and treat all errors as equivalent\. This collapses a fundamental property of language model failure: an LLM that cites the wrong publication date and an LLM that fabricates an entire judicial opinion both contribute one count toε\\varepsilon, yet their downstream consequences differ by orders of magnitude\. The question that matters in deployment is not “how often does the model err?” but “how badly?”\. Borrowing the vocabulary of seismology, we summarise an LLM’s tail behaviour with the slopebbof the Gutenberg–Richter magnitude\-frequency relationlog10⁡N​\(M≥m\)=a−b​m\\log\_\{10\}N\(M\\geq m\)=a\-b\\,m: smallbbmeans the model emits few errors but the rare ones are catastrophic; largebbmeans many small errors with bounded severity\.

#### Central finding: a matched\-accuracy discriminator\.

Our headline is a pairwise discrimination result: across the210210pairs in our 21\-model catalog,8585have disjoint95%95\\%bbconfidence intervals at matched accuracyon human\-consensus scoring \(\|Δ​ε\|<0\.05\|\\Delta\\varepsilon\|<0\.05;3131on LLM\-judge scoring alone\)\. The clearest example isdeepseek\-v3\.2vs\.ministral\-14batε=0\.586\\varepsilon=0\.586andΔ​b=0\.47\\Delta b=0\.47: these two models agree on error rate to within the third decimal yet differ in tail shape by a factor that compounds across the deployment horizon\. The pair count is robust to the auxiliary judge\-baseline robustness checks:2525–4141pairs across88single\-domain drops,≥6\\geq 6pairs under all four judge\-aggregation alternatives \(primary\-only, secondary\-only, max\-severity, min\-severity\), and2828pairs on a≥80%\\geq 80\\%dual\-judge coverage subset\. This is a per\-pair result, not a cross\-model regression; it does not depend on a scaling law or on the marginalε\\varepsilon\-bb\{\}relationship, both of which we report separately as sensitivity analyses\.*Severity distribution should be reported alongside accuracy; it carries discriminative information that the error rate cannot\.*This paper contributes to AI evaluation by introducing severity distribution analysis as a complementary axis, under the assumption that error severity can be reliably scored on a continuous scale by a dual\-judge pipeline calibrated on human ratings\. The claim applies to open\-weight instruction\-tuned models at33–3737B active parameters; it does not yet cover proprietary frontier models, reasoning models, or models below33B \(§[8](https://arxiv.org/html/2606.05170#S8)\)\. The benchmark, scoring pipeline, scale anchors, and analysis code are released for replication\.

#### Contributions\.

C1 \(theory\)A Non\-Reducibility Theorem proving that severity profile and error rate are informationally non\-redundant, with a Resolution Bound connecting measurement reliability to discriminative power \(§[5](https://arxiv.org/html/2606.05170#S5)\)\.C2 \(measurement\)A99\-level severity scale validated by a519519\-item,33\-rater human study \(ICC​\(2,k=3\)=0\.85\\mathrm\{ICC\}\(2,k\{=\}3\)=0\.85, overcall=13\.7%=13\.7\\%\), with human\-bbvs\. judge\-bbrank correlationρ=0\.89\\rho=0\.89across1515models \(§[2](https://arxiv.org/html/2606.05170#S2), §[2](https://arxiv.org/html/2606.05170#S2.SS0.SSS0.Px4)\)\.C3 \(headline\)bbis a matched\-accuracy discriminator:8585/210210model pairs have disjoint95%95\\%bbCIs at\|Δ​ε\|<0\.05\|\\Delta\\varepsilon\|<0\.05on human\-consensus scoring, with auxiliary LLM\-judge robustness under cross\-domain jackknife \(2525–4141pairs\), four judge\-aggregation rules, and a dual\-coverage subset \(§[4\.1](https://arxiv.org/html/2606.05170#S4.SS1)\)\.C4 \(taxonomy\)A severity mechanism taxonomy \(66categories,κ=0\.83\\kappa=0\.83\) showing that error type shifts categorically with severity \(low=71%=71\\%retrieval, high=39%=39\\%fabrication\) and differs by model size \(p<0\.0001p<0\.0001\) \(§[4\.7](https://arxiv.org/html/2606.05170#S4.SS7)\)\.C5 \(scaling\)Dense\-model scaling correlationρs=−0\.86\\rho\_\{s\}=\-0\.86on human\-validated data \(ndense=11n\_\{\\text\{dense\}\}=11\): larger models have heavier severity tails, confirmed independently by judges \(ρs=−0\.56\\rho\_\{s\}=\-0\.56\) and human raters \(§[4\.3](https://arxiv.org/html/2606.05170#S4.SS3)\)\.C6Distribution characterisation \(17/2117/21non\-exponential\), pre\-registered micro\-error prediction \(rank\-significant, magnitude fails\), and deployment risk table \(§[4\.2](https://arxiv.org/html/2606.05170#S4.SS2), §[4\.4](https://arxiv.org/html/2606.05170#S4.SS4), §[4\.7](https://arxiv.org/html/2606.05170#S4.SS7.SSS0.Px2)\)\.C7 \(resource\)Errorquake\-10kbenchmark,10,00010\{,\}000queries×\\times2121models, scoring toolkit, Croissant metadata, and HuggingFace release\.

## 2Method

#### Error\-severity scale\.

We score each model response on a continuous0\.00\.0–4\.04\.0scale quantised to0\.50\.5increments \(99distinct levels\)\.0\.00\.0denotes a correct response;0\.50\.5–1\.01\.0mark imprecisions that preserve the gist;1\.51\.5–2\.02\.0denote moderate factual errors;2\.52\.5–3\.03\.0denote substantial errors that mislead a typical reader;3\.53\.5–4\.04\.0denote fabrication \(confident assertion of invented information\)\. Appendix[K](https://arxiv.org/html/2606.05170#A11)reproduces the full rubric with three worked examples per anchor\. Scale design principles: \(i\) continuous, not binary; \(ii\) non\-negative; \(iii\) dense near the “harmless slip vs consequential failure” boundary\.

#### Query benchmark\.

Errorquake\-10kcomprises10,00010\{,\}000queries:1,2501\{,\}250per domain across88domains \(BIO, LAW, HIST, GEO, SCI, TECH, FIN, CULT\), with each domain stratified into five difficulty tiers T1–T5 of100100queries each\. Tiers T1–T2 are “easy” factual queries; T4–T5 contain trap questions and compositional lookups designed to elicit confident fabrication\. A tier\-calibration audit flagged∼6%\{\\sim\}6\\%of cells as mis\-tiered and these were regenerated \(Appendix[E](https://arxiv.org/html/2606.05170#A5)\)\.

#### Dual\-judge scoring\.

Each response is scored by two judges from an88\-model round\-robin pool that excludes the target model \(zero self\-judging, audited\)\. Final score = mean of the two judges when they agree within1\.01\.0; otherwise median\-of\-three with a tiebreaker\. Pre\-tiebreak inter\-judge agreement on the60,56860\{,\}568records where both judges produced a score isICC​\(2,1\)=0\.374\\mathrm\{ICC\}\(2,1\)=0\.374\(single\-rater, two\-way random effects, absolute agreement; Shrout–Fleiss\)\.*The averaged final score that the paper actually uses hasICC​\(2,k=2\)=0\.545\\mathrm\{ICC\}\(2,k\{=\}2\)=0\.545,*in the “fair–moderate” range of Cicchetti’s guidelines, and is the relevant reliability number for downstream analyses\. Linear Cohen’sκ=0\.285\\kappa=0\.285and quadraticκ=0\.374\\kappa=0\.374are reported for completeness, though Cohen’sκ\\kappais depressed by the99\-level scale’s low chance agreement\. The secondary judge call failed on a non\-random subset of small\-model records, so per\-model agreement comparisons are biased; per\-model breakdown in Appendix[R](https://arxiv.org/html/2606.05170#A18)\. A340340\-item manual audit additionally finds that33\.5%33\.5\\%of judge score\-2\.02\.0verdicts are overcalls \(S2, §[4\.6](https://arxiv.org/html/2606.05170#S4.SS6)\)\. All inference uses an open\-access inference API hosting the target models on third\-party GPU infrastructure; prompts and model version strings are in Appendix[J](https://arxiv.org/html/2606.05170#A10)\.

#### Human validation\.

Three expert raters independently scored a stratified sample of519519items \(∼35\{\\sim\}35per model×\\times1515models,55severity bands\) on the same99\-point scale, blind to model identity and judge scores\. Inter\-rater reliability is excellent:ICC​\(2,k=3\)=0\.85\\mathrm\{ICC\}\(2,k\{=\}3\)=0\.85\(95%95\\%CI\[0\.83,0\.87\]\[0\.83,0\.87\]\),ICC​\(2,1\)=0\.66\\mathrm\{ICC\}\(2,1\)=0\.66\. Pairwise quadraticκ\\kapparanges0\.650\.65–0\.660\.66\. Per\-domainICC​\(2,k=3\)\\mathrm\{ICC\}\(2,k\{=\}3\)is consistent across all88domains \(0\.820\.82–0\.920\.92\)\. Human overcall rate at score2\.02\.0is13\.7%13\.7\\%\(vs\.33\.5%33\.5\\%for LLM judges\)\.

Human\-derivedbbvalues span\[0\.72,1\.27\]\[0\.72,1\.27\], matching the judge range\[0\.57,1\.31\]\[0\.57,1\.31\], with human\-vs\-judge rank correlationρ=0\.89\\rho=0\.89\(p<0\.001p<0\.001\) across1515models\. Each rater’s item\-level Spearman with the judge is0\.770\.77–0\.800\.80\. The dense\-model scaling correlation isρs=−0\.86\\rho\_\{s\}=\-0\.86on human data,*stronger*than the judge\-based−0\.56\-0\.56, confirming the scaling finding independently\. Raters also classified each error into the severity mechanism taxonomy \(§[4\.7](https://arxiv.org/html/2606.05170#S4.SS7)\), achieving Fleissκ=0\.83\\kappa=0\.83\.

#### Distribution fitting\.

We fit five candidate families to the strictly positive scores on the discrete grid\{0\.5,1\.0,…,4\.0\}\\\{0\.5,1\.0,\\ldots,4\.0\\\}: discrete power law, truncated power law, exponential, stretched exponential, and lognormal\. Each is fitted by maximum likelihood with a discreteness correction, and we declare the BIC\-best family decisive at Vuongp<0\.05p<0\.05\(Vuong,[1989](https://arxiv.org/html/2606.05170#bib.bib23)\)orΔ​BIC\>6\\Delta\\text\{BIC\}\>6\(cf\.Clauset et al\.,[2009](https://arxiv.org/html/2606.05170#bib.bib5)\)\. No model in our catalog is best\-fit by a pure power law\.

#### severity distribution index \(bb\) estimation\.

The Gutenberg–Richter magnitude\-frequency relation\(Gutenberg and Richter,[1944](https://arxiv.org/html/2606.05170#bib.bib11)\)log10⁡N​\(M≥m\)=a−b​\(m−mmin\)\\log\_\{10\}N\(M\\geq m\)=a\-b\(m\-m\_\{\\min\}\)models the count of events at or above severitymm\. We estimatebbby maximum likelihood on grid\-quantised positive scores using the Aki formula\(Aki,[1965](https://arxiv.org/html/2606.05170#bib.bib1)\)with a discreteness correction:b^=log10⁡e/\(m¯−mmin\+δ/2\)\\hat\{b\}=\\log\_\{10\}\\mathrm\{e\}/\(\\bar\{m\}\-m\_\{\\min\}\+\\delta/2\)for bin widthδ=0\.5\\delta=0\.5, wherem¯\\bar\{m\}is the mean of observations at or abovemminm\_\{\\min\}\. We selectmminm\_\{\\min\}by minimising Kolmogorov–Smirnov distance to the fitted exponential tail, restricted to grid points with at least3030events above\. Confidence intervals are95%95\\%percentile bootstraps from2,0002\{,\}000resamples\.

## 3Experimental setup

We evaluate2121open\-weight instruction\-tuned language models from1010families, spanning active\-parameter counts from∼3\\sim 3B \(llama\-3\.2\-3b,phi\-3\.5\-mini\) to∼37\\sim 37B active of∼671\\sim 671B total \(deepseek\-v3\.1,v3\.2\)\. The catalog includes1414dense and77MoE models; the full list with version strings is in Appendix[D](https://arxiv.org/html/2606.05170#A4)\. Each model is evaluated on all10,00010\{,\}000Errorquake\-10kqueries with greedy decoding at a500500\-token budget, via an open\-access inference API hosting the target models on third\-party GPU infrastructure\. Reasoning models and three rate\-limit\-exhausted models are excluded; see §[8](https://arxiv.org/html/2606.05170#S8)\. Pre\-registered criteria and observed outcomes are summarised in[Table˜1](https://arxiv.org/html/2606.05170#S3.T1); we report all verdicts including failures\.

Table 1:Pre\-registered criteria and outcomes\. Exp\. 5’s null was rejected in the*opposite*direction from intuition\.†S2 misses the threshold by0\.0030\.003and is reported as “borderline fail”\.‡Four models are BIC\-best\-fit by exponential, but17/2117/21are non\-exponential and17/2117/21are Vuong\-decisive; taken together, every model shows tail\-shape structure under at least one criterion\.
## 4Experiments

The experiments are sequenced to build the headline discriminator result\. §[4\.1](https://arxiv.org/html/2606.05170#S4.SS1)opens with the matched\-accuracy headline:8585of\(212\)=210\\binom\{21\}\{2\}=210model pairs have disjointbbconfidence intervals on human\-consensus scoring \(3131on the LLM\-judge baseline\), with judge\-baseline jackknife and aggregation robustness\. §[4\.2](https://arxiv.org/html/2606.05170#S4.SS2)then establishes that severity distributions exist and carry non\-trivial tail structure\. §[4\.4](https://arxiv.org/html/2606.05170#S4.SS4)reports the pre\-registered micro\-error\-to\-catastrophe prediction \(rank\-significant, magnitude\-calibration fails\); §[4\.5](https://arxiv.org/html/2606.05170#S4.SS5)shows domain variation; §[4\.3](https://arxiv.org/html/2606.05170#S4.SS3)reports a marginal dense\-model scaling correlation as a sensitivity observation only\. §[4\.6](https://arxiv.org/html/2606.05170#S4.SS6)collects judge\-robustness checks that apply across the headline and sensitivities\.

### 4\.1Matched\-accuracy discriminator \(Exp\. 2, headline\)

This is our headline result\.Across the\(212\)=210\\binom\{21\}\{2\}=210model pairs,8585have disjoint95%95\\%CIs onbbat matched accuracy\(\|Δ​ε\|<0\.05\|\\Delta\\varepsilon\|<0\.05\) on human\-consensus scoring — a2\.7×2\.7\\timesincrease over the3131pairs found with LLM\-judge scoring alone, and an order of magnitude above the pre\-registered criterion of≥3\\geq 3\. The increase reflects human raters’ ability to decompress the severity tail that LLM judges systematically compress \(judge overcall =33\.5%33\.5\\%; human overcall =13\.7%13\.7\\%\)\. Concretely,deepseek\-v3\.2\(ε=0\.586\\varepsilon=0\.586,b=0\.655b=0\.655\) andministral\-14b\(ε=0\.586\\varepsilon=0\.586,b=1\.122b=1\.122\) have identical accuracy but abbgap of0\.4670\.467with non\-overlapping95%95\\%confidence intervals\. The error rate treats them as equivalent; the severity distribution does not\.

#### Cross\-domain jackknife\.

Leaving each of the88domains out in turn and recounting disjoint\-CI pairs on the remaining3,5003\{,\}500queries per model gives\[25,41\]\[25,41\]pairs \(all≥3\\geq 3, all well above the pre\-registered criterion; full table in Appendix[F](https://arxiv.org/html/2606.05170#A6)\)\. This robustness suite is computed on the LLM\-judge baseline, where the reference count is3030, and is not driven by any single domain\.

#### Judge\-aggregation robustness\.

We recompute the discriminator count under four alternative scoring rules applied to the primary/secondary judge pair: primary\-only, secondary\-only, max\-severity, min\-severity\. The disjoint\-CI pair counts are77,2727,2727,5858respectively, all exceeding the pre\-registered threshold of≥3\\geq 3\. On the subset of1515models with≥80%\\geq 80\\%dual\-judge coverage, the LLM\-judge baseline gives2828pairs and the alternatives give\[6,24\]\[6,24\]\(Appendix[G](https://arxiv.org/html/2606.05170#A7)\)\.

#### Family\-native cross\-check\.

The empirical tail ratioP​\(M≥3​∣M\>​0\)/P​\(M≥1​∣M\>​0\)P\(M\\geq 3\\mid M\>0\)/P\(M\\geq 1\\mid M\>0\)is a family\-free tail\-mass summary that depends on no fitting choice\. Using matched\-accuracy pairs with a\|Δ​tail\_ratio\|\>0\.005\|\\Delta\\text\{tail\\\_ratio\}\|\>0\.005threshold gives4949pairs; tightening to\>0\.010\>0\.010gives1414; at\>0\.015\>0\.015gives44\. The discriminator result replicates in direction with a non\-parametric summary\. The empirical tail ratio and the fittedbbare themselves only weakly rank\-correlated across the2121\-model catalog \(ρs=\+0\.13\\rho\_\{s\}=\+0\.13\), which we report honestly: thebbcaptures upper\-tail slope while the tail ratio captures upper\-tail mass, and these diverge in the presence of bulk\-distribution differences\.

#### Exceedance threshold sweep \(answers Q4\)\.

Tightening the minimum tail\-support requirement fromT=30T=30exceedances up toT=200T=200drops the surviving model count from2121to99but keeps the discriminator above the pre\-registered threshold at every step:30→22→13→1030\\to 22\\to 13\\to 10disjoint\-CI pairs on the LLM\-judge baseline asTTincreases \(Appendix[P](https://arxiv.org/html/2606.05170#A16)\)\.

*Takeaway:*the severity distribution carries model\-discriminative information thatε\\varepsilonalone cannot express\. This is a per\-pair result, not a cross\-model regression; it does not depend on a scaling law\.

### 4\.2Distribution characterisation \(Exp\. 1\)

Across 21 models, the best\-fit family by BIC isstretched exponential \(13\),truncated power law \(4\),exponential \(4\)\.*Zero*models are best fit by a clean power law or lognormal on the 10K judge baseline\. Vuong’s test against the runner\-up declares the best fit decisive atp<0\.05p<0\.05\(orΔ​BIC\>6\\Delta\\text\{BIC\}\>6\) for17/2117/21models\. The magnitude\-frequency curves and BIC heatmap appear in[Figure˜1](https://arxiv.org/html/2606.05170#S4.F1)\(full grid in Appendix[C](https://arxiv.org/html/2606.05170#A3)\)\.*Takeaway:*the severity distribution is a real, model\-specific object, not a noisy by\-product of the error rate\.

#### Operational meaning of “heavy\-tailed”\.

On a bounded, discrete severity grid with88positive bins, asymptotic heavy\-tail claims are not available\. We use “heavy\-tailed” \(and the severity distribution index \(bb\)\) operationally to mean “slower\-decaying than exponential on the positive grid” — i\.e\., the BIC\-best family is non\-exponential, or the residual mass in the upper bins \(M≥2\.5M\\geq 2\.5\) is systematically larger than an exponential fit predicts\. Under this operational definition,17/2117/21models qualify by BIC and17/2117/21qualify by Vuong \(p<0\.05p<0\.05versus runner\-up\)\. The four exponential best\-fits \(deepseek\-v3\.1,mistral\-small\-4\-119b,gemma\-3\-27b,mistral\-small\-24b\) are still informative because their absolute decay rates differ by a factor of∼2\\sim 2across models, which thebbcaptures\.

![Refer to caption](https://arxiv.org/html/2606.05170v1/x1.png)Figure 1:Magnitude\-frequency curves for four representative models spanning thebbrange \(heaviest to lightest tail\)\. Dashed red: Gutenberg–Richter fit\. All2121models and theΔ\\DeltaBIC heatmap are in Appendix[C](https://arxiv.org/html/2606.05170#A3)\.

### 4\.3Scaling correlation \(Exp\. 5\)

Larger dense models have heavier severity tails\.On LLM\-judge scoring, the Spearman correlation betweenlog10\\log\_\{10\}\(active parameters\) and the upper\-tailbbisρs=−0\.562\\rho\_\{s\}=\-0\.562\(p=0\.006p=0\.006,ndense=14n\_\{\\text\{dense\}\}=14;[Figure˜2](https://arxiv.org/html/2606.05170#S4.F2)\)\.On human\-validated scoring, the correlation strengthens:ρs=−0\.86\\rho\_\{s\}=\-0\.86\(ndense=11n\_\{\\text\{dense\}\}=11\), confirming independently that the scaling relationship is not a judge artifact\. The human\-validated correlation is stronger because human raters decompress the severity tail that judges compress, amplifying the between\-model differences \(see §[2](https://arxiv.org/html/2606.05170#S2.SS0.SSS0.Px4)\)\.

The magnitude is sensitive to covariates: the partial correlation after residualisingε\\varepsilondrops to−0\.20\-0\.20\(p=0\.31p=0\.31\) on judge data; under fixedmmin=0\.5m\_\{\\min\}=0\.5the sign flips to\+0\.79\+0\.79\(p=0\.0007p=0\.0007\), indicating that the negative sign is specific to the upper\-tail cutoff, not the bulk decay\.*Interpretation:*larger dense models commit fewer small slips \(steeper bulk\) but a relatively higher fraction of catastrophic fabrications \(shallower upper tail\)\. The paper’s headline applies to the upper\-tail slope only\. See §[4\.6](https://arxiv.org/html/2606.05170#S4.SS6)for full robustness numbers\.

![Refer to caption](https://arxiv.org/html/2606.05170v1/x2.png)Figure 2:bbvs\. active parameter count\. Dense \(blue\) shows a monotone downward trend\. Reported as a sensitivity observation; the headline is §[4\.1](https://arxiv.org/html/2606.05170#S4.SS1)\.#### Scope\.

Trend covers a∼10×\\sim 10\\timesrange in active parameters, not the full frontier \(llama\-3\.1\-405bexcluded after rate\-limit exhaustion: only403/3314403/3314valid responses at cohort\-scale concurrency\)\. The MoE subset \(n=7n=7\) shows the same sign as dense \(ρs=−0\.60\\rho\_\{s\}=\-0\.60,p=0\.16p=0\.16\) but cannot independently support the claim\. Exp\. 4 \(§[4\.5](https://arxiv.org/html/2606.05170#S4.SS5)\) shows that domain\-levelbbis model\-idiosyncratic, so this is*whole\-model*tail shape, not any particular domain\. Bootstrap, permutation, Bayesian, per\-tier, and verbosity\-controlled checks all preserve the qualitative sign \(Appendix[S](https://arxiv.org/html/2606.05170#A19)\)\.

### 4\.4Predicting catastrophes from micro\-errors \(Exp\. 3\)

Pre\-registered hypothesis:fittingbbon tier\-1/2 errors predicts catastrophic\-error counts \(M≥3\.0M\\geq 3\.0\) on tiers 4/5 via Gutenberg–Richter extrapolation\. Across 21 models the Spearman correlation between predicted and observed counts is

ρs=0\.443,p=0\.044,Kendall​τ=0\.325\.\\rho\_\{s\}=0\.443,\\quad p=0\.044,\\quad\\text\{Kendall\}~\\tau=0\.325\.At a relaxed thresholdM≥2\.5M\\geq 2\.5,ρs=0\.637\\rho\_\{s\}=0\.637\(p=0\.002p=0\.002\)\. The result lands in the WEAK band of the pre\-registered schedule \([Figure˜3](https://arxiv.org/html/2606.05170#A3.F3)\): rank prediction is statistically significant butabsolute\-rate prediction fails— only4/214/21\(M≥3M\\geq 3\) and1/211/21\(M≥2\.5M\\geq 2\.5\) of models land within1\.5×1\.5\\timesof the observed count, with0/210/21over\-predicting and17/2117/21under\-predicting at the primary threshold\.

#### Mechanistic explanation\.

Refittingbbseparately on easy and hard subsets shows⟨beasy⟩=0\.921\\langle b\_\{\\text\{easy\}\}\\rangle=0\.921and⟨bhard⟩=0\.962\\langle b\_\{\\text\{hard\}\}\\rangle=0\.962\(Δ=−0\.041\\Delta=\-0\.041, std0\.2160\.216\)\. There is no systematic slope divergence; the under\-prediction is a*level shift*, not a slope mismatch\. Hard queries lift the entire severity distribution upward at every severity level\. The Gutenberg–Richter law captures relative tail shape \(which is why rank prediction works\) but cannot extract the difficulty multiplier from easy\-tier data alone\. This partial\-signal finding is consistent with Exp\. 5:bbis a*model\-level*summary statistic that carries cross\-model discriminative information, but it is not a cross\-regime extrapolator\.

### 4\.5Domain variation \(Exp\. 4\)

A Friedman test rejects equalbb\-value across the 8 domains \(χ2=15\.94\\chi^\{2\}=15\.94,p=0\.026p=0\.026,n=21n=21\)\. However, Kendall’s coefficient of concordance isW=0\.108W=0\.108, meaning that models*disagree*on which domains have heavy tails\. BIO \(meanbb=0\.8490\.849\) and FIN \(0\.8410\.841\) are the heaviest\-tailed domains on average; LAW \(1\.0231\.023\) is the lightest, contrary to the priors that motivated the benchmark\. Domain effects exist but are model\-idiosyncratic \(heatmap in Appendix[C](https://arxiv.org/html/2606.05170#A3)\)\. This is why the headline negative\-scaling result in §[4\.3](https://arxiv.org/html/2606.05170#S4.SS3)is computed on model\-aggregatedbbvalues rather than per\-domain slopes: there is no stable domain ranking to average across\.

### 4\.6Training\-pipeline pairs and sensitivity summary

Matched\-pair bootstrap tests on1111model pairs yield22significant differences atp<0\.05p<0\.05\(llama\-3\.2\-3b vs\. llama\-3\.1\-8b,Δ​b=−0\.53\\Delta b=\-0\.53,p<0\.001p<0\.001; andqwen2\.5\-7b vs\. llama\-3\.1\-8b,Δ​b=−0\.32\\Delta b=\-0\.32,p=0\.021p=0\.021\); the DeepSeek v3\.1→\\tov3\.2 version bump is*not*significant \(p=0\.87p=0\.87\)\. The full pair table is Appendix[O](https://arxiv.org/html/2606.05170#A15), and within\-generation and version\-bump comparisons appear in[Figures˜7](https://arxiv.org/html/2606.05170#A15.F7)and[8](https://arxiv.org/html/2606.05170#A15.F8)\(Appendix[U](https://arxiv.org/html/2606.05170#A21)\)\.

Three pre\-registered sensitivity checks were executed\.S1 \(scale coarsening\)*fails*: collapsing the99\-point scale to77points or55levels drops the rank correlation to0\.430\.43and0\.160\.16, so the full grid is load\-bearing\.S2 \(overcall correction\)*narrowly fails*atρs=0\.847\\rho\_\{s\}=0\.847, only0\.0030\.003below threshold\.S3 \(subsample stability\)*passes*with median coefficient of variation0\.1430\.143\. Full figures are in Appendix[U](https://arxiv.org/html/2606.05170#A21)\.

#### S5:mminm\_\{\\min\}sensitivity\.

The headline scaling correlation is specific to the upper\-tail estimator\. Under our default model\-specific KS\-selectorρs=−0\.562\\rho\_\{s\}=\-0\.562\(p=0\.006p=0\.006\); under fixedmmin=1\.5m\_\{\\min\}=1\.5ρs=\+0\.837\\rho\_\{s\}=\+0\.837\(p=0\.0002p=0\.0002\); under Clauset\-style≥100\\geq 100exceedances \(which selectsmmin=0\.5m\_\{\\min\}=0\.5for every model\)ρs=\+0\.793\\rho\_\{s\}=\+0\.793\(p=0\.0007p=0\.0007\)\.*The sign flips\.*The auto\-selector targets the upper\-tail slope; fixedmminm\_\{\\min\}targets the bulk decay rate\. Both are real — larger dense models commit fewer small slips \(steeper bulk\) and a relatively higher fraction of catastrophic fabrications \(shallower upper tail\)\.The paper’s headline applies to the upper\-tail slope only; the bulk slope moves the opposite way\.Full table: Appendix[T](https://arxiv.org/html/2606.05170#A20)\.

#### Deployment implications\.

[Figure˜9](https://arxiv.org/html/2606.05170#A17.F9)\(Appendix[Q](https://arxiv.org/html/2606.05170#A17)\) translates the empirical severity distributions into expected catastrophic \(M≥3\.0M\\geq 3\.0\) and severe \(M≥2\.5M\\geq 2\.5\) events per million queries under i\.i\.d\. deployment\. Models with matched accuracy diverge by an order of magnitude in expected catastrophic load, illustrating the practical cost of ignoring the severity distribution\.

### 4\.7Severity mechanism taxonomy

To understand*what*thebbcaptures mechanistically, three expert raters classified each error item in the519519\-item validation study into one of six top\-level mechanism categories \(Fleissκ=0\.83\\kappa=0\.83\): retrieval \(35\.9%\), generation/fabrication \(21\.6%\), amplification \(20\.1%\), reasoning \(10\.7%\), format \(7\.2%\), and metacognitive failure \(4\.4%\)\.

#### Severity–mechanism coupling\.

Error mechanism shifts categorically with severity level\. At low severity \(0\.50\.5–1\.01\.0\),71%71\\%of errors are retrievals and0%0\\%are fabrications\. At mid severity \(1\.01\.0–2\.02\.0\),47%47\\%are retrievals and29%29\\%are amplifications\. At high severity \(2\.02\.0–4\.04\.0\), only14%14\\%are retrievals and39%39\\%are fabrications\.*What makes an error severe is not degree but kind:*moving up the severity scale shifts the mechanism from factual retrieval failure to confident content fabrication\.

#### Size–mechanism coupling and deployment\.

Mechanism profiles also differ significantly by model size \(χ2\\chi^\{2\}test,p<0\.0001p<0\.0001\): small models \(33–99B\) show44\.7%44\.7\\%retrieval errors, while large models \(≥24\\geq 24B\) show29\.2%29\.2\\%fabrication errors\. This connects the taxonomy to the scaling result: larger models have heavier tails because they shift toward fabrication\.[Figure˜9](https://arxiv.org/html/2606.05170#A17.F9)\(Appendix[Q](https://arxiv.org/html/2606.05170#A17)\) translates the empirical severity distributions into expected catastrophic \(M≥3\.0M\\geq 3\.0\) and severe \(M≥2\.5M\\geq 2\.5\) events per million queries\. Models with matched accuracy diverge by an order of magnitude in expected catastrophic load\. The mechanism taxonomy gives this concrete meaning: the catastrophic events are predominantly fabrications \(39%39\\%of high\-severity items\), not retrieval errors\.

## 5Theoretical Framework

We establish two formal results underpinning the empirical analysis\.

###### Theorem 1\(Non\-Reducibility of Severity Profile\)\.

Letℳ\\mathcal\{M\}be a set of models, each with error rateεM=P​\(SM\>0\)\\varepsilon\_\{M\}=P\(S\_\{M\}\>0\)and severity distributionFM​\(s\)=P​\(SM≤s​∣SM\>​0\)F\_\{M\}\(s\)=P\(S\_\{M\}\\leq s\\mid S\_\{M\}\>0\)\. \(i\) For anyδ\>0\\delta\>0, there exist modelsMi,MjM\_\{i\},M\_\{j\}with\|εi−εj\|<δ\|\\varepsilon\_\{i\}\-\\varepsilon\_\{j\}\|<\\deltawhile\|bi−bj\|\|b\_\{i\}\-b\_\{j\}\|is arbitrarily large\. \(ii\)I​\(b;model∣ε\)\>0I\(b;\\,\\text\{model\}\\mid\\varepsilon\)\>0whenever the population includes matched\-accuracy pairs with divergent severity\.

*Proof sketch\.*Part \(i\) adjusts the intercept of two Gutenberg–Richter tails with distinct slopes so they share any targetε∗\\varepsilon^\{\*\}; part \(ii\) then follows becausebbcannot be a deterministic function ofε\\varepsilon\. Full proofs are in Appendix[A](https://arxiv.org/html/2606.05170#A1)\.*Empirically:*I​\(b;model∣ε\)=1\.56I\(b;\\,\\text\{model\}\\mid\\varepsilon\)=1\.56bits on our2121\-model catalog \(55\-bin discretisation\), and only35\.5%35\.5\\%of cross\-modelbbvariance is explained byε\\varepsilon\(R2=0\.356R^\{2\}=0\.356\)\.

###### Proposition 2\(Resolution Bound\)\.

The standard error ofb^\\hat\{b\}from the Aki MLE satisfiesSE​\(b^\)≥b/n≥mmin⋅r\\mathrm\{SE\}\(\\hat\{b\}\)\\geq b/\\sqrt\{n\_\{\\geq m\_\{\\min\}\}\\cdot r\}wherer=ICC​\(2,k\)r=\\mathrm\{ICC\}\(2,k\)is the score reliability\. Two models are distinguishable atα=0\.05\\alpha\{=\}0\.05, power=0\.80\{=\}0\.80when\|b1−b2\|≥2\.80⋅SE​\(b^\)\|b\_\{1\}\-b\_\{2\}\|\\geq 2\.80\\cdot\\mathrm\{SE\}\(\\hat\{b\}\)\. On our data: medianSE=0\.064\\mathrm\{SE\}=0\.064, minimum detectableΔ​b=0\.253\\Delta b=0\.253; observed range=\[0\.57,1\.31\]=\[0\.57,1\.31\]\(0\.740\.74spread\), confirming adequate power despite moderate ICC\.

*Proof sketch\.*The Aki MLE has variance proportional to1/n≥mmin1/n\_\{\\geq m\_\{\\min\}\}; score reliabilityrrshrinks effective sample size ton≥mmin​rn\_\{\\geq m\_\{\\min\}\}r, giving the stated lower bound and the two\-sample threshold\. Appendix[A](https://arxiv.org/html/2606.05170#A1)gives the derivation\.

## 6Related Work

#### Binary hallucination benchmarks miss severity\.

TruthfulQA\(Lin et al\.,[2022b](https://arxiv.org/html/2606.05170#bib.bib19)\),HaluEval\(Li et al\.,[2023](https://arxiv.org/html/2606.05170#bib.bib16)\),FaithDial\(Dziri et al\.,[2022](https://arxiv.org/html/2606.05170#bib.bib8)\), and the survey ofJi et al\. \([2023](https://arxiv.org/html/2606.05170#bib.bib13)\)establish the dominant factuality protocol: score each response as correct/incorrect and report an aggregate error rate\. What they do not report is the*distribution*of error severity\. Our empirical claim is that matched accuracy can still conceal an order\-of\-magnitude difference in catastrophic event rate \(§[4\.1](https://arxiv.org/html/2606.05170#S4.SS1)\), and that scaling can improve accuracy while worsening the residual severity tail \(§[4\.3](https://arxiv.org/html/2606.05170#S4.SS3)\)\.

#### Severity\-aware evaluation\.

Asgari et al\. \([2025](https://arxiv.org/html/2606.05170#bib.bib2)\)argue that hallucination severity is informative beyond the error count, but they use three ordinal bins and histogram summaries\. We extend this line with a99\-level0\.50\.5\-spaced scale, a parametric tail index \(bb\) anchored to the Gutenberg–Richter law, and a2121\-model benchmark large enough to test matched\-accuracy discrimination and scale trends\. Domain\- or modality\-specific studies\(Dahl et al\.,[2024](https://arxiv.org/html/2606.05170#bib.bib7); Colelough et al\.,[2025](https://arxiv.org/html/2606.05170#bib.bib6); Chang et al\.,[2025](https://arxiv.org/html/2606.05170#bib.bib4); Zuo and Jiang,[2024](https://arxiv.org/html/2606.05170#bib.bib25); Pandit et al\.,[2025](https://arxiv.org/html/2606.05170#bib.bib20); Seth et al\.,[2024](https://arxiv.org/html/2606.05170#bib.bib21); Atwany et al\.,[2025](https://arxiv.org/html/2606.05170#bib.bib3); Li et al\.,[2024](https://arxiv.org/html/2606.05170#bib.bib17)\)are complementary: they refine type taxonomies within one domain, whereas we target cross\-model tail\-shape comparison across eight general domains\.

#### Calibration, heavy tails, and scaling\.

Confidence calibration asks whether a model knows when it is wrong\(Guo et al\.,[2017](https://arxiv.org/html/2606.05170#bib.bib10); Lin et al\.,[2022a](https://arxiv.org/html/2606.05170#bib.bib18); Kadavath et al\.,[2022](https://arxiv.org/html/2606.05170#bib.bib14)\); we ask how severe the errors are when it is wrong\. The statistical machinery comes from heavy\-tail modeling\(Taleb,[2020](https://arxiv.org/html/2606.05170#bib.bib22); Clauset et al\.,[2009](https://arxiv.org/html/2606.05170#bib.bib5)\), while the baseline intuition that larger models are better comes from scaling\-law work\(Kaplan et al\.,[2020](https://arxiv.org/html/2606.05170#bib.bib15); Hoffmann et al\.,[2022](https://arxiv.org/html/2606.05170#bib.bib12); Wei et al\.,[2022](https://arxiv.org/html/2606.05170#bib.bib24)\)and MoE parameterization\(Fedus et al\.,[2022](https://arxiv.org/html/2606.05170#bib.bib9)\)\. Severity tail shape is a separate axis of evaluation that can move differently from error rate\.

## 7Discussion

#### What the paper establishes\.

Three evidence pathways converge\. First,8585matched\-accuracy model pairs have disjoint95%95\\%bbintervals on human\-consensus scoring, showing that severity distribution carries discriminative information invisible toε\\varepsilon\. Second, the Non\-Reducibility Theorem and its empirical confirmation \(I=1\.56I=1\.56bits;R2=0\.356R^\{2\}=0\.356\) show this is not a redundant restatement of error rate\. Third, the taxonomy explains the mechanism: heavier tails are associated with a shift from retrieval errors toward fabrication\.

#### Interpretation\.

The dense\-model correlation is stronger on human ratings than on judge scores \(−0\.86\-0\.86vs\.−0\.56\-0\.56\), and themminm\_\{\\min\}sweep shows why: larger models have steeper*bulk*decay but shallower*upper\-tail*decay\. Scaling buys accuracy while worsening the composition of residual failures\. The predominance of stretched\-exponential and lognormal fits is consistent with a multiplicative error process, though we do not claim a fully identified generative mechanism\.

## 8Limitations and misuse

#### Judge overcalling and scale resolution\.

A340340\-item manual audit classifies33\.5%33\.5\\%of judge score\-2\.02\.0verdicts as overcalls; verbose responses are overcalled more \(Appendix[L](https://arxiv.org/html/2606.05170#A12)\)\. The headline survives bootstrap overcall correction \(ρs=0\.847\\rho\_\{s\}=0\.847, S2, just below the0\.850\.85threshold\)\. Sensitivity S1 also shows that collapsing the99\-point severity grid to77points or55levels destroys thebbranking \(ρs=0\.43\\rho\_\{s\}=0\.43and0\.160\.16\); practitioners must use the full grid\.

#### Model coverage and scope\.

We evaluate2121open\-weight instruction\-tuned models and make no claims about proprietary systems\. Frontier\-dense coverage is constrained:llama\-3\.1\-405b\-instruct,gpt\-oss\-120b, andminimax\-m2\.5were excluded from the main analysis after rate\-limit exhaustion\. We re\-attemptedllama\-3\.1\-405bduring revision; one\-shot calls succeed on a single key, but at the3232\-way concurrency required for a10,00010\{,\}000\-query cohort evaluation,87\.5%87\.5\\%of requests return429429or403403\(403/3314403/3314valid responses,12\.2%12\.2\\%success\)\. This is too sparse for a stable upper\-tailbbfit and we exclude405405B from the main analysis\. The largest dense model in the headline cohort is3636B; the scaling finding covers a∼10×\\sim 10\\timesrange and should not be extrapolated to the100100B\+ dense regime without direct measurement\. Three reasoning\-specialised models were excluded after chain\-of\-thought truncation at a500500\-token budget\. Our77MoE models span33–3737B active parameters and are too few for a separate scaling fit \(ρs=−0\.595\\rho\_\{s\}=\-0\.595,p=0\.159p=0\.159,n=7n=7— same sign, not significant\)\.

#### Human validation scope\.

The519519\-item,33\-rater study confirms measurement reliability \(ICC=0\.85\\mathrm\{ICC\}=0\.85\) and judge validity \(ρ=0\.89\\rho=0\.89\), but covers∼35\{\\sim\}35items per model — adequate for ICC and ranking but limited for per\-model b\-value precision\. The full186,521186\{,\}521\-item human\-consensus scoring \(Appendix[B](https://arxiv.org/html/2606.05170#A2)\) provides higher precision\.

#### Prediction and misuse\.

Experiment 3 fails its primary pre\-registered criterion \(ρs≥0\.75\\rho\_\{s\}\\geq 0\.75atM≥3M\\geq 3\); only the moderate rank signalρs=0\.443\\rho\_\{s\}=0\.443\(p=0\.044p=0\.044\) holds\. Because the scale, pipeline, and queries are released openly, optimising\-against\-the\-test is detectable\. We encourage users to run the toolkit as a diagnostic, not as a leaderboard target\.

#### Broader impacts\.

Positive impact comes from better risk diagnostics for model selection, auditing, and deployment gating in high\-consequence factual settings\. Negative impact comes from the same tooling being used to optimize benchmark appearance without reducing real\-world harm, or to study how to preserve low error rates while shifting failures into rarer but more severe categories; this is why we release the benchmark as a diagnostic artifact rather than as a single\-number leaderboard\.

#### LLM\-generated queries\.

All10,00010\{,\}000queries were generated by a frontier LLM, not authored by humans\. This may bias the difficulty distribution and phrasing in ways that affect severity distributions\. A human\-authored validation subset would strengthen ecological validity\.

## 9Conclusion

At matched accuracy, open\-weight LLMs differ in tail shape in ways the error rate cannot see\.In our2121\-model catalog,8585of210210pairs have disjoint95%95\\%bbconfidence intervals on human\-consensus scoring; the theorem and mutual information analysis show that this is genuinely new information, not a restatement ofε\\varepsilon; and the taxonomy shows that the tail shift corresponds to a change from retrieval errors toward fabrication\. Our operational recommendation is simple:*reportbbalongsideε\\varepsilonwhenever you report an error rate\.*

## References

- Aki \[1965\]Keiiti Aki\.Maximum likelihood estimate ofbbin the formulalog⁡n=a−b​m\\log n=a\-bmand its confidence limits\.*Bulletin of the Earthquake Research Institute*, 43:237–239, 1965\.
- Asgari et al\. \[2025\]Pardis Asgari et al\.Beyond accuracy: Measuring the severity of LLM hallucinations\.*Findings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP Findings\)*, 2025\.
- Atwany et al\. \[2025\]Hanin Atwany, Abdul Waheed, Rita Singh, Monojit Choudhury, and Bhiksha Raj\.Lost in transcription, found in distribution shift: Demystifying hallucination in speech foundation models\.In*Findings of the Association for Computational Linguistics: ACL 2025*, 2025\.Introduces the Hallucination Error Rate \(HER\) metric\.
- Chang et al\. \[2025\]Aofei Chang, Le Huang, Parminder Bhatia, Taha Kass\-Hout, Fenglong Ma, and Cao Xiao\.MedHEval: Benchmarking hallucinations and mitigation strategies in medical large vision\-language models\.*arXiv preprint arXiv:2503\.02157*, 2025\.
- Clauset et al\. \[2009\]Aaron Clauset, Cosma Rohilla Shalizi, and M\. E\. J\. Newman\.Power\-law distributions in empirical data\.*SIAM Review*, 51\(4\):661–703, 2009\.
- Colelough et al\. \[2025\]Brandon Colelough, Davis Bartels, and Dina Demner\-Fushman\.Overview of the ClinIQLink 2025 shared task on medical question\-answering\.In*Proceedings of the 24th BioNLP Workshop \(ACL 2025\)*, 2025\.
- Dahl et al\. \[2024\]Matthew Dahl, Varun Magesh, Mirac Suzgun, and Daniel E\. Ho\.DAHL: Domain\-specific hallucination decomposition for legal LLMs\.In*Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2024\.
- Dziri et al\. \[2022\]Nouha Dziri, Ehsan Kamalloo, Sivan Milton, Osmar Zaiane, Mo Yu, Edoardo Ponti, and Siva Reddy\.FaithDial: A faithful benchmark for information\-seeking dialogue\.*Transactions of the Association for Computational Linguistics \(TACL\)*, 10:1473–1490, 2022\.
- Fedus et al\. \[2022\]William Fedus, Barret Zoph, and Noam Shazeer\.Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity\.*Journal of Machine Learning Research*, 23\(120\):1–39, 2022\.
- Guo et al\. \[2017\]Chuan Guo, Geoff Pleiss, Yu Sun, and Kilian Q\. Weinberger\.On calibration of modern neural networks\.In*Proceedings of the 34th International Conference on Machine Learning \(ICML\)*, pages 1321–1330, 2017\.
- Gutenberg and Richter \[1944\]B\. Gutenberg and C\. F\. Richter\.Frequency of earthquakes in California\.*Bulletin of the Seismological Society of America*, 34\(4\):185–188, 1944\.
- Hoffmann et al\. \[2022\]Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, et al\.Training compute\-optimal large language models\.In*Advances in Neural Information Processing Systems \(NeurIPS\)*, 2022\.
- Ji et al\. \[2023\]Ziwei Ji, Nayeon Lee, Rita Frieske, Tiezheng Yu, Dan Su, Yan Xu, Etsuko Ishii, Yejin Bang, Andrea Madotto, and Pascale Fung\.Survey of hallucination in natural language generation\.*ACM Computing Surveys*, 55\(12\):1–38, 2023\.
- Kadavath et al\. \[2022\]Saurav Kadavath, Tom Conerly, Amanda Askell, Tom Henighan, Dawn Drain, Ethan Perez, Nicholas Schiefer, et al\.Language models \(mostly\) know what they know\.*arXiv preprint arXiv:2207\.05221*, 2022\.
- Kaplan et al\. \[2020\]Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B\. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei\.Scaling laws for neural language models\.*arXiv preprint arXiv:2001\.08361*, 2020\.
- Li et al\. \[2023\]Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian\-Yun Nie, and Ji\-Rong Wen\.HaluEval: A large\-scale hallucination evaluation benchmark for large language models\.In*Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, pages 6449–6464, 2023\.
- Li et al\. \[2024\]Junyi Li et al\.HaluEval 2\.0: Updated hallucination evaluation benchmark for large language models\.In*Findings of the Association for Computational Linguistics \(ACL Findings\)*, 2024\.
- Lin et al\. \[2022a\]Stephanie Lin, Jacob Hilton, and Owain Evans\.Teaching models to express their uncertainty in words\.*Transactions on Machine Learning Research*, 2022a\.
- Lin et al\. \[2022b\]Stephanie Lin, Jacob Hilton, and Owain Evans\.TruthfulQA: Measuring how models mimic human falsehoods\.In*Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics \(ACL\)*, pages 3214–3252, 2022b\.
- Pandit et al\. \[2025\]Shrey Pandit, Jiawei Xu, Junyuan Hong, Zhangyang Wang, Tianlong Chen, Kaidi Xu, and Ying Ding\.MedHallu: A comprehensive benchmark for detecting medical hallucinations in large language models\.In*Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing \(EMNLP\)*, 2025\.
- Seth et al\. \[2024\]Ashish Seth, Dinesh Manocha, and Chirag Agarwal\.Towards a systematic evaluation of hallucinations in large\-vision language models \(HALLUCINOGEN\)\.*arXiv preprint arXiv:2412\.20622*, 2024\.
- Taleb \[2020\]Nassim Nicholas Taleb\.*Statistical Consequences of Fat Tails: Real World Preasymptotics, Epistemology, and Applications*\.STEM Academic Press, 2020\.
- Vuong \[1989\]Quang H\. Vuong\.Likelihood ratio tests for model selection and non\-nested hypotheses\.*Econometrica*, 57\(2\):307–333, 1989\.
- Wei et al\. \[2022\]Jason Wei, Yi Tay, Rishi Bommasani, Colin Raffel, Barret Zoph, Sebastian Borgeaud, et al\.Emergent abilities of large language models\.*Transactions on Machine Learning Research*, 2022\.
- Zuo and Jiang \[2024\]Kaiwen Zuo and Yirui Jiang\.MedHallBench: A new benchmark for assessing hallucination in medical large language models\.*arXiv preprint arXiv:2412\.18947*, 2024\.

## Appendix AProof details

###### Proof of Theorem[1](https://arxiv.org/html/2606.05170#Thmtheorem1)\.

For part \(i\), fix a positive severity grid𝒮=\{mmin,mmin\+δ,…,smax\}\\mathcal\{S\}=\\\{m\_\{\\min\},m\_\{\\min\}\+\\delta,\\ldots,s\_\{\\max\}\\\}and define a normalized Gutenberg–Richter tailqb​\(s\)∝10−b​\(s−mmin\)q\_\{b\}\(s\)\\propto 10^\{\-b\(s\-m\_\{\\min\}\)\}on𝒮\\mathcal\{S\}\. For any target error rateε⋆∈\(0,1\)\\varepsilon^\{\\star\}\\in\(0,1\), setP​\(S=0\)=1−ε⋆P\(S=0\)=1\-\\varepsilon^\{\\star\}andP​\(S=s​∣S\>​0\)=qb​\(s\)P\(S=s\\mid S\>0\)=q\_\{b\}\(s\)\. Thenε=ε⋆\\varepsilon=\\varepsilon^\{\\star\}is fixed while the slope parameterbbremains free\. Choosingb1≠b2b\_\{1\}\\neq b\_\{2\}yields two models with identical error rate and distinct severity profiles; by taking the pair arbitrarily close inε\\varepsilonwe obtain the stated matched\-accuracy divergence\.

For part \(ii\), suppose instead thatI​\(b;model∣ε\)=0I\(b;\\,\\text\{model\}\\mid\\varepsilon\)=0for a population containing matched\-accuracy pairs with differentbbvalues\. Then, conditional onε\\varepsilon, the model identity carries no information aboutbb, which implies thatbbis almost surely a deterministic function ofε\\varepsilonon that population\. Part \(i\) provides a counterexample: two models can share the sameε\\varepsilonwhile differing inbb\. Hence the conditional mutual information must be strictly positive\. ∎

###### Proof of Proposition[2](https://arxiv.org/html/2606.05170#Thmtheorem2)\.

For the Aki estimator on exceedances abovemminm\_\{\\min\},b^=log10⁡e/\(m¯−mmin\+δ/2\)\\hat\{b\}=\\log\_\{10\}\\mathrm\{e\}/\(\\bar\{m\}\-m\_\{\\min\}\+\\delta/2\), standard delta\-method calculations give asymptotic varianceVar⁡\(b^\)≈b2/n≥mmin\\operatorname\{Var\}\(\\hat\{b\}\)\\approx b^\{2\}/n\_\{\\geq m\_\{\\min\}\}when the severity scores are measured without annotation noise\. If the effective reliability of the averaged score isr=ICC​\(2,k\)r=\\mathrm\{ICC\}\(2,k\), then the effective sample size is attenuated ton≥mmin​rn\_\{\\geq m\_\{\\min\}\}r, yielding the lower boundSE​\(b^\)≥b/n≥mmin​r\\mathrm\{SE\}\(\\hat\{b\}\)\\geq b/\\sqrt\{n\_\{\\geq m\_\{\\min\}\}r\}\.

For two independent model estimates with similar standard errors, the standard error ofb^1−b^2\\hat\{b\}\_\{1\}\-\\hat\{b\}\_\{2\}is at most2​SE​\(b^\)\\sqrt\{2\}\\,\\mathrm\{SE\}\(\\hat\{b\}\)\. A two\-sided level\-α\\alphatest with power1−β1\-\\betatherefore requires\|b1−b2\|≥\(zα/2\+zβ\)​2​SE​\(b^\)\|b\_\{1\}\-b\_\{2\}\|\\geq\(z\_\{\\alpha/2\}\+z\_\{\\beta\}\)\\sqrt\{2\}\\,\\mathrm\{SE\}\(\\hat\{b\}\)\. Substitutingα=0\.05\\alpha=0\.05and1−β=0\.801\-\\beta=0\.80gives the constant2\.802\.80used in the proposition\. ∎

## Appendix B4K vs 10K scale\-up comparison

This appendix accompanies the v6 scale\-up from 4,000 to 10,000 queries \(Errorquake\-10k\)\. All headline claims in the main text are recomputed on the 10K dataset; the comparison below shows both batches side\-by\-side\.

Table 2:4K\-vs\-10K comparison of the LLM\-judge baseline metrics\.#### B1: Hierarchical bootstrap \(judge\-noise\-aware\)\.

Resampling queries with replacement and simulating primary/secondary swap noise \(200 iterations on 10K\), the median number of disjoint\-CI matched\-accuracy pairs is5858\(95% CI \[30, 84\]\)\.

#### B2: Fixed\-mminm\_\{\\min\}discriminator counts\.

#### B3: Model\-agnostic tail\-slope estimators\.

Log\-linear regression over the\{2\.5,3\.0,3\.5,4\.0\}\\\{2\.5,3\.0,3\.5,4\.0\\\}upper\-bin counts gives6868matched\-accuracy pairs with\|Δ​bll\|\>0\.15\|\\Delta b\_\{\\text\{ll\}\}\|\>0\.15\. The empirical tail ratioP​\(M≥3\)/P​\(M≥1\)P\(M\\\!\\geq\\\!3\)/P\(M\\\!\\geq\\\!1\)gives4545pairs with\|Δ​tail\_ratio\|\>0\.01\|\\Delta\\text\{tail\\\_ratio\}\|\>0\.01\.

#### B4: Binomial catastrophic\-rate test\.

Fisher’s exact test on per\-model counts atM≥3\.0M\\geq 3\.0applied to each matched\-accuracy pair, then BH\-FDR corrected, yields5858significant pairs atq<0\.05q<0\.05; atM≥2\.5M\\geq 2\.5,7676pairs are significant\.

#### B5: Judge leniency\.

Kruskal–WallisH=96286\.7H=96286\.7\(p=0p=0\) on per\-judge score distributions across the 10K dataset shows significant differences in mean leniency, but the round\-robin pool aggregation removes this as a per\-model bias\.

## Appendix CSupplementary figures \(Exp\. 1, 3, and 4\)

[Figure˜3](https://arxiv.org/html/2606.05170#A3.F3)is the prediction calibration plot for Experiment 3 \(two panels: the pre\-registeredM≥3\.0M\\geq 3\.0threshold and the exploratoryM≥2\.5M\\geq 2\.5threshold\)\.[Figure˜4](https://arxiv.org/html/2606.05170#A3.F4)gives the full2121\-model magnitude\-frequency grid \(small multiples\) that[Figure˜1](https://arxiv.org/html/2606.05170#S4.F1)summarises with four representative models\.[Figure˜5](https://arxiv.org/html/2606.05170#A3.F5)is the BIC heatmap showingΔ\\DeltaBIC between each candidate distribution family and the BIC\-best for each model \(stars mark the selected family\)\.[Figure˜6](https://arxiv.org/html/2606.05170#A3.F6)is the21×821\\times 8model\-by\-domainbbheatmap\.

![Refer to caption](https://arxiv.org/html/2606.05170v1/x3.png)Figure 3:Predicted vs\. observed catastrophic counts per model for Experiment 3\.Left:pre\-registeredM≥3\.0M\\geq 3\.0threshold,ρs=0\.443\\rho\_\{s\}=0\.443\.Right:exploratoryM≥2\.5M\\geq 2\.5threshold,ρs=0\.637\\rho\_\{s\}=0\.637\. Solid line is identity, dashed lines mark±1\.5×\\pm 1\.5\\times\. Only4/214/21models land within1\.5×1\.5\\timesatM≥3\.0M\\geq 3\.0;0/210/21over\-predict\.![Refer to caption](https://arxiv.org/html/2606.05170v1/x4.png)Figure 4:All2121models sorted bybb\(heaviest\-tailed at top\)\. Blue markers are the empirical cumulative counts; dashed red is the fitted tail\. Compact form of[Figure˜1](https://arxiv.org/html/2606.05170#S4.F1)\.![Refer to caption](https://arxiv.org/html/2606.05170v1/x5.png)Figure 5:Δ\\DeltaBIC between each distribution family and the best\-fit family for each of the2121models, capped at6060\. Stars mark the selected family\.![Refer to caption](https://arxiv.org/html/2606.05170v1/x6.png)Figure 6:bbby model×\\timesdomain\. Red = heavier tail\. Per\-model rows are sorted by meanbb\. KendallW=0\.108W=0\.108across the2121models indicates strong model\-idiosyncrasy in the domain ranking\.
## Appendix DFull 21\-model results table

[Table˜3](https://arxiv.org/html/2606.05170#A4.T3)reports each model’sbbwith95%95\\%bootstrap confidence interval, the selectedmminm\_\{\\min\}, the number of events at or abovemminm\_\{\\min\}, the total error count, the error rate, and the BIC\-best distribution family\. Rows are sorted bybb\(heaviest tail first\)\. Active\-parameter counts and architecture labels \(dense vs MoE\) appear in Appendix[N](https://arxiv.org/html/2606.05170#A14)\.

Table 3:Full 21\-modelbbtable\. Bootstrap CIs fromn=2000n=2000resamples\.
## Appendix EQuery benchmark construction

Errorquake\-10kwas generated by a dense frontier model prompted to produce stratified queries in eight domains and five difficulty tiers, then verified by an independent dual\-judge audit that flagged tier\-miscalibrated cells\. Approximately6%6\\%of the generated queries failed the tier\-calibration audit \(most commonly T5 queries that were easier than the tier specification required, and T1 queries in LAW that were harder\) and were regenerated using a more capable model with a stricter prompt\. Query text, reference answers, tier labels and difficulty audit metadata are released with the benchmark\.

## Appendix FExp\. 2 cross\-domain jackknife \(judge baseline\)

Removing each of the88domains in turn and recounting matched\-accuracy disjoint\-CI pairs on the3,5003\{,\}500\-query residual:

All88drops exceed the pre\-registered criterion of≥3\\geq 3disjoint\-CI pairs\. The judge\-baseline discriminator is not driven by any single domain\.

## Appendix GExp\. 2 judge\-aggregation robustness \(judge baseline\)

Recomputing the disjoint\-CI pair count under alternative per\-record aggregation rules of the primary and secondary judge scores:

Every aggregation rule exceeds the pre\-registered criterion of≥3\\geq 3disjoint\-CI pairs\. Theprimary\_onlycount is the lowest because per\-judgebbvalues cluster more tightly than the aggregated values, so fewer pairs meet\|Δ​b\|\>0\.15\|\\Delta b\|\>0\.15— but the pairs that do qualify still have disjoint CIs\.

## Appendix HJudge LOO ablation \(full\)

We leave each of the2222judges in our pool out in turn \(removing all records where that judge participated, rebuilding the final score from the surviving judges\) and recompute the dense scaling correlation\. The*sign*is preserved in22/2222/22drops; thep<0\.05p<0\.05significance threshold is preserved in6/226/22, with the magnitude weakening as the largest\-contribution judges \(deepseek\-v3\.2,qwen3\-next\-80b,eurollm\-9b\) are removed\. Full per\-judge rows are inresults/analysis/judge\_loo\_ablation\.json\. The*sign stability*is consistent with the headline discriminator \(§[4\.1](https://arxiv.org/html/2606.05170#S4.SS1)\), which passes under all judge drops; the*magnitude instability*is the reason we demote the scaling correlation to a sensitivity observation \(§[4\.3](https://arxiv.org/html/2606.05170#S4.SS3)\)\.

## Appendix IExtended human validation \(100\-item pilot\)

On the100100\-item pilot subset where an expert rater scored responses on the same99\-level grid, the dual\-judge pipeline has:

Sensitivity is100%100\\%at every severity threshold: the judges catch every response a human rater flagged as severe\. The low PPV \(0\.180\.18atM≥2\.5M\\geq 2\.5\) reflects the overcall problem documented in Appendix[L](https://arxiv.org/html/2606.05170#A12): the judges flag∼5×\\sim 5\\timesas many items as severe as a human would, so absolute severity\-count estimates are inflated\. The*ranking\-of\-models*headline discriminator of §[4\.1](https://arxiv.org/html/2606.05170#S4.SS1)is unaffected because overcalling is not specific to particular models \(Appendix[L](https://arxiv.org/html/2606.05170#A12)shows per\-model overcall rates range from15%15\\%to55%55\\%around the33\.5%33\.5\\%pooled rate\)\. Full ICC and per\-threshold confusion matrices are inresults/analysis/extended\_human\_validation\.json\. The pilot covers only33models, so human\-only scaling verification atn≥5n\\geq 5dense models is not yet feasible; this is flagged as a limitation in §[8](https://arxiv.org/html/2606.05170#S8)\.

## Appendix JScoring and judge prompts

The scoring pipeline executes the following stages per query: \(i\) send the query to the target model withtemperature=0and a500500\-token budget; \(ii\) store the raw response; \(iii\) draw a primary and secondary judge from the pool by a round\-robin rule that excludes self\-judging; \(iv\) send the response together with the reference answer and the full99\-level rubric to each judge and parse a single\-float score from each reply; \(v\) average the two scores if they agree within1\.01\.0, else invoke a tiebreaker judge and take the median\. The exact judge prompt template, the scoring rubric \(all99anchor levels with three worked examples each\), the model\-version strings, and the round\-robin configuration are included in the code release\.

## Appendix KError\-severity scale anchors

The99\-level continuous severity scale has anchors at\{0\.0,0\.5,1\.0,1\.5,2\.0,2\.5,3\.0,3\.5,4\.0\}\\\{0\.0,0\.5,1\.0,1\.5,2\.0,2\.5,3\.0,3\.5,4\.0\\\}with the following semantic labels:

0\.0 — correct\.The response answers the query accurately and completely\.

0\.5 — trivial imprecision\.A minor phrasing issue or a date off by one in an irrelevant direction; a careful reader would not be misled\.

1\.0 — minor imprecision\.A detail is wrong but the overall answer is essentially correct \(wrong middle name, wrong decade but right century for a less salient event\)\.

1\.5 — moderate imprecision with possible misdirection\.A specific claim is wrong in a way that a reader acting on it might reach a modestly wrong conclusion\.

2\.0 — moderate error\.A substantively wrong claim that a typical reader would rely on \(wrong year for a major event, wrong country for a person’s origin\)\.

2\.5 — substantial error\.Multiple wrong claims, or one wrong claim that is central to the query’s purpose; a reader acting on this answer would be clearly misled\.

3\.0 — major error\.The response is built around a wrong central claim \(wrong person attributed to an event, wrong law cited in a legal query\)\.

3\.5 — minor fabrication\.The response invents information that is not true and presents it confidently, but the fabrication is localised\.

4\.0 — major fabrication\.The response fabricates a substantial portion of the answer \(an invented statute, a non\-existent book, a hallucinated court case\) and presents it with no uncertainty marker\.

Each level has three worked examples released with the benchmark\. The pilot human\-rating study distinguished66of the99levels reliably; levels0\.50\.5,1\.51\.5,2\.52\.5,3\.53\.5were used less often by the human rater and less often still by the LLM judges \(Appendix[L](https://arxiv.org/html/2606.05170#A12)\)\.

## Appendix LOvercall diagnostic \(clearest examples\)

We manually classified340340score\-2\.02\.0judgements across1717models, stratified2020items per model, using a single expert rater\. Each item was placed into one of three categories:*genuine*\(the judge’s2\.02\.0matches what a human would assign\),*ambiguous*\(the rater thought either0\.50\.5/1\.01\.0or2\.02\.0was defensible\), or*overcall*\(the judge marked2\.02\.0for a response that a human would score0\.00\.0or0\.50\.5\)\.

Overall:163/340163/340genuine \(47\.9%47\.9\\%\),63/34063/340ambiguous \(18\.5%18\.5\\%\),114/340114/340overcall \(33\.5%33\.5\\%\)\. Per\-model overcall rates are reported in[Figure˜12](https://arxiv.org/html/2606.05170#A21.F12)\. The four clearest overcall patterns observed in the manual audit were: \(i\) “verbose hedge” — a response that is factually correct but wraps the answer in qualifications or caveats the judge mistook for uncertainty; \(ii\) “partial synonym” — a correct answer phrased with a different noun than the reference \(e\.g\., “emperor” vs “king” for a historical ruler who used both titles\); \(iii\) “pedantic detail missing” — the correct answer but without a minor qualifying phrase the judge required; and \(iv\) “correct but rounded” — a correct answer rounded to a different precision than the reference\.

## Appendix MExperiment 3 per\-model breakdown

[Figure˜3](https://arxiv.org/html/2606.05170#A3.F3)aggregates the prediction results; the per\-model breakdown \(predicted vs observed catastrophic counts atM≥3\.0M\\geq 3\.0andM≥2\.5M\\geq 2\.5, the ratio, and whether each model lands within1\.5×1\.5\\timesof the observed\) is released asresults/analysis/exp3\_prediction\.json\. The mechanistic diagnostic — refitting thebbseparately on easy and hard tiers — is inexp3\_diagnostic\.json; meanΔ​beasy−hard=−0\.041\\Delta b\_\{\\text\{easy\}\-\\text\{hard\}\}=\-0\.041\(std0\.2160\.216\),10/2110/21models have easy steeper and11/2111/21have hard steeper\.

## Appendix NLeave\-one\-out scaling robustness

[Table˜4](https://arxiv.org/html/2606.05170#A14.T4)gives the per\-drop values from the leave\-one\-out robustness check on the Exp\. 5 headline correlation\. Dropping each of the1414dense models in turn and recomputing the Spearman correlation betweenlog10⁡\(active params\)\\log\_\{10\}\(\\text\{active params\}\)andbbon the remaining1313models, the correlation stays negative and significant in all1414drops\. The worst\-casepp\-value \(p=0\.0263p=0\.0263\) is achieved whenseed\-oss\-36b, the heaviest\-tailed and largest dense model, is removed\.

Table 4:Leave\-one\-out Spearman correlations on the Exp\. 5 headline\. All1414drops preserve sign and significance\. Baselineρs=−0\.562\\rho\_\{s\}=\-0\.562,p=0\.006p=0\.006\(n=14n=14\)\.
## Appendix OTraining\-pipeline pair comparisons

[Table˜5](https://arxiv.org/html/2606.05170#A15.T5)lists the1111matched pairs tested\. For each pair we estimate thebbon the two models separately, fit at a sharedmminm\_\{\\min\}, and report the bootstrap\-resampled difference distribution\.

Table 5:Training\-pipeline pair tests\.2/112/11pairs reachp<0\.05p<0\.05; notably, the DeepSeek version bump v3\.1→\\tov3\.2 does*not*significantly change the severity distribution\.![Refer to caption](https://arxiv.org/html/2606.05170v1/x7.png)Figure 7:Gemma\-2 vs\. Gemma\-3 at 27B \(generation upgrade\)\. Within\-family generation comparison; theΔ​b\\Delta bis small and not statistically significant \(Table[5](https://arxiv.org/html/2606.05170#A15.T5)\)\.![Refer to caption](https://arxiv.org/html/2606.05170v1/x8.png)Figure 8:DeepSeek v3\.1 vs\. v3\.2 \(minor version bump\)\. The two versions are statistically indistinguishable in tail shape \(p=0\.872p=0\.872, Table[5](https://arxiv.org/html/2606.05170#A15.T5)\)\.
## Appendix PExceedance threshold sweep \(Q4\)

This appendix answers Q4 \(how many exceedances are required for a stablebbfit, and does the headline discriminator survive a stricter requirement\)\. For each minimum\-exceedance thresholdT∈\{30,50,75,100,150,200\}T\\in\\\{30,50,75,100,150,200\\\}, we exclude models whosemminm\_\{\\min\}\-selected tail contains fewer thanTTevents \(insufficient statistical support\), refit per\-modelbbon the survivors, and recount both the Exp\. 2 disjoint\-CI discriminator pairs and the Exp\. 5 dense scaling correlation plus partial correlation\.

Table 6:Exceedance threshold sweep\. At every tested threshold the Exp\. 2 discriminator count exceeds the pre\-registered criterion of≥3\\geq 3disjoint\-CI pairs \(minimum1010atT=200T=200\)\. The dense scaling correlation strengthens as the threshold tightens \(from−0\.562\-0\.562to−0\.928\-0\.928\) because the survivors are the best\-supported tail fits; the partial correlation controlling forε\\varepsilonweakens with the shrinking sample and does not reachp<0\.05p<0\.05at any threshold, consistent with the underpowered regime\.
## Appendix QDeployment table \(judge baseline, full\)

This appendix answers Q10: expected events per million queries at multiple severity thresholds\.[Table˜7](https://arxiv.org/html/2606.05170#A17.T7)gives the per\-model counts atM≥\{1\.5,2\.0,2\.5,3\.0,3\.5,4\.0\}M\\geq\\\{1\.5,2\.0,2\.5,3\.0,3\.5,4\.0\\\}scaled to per\-million\-query rates, computed directly from the empirical LLM\-judge baseline evaluation on the 4K subset \(no extrapolation, no fit\)\. The table answers the practitioner question: “how many events of severity≥m∗\\geq m^\{\*\}should I expect per million queries if I deploy modelXX?”

Table 7:Expected event counts per1,000,0001\{,\}000\{,\}000queries at six severity thresholds, computed from the10,00010\{,\}000\-query empirical evaluation\. Wilson95%95\\%binomial CIs and raw counts are inresults/analysis/deployment\_table\.json\. Rows sorted bylog10\\log\_\{10\}\(active params\)\. No fit, no extrapolation\.[Figure˜9](https://arxiv.org/html/2606.05170#A17.F9)visualises theM≥2\.5M\\geq 2\.5andM≥3\.0M\\geq 3\.0columns\.gemma\-3\-4bandllama\-3\.2\-3bhave the highest catastrophic rate \(∼53,000\{\\sim\}53\{,\}000–63,00063\{,\}000per million atM≥3\.0M\\geq 3\.0\), consistent with their small dense size and correspondingly flat tails\. At the other end,qwen3\-next\-80bandllama\-4\-maverickhave∼3,000\{\\sim\}3\{,\}000catastrophic events per million — a∼20×\\sim 20\\timesspread\. Fittedbbvalues provide a distribution\-shape summary but are not a calibrated extrapolator \(§[4\.4](https://arxiv.org/html/2606.05170#S4.SS4)\)\.

![Refer to caption](https://arxiv.org/html/2606.05170v1/x9.png)Figure 9:Empirical event rate per million queries for2121models, sorted by catastrophic count \(M≥3\.0M\\geq 3\.0, red bars\) with severe \(M≥2\.5M\\geq 2\.5, orange bars\) for comparison\.
## Appendix RInter\-judge agreement \(per\-model\)

We compute linear\- and quadratic\-weighted Cohen’sκ\\kappabetween the primary and secondary judge scores on the99\-level severity grid\. The pooled values across60,56860\{,\}568dual\-scored records areκlin=0\.285\\kappa\_\{\\text\{lin\}\}=0\.285\(“fair” on the Landis–Koch scale\) andκquad=0\.374\\kappa\_\{\\text\{quad\}\}=0\.374\(“fair–moderate”\)\. The60,56860\{,\}568count is the union of records where both judges produced a non\-null score; for several models, especially the smaller ones, the secondary judge call failed on a non\-random subset of records \(e\.g\.,phi\-3\.5\-minihas only95/400095/4000dual\-scored records because the secondary judge errored on the rest\), so the per\-modelκ\\kappavalues are not directly comparable across models\. We nonetheless report them inresults/analysis/judge\_agreement\.jsonfor transparency\.

The pooledκ\\kappais lower than the typicalκ\>0\.6\\kappa\>0\.6target for high\-stakes evaluation\. Two structural factors contribute: \(i\) our99\-level scale produces lower chance agreement than the33– or55\-level scales used in most LLM\-as\-judge studies, and \(ii\) the response\-style confound documented in Appendix[L](https://arxiv.org/html/2606.05170#A12)introduces systematic disagreement on verbose, hedged outputs\. The headline scaling correlation is computed on the*final*score \(the mean of the two judges, with tiebreak when needed\), so judge disagreement is averaged out per\-record before thebbfit; the leave\-one\-out robustness check \(14/1414/14, §[4\.3](https://arxiv.org/html/2606.05170#S4.SS3)\) demonstrates that the final\-score noise is small enough not to undermine the headline\.

## Appendix SPer\-tier scaling decomposition \(judge baseline\)

[Table˜8](https://arxiv.org/html/2606.05170#A19.T8)reports the within\-tier scaling correlation:ρs​\(log10⁡\(active params\),btier\)\\rho\_\{s\}\(\\log\_\{10\}\(\\text\{active params\}\),b\_\{\\text\{tier\}\}\)on the1414dense models, withbbrefit on the∼200\\sim 200–500500within\-tier errors per model\. The aggregate \(across all tiers\) correlation isρs=−0\.562\\rho\_\{s\}=\-0\.562\(p=0\.006p=0\.006\)\. Within tiers, the sign is preserved in4/54/5tiers and only T2 reaches conventional significance individually\. T4 is the one tier where the correlation flattens toρs≈0\\rho\_\{s\}\\approx 0\. The aggregate effect benefits from the larger per\-model error counts \(∼2000\\sim 2000vs∼400\\sim 400\), which reduce per\-fit noise and let the underlying signal emerge\. The per\-tier sign\-preservation is the relevant robustness statement; the aggregatepp\-value is the relevant magnitude statement\.

Table 8:Per\-tier scaling correlation on the1414dense models\. Sign preserved in4/54/5tiers; T2 individually significant\. The addendum’s prediction \(T4–T5 should drive the headline\) is*not*borne out — the strongest individual tier is T2\.
## Appendix TS5 —mminm\_\{\\min\}sensitivity \(judge baseline\)

[Table˜9](https://arxiv.org/html/2606.05170#A20.T9)reports each dense model’sbbunder threemminm\_\{\\min\}strategies: \(a\) our default model\-specific KS\-selector, \(b\) fixedmmin=1\.5m\_\{\\min\}=1\.5, and \(c\) the Clauset\-style criterion of≥100\\geq 100exceedances \(which selectsmmin=0\.5m\_\{\\min\}=0\.5for every model in the catalog\)\. The dense Spearman correlation betweenlog10⁡\(active params\)\\log\_\{10\}\(\\text\{active params\}\)andbbis−0\.562\-0\.562under \(a\),\+0\.837\+0\.837under \(b\), and\+0\.793\+0\.793under \(c\)\. The interpretation of this sign flip is in §[4\.6](https://arxiv.org/html/2606.05170#S4.SS6)\.

modelbbdef\{\}\_\{\\text\{def\}\}bbm=1\.5bbm=0\.5n≥0\.5n\_\{\\geq 0\.5\}llama\-3\.2\-3b1\.0460\.4690\.2901750phi\-3\.5\-mini1\.3090\.5300\.2981126gemma\-3\-4b0\.9790\.4390\.2431497qwen2\.5\-7b1\.2570\.5170\.2991247llama\-3\.1\-8b1\.0010\.7370\.5662429eurollm\-9b1\.0670\.5260\.2971356solar\-10\.7b0\.9050\.7110\.5622571gemma\-3\-12b0\.9380\.7420\.5552521ministral\-14b1\.1220\.7950\.5342340mistral\-medium\-30\.9060\.7670\.4342334mistral\-small\-24b0\.9990\.8340\.6392265gemma\-2\-27b0\.6190\.7860\.6192665gemma\-3\-27b0\.9560\.7600\.6102500seed\-oss\-36b0\.5740\.7810\.5742260dense Spearman vs\.log10⁡\(params\)\\log\_\{10\}\(\\text\{params\}\)−0\.562\-0\.562\+0\.837\+0\.837\+0\.793\+0\.793pp\-value0\.0060\.0060\.00020\.00020\.00070\.0007Table 9:S5: densebbunder threemminm\_\{\\min\}strategies\. The default model\-specific KS\-selector targets the upper tail; fixedmmin=1\.5m\_\{\\min\}=1\.5and the Clauset\-style≥100\\geq 100\-exceedances rule both target the bulk of the positive distribution\. The sign of the scaling correlation flips between the two regimes: larger dense models have heavier upper tails but lighter bulk decay rates\.
## Appendix USensitivity figures

![Refer to caption](https://arxiv.org/html/2606.05170v1/x10.png)Figure 10:S1:bbunder scale coarsening\. The 9\-point grid is load\-bearing; both coarsenings fall well below theρ\>0\.85\\rho\>0\.85stability threshold\.![Refer to caption](https://arxiv.org/html/2606.05170v1/x11.png)Figure 11:S2: ranking stability under overcall correction\. Mean Spearmanρ=0\.847\\rho=0\.847across5050bootstrap trials, narrowly below the pre\-registered0\.850\.85threshold\.![Refer to caption](https://arxiv.org/html/2606.05170v1/x12.png)Figure 12:Per\-model overcall breakdown on the340340\-item human\-rated subset \(Appendix[L](https://arxiv.org/html/2606.05170#A12)\)\.
## NeurIPS Paper Checklist

1. 1\.Claims
2. Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope?
3. Answer:\[Yes\]
4. Justification: The abstract and introduction summarize the matched\-accuracy discriminator result, the theory contribution, the human\-validation study, and the scope limits on model coverage\. Limitations and scope restrictions are also stated explicitly in the introduction and the Limitations section\.
5. 2\.Limitations
6. Question: Does the paper discuss the limitations of the work performed by the authors?
7. Answer:\[Yes\]
8. Justification: Section[8](https://arxiv.org/html/2606.05170#S8)discusses judge overcalling, the load\-bearing nature of the 9\-point scale, model\-coverage gaps, limited human\-sample size, failed predictive calibration, benchmark misuse, and the fact that queries are LLM\-generated\.
9. 3\.Theory assumptions and proofs
10. Question: For each theoretical result, does the paper provide the full set of assumptions and a complete \(and correct\) proof?
11. Answer:\[Yes\]
12. Justification: Theorem[1](https://arxiv.org/html/2606.05170#Thmtheorem1)and Proposition[2](https://arxiv.org/html/2606.05170#Thmtheorem2)are stated in Section[5](https://arxiv.org/html/2606.05170#S5), and complete proofs are provided in Appendix[A](https://arxiv.org/html/2606.05170#A1)\. The main text includes proof sketches for intuition and the appendix provides the formal derivations\.
13. 4\.Experimental result reproducibility
14. Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims and/or conclusions of the paper \(regardless of whether the code and data are provided or not\)?
15. Answer:\[Yes\]
16. Justification: Sections[2](https://arxiv.org/html/2606.05170#S2)and[3](https://arxiv.org/html/2606.05170#S3), together with the appendices and released artifact, describe the benchmark construction, model catalog, scoring pipeline, fitting procedure, and statistical analyses used for the main claims\. The released artifact also includes scripts and a reproduction guide for rerunning the analyses from saved data\.
17. 5\.Open access to data and code
18. Question: Does the paper provide open access to the data and code, with sufficient instructions to faithfully reproduce the main experimental results, as described in supplemental material?
19. Answer:\[Yes\]
20. Justification: The released artifact includes the code, saved analysis outputs, released data files, Croissant metadata, and reproduction instructions\. The paper states that the benchmark, scoring toolkit, human\-validation data, and analysis code are released for replication\.
21. 6\.Experimental setting/details
22. Question: Does the paper specify all the training and test details \(e\.g\., data splits, hyperparameters, how they were chosen, type of optimizer\) necessary to understand the results?
23. Answer:\[Yes\]
24. Justification: The paper specifies the 10,000\-query benchmark design, domain/tier stratification, model catalog, decoding setup, severity scale, dual\-judge resolution pipeline, human\-validation protocol, candidate distribution families, and bootstrap procedures\. Additional tables and prompts are included in the appendix and supplemental files\.
25. 7\.Experiment statistical significance
26. Question: Does the paper report error bars suitably and correctly defined or other appropriate information about the statistical significance of the experiments?
27. Answer:\[Yes\]
28. Justification: The paper reports bootstrap confidence intervals forbb, p\-values for correlation and goodness\-of\-fit tests, mutual\-information decomposition, and additional robustness analyses\. The bootstrap setup and significance thresholds are described in the paper and supplemental scripts\.
29. 8\.Experiments compute resources
30. Question: For each experiment, does the paper provide sufficient information on the computer resources \(type of compute workers, memory, time of execution\) needed to reproduce the experiments?
31. Answer:\[No\]
32. Justification: The paper states that inference was run through an open\-access third\-party API hosting the target models, but it does not provide exact GPU types, memory, or wall\-clock runtime for every experiment\. We chose not to over\-claim compute reproducibility where the provider abstraction hides the underlying hardware\.
33. 9\.Code of ethics
35. Answer:\[Yes\]
36. Justification: The paper discusses misuse risks and limitations, and the released artifact is intended for auditing and evaluation rather than unsafe deployment\. We are not aware of any aspect of the work that conflicts with the NeurIPS Code of Ethics\.
37. 10\.Broader impacts
38. Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work performed?
39. Answer:\[Yes\]
40. Justification: Section[8](https://arxiv.org/html/2606.05170#S8)includes a Broader impacts paragraph discussing positive impacts for risk auditing and deployment gating, as well as negative impacts from benchmark gaming or optimizing for better\-looking but still dangerous failure profiles\.
41. 11\.Safeguards
42. Question: Does the paper describe safeguards that have been put in place for responsible release of data or models that have a high risk for misuse \(e\.g\., pre\-trained language models, image generators, or scraped datasets\)?
43. Answer:\[N/A\]
44. Justification: The paper releases an evaluation benchmark, scores, and analysis code, not a new generative model or scraped multimedia dataset with elevated release risk\. We therefore view the high\-risk\-model safeguard question as not directly applicable to this artifact\.
45. 12\.Licenses for existing assets
46. Question: Are the creators or original owners of assets \(e\.g\., code, data, models\), used in the paper, properly credited and are the license and terms of use explicitly mentioned and properly respected?
47. Answer:\[No\]
48. Justification: The paper cites the upstream benchmarks and model families used in the evaluation, and the released artifact includes licenses for the released ERRORQUAKE code/data\. However, the paper does not enumerate every upstream model or provider license directly in the manuscript, so we answer conservatively\.
49. 13\.New assets
50. Question: Are new assets introduced in the paper well documented and is the documentation provided alongside the assets?
51. Answer:\[Yes\]
52. Justification: The benchmark release includes a datasheet, Croissant metadata, reproduction instructions, and scripts for the reported analyses\. The paper and supplemental material document the query structure, severity scale, scoring fields, and released outputs\.
53. 14\.Crowdsourcing and research with human subjects
54. Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants and screenshots, if applicable, as well as details about compensation \(if any\)?
55. Answer:\[No\]
56. Justification: The released artifact includes the full rating instructions used in the human\-validation study\. However, the paper does not currently document participant compensation or an equivalent statement that compensation was not applicable\.
57. 15\.Institutional review board \(IRB\) approvals or equivalent for research with human subjects
58. Question: Does the paper describe potential risks incurred by study participants, whether such risks were disclosed to the subjects, and whether Institutional Review Board \(IRB\) approvals \(or an equivalent approval/review based on the requirements of your country or institution\) were obtained?
59. Answer:\[No\]
60. Justification: The paper reports a human\-rating study but does not include an IRB\-approval statement or a fuller discussion of participant\-risk disclosure\. We therefore answer conservatively rather than implying review documentation that is not present in the released materials\.
61. 16\.Declaration of LLM usage
62. Question: Does the paper describe the usage of LLMs if it is an important, original, or non\-standard component of the core methods in this research? Note that if the LLM is used only for writing, editing, or formatting purposes and does*not*impact the core methodology, scientific rigor, or originality of the research, declaration is not required\.
63. Answer:\[Yes\]
64. Justification: The paper explicitly states that the benchmark queries were generated by a frontier LLM and that response scoring uses an LLM\-judge pipeline with human validation\. These components are described in Sections[2](https://arxiv.org/html/2606.05170#S2),[3](https://arxiv.org/html/2606.05170#S3), and[8](https://arxiv.org/html/2606.05170#S8)\.

Similar Articles

Confidence Calibration in Large Language Models

arXiv cs.AI

This paper analyzes the confidence calibration of 11 popular LLMs, finding that they are generally overconfident, especially on hard tasks, and underconfident on easy tasks. It introduces LifeEval, a test for evaluating calibration across difficulty levels.

Uncertainty Quantification for Large Language Diffusion Models

arXiv cs.CL

This paper presents the first systematic study of uncertainty quantification (UQ) for Large Language Diffusion Models (LLDMs), proposing lightweight zero-shot uncertainty signals derived from the iterative denoising process and showing that LLDMs can achieve both fast inference and reliable hallucination detection with up to 100x lower computational overhead compared to sampling-based baselines.