Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization
Summary
This paper extends optimal transport-based hallucination detection to all decoder layers in NMT and abstractive summarization, finding that detection is concentrated in early layers and that the geometric signal transfers poorly to summarization due to faithfulness failures not detectable via attention concentration.
View Cached Full Text
Cached at: 06/12/26, 08:51 AM
# Layer-Resolved Optimal Transport for Hallucination Detection in NMT and Abstractive Summarization
Source: [https://arxiv.org/html/2606.13216](https://arxiv.org/html/2606.13216)
###### Abstract
Optimal transport \(OT\) has been shown to detect hallucinations in neural machine translation \(NMT\) by measuring the geometric distance between cross\-attention distributions and a reference distribution, without any supervisionGuerreiroet al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib1)\)\. We extend this analysis to all six decoder layers of the Fairseq DE\-EN model \(N=3,414N=3\{,\}414\), showing that Wass\-to\-Unif and Wass\-to\-Data are complementary detectors specialised across hallucination types, that detection is concentrated in layers L1–L4 with L5 anti\-predictive for subtler types, and that hallucinated translations lack the exploratory attention phase present in correct translations from the first decoding step\. We further evaluate whether the geometric signal transfers to abstractive summarization faithfulness detection: our unsupervised OT detector on AggreFactTanget al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib2)\)\(N=1,116N=1\{,\}116\) achieves57\.2%57\.2\\%/57\.6%57\.6\\%balanced accuracy on CNN/XSum – above chance but substantially below supervised MiniCheck\-Flan\-T5\-LTanget al\.\([2024](https://arxiv.org/html/2606.13216#bib.bib3)\)\(69\.9%69\.9\\%/74\.3%74\.3\\%\)\. This gap is principled: unlike NMT hallucinations, unfaithful summaries can attend correctly to source tokens while misrepresenting their content, a failure mode invisible to concentration\-based OT metrics by construction\. Structural experiments on T5\-baseRaffelet al\.\([2020](https://arxiv.org/html/2606.13216#bib.bib4)\)confirm consistent decoder organisation across depth, with Layer 3 showing peak concentration and Layer 12 being most critical for generation quality\. Together, the results establish OT on cross\-attention as a reliable detector when the failure mode is source disengagement, a principled interpretability tool regardless of task, and fundamentally limited when faithfulness failures occur downstream of attention\.
Optimal Transport, Hallucination Detection, Neural Machine Translation, Abstractive Summarization, Cross\-Attention Analysis
## 1Introduction
Transformer modelsVaswaniet al\.\([2017](https://arxiv.org/html/2606.13216#bib.bib5)\)have achieved strong performance in abstractive summarization, yet their internal attention mechanisms remain poorly understood\. A practical concern is faithfulness: models can generate summaries that are fluent but factually inconsistent with the source, a failure mode closely related to hallucination in neural machine translation \(NMT\)\. Recent workMaynezet al\.\([2020](https://arxiv.org/html/2606.13216#bib.bib6)\); Kryścińskiet al\.\([2020](https://arxiv.org/html/2606.13216#bib.bib7)\)has shown that faithfulness failures are common even in state\-of\-the\-art summarization systems\.
Guerreiroet al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib1)\)demonstrated that hallucinations in NMT produce cross\-attention distributions that are geometrically detached from the source, and that this detachment is measurable via Wasserstein\-1 \(W1W\_\{1\}\) distance\. Their fully unsupervised detector outperformed all prior model\-based approaches and was competitive with external models trained on millions of samples for quality estimation and cross\-lingual sentence similarity\. However, their analysis operates on a single aggregate attention distribution from the final decoder layer, leaving open how the hallucination signal distributes across individual layers, how different hallucination types relate to different detectors, and whether the geometric intuition transfers beyond NMT\.
We address both of these open questions\. First, we extend the original NMT analysis to all six decoder layers of the Fairseq DE\-EN model, introducing routing consistency as an additional detector and characterising the layer\-resolved geometry of each hallucination type\. Second, we ask whether the geometric signal transfers to abstractive summarization faithfulness detection, using the T5 architectureRaffelet al\.\([2020](https://arxiv.org/html/2606.13216#bib.bib4)\)as a testbed and evaluating on the AggreFact benchmarkTanget al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib2)\)\.
Our contributions are:
1. 1\.A layer\-resolved analysis of the Fairseq DE\-EN hallucination corpus ofGuerreiroet al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib1)\), extending their aggregate last\-layer result to all six decoder layers and introducing routing consistency as an additional unsupervised detector\. We show that Wass\-to\-Unif and Wass\-to\-Data are complementary detectors specialised across hallucination types, that detection performance is concentrated in layers L1–L4, and that hallucinated translations lack the exploratory attention phase present in correct translations from the first decoding step\.
2. 2\.The first application of OT\-based hallucination detection to abstractive summarization, evaluated on the AggreFact benchmarkTanget al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib2)\)against supervised baselines including MiniCheckTanget al\.\([2024](https://arxiv.org/html/2606.13216#bib.bib3)\)\.
3. 3\.A theoretical account of why the NMT\-to\-summarization transfer is partial, grounding the empirical gap in the distinction between retrieval failure and content misuse, and calibrating it against the gradient of detectability observed across NMT hallucination types\.
4. 4\.A structural analysis of T5\-base cross\-attention geometry across all 12 decoder layers via OT metrics, revealing consistent architectural organisation confirmed by leave\-one\-out ablation and convergent with the layer structure identified in the Fairseq model\.
## 2Background
### 2\.1OT\-Based Hallucination Detection in NMT
Given two discrete probability distributionsμ\\muandν\\nuover positions\{1,…,S\}\\\{1,\\ldots,S\\\}, the Wasserstein\-1 distance is defined as:
W1\(μ,ν\)=infγ∈Γ\(μ,ν\)∫ℝ×ℝ\|x−y\|𝑑γ\(x,y\),W\_\{1\}\(\\mu,\\nu\)\\;=\\;\\inf\_\{\\gamma\\,\\in\\,\\Gamma\(\\mu,\\nu\)\}\\int\_\{\\mathbb\{R\}\\times\\mathbb\{R\}\}\|x\-y\|\\;d\\gamma\(x,y\),\(1\)whereΓ\(μ,ν\)\\Gamma\(\\mu,\\nu\)is the set of all joint distributions \(transport plans\) with marginalsμ\\muandν\\nu, and\|x−y\|\|x\-y\|is the ground metric on token positions\. Intuitively,W1W\_\{1\}measures the minimum “work” needed to rearrange one distribution into the other, making it sensitive to the spatial structure of attention mass in a way that entropy\-based measures are not\. For discrete distributions on a 1D grid,W1W\_\{1\}reduces to the area between cumulative distribution functions, enabling efficient exact computationPeyré and Cuturi \([2019](https://arxiv.org/html/2606.13216#bib.bib8)\)\.
Guerreiroet al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib1)\)proposed treating each cross\-attention distribution as a point in the space of probability measures over source positions, and measuring its concentration via W1distance to the uniform distribution𝐮=\(1/S,…,1/S\)⊤\\mathbf\{u\}=\(1/S,\\ldots,1/S\)^\{\\top\}:
c\(ℓ,t\)=W1\(π\(ℓ,t\),𝐮\),c^\{\(\\ell,t\)\}=W\_\{1\}\\\!\\left\(\\pi^\{\(\\ell,t\)\},\\,\\mathbf\{u\}\\right\),\(2\)whereπ\(ℓ,t\)\\pi^\{\(\\ell,t\)\}is the head\-averaged cross\-attention distribution at decoder layerℓ\\elland generation steptt\. Low concentration – attention mass spread uniformly across source positions – flags potential hallucination\. Their per\-example score aggregates this signal as the layer\-median mean ofc\(ℓ,t\)c^\{\(\\ell,t\)\}, and their Wass\-to\-Unif \(WTU\) and Wass\-to\-Data \(WTD\) detectors are complementary: WTU captures absolute concentration while WTD measures distributional similarity to a reference set of confirmed\-correct translations\.
### 2\.2Faithfulness in Summarisation
Faithfulness failures in abstractive summarisation differ fundamentally from NMT hallucinationsMaynezet al\.\([2020](https://arxiv.org/html/2606.13216#bib.bib6)\)\.Maynezet al\.\([2020](https://arxiv.org/html/2606.13216#bib.bib6)\)distinguish*intrinsic*hallucinations, where generated content contradicts the source, from*extrinsic*ones, where content cannot be verified from the source\. NMT hallucinations are predominantly intrinsic and severe – the decoder ignores source content almost entirely\. Abstractive summarisation failures are often extrinsic: the model attends correctly to source tokens but infers or distorts beyond what is licensed by the evidence\. This distinction is central to understanding why OT transfer is partial: the signal that works in NMT \(source disengagement\) is simply not the dominant failure mode in abstractive summarisation\.
### 2\.3Attention as an Interpretability Signal
Cross\-attention distributions in encoder–decoder Transformers have been used as a proxy for source–target alignmentVaswaniet al\.\([2017](https://arxiv.org/html/2606.13216#bib.bib5)\)\. More recently, mechanistic interpretability work has identified functional specialisation across decoder layers\. OT provides a principled, label\-free way to characterise this structure via the geometry of attention distributions, without requiring probing classifiers or task\-specific supervision\.
## 3Methodology
### 3\.1Attention Extraction
For each source–output pair we extract the full cross\-attention tensors from the decoder\. For layerℓ\\elland generation steptt, the raw tensor has shape\(H,Ttgt,S\)\(H,T\_\{\\text\{tgt\}\},S\), whereHHis the number of heads,TtgtT\_\{\\text\{tgt\}\}the output length, andSSthe source length\. We average over heads:
π\(ℓ,t\)=1H∑h=1Hα\(h,ℓ,t\)∈ΔS−1,\\pi^\{\(\\ell,t\)\}=\\frac\{1\}\{H\}\\sum\_\{h=1\}^\{H\}\\alpha^\{\(h,\\ell,t\)\}\\in\\Delta^\{S\-1\},\(3\)followingGuerreiroet al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib1)\)\.
### 3\.2OT Metrics
#### Wass\-to\-Unif \(WTU\)\.
The per\-layer concentration score is the mean W1distance to the uniform distribution over decoding steps:
sWTU\(ℓ\)=1T∑t=1TW1\(π\(ℓ,t\),𝐮\)\.s^\{\(\\ell\)\}\_\{\\text\{WTU\}\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}W\_\{1\}\\\!\\left\(\\pi^\{\(\\ell,t\)\},\\mathbf\{u\}\\right\)\.\(4\)The aggregate per\-example score averages over layers: low scores flag potential hallucination\.
#### Step\-to\-step OT\.
To measure how dynamically the decoder repositions its source attention during generation, we compute theW1W\_\{1\}distance between attention distributions at consecutive steps within each layer:
\[𝐒\]ℓ,t=W1\(π\(ℓ,t\),π\(ℓ,t\+1\)\),t=1,…,T−1\.\[\\mathbf\{S\}\]\_\{\\ell,t\}=W\_\{1\}\\\!\\left\(\\pi^\{\(\\ell,t\)\},\\,\\pi^\{\(\\ell,t\+1\)\}\\right\),\\quad t=1,\\ldots,T\-1\.\(5\)The per\-layer mean step\-OT summarises the average attention shift\. A model that scans the source dynamically produces high step\-OT; one that locks onto fixed positions produces low step\-OT\.
#### Layer\-pairwise OT\.
To characterise routing similarity across decoder depth, we computeW1W\_\{1\}between step\-averaged attention distributions at each pair of layers:
\[𝐃\]ℓ,ℓ′=W1\(π¯\(ℓ\),π¯\(ℓ′\)\),π¯\(ℓ\)=1T∑t=1Tπ\(ℓ,t\),\[\\mathbf\{D\}\]\_\{\\ell,\\ell^\{\\prime\}\}=W\_\{1\}\\\!\\left\(\\bar\{\\pi\}^\{\(\\ell\)\},\\,\\bar\{\\pi\}^\{\(\\ell^\{\\prime\}\)\}\\right\),\\quad\\bar\{\\pi\}^\{\(\\ell\)\}=\\frac\{1\}\{T\}\\sum\_\{t=1\}^\{T\}\\pi^\{\(\\ell,t\)\},\(6\)producing a symmetricL×LL\\times Ldistance matrix per example\. Low\[𝐃\]ℓ,ℓ′\[\\mathbf\{D\}\]\_\{\\ell,\\ell^\{\\prime\}\}indicates that two layers attend to similar source positions on average; high values indicate functionally distinct routing\.
#### Wass\-to\-Data \(WTD\)\.
For each test sentence we retrievek=4k\{=\}4nearest neighbours from a reference set of confirmed\-correct sentences \(filtered by length proximityδ=0\.1\\delta\{=\}0\.1\) and compute the mean W1distance to their step\-averaged attention distributions\.
#### Routing Consistency \(RC\)\.
Letȷ^\(ℓ,t\)=argmaxjαj\(ℓ,t\)\\hat\{\\jmath\}\(\\ell,t\)=\\arg\\max\_\{j\}\\alpha^\{\(\\ell,t\)\}\_\{j\}be the argmax source position at steptt\. The routing entropy at layerℓ\\ellisH\(ℓ\)=−∑jp^j\(ℓ\)logp^j\(ℓ\)H\(\\ell\)=\-\\sum\_\{j\}\\hat\{p\}^\{\(\\ell\)\}\_\{j\}\\log\\hat\{p\}^\{\(\\ell\)\}\_\{j\}, wherep^j\(ℓ\)=T−1∑t𝟏\[ȷ^\(ℓ,t\)=j\]\\hat\{p\}^\{\(\\ell\)\}\_\{j\}=T^\{\-1\}\\sum\_\{t\}\\mathbf\{1\}\[\\hat\{\\jmath\}\(\\ell,t\)=j\]\. The per\-example RC score is−L−1∑ℓH\(ℓ\)\-L^\{\-1\}\\sum\_\{\\ell\}H\(\\ell\), negated so that higher values correspond to more consistent, focused routing\. RC captures a complementary aspect of attention geometry: not how concentrated the distribution is, but how consistently the decoder returns to the same source position across steps\.
All W1distances are computed exactly via the CDF formulaPeyré and Cuturi \([2019](https://arxiv.org/html/2606.13216#bib.bib8)\), avoiding Sinkhorn regularisation error\.
### 3\.3Datasets
#### NMT \(Fairseq DE\-EN\)\.
We use the annotated corpus ofGuerreiroet al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib1)\): 3,414 DE→\\toEN translations with binary labels for five hallucination categories\. Following their taxonomy, we define the hallucination group as sentences positive for any of full\-unsupport \(129\), strong\-unsupport \(164\), or repetitions \(87\) – yielding 324 hallucinated sentences – and the confirmed\-correct group as 2,882 sentences where all five label columns are zero\.
#### Summarisation \(AggreFact\)\.
We evaluate on the AggreFact benchmarkTanget al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib2)\), using test splits for CNN/DailyMail \(N=558N\{=\}558\) and XSum \(N=558N\{=\}558\)\. The primary supervised baseline is MiniCheck\-Flan\-T5\-L111[https://huggingface\.co/lytang/MiniCheck\-Flan\-T5\-Large](https://huggingface.co/lytang/MiniCheck-Flan-T5-Large)\(0\.8B parameters\), which achieves 69\.9%/74\.3% balanced accuracy on CNN/XSum\. Balanced accuracy \(BAcc\), the arithmetic mean of sensitivity and specificity, is used as the primary metric to account for class imbalance \(CNN: 89\.8% faithful\)\. Structural experiments use T5\-base222[https://huggingface\.co/google\-t5/t5\-base](https://huggingface.co/google-t5/t5-base)\(N=100N\{=\}100CNN/DailyMail examples for concentration profiling;N=50N\{=\}50for quality\-group comparisons\)\.
## 4Experiments I: Revisiting Translation
### 4\.1Layer Structure and Concentration Profile
Figure[1](https://arxiv.org/html/2606.13216#S4.F1)shows the mean pairwiseW1W\_\{1\}distance matrix𝐃\\mathbf\{D\}averaged over all 3,414 sentences\. The matrix reveals three distinct functional regimes\. L0 is moderately distant from all other layers, with no strong affinity to any particular group\. L1 acts as a transitional layer: its row is dark toward both L0 and the central block, indicating it shares routing behaviour with both neighbours rather than belonging cleanly to either\. Layers L2–L4 form a tight cluster, with pairwise distances among themselves substantially lower than to any other layer, suggesting shared or functionally redundant source\-routing behaviour\. Finally, L5 is structurally isolated: the maximum pairwise distance in the entire matrix occurs between L2 and L5 \(meanW1≈0\.13W\_\{1\}\\approx 0\.13, yellow cell\), and every entry in L5’s row is markedly brighter than the interior of the L2–L4 block\. This is consistent with L5’s role as the final decoder layer, which must commit to specific source tokens immediately before generation and therefore implements qualitatively different routing from the preceding layers\.
Figure 1:Mean pairwiseW1W\_\{1\}distance matrix𝐃\\mathbf\{D\}averaged over all 3,414 sentences \(Fairseq DE\-EN\)\. Three functional regimes are visible: L0 \(transitional\), L1–L4 \(tight cluster, L1 being bridging to L0\), and L5 \(isolated; maximum distance≈0\.13\\approx 0\.13from L2\)\.Figure[2](https://arxiv.org/html/2606.13216#S4.F2)shows the mean concentrationsWTU\(ℓ\)s\_\{\\mathrm\{WTU\}\}^\{\(\\ell\)\}per layer\. The profile is broadly monotone increasing from L1 \(0\.210\.21\) through L5 \(0\.330\.33\), with L0 sitting slightly above L1 at0\.250\.25– a minor non\-monotonicity at the decoder entry\. The sharpest single jump occurs between L4 \(0\.260\.26\) and L5 \(0\.330\.33\), consistent with L5’s structural isolation: it routes differently from all other layers and does so by attending more selectively than any of them\.
Figure 2:MeanW1W\_\{1\}concentrationsWTU\(ℓ\)s\_\{\\mathrm\{WTU\}\}^\{\(\\ell\)\}per decoder layer \(Fairseq DE\-EN,N=3,414N=3\{,\}414\)\. The profile increases broadly from L1 \(0\.210\.21\) to L5 \(0\.330\.33\), with the sharpest jump at the final layer\. L0 sits slightly above L1, producing a minor non\-monotonicity at the decoder entry\.Finding 1\.The six Fairseq decoder layers organise into three functional regimes: L0 is transitional, L1 bridges early and middle layers, L2–L4 form a tight cluster with shared routing behaviour, and L5 is structurally isolated as the most concentrated and routing\-divergent layer in the network\.
### 4\.2Replication and Hallucination Separation
Table 1:Mann\-WhitneyUUtest: hallucinated vs\. confirmed\-correct translations \(Fairseq DE\-EN,Nhall=324N\_\{\\text\{hall\}\}\{=\}324,Ncorr=2,882N\_\{\\text\{corr\}\}\{=\}2\{,\}882\)\. All differences significant atp<0\.001p\{<\}0\.001\. The correct group excludes 208 sentences with non\-hallucination error labels\.Table[1](https://arxiv.org/html/2606.13216#S4.T1)confirms the geometric signal identified byGuerreiroet al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib1)\)with exactW1W\_\{1\}distances computed at each individual decoder layer\. Hallucinated translations exhibit significantly higher mean concentration \(0\.2960\.296vs\.0\.2490\.249,p<0\.001p\{<\}0\.001, Mann\-Whitney U\): attention mass is more tightly focused on a sparse set of source tokens throughout generation\. They also show significantly lower step\-to\-step OT \(0\.0820\.082vs\.0\.1240\.124,p<0\.001p\{<\}0\.001\): a hallucinating decoder locks onto fixed source positions and stays there, while a correctly translating model dynamically scans as it generates\. This*static attention*signature – not reported in the original paper, which did not compute step\-resolved OT – provides a complementary characterisation of the failure mode beyond concentration alone\.
The remaining metrics reinforce this picture\. Lower standard deviation of concentration \(0\.0900\.090vs\.0\.1040\.104\) confirms that hallucinated attention is not only more peaked on average but also less variable across steps\. Lower mean layer OT \(0\.0550\.055vs\.0\.0590\.059\) indicates that hallucinated sentences exhibit less divergence across decoder layers, consistent with a model that has committed to an output independently of source content and does so uniformly across the stack\.
Figure[3](https://arxiv.org/html/2606.13216#S4.F3)shows the four most interpretable metrics as boxplots\. The mean concentration panel shows the strongest visual separation: the hallucination box sits almost entirely above the correct box with minimal interquartile overlap\. The step\-OT panel shows the inverse pattern with comparable clarity\. Final\-layer concentration and mean layer OT show significant but weaker separation, consistent with their smaller absolute differences in Table[1](https://arxiv.org/html/2606.13216#S4.T1)\.
Figure 3:OT metrics for hallucinated vs\. confirmed\-correct translations \(Fairseq DE\-EN,N=324N\{=\}324hallucinated,N=2,882N\{=\}2\{,\}882correct; outliers suppressed\)\. Left to right: mean concentration, final\-layer concentration, mean layer OT, mean step OT\. All differences significant atp<0\.001p\{<\}0\.001\(Mann\-WhitneyUU\)\.Finding 2\.Hallucinated translations are distinguished from correct translations on every OT metric \(allp<0\.001p\{<\}0\.001\)\. The two most discriminative signals are higher mean concentration \(\+0\.047\+0\.047\) and lower step\-to\-step OT \(−0\.042\-0\.042\), jointly characterising hallucination as static, focused attention that locks onto irrelevant source positions from the first decoding step\.
### 4\.3Layer\-Resolved Detection Performance
Figure[4](https://arxiv.org/html/2606.13216#S4.F4)reports WTU AUROC by decoder layer and hallucination type\. Results are strongly type\-dependent\. For full\-unsupport hallucinations, WTU is a reliable detector: AUROC peaks at L2 \(0\.9460\.946\) and remains above0\.930\.93for L1–L4, before dropping to0\.7500\.750at L5\. L0 is near\-random \(0\.5840\.584\), confirming that the earliest layer carries almost no concentration\-based hallucination signal\. For strong\-unsupport and repetitions, performance is substantially lower throughout \(maximum per\-layer AUROC0\.6720\.672and0\.6670\.667respectively\)\.
A notable inversion appears at L5: WTU AUROC falls below chance for strong\-unsupport \(0\.4610\.461\) and repetitions \(0\.3740\.374\), meaning final\-layer concentration is actively*anti\-predictive*for these types\. This reflects the layer’s structural role: L5 is always the most concentrated layer regardless of translation quality \(mean WTU0\.330\.33corpus\-wide\), so its concentration provides no discriminative signal for hallucination types that retain some source routing\.
Figure 4:AUROC of the WTU detector by decoder layer and hallucination type \(Fairseq DE\-EN\)\. The RdYlGn colormap spans\[0\.4,1\.0\]\[0\.4,1\.0\]; values below0\.50\.5\(red\) indicate anti\-predictive layers\. Full\-unsupport is strongly detectable at L1–L4; strong\-unsupport and repetitions are weakly detectable throughout; L5 is anti\-predictive for the two harder types\.
### 4\.4Routing Consistency as a New Detector
RC outperforms WTU for full\-unsupport both in aggregate \(0\.9570\.957vs\.0\.9370\.937\) and at its best layer, L2 \(0\.9550\.955vs\.0\.9460\.946\), making it the single strongest individual detector in our analysis \(Table[2](https://arxiv.org/html/2606.13216#S4.T2)\)\. The explanation is mechanistic: full\-unsupport hallucinations have RC near zero at every layer – the decoder routes to the same source position at every step – while confirmed\-correct translations dip to−1\.77\-1\.77at L2, reflecting highly diverse source scanning\. It is not merely that attention is concentrated on a wrong token; it is that the*same*token receives maximum attention at every step regardless of what is being generated\. For repetitions, both WTU and RC are near chance \(0\.5680\.568and0\.5080\.508\), confirming that oscillatory hallucinations have a qualitatively different attention signature that neither metric captures reliably\.
Table 2:Routing Consistency \(RC\) detector AUROC by decoder layer and hallucination type \(Fairseq DE\-EN\)\. Higher is better\.Figure 5:Mean routing consistencyRC\(ℓ\)\\mathrm\{RC\}\(\\ell\)per decoder layer, split by hallucination type and confirmed\-correct translations \(Fairseq DE\-EN\)\. Full\-unsupport hallucinations \(red, left panel\) have RC near zero at every layer – they route to a single source position throughout decoding – while correct translations \(blue\) dip sharply at L2 \(−1\.77\-1\.77\), reflecting diverse source routing\. Strong\-unsupport shows partial separation; repetitions are indistinguishable\.Table 3:Aggregate WTD vs\. WTU AUROC by hallucination type \(Fairseq DE\-EN\)\. Bold indicates the stronger detector per type\.WTD and WTU are complementary across hallucination types: WTU dominates for full\-unsupport \(0\.9370\.937vs\. WTD0\.8010\.801\) where absolute concentration is the discriminative signal, while WTD dominates for strong\-unsupport \(0\.7700\.770vs\.0\.6290\.629\) and repetitions \(0\.7900\.790vs\.0\.5680\.568\), where shape similarity to reference distributions carries more information than concentration alone\. Figure[6](https://arxiv.org/html/2606.13216#S4.F6)shows the per\-layer picture\. For strong\-unsupport, WTD sits consistently above WTU across L0–L4 \(0\.720\.72–0\.790\.79vs\.0\.600\.60–0\.670\.67\); at L5, WTU collapses below chance \(0\.460\.46\) while WTD holds steady at∼0\.72\{\\sim\}0\.72\. The repetitions panel shows the most extreme divergence: WTD is flat across all six layers \(0\.790\.79–0\.810\.81\), while WTU collapses catastrophically at L4–L5, reaching0\.370\.37at L5 – actively misleading\. This complementarity provides empirical grounding for the Wass\-Combo design ofGuerreiroet al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib1)\): the two detectors are specialised at the level of individual layers and hallucination types in consistent and interpretable ways\.
Figure 6:Per\-layer AUROC for WTD \(orange\) and WTU \(blue\) by hallucination type \(Fairseq DE\-EN\)\. WTU dominates for full\-unsupport at all layers; WTD dominates for strong\-unsupport and repetitions and is robust to the L5 collapse that afflicts WTU\. The dashed line marks the random baseline \(0\.50\.5\)\.Finding 3\.WTU and RC are strong detectors for full\-unsupport hallucinations \(AUROC0\.9370\.937and0\.9570\.957respectively\), peaking at L2\. WTD is the stronger detector for strong\-unsupport and repetitions \(AUROC0\.7700\.770and0\.7900\.790\), robust across all layers including L5 where WTU becomes anti\-predictive\. The three detectors are complementary: WTU and RC capture absolute concentration and routing diversity; WTD captures distributional shape relative to correct translations\.
### 4\.5Generation Dynamics
Normal decoding follows a consistent two\-phase trajectory across all six decoder layers\. In the*exploration phase*\(roughly the first25%25\\%of the sequence\), attention shifts rapidly between source positions and concentration is low – the model is actively scanning the source\. In the*commitment phase*\(the remaining75%75\\%\), attention shifts decay and concentration rises as the model progressively locks onto specific source tokens\.
Hallucinated translations lack this structure entirely\. Layer trajectories are compressed together from the very first decoding step, with no pronounced early dip and little fan\-out between layers: the model begins in a statically concentrated state, skipping the exploratory phase before generation properly starts\. This confirms that the attention pathology underlying hallucination is present from step 1, with a practical implication: early\-step OT scores could in principle support online detection before the full output is generated\.
\(a\)Concentration trajectory\.
\(b\)Step\-to\-step OT trajectory\.
Figure 7:Mean concentration \(left\) and step\-to\-stepW1W\_\{1\}distance \(right\) across relative decoding position, averaged over all 3,414 sentences \(Fairseq DE\-EN\)\. Both reveal a two\-phase dynamic: early exploration \(high shift, low concentration\) followed by commitment \(low shift, rising concentration\)\.\(a\)Confirmed\-correct translations\.
\(b\)Hallucinated translations\.
Figure 8:Concentration trajectory split by quality group \(Fairseq DE\-EN\)\. Correct translations \(left\) exhibit the two\-phase structure with pronounced early dip and layer fan\-out\. Hallucinated translations \(right\) show compressed, elevated trajectories from step 1 with no exploratory phase\.Finding 4\.Normal decoding follows a two\-phase trajectory: an early exploratory phase \(∼\{\\sim\}first25%25\\%of the sequence\) followed by a commitment phase of rising concentration and decaying attention shift\. Hallucinated translations lack the exploratory phase entirely, beginning in a statically concentrated state from step 1 – consistent with the elevated mean concentration, suppressed step\-OT, and near\-zero routing consistency observed in Section 4\.2\.
Main takeaway\.OT strongly detects the failure mode where the decoder detaches from the source\. Detection power degrades with hallucination severity – from full\-unsupport \(AUROC0\.9570\.957with RC\) through strong\-unsupport \(0\.6720\.672\) to repetitions \(0\.5680\.568\) – tracking the degree to which source routing is disrupted\.
## 5Experiments II: Transfer to Summarisation
### 5\.1Faithfulness Detection
Table[4](https://arxiv.org/html/2606.13216#S5.T4)reports AggreFact results\. The unsupervised OT detector \(T5\-base\) reaches 57\.4% BAcc on average – above the 50% random baseline but 14\.7 points below supervised MiniCheck\-Flan\-T5\-L\. Flan\-T5\-large yields a marginally different profile \(55\.6/61\.4%\), with a higher XSum score consistent with stronger encoders producing more discriminative concentration signals on abstractive data\.
Table 4:Faithfulness detection on AggreFactTanget al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib2)\)\(Balanced Accuracy, %\)\. Threshold chosen by sweep on the evaluation set\.A cross\-dataset logistic regression \(OT features from CNN, tested on XSum\) achieves 44\.2% BAcc – below chance – confirming that OT features do not generalise across extractive and abstractive regimesMaynezet al\.\([2020](https://arxiv.org/html/2606.13216#bib.bib6)\)\.
### 5\.2Layer\-wise Concentration Profile
The cross\-attention concentration profile is non\-monotonic \(Figure[9](https://arxiv.org/html/2606.13216#S5.F9), Table[5](https://arxiv.org/html/2606.13216#S5.T5)\): Layer 3 is the global peak \(medianW1=0\.1612W\_\{1\}=0\.1612\), with partial relaxation in L4 and secondary elevations at L5, L7, L9\. Layer 12 returns to low concentration \(medianW1=0\.0713W\_\{1\}=0\.0713\)\.
Figure 9:Cross\-attention concentrationc\(ℓ\)c\(\\ell\)by decoder layer \(T5\-base,N=100N\{=\}100\)\. Shaded: IQR\. Blue: median\. L3 is the global peak\.Table 5:Per\-layerW1\(π,𝐮\)W\_\{1\}\(\\pi,\\mathbf\{u\}\)for T5\-base \(median, min, max\)\. Global peak marked⋆\\star\.Despite this architecture, HIGH and LOW ROUGE\-L groups are indistinguishable at every layer \(Figure[10](https://arxiv.org/html/2606.13216#S5.F10); Mann\-WhitneyUU,p\>0\.60p\{\>\}0\.60at all layers\): concentration magnitude does not predict summarisation quality\. Pairwise inter\-layer distances confirm that L3 is maximally distant from all other layers \(medianD\(L1,L3\)≈0\.080D\(L1,L3\)\\approx 0\.080\), while L9–L11 form a tight cluster \(distances≈0\.009\\approx 0\.009–0\.0320\.032\), indicating functional redundancy\.
Figure 10:Concentrationc\(ℓ\)c\(\\ell\)by ROUGE\-L quality group \(T5\-base,N=100N\{=\}100\)\. HIGH \(green\) and LOW \(red\) are statistically indistinguishable at every layer \(p\>0\.60p\{\>\}0\.60\)\. Both exhibit the L3 peak\.
### 5\.3Layer Ablation
Leave\-one\-out ablation \(Table[6](https://arxiv.org/html/2606.13216#S5.T6)\) on the T5\-base baseline \(R\-L=24\.94\\text\{R\-L\}=24\.94\) converges on the same exceptional layers\. L12 is the single most critical layer \(ΔR\-L=−0\.96\\Delta\\text\{R\-L\}=\-0\.96\); L5, L7, L11 yield mild improvements when ablated \(\+0\.59\+0\.59to\+0\.88\+0\.88\), consistent with the OT clustering of L9–L11\. Cumulative ablation reveals a critical threshold:
ΔROUGE\-L=\{−0\.14ablate L1−0\.07ablate L1–L2−4\.47ablate L1–L3\\Delta\\mathrm\{ROUGE\\text\{\-\}L\}\\;=\\;\\begin\{cases\}\-0\.14&\\text\{ablate L1\}\\\\ \-0\.07&\\text\{ablate L1\-\-L2\}\\\\ \-4\.47&\\text\{ablate L1\-\-L3\}\\end\{cases\}\(7\)The sharp collapse at L1–L3 establishes these layers as a jointly indispensable early context\-encoding block, directly reinforcing the L3 concentration peak\.
Table 6:Selected leave\-one\-out ablation results \(T5\-base,N=50N\{=\}50\)\.Δ\\DeltaR\-L==ablated−\-baseline\(24\.94\)\(24\.94\)\.Figure 11:ROUGE\-L distribution for T5\-base \(N=100N\{=\}100\)\. Left: histogram \(median=0\.212=0\.212\)\. Right: sorted scores; HIGH = green, LOW = red\.#### Summary\.
The OT detector transfers to summarisation with above\-chance performance; the gap to supervised methods is systematic\. The concentration profile and ablation findings are mutually consistent: L3 and L12 are structurally and functionally exceptional, while L9–L11 are redundant\. OT features do not generalise across extractive/abstractive regimes\.
## 6Discussion
### 6\.1Retrieval Failure vs\. Content Misuse
The NMT success ofGuerreiroet al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib1)\)rests on a geometric signal: hallucinating decoders concentrate attention on irrelevant source tokens \(punctuation, EOS\), producing anomalously largeW1\(π\(ℓ,t\),𝐮\)W\_\{1\}\(\\pi^\{\(\\ell,t\)\},\\mathbf\{u\}\)\. In abstractive summarisation, the failure mode is different: an unfaithful summary may still exhibit concentrated, correctly\-targeted attention – the error occurs downstream of retrieval, in how the model processes what it attends to\. By construction, the OT score detects retrieval failure only; it is blind to content misuse\.
The dataset\-level pattern in Table[4](https://arxiv.org/html/2606.13216#S5.T4)is consistent with this\. CNN/DailyMail summaries are largely extractive, so unfaithful ones may genuinely exhibit more diffuse attention; XSum is highly abstractive, making the OT signal noisier even for faithful summaries\. The Fairseq results quantify the spectrum directly: WTU AUROC degrades from0\.9370\.937\(full\-unsupport, pure retrieval failure\) to0\.6290\.629\(strong\-unsupport, partial source engagement\)\. The AggreFact result \(57\.4%57\.4\\%BAcc\) falls below even the strong\-unsupport figure, placing abstractive faithfulness failures at the far end of this spectrum\.
### 6\.2OT as an Interpretability Tool
Despite limited quality prediction, OT metrics provide a principled lens on Transformer architecture\. The L3 concentration peak, its maximum inter\-layer routing divergence, and the functional redundancy of L9–L11 form a coherent picture of T5\-base decoder organisation – independently confirmed by ablation: OT\-anomalous layers \(L3, L12\) are ablation\-critical; OT\-redundant layers \(L9–L11\) are ablation\-neutral or beneficial\. That the structural pivot is L2 in the six\-layer Fairseq model and L3 in twelve\-layer T5\-base suggests OT localises functionally exceptional layers consistently across architectures\.
A LoRA analysis shows only 0\.132% of T5\-base parameters are adjusted during task adaptation, confirming that the identified attention patterns are intrinsic to the pre\-trained architecture, not task\-specific artifacts\.
## 7Conclusion
We extended the OT\-based detector ofGuerreiroet al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib1)\)to abstractive summarisation, evaluating on 1,116 AggreFact examplesTanget al\.\([2023](https://arxiv.org/html/2606.13216#bib.bib2)\)\. The unsupervised detector achieves 57\.4% average BAcc – above chance but substantially below supervised MiniCheckTanget al\.\([2024](https://arxiv.org/html/2606.13216#bib.bib3)\)\(72\.1%\)\. The gap is principled: OT on cross\-attention detects retrieval failure but not content misuse, and abstractive faithfulness failures are predominantly of the latter typeMaynezet al\.\([2020](https://arxiv.org/html/2606.13216#bib.bib6)\)\.
Structural analysis of T5\-baseRaffelet al\.\([2020](https://arxiv.org/html/2606.13216#bib.bib4)\)reveals consistent decoder organisation confirmed across independent methods: Layer 3 is the most selective and routing\-distinct; Layers L1–L3 form an indispensable early context\-encoding block \(Eq\.[7](https://arxiv.org/html/2606.13216#S5.E7)\); Layer 12 is the single most critical for generation quality\. Complementary analysis on the Fairseq DE\-EN corpus achieves AUROC0\.9460\.946for fully\-detached hallucinations \(RC:0\.9570\.957\), with detection concentrated in L1–L4 and L5 anti\-predictive for subtler types\. WTU and WTD are complementary detectors specialised by hallucination type in consistent, interpretable ways\.
Together, the results support a unified picture: OT on cross\-attention is a reliable detector when the failure mode is source disengagement, a principled interpretability tool regardless of task, and fundamentally limited when faithfulness failures occur downstream of attention\.
Future work should explore token\-level \(non\-head\-averaged\) OT signals, larger instruction\-tuned models where attention detachment may be a cleaner failure signal, and combinations with NLI\-based detectorsKryścińskiet al\.\([2020](https://arxiv.org/html/2606.13216#bib.bib7)\)\.
## References
- Anonymous \(2026\)Code implementation and analytics\.External Links:[Link](https://anonymous.4open.science/r/Layer_Resolved_Optimal_Transport)Cited by:[Layer\-Resolved Optimal Transport for Hallucination Detectionin NMT and Abstractive Summarization](https://arxiv.org/html/2606.13216#p1.1)\.
- N\. M\. Guerreiro, A\. F\. T\. Martins, and Z\. Mariet \(2023\)Optimal transport for unsupervised hallucination detection in neural machine translation\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL 2023\),External Links:[Link](https://arxiv.org/abs/2212.09631)Cited by:[item 1](https://arxiv.org/html/2606.13216#S1.I1.i1.p1.1),[§1](https://arxiv.org/html/2606.13216#S1.p2.1),[§2\.1](https://arxiv.org/html/2606.13216#S2.SS1.p2.2),[§3\.1](https://arxiv.org/html/2606.13216#S3.SS1.p1.7),[§3\.3](https://arxiv.org/html/2606.13216#S3.SS3.SSS0.Px1.p1.1),[§4\.2](https://arxiv.org/html/2606.13216#S4.SS2.p1.7),[§4\.4](https://arxiv.org/html/2606.13216#S4.SS4.p2.15),[§6\.1](https://arxiv.org/html/2606.13216#S6.SS1.p1.1),[§7](https://arxiv.org/html/2606.13216#S7.p1.1)\.
- W\. Kryściński, B\. McCann, C\. Xiong, and R\. Socher \(2020\)Evaluating the factual consistency of abstractive text summarization\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP 2020\),External Links:[Link](https://arxiv.org/abs/1910.12840)Cited by:[§1](https://arxiv.org/html/2606.13216#S1.p1.1),[§7](https://arxiv.org/html/2606.13216#S7.p4.1)\.
- J\. Maynez, S\. Narayan, B\. Bohnet, and R\. McDonald \(2020\)On faithfulness and factuality in abstractive summarization\.InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics \(ACL 2020\),External Links:[Link](https://arxiv.org/abs/2005.00661)Cited by:[§1](https://arxiv.org/html/2606.13216#S1.p1.1),[§2\.2](https://arxiv.org/html/2606.13216#S2.SS2.p1.1),[§5\.1](https://arxiv.org/html/2606.13216#S5.SS1.p2.1),[§7](https://arxiv.org/html/2606.13216#S7.p1.1)\.
- G\. Peyré and M\. Cuturi \(2019\)Computational optimal transport\.Foundations and Trends in Machine Learning11\(5–6\),pp\. 355–607\.External Links:[Link](https://arxiv.org/abs/1803.00567)Cited by:[§2\.1](https://arxiv.org/html/2606.13216#S2.SS1.p1.9),[§3\.2](https://arxiv.org/html/2606.13216#S3.SS2.SSS0.Px5.p2.1)\.
- C\. Raffel, N\. Shazeer, A\. Roberts, K\. Lee, S\. Narang, M\. Matena, Y\. Zhou, W\. Li, and P\. J\. Liu \(2020\)Exploring the limits of transfer learning with a unified text\-to\-text transformer\.Journal of Machine Learning Research21\(140\),pp\. 1–67\.External Links:[Link](https://arxiv.org/abs/1910.10683)Cited by:[§1](https://arxiv.org/html/2606.13216#S1.p3.1),[§7](https://arxiv.org/html/2606.13216#S7.p2.2)\.
- L\. Tang, T\. Goyal, A\. Fabbri, P\. Laban, J\. Xu, R\. Koncel\-Kedziorski, E\. Choi, A\. Nenkova, and K\. McKeown \(2023\)Understanding factual errors in summarization: errors, summarizers, datasets, error detectors\.InProceedings of the 61st Annual Meeting of the Association for Computational Linguistics \(ACL 2023\),External Links:[Link](https://arxiv.org/abs/2205.12854)Cited by:[item 2](https://arxiv.org/html/2606.13216#S1.I1.i2.p1.1),[§1](https://arxiv.org/html/2606.13216#S1.p3.1),[§3\.3](https://arxiv.org/html/2606.13216#S3.SS3.SSS0.Px2.p1.4),[Table 4](https://arxiv.org/html/2606.13216#S5.T4),[Table 4](https://arxiv.org/html/2606.13216#S5.T4.3.2),[§7](https://arxiv.org/html/2606.13216#S7.p1.1)\.
- L\. Tang, P\. Laban, and K\. McKeown \(2024\)MiniCheck: efficient fact\-checking of LLMs on grounding documents\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing \(EMNLP 2024\),External Links:[Link](https://arxiv.org/abs/2404.10774)Cited by:[item 2](https://arxiv.org/html/2606.13216#S1.I1.i2.p1.1),[§7](https://arxiv.org/html/2606.13216#S7.p1.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InAdvances in Neural Information Processing Systems \(NeurIPS\),Vol\.30\.External Links:[Link](https://arxiv.org/abs/1706.03762)Cited by:[§1](https://arxiv.org/html/2606.13216#S1.p1.1),[§2\.3](https://arxiv.org/html/2606.13216#S2.SS3.p1.1)\.Similar Articles
Hallucination Detection-Guided Preference Optimization for Clinical Summarization
Introduces HDSR and HDSR-PL, methods that use hallucination detectors to guide iterative self-refinement and preference learning, achieving up to 48% reduction in hallucinations for clinical summarization using Llama and Gemma models on MIMIC-IV-Note.
Mind the Unseen Mass: Unmasking LLM Hallucinations via Soft-Hybrid Alphabet Estimation
Researchers introduce SHADE, a hybrid estimator that combines Good-Turing coverage with graph-spectral cues to quantify semantic uncertainty and detect LLM hallucinations when only a few black-box samples are available.
HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders
Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.
Automatic Layer Selection for Hallucination Detection
This paper proposes automatic layer selection for hallucination detection in LLMs and introduces First Effective Peak of Intrinsic Dimension (FEPoID), a training-free criterion that consistently identifies optimal intermediate layers, outperforming existing heuristics.
From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data
This paper analyzes hallucination in large language models as a structural consequence of three architectural decisions: self-attention's co-occurrence learning, maximum likelihood estimation training objective, and autoregressive decoding's left-to-right commitment. It maps each mechanism to specific hallucination types and argues that dataset pathologies amplify but do not cause these vulnerabilities.