Max-pooling Network Revisited: Analyzing the Role of Semantic Probability in Multiple Instance Learning for Hallucination Detection

arXiv cs.CL Papers

Summary

This paper analyzes hallucination detection in LLMs, proposing a max-pooling approach that improves efficiency by eliminating costly semantic consistency computations while maintaining competitive performance.

arXiv:2605.08863v1 Announce Type: new Abstract: Hallucination detection has become increasingly important for improving the reliability of large language models (LLMs). Recently, hybrid approaches such as HaMI, which combine semantic consistency with internal model states via Multiple Instance Learning (MIL), have achieved state-of-the-art performance. However, these methods incur substantial computational overhead due to repeated sampling and costly semantic similarity computations. In this work, we first provide a theoretical analysis of HaMI in terms of decision margins, revealing that scaling internal states with semantic consistency leads to an enlarged decision margin. Motivated by this insight, we revisit classical sentence classification models from a margin enlargement perspective, aggregating token-level features via max pooling and directly estimating sentence scores using a lightweight MLP. Without requiring semantic consistency computations, our approach achieves substantial efficiency improvements while maintaining competitive performance with state-of-the-art baselines through adaptive aggregation of internal feature representations.
Original Article
View Cached Full Text

Cached at: 05/12/26, 07:04 AM

# Max-pooling Network Revisited: Analyzing the Role of Semantic Probability in Multiple Instance Learning for Hallucination Detection
Source: [https://arxiv.org/html/2605.08863](https://arxiv.org/html/2605.08863)
Issei Satosato@g\.ecc\.u\-tokyo\.ac\.jpDepartment of Computer Science, The University of Tokyo

###### Abstract

Hallucination detection has become increasingly important for improving the reliability of large language models \(LLMs\)\. Recently, hybrid approaches such as HaMI, which combine semantic consistency with internal model states via Multiple Instance Learning \(MIL\), have achieved state\-of\-the\-art performance\. However, these methods incur substantial computational overhead due to repeated sampling and costly semantic similarity computations\. In this work, we first provide a theoretical analysis of HaMI in terms of decision margins, revealing that scaling internal states with semantic consistency leads to an enlarged decision margin\. Motivated by this insight, we revisit classical sentence classification models from a margin enlargement perspective, aggregating token\-level features via max pooling and directly estimating sentence scores using a lightweight MLP\. Without requiring semantic consistency computations, our approach achieves substantial efficiency improvements while maintaining competitive performance with state\-of\-the\-art baselines through adaptive aggregation of internal feature representations\.

## 1Introduction

Despite the strong performance of Large Language Models \(LLMs\), their practical deployment is hindered by hallucinations\[[11](https://arxiv.org/html/2605.08863#bib.bib34),[13](https://arxiv.org/html/2605.08863#bib.bib11)\]\. This issue is particularly critical in high\-stakes domains such as healthcare, where incorrect recommendations can have serious consequences\. Prior work has shown that models may generate contextually inconsistent or incorrect information\[[19](https://arxiv.org/html/2605.08863#bib.bib13)\], and may even risk exposing sensitive training data through prompting\[[7](https://arxiv.org/html/2605.08863#bib.bib12)\]\. Ensuring output reliability is therefore essential, making hallucination detection a key challenge\. Existing approaches to hallucination detection fall into three main categories\. First, uncertainty\-based methods estimate confidence from model probabilities or from the semantic diversity of multiple sampled responses, as in Semantic Entropy\[[9](https://arxiv.org/html/2605.08863#bib.bib7)\], which accounts for semantic equivalence\. Second, retrieval\-based approaches query external knowledge sources or LLMs to verify factual consistency\[[28](https://arxiv.org/html/2605.08863#bib.bib4)\]\. Third, representation\-based methods exploit internal hidden states\[[3](https://arxiv.org/html/2605.08863#bib.bib35),[10](https://arxiv.org/html/2605.08863#bib.bib5)\], often training classifiers on fixed token positions such as the final token\[[25](https://arxiv.org/html/2605.08863#bib.bib6)\]\. A recent hybrid method, HaMI\[[23](https://arxiv.org/html/2605.08863#bib.bib2)\], formulates hallucination detection as a multiple instance learning \(MIL\) problem and improves robustness through adaptive token selection and semantic\-probability weighting\.

A major limitation of HaMI is its reliance on external models for estimating semantic uncertainty, which introduces substantial computational overhead and latency\. Frequent API calls not only increase economic costs but also raise privacy concerns and hinder real\-time deployment\. In practical large\-scale or interactive settings, minimizing inference latency is therefore critical\.

Despite its empirical success, it remains unclear why semantic\-probability weighting improves detection performance\. To better understand this issue, we first analyze the mechanism of HaMI from a theoretical perspective\. Our analysis shows that weighting by semantic uncertainty effectively increases the classification margin at the logit level\. Motivated by this observation, we propose a detection framework that operates entirely on the model’s internal states\. Rather than scaling individual tokens based on external uncertainty signals, our approach enlarges the margin through structured aggregation of token\-level features, capturing sentence\-level representations more effectively\.

Specifically, we revisit the classical sentence classification architecture by incorporating a lightweight feature transformation layer followed by feature\-wise max\-pooling\[[16](https://arxiv.org/html/2605.08863#bib.bib26)\]\. This design is motivated by the view that hallucination\-related signals may be sparse across tokens and can be better preserved by max pooling\. Rather than relying on external uncertainty signals, our method as shown in Table[1](https://arxiv.org/html/2605.08863#S1.T1)operates directly on internal hidden states\. Token\-level representations are first transformed using an MLP, after which feature\-wise max pooling aggregates the most informative signals across tokens\. This process leads to a clearer separation between faithful and hallucinated responses while maintaining high computational efficiency\.

Our contributions are summarized as follows:

- •We provide a theoretical analysis of HaMI, showing that semantic\-probability weighting improves performance by enlarging the classification margin \(Theorem[1](https://arxiv.org/html/2605.08863#Thmtheorem1)\)\.
- •We analyze max\-pooling\-based networks from a theoretical perspective, focusing on their effect on margin and generalization \(Theorems[2](https://arxiv.org/html/2605.08863#Thmtheorem2)and[3](https://arxiv.org/html/2605.08863#Thmtheorem3), and Proposition[1](https://arxiv.org/html/2605.08863#Thmproposition1)\)\.
- •We propose a fully self\-contained detection framework that achieves comparable performance while being significantly faster \(Section[5\.2](https://arxiv.org/html/2605.08863#S5.SS2)and[5\.3](https://arxiv.org/html/2605.08863#S5.SS3)\)\.

Table 1:Comparison of architectural designs between HaMI and our study\.
## 2Preliminaries

### 2\.1Problem Settings

Hallucination refers to the phenomenon in which responses contradict factual reality, typically arising when LLMs lack sufficient internal knowledge or fail to properly use their internal knowledge\[[29](https://arxiv.org/html/2605.08863#bib.bib10)\]\.

LetD=\{\(qi,ai\)\}i=1ND=\\\{\(q\_\{i\},a\_\{i\}\)\\\}\_\{i=1\}^\{N\}denote a QA dataset, whereqiq\_\{i\}andaia\_\{i\}represent a question and its ground\-truth answer, respectively\. Given a questionqq, an LLM produces a response sequence𝒜=\(t1,t2,…,tn\)\\mathcal\{A\}=\(t\_\{1\},t\_\{2\},\\dots,t\_\{n\}\)\.

For each generated tokentit\_\{i\}, we extract the hidden representation𝐡i∈ℝd\\mathbf\{h\}\_\{i\}\\in\\mathbb\{R\}^\{d\}from an intermediate layer\. The factual correctness of the generated response𝒜\\mathcal\{A\}is evaluated by an external evaluator LLM, which assigns a binary labely∈\{−1,1\}y\\in\\\{\-1,1\\\}indicating whether the response is factually correct\.

### 2\.2Semantic Probability

Following the Semantic Entropy framework\[[9](https://arxiv.org/html/2605.08863#bib.bib7)\], we quantify uncertainty by analyzing the semantic diversity of stochastic samples𝒢=\{𝒜1,…,𝒜K\}\\mathcal\{G\}=\\\{\\mathcal\{A\}\_\{1\},\\dots,\\mathcal\{A\}\_\{K\}\\\}for each questionqq\.

#### Semantic Clustering

To group linguistically diverse but semantically equivalent responses, we use an external LLM to evaluate the relationRi​j:=R​\(𝒜i,𝒜j\)∈\{E,C,N\}R\_\{ij\}:=R\(\\mathcal\{A\}\_\{i\},\\mathcal\{A\}\_\{j\}\)\\in\\\{E,C,N\\\}\(Entailment, Contradiction, Neutral\)\. We define equivalence as:

𝒜i≡𝒜j⇔\(Ri​j,Rj​i\)∈\{\(E,E\),\(E,N\),\(N,E\)\}\\mathcal\{A\}\_\{i\}\\equiv\\mathcal\{A\}\_\{j\}\\iff\(R\_\{ij\},R\_\{ji\}\)\\in\\\{\(E,E\),\(E,N\),\(N,E\)\\\}The set𝒢\\mathcal\{G\}is thus partitioned into semantic clusters\{𝒞1,…,𝒞s\}\\\{\\mathcal\{C\}\_\{1\},\\dots,\\mathcal\{C\}\_\{s\}\\\}\.

#### Uncertainty Metrics

A cluster’s probability is the normalized sum of its sequences’ likelihoods:

P𝒞=∑𝒜∈𝒞P​\(𝒜∣q\)∑k=1KP​\(𝒜k∣q\)\.P\_\{\\mathcal\{C\}\}=\\frac\{\\sum\_\{\\mathcal\{A\}\\in\\mathcal\{C\}\}P\(\\mathcal\{A\}\\mid q\)\}\{\\sum\_\{k=1\}^\{K\}P\(\\mathcal\{A\}\_\{k\}\\mid q\)\}\.Based on this,Semantic Probability\[[23](https://arxiv.org/html/2605.08863#bib.bib2)\]is defined as:

Psem\(i\)=P𝒞​\(i\),where​𝒞​\(i\)​is the cluster containing𝒜i\.P^\{\(i\)\}\_\{\\text\{sem\}\}=P\_\{\\mathcal\{C\}\(i\)\},\\quad\\text\{where \}\\mathcal\{C\}\(i\)\\text\{ is the cluster containing $\\mathcal\{A\}\_\{i\}$\}\.

### 2\.3Hallucination Detection via Multiple Instance Learning

We formulate hallucination detection within the Multiple Instance Learning \(MIL\) framework\[[6](https://arxiv.org/html/2605.08863#bib.bib19),[12](https://arxiv.org/html/2605.08863#bib.bib3),[20](https://arxiv.org/html/2605.08863#bib.bib32)\]\. A response is treated as a bag𝐁=\{𝐡𝐁,1,…,𝐡𝐁,T𝐁\}\\mathbf\{B\}=\\\{\\mathbf\{h\}\_\{\\mathbf\{B\},1\},\\dots,\\mathbf\{h\}\_\{\\mathbf\{B\},T\_\{\\mathbf\{B\}\}\}\\\}with a labely𝐁y\_\{\\mathbf\{B\}\}\. Let𝒮\\mathcal\{S\}be the set of bags in the dataset\. We define the subsets of positive and negative samples as follows:

𝒮p​o​s=\{𝐁∈𝒮∣y𝐁=1\},𝒮n​e​g=\{𝐁∈𝒮∣y𝐁=−1\}\\mathcal\{S\}\_\{pos\}=\\\{\\mathbf\{B\}\\in\\mathcal\{S\}\\mid y\_\{\\mathbf\{B\}\}=1\\\},\\quad\\mathcal\{S\}\_\{neg\}=\\\{\\mathbf\{B\}\\in\\mathcal\{S\}\\mid y\_\{\\mathbf\{B\}\}=\-1\\\}To ensure permutation invariance, the scoring functionS​\(𝐁\)S\(\\mathbf\{B\}\)must be symmetric\.

#### Aggregation Strategies

1. 1\.Instance\-based Aggregation:Each token𝐡𝐁,i\{\\mathbf\{h\}\}\_\{\\mathbf\{B\},i\}is first mapped to a scalars𝐁,i=f1​\(𝐡𝐁,i\)s\_\{\\mathbf\{B\},i\}=f\_\{1\}\(\{\\mathbf\{h\}\}\_\{\\mathbf\{B\},i\}\)\. The response\-level score is then obtained through a pooling operatorg1g\_\{1\}: S​\(𝐁\)=g1​\(\{f1​\(𝐡𝐁,i\)\}i=1T𝐁\)S\(\\mathbf\{B\}\)=g\_\{1\}\(\\\{f\_\{1\}\(\{\\mathbf\{h\}\}\_\{\\mathbf\{B\},i\}\)\\\}\_\{i=1\}^\{T\_\{\\mathbf\{B\}\}\}\)
2. 2\.Embedding\-based Aggregation:We first aggregate token\-level hidden states into a bag\-level embedding by using a pooling operatorg2\{g\_\{2\}\}\. The final score is computed by using scoring functionf2f\_\{2\}as: S​\(𝐁\)=f2​\(g2​\(\{𝐡𝐁,i\}i=1T𝐁\)\)\.S\(\\mathbf\{B\}\)=f\_\{2\}\(g\_\{2\}\(\\\{\\mathbf\{h\}\_\{\\mathbf\{B\},i\}\\\}\_\{i=1\}^\{T\_\{\\mathbf\{B\}\}\}\)\)\.

## 3Analysis of HaMI\[[23](https://arxiv.org/html/2605.08863#bib.bib2)\]

### 3\.1HaMI Architecture Formulation

The HaMI architecture is illustrated in Figure[1](https://arxiv.org/html/2605.08863#A1.F1)\. Let a bag be denoted as𝐁=\{𝐡𝐁,i\}i=1T𝐁\\mathbf\{B\}=\\\{\\mathbf\{h\}\_\{\\mathbf\{B\},i\}\\\}\_\{i=1\}^\{T\_\{\\mathbf\{B\}\}\}, where𝐡𝐁,i∈ℝd\\mathbf\{h\}\_\{\\mathbf\{B\},i\}\\in\\mathbb\{R\}^\{d\}\. Each instance is transformed into a logitz𝐁,iz\_\{\\mathbf\{B\},i\}through a neural networkff:

z𝐁,i=f​\(𝐡𝐁,i\)=w⊤​ReLU​\(BN​\(𝐱𝐁,i\+𝐛1\)\)\+b2,z\_\{\\mathbf\{B\},i\}=f\(\\mathbf\{h\}\_\{\\mathbf\{B\},i\}\)=w^\{\\top\}\\mathrm\{ReLU\}\(\\mathrm\{BN\}\(\\mathbf\{x\}\_\{\\mathbf\{B\},i\}\+\\mathbf\{b\}\_\{1\}\)\)\+b\_\{2\},where𝐱𝐁,i=𝐖𝐡𝐁,i\\mathbf\{x\}\_\{\\mathbf\{B\},i\}=\\mathbf\{W\}\\mathbf\{h\}\_\{\\mathbf\{B\},i\},BN​\(⋅\)\\mathrm\{BN\}\(\\cdot\)denotes Batch Normalization, and𝐛1,b2\\mathbf\{b\}\_\{1\},b\_\{2\}are bias terms\.

In HaMI, instance representations are scaled by the semantic probabilityPsem𝐁P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}:

𝐡~𝐁,i=\(1\+λ​Psem𝐁\)​𝐡𝐁,i,\\tilde\{\\mathbf\{h\}\}\_\{\\mathbf\{B\},i\}=\(1\+\\lambda P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\)\\mathbf\{h\}\_\{\\mathbf\{B\},i\},whereλ\>0\\lambda\>0\. We denote the scaled logit asz~𝐁,i=f​\(𝐡~𝐁,i\)\\tilde\{z\}\_\{\\mathbf\{B\},i\}=f\(\\tilde\{\\mathbf\{h\}\}\_\{\\mathbf\{B\},i\}\)\. The bag\-level scoreS​\(𝐁\)S\(\\mathbf\{B\}\)is the average of the TopK instance scores𝒦𝐁\\mathcal\{K\}\_\{\\mathbf\{B\}\}in the bag\.

#### Margin Formulation

We define the logit\-space margin over a dataset of bags𝒮\\mathcal\{S\}:

Mhami​\(λ\)=1\|𝒮\|​∑𝐁∈𝒮y𝐁​Z~𝐁,Z~𝐁=1k​∑i∈𝒦​𝐁z~𝐁,i\.M\_\{\\text\{hami\}\}\(\\lambda\)=\\frac\{1\}\{\|\\mathcal\{S\}\|\}\\sum\_\{\\mathbf\{B\}\\in\\mathcal\{S\}\}y\_\{\\mathbf\{B\}\}\\tilde\{Z\}\_\{\\mathbf\{B\}\},\\quad\\tilde\{Z\}\_\{\\mathbf\{B\}\}=\\frac\{1\}\{k\}\\sum\_\{i\\in\\mathcal\{K\}\{\\mathbf\{B\}\}\}\\tilde\{z\}\_\{\\mathbf\{B\},i\}\.We use logit\-space margins rather than sigmoid probabilities, since margin analysis concerns signed distance from decision boundary\. Although sigmoid is monotone, it is not distance\-preserving\. For large\|z\|\|z\|, changes inzzyield negligible changes inσ​\(z\)\\sigma\(z\)\. Thus, probability space may underestimate margins\. We evaluate margins in logit space prior to nonlinear compression by sigmoid\.

### 3\.2Margin Enhancement via Asymmetric Scaling

We analyze the effect of the scaling factorp𝐁=λ​Psem𝐁p\_\{\\mathbf\{B\}\}=\\lambda P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\. The pre\-normalization feature with scaling during the forward pass is\(1\+p𝐁\)​𝐱𝐁,i\+𝐛1\(1\+p\_\{\\mathbf\{B\}\}\)\\mathbf\{x\}\_\{\\mathbf\{B\},i\}\+\\mathbf\{b\}\_\{1\}\. During inference, the running meanμj\\mu\_\{j\}and standard deviationσj\\sigma\_\{j\}are fixed\. Thejj\-th dimension of the BN output is decomposed as:

BN​\(\(1\+p𝐁\)​𝐱𝐁,i\+𝐛1\)j\\displaystyle\\mathrm\{BN\}\(\(1\+p\_\{\\mathbf\{B\}\}\)\\mathbf\{x\}\_\{\\mathbf\{B\},i\}\+\\mathbf\{b\}\_\{1\}\)\_\{j\}=γj​\(1\+p𝐁\)​x𝐁,i,j\+b1,j−μjσj\+βj\\displaystyle=\\gamma\_\{j\}\\frac\{\(1\+p\_\{\\mathbf\{B\}\}\)x\_\{\\mathbf\{B\},i,j\}\+b\_\{1,j\}\-\\mu\_\{j\}\}\{\\sigma\_\{j\}\}\+\\beta\_\{j\}=BN​\(𝐱𝐁,i\+𝐛1\)j\+p𝐁​γj​x𝐁,i,jσj\.\\displaystyle=\\mathrm\{BN\}\(\\mathbf\{x\}\_\{\\mathbf\{B\},i\}\+\\mathbf\{b\}\_\{1\}\)\_\{j\}\+p\_\{\\mathbf\{B\}\}\\gamma\_\{j\}\\frac\{x\_\{\\mathbf\{B\},i,j\}\}\{\\sigma\_\{j\}\}\.
Let𝒜𝐁,i\\mathcal\{A\}\_\{\\mathbf\{B\},i\}denote the active set of indices where the ReLU activation is strictly positive\. Assuming the active set remains invariant under the scalingp𝐁p\_\{\\mathbf\{B\}\}, the exact scaled logitz~𝐁,i\\tilde\{z\}\_\{\\mathbf\{B\},i\}becomes:

z~𝐁,i=∑j∈𝒜𝐁,iwj​\(BN​\(𝐱𝐁,i\+𝐛1\)j\+p𝐁​γj​x𝐁,i,jσj\)\+b2=z𝐁,i−p𝐁​C​\(𝐱𝐁,i\),\\tilde\{z\}\_\{\\mathbf\{B\},i\}=\\sum\_\{j\\in\\mathcal\{A\}\_\{\\mathbf\{B\},i\}\}w\_\{j\}\\left\(\\mathrm\{BN\}\(\\mathbf\{x\}\_\{\\mathbf\{B\},i\}\+\\mathbf\{b\}\_\{1\}\)\_\{j\}\+p\_\{\\mathbf\{B\}\}\\gamma\_\{j\}\\frac\{x\_\{\\mathbf\{B\},i,j\}\}\{\\sigma\_\{j\}\}\\right\)\+b\_\{2\}=z\_\{\\mathbf\{B\},i\}\-p\_\{\\mathbf\{B\}\}C\(\\mathbf\{x\}\_\{\\mathbf\{B\},i\}\),whereC​\(𝐱𝐁,i\)=−∂z~𝐁,i∂p𝐁=−∑j∈𝒜𝐁,iwj​γj​x𝐁,i,jσjC\(\\mathbf\{x\}\_\{\\mathbf\{B\},i\}\)=\-\\frac\{\\partial\\tilde\{z\}\_\{\\mathbf\{B\},i\}\}\{\\partial p\_\{\\mathbf\{B\}\}\}=\-\\sum\_\{j\\in\\mathcal\{A\}\_\{\\mathbf\{B\},i\}\}w\_\{j\}\\gamma\_\{j\}\\frac\{x\_\{\\mathbf\{B\},i,j\}\}\{\\sigma\_\{j\}\}\. For clarity, we present the derivation assuming a locally invariant active set\. However, using the property that ReLU networks are continuous piecewise\-affine \(CPWA\) maps\[[2](https://arxiv.org/html/2605.08863#bib.bib30)\], we can generalize this sensitivity analysis via path integration\. This generalization ensures that our formulationZB​\(pB\)=ZB​\(0\)−pB​C¯BintZ\_\{B\}\(p\_\{B\}\)=Z\_\{B\}\(0\)\-p\_\{B\}\\bar\{C\}\_\{B\}^\{\\mathrm\{int\}\}remains mathematically rigorous even under dynamic changes in activation patterns, as derived in Appendix[C](https://arxiv.org/html/2605.08863#A3)\. Empirically, we observe the following three properties \(Appendix[G](https://arxiv.org/html/2605.08863#A7)\):

C¯𝐁\>0,whereC¯𝐁=1k​∑i∈𝒦𝐁C​\(𝐱𝐁,i\)\\bar\{C\}\_\{\\mathbf\{B\}\}\>0,\\quad\\text\{where\}\\quad\\bar\{C\}\_\{\\mathbf\{B\}\}=\\frac\{1\}\{k\}\\sum\_\{i\\in\\mathcal\{K\}\_\{\\mathbf\{B\}\}\}C\(\\mathbf\{x\}\_\{\\mathbf\{B\},i\}\)\(1\)𝔼n​e​g​\[Psem𝐁\]\>𝔼p​o​s​\[Psem𝐁\]\\mathbb\{E\}\_\{neg\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\]\>\\mathbb\{E\}\_\{pos\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\]\(2\)𝔼n​e​g​\[C¯𝐁\]\>𝔼p​o​s​\[C¯𝐁\]\.\\mathbb\{E\}\_\{neg\}\[\\bar\{C\}\_\{\\mathbf\{B\}\}\]\>\\mathbb\{E\}\_\{pos\}\[\\bar\{C\}\_\{\\mathbf\{B\}\}\]\.\(3\)These properties can be understood from the following perspectives:

- •Logit Suppression Effect:The positivity in Equation \([1](https://arxiv.org/html/2605.08863#S3.E1)\) indicates that the scaling operation inherently acts as a penalty\. The scaling operation suppresses the overall logits\.
- •Semantic Consistency:Equation \([2](https://arxiv.org/html/2605.08863#S3.E2)\) reflects the intuition that models produce more semantically consistent outputs for non\-hallucinated instances\.
- •Input\-Scale Sensitivity:Driven by the classification loss during training, the model adaptively adjustsC¯𝐁\\bar\{C\}\_\{\\mathbf\{B\}\}\. It learns to increaseC¯𝐁\\bar\{C\}\_\{\\mathbf\{B\}\}for negative instances to amplify logit suppression, while decreasing it for positive instances to prevent undesired attenuation\. This adaptive behavior naturally leads to Equation \([3](https://arxiv.org/html/2605.08863#S3.E3)\)\.

Motivated by these observations, we define the ratio of the expected product of the scaling factor and logit sensitivity between the negative and positive classes asγ\\gamma:

γ=𝔼n​e​g​\[Psem𝐁​C¯𝐁\]𝔼p​o​s​\[Psem𝐁​C¯𝐁\]\.\\gamma=\\frac\{\\mathbb\{E\}\_\{neg\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}\]\}\{\\mathbb\{E\}\_\{pos\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}\]\}\.The empirical validation of this property and the observed range ofγ\\gammaare provided in Appendix[G\.4](https://arxiv.org/html/2605.08863#A7.SS4)\.

###### Theorem 1\(Margin Enhancement Logic via Semantic Probability Weighting\)\.

Weighting the input instances by their semantic probabilities strictly increases the empirical expected margin𝔼𝐁​\[Mh​a​m​i\]\\mathbb\{E\}\_\{\\mathbf\{B\}\}\[M\_\{hami\}\], provided that the dataset class ratio satisfies the following condition:

\|𝒮n​e​g\|\|𝒮p​o​s\|\>1γ\.\\frac\{\|\\mathcal\{S\}\_\{neg\}\|\}\{\|\\mathcal\{S\}\_\{pos\}\|\}\>\\frac\{1\}\{\\gamma\}\.\(4\)

#### Interpretation\.

Theorem[1](https://arxiv.org/html/2605.08863#Thmtheorem1)formally justifies the mechanism of HaMI\. Since the scaling operation inherently suppresses logits, it introduces a trade\-off: it expands margins for negative bags while compressing them for positive ones\. The parameterγ\\gammacharacterizes the relative magnitude of this effect, representing the ratio of expected margin gain in negative bags to the penalty in positive bags\. Consequently, Equation[4](https://arxiv.org/html/2605.08863#S3.E4)implies that as long as the dataset contains a sufficient proportion of negative bags to leverage this sensitivity differential, the overall expected margin will increase\. This widened margin establishes a more robust decision boundary for hallucination detection\.

## 4Max Pooling Network Revisited\[[16](https://arxiv.org/html/2605.08863#bib.bib26)\]

HaMI increases the margin by adjusting the token\-level scale\. However, since it scores tokens independently, it may not fully capture sentence\-level coherence\. To address this limitation, we revisit the classical sentence classification model\[[16](https://arxiv.org/html/2605.08863#bib.bib26)\]\. While our formulation is not a direct instantiation, it is inspired by the same underlying intuition\. Our theoretical analysis indicates that such models can effectively enlarge the margin while incorporating holistic sentence\-level information\.

### 4\.1Model Architecture

The model’s architecture is shown in Figure[2](https://arxiv.org/html/2605.08863#A1.F2)\. It treats each LLM token’s hidden state as an instance and computes the hallucination probability for the entire bag\.

Let a bag be𝐁=\{𝐡1,𝐡2,…,𝐡T𝐁\}\\mathbf\{B\}=\\\{\\mathbf\{h\}\_\{1\},\\mathbf\{h\}\_\{2\},\\dots,\\mathbf\{h\}\_\{T\_\{\\mathbf\{B\}\}\}\\\}, where𝐡i∈ℝd\\mathbf\{h\}\_\{i\}\\in\\mathbb\{R\}^\{d\}represents the hidden state vector of theii\-th token\. First, we obtain instance features𝐮i∈ℝD\\mathbf\{u\}\_\{i\}\\in\\mathbb\{R\}^\{D\}through a feature extraction layerfϕf\_\{\\phi\}that maps the states to a lower\-dimensional space:

𝐮i=ReLU​\(𝐱i\),where𝐱i=W⊤​𝐡i\\displaystyle\\mathbf\{u\}\_\{i\}=\\text\{ReLU\}\(\\mathbf\{x\}\_\{i\}\),\\quad\\text\{where\}\\quad\\mathbf\{x\}\_\{i\}=W^\{\\top\}\\mathbf\{h\}\_\{i\}Next, a max\-pooling operation is applied to all instances within the bag, selecting the maximum value for each dimension to extract the bag representation𝐯∈ℝD\\mathbf\{v\}\\in\\mathbb\{R\}^\{D\}:

vj=maxi=1,…,T𝐁\(ui\)jforj=1,…,D\.\\displaystyle v\_\{j\}=\\max\_\{i=1,\\dots,T\_\{\\mathbf\{B\}\}\}\(u\_\{i\}\)\_\{j\}\\quad\\text\{for \}j=1,\\dots,D\.
The final hallucination scoreS​\(𝐁\)S\(\\mathbf\{B\}\)is calculated by the classification layer as follows:

z𝐁=w⊤​𝐯,S​\(𝐁\)=σ​\(z𝐁\),\\displaystyle z\_\{\\mathbf\{B\}\}=w^\{\\top\}\\mathbf\{v\},\\quad S\(\\mathbf\{B\}\)=\\sigma\(z\_\{\\mathbf\{B\}\}\),whereσ\\sigmadenotes the sigmoid function, andS∈\[0,1\]S\\in\[0,1\]represents the predicted probability that the bag contains a hallucination\. To train the model, we employ the logistic loss\. For a given set ofNNresponses, lety𝐁∈\{−1,1\}y\_\{\\mathbf\{B\}\}\\in\\\{\-1,1\\\}denote the ground\-truth label, wherey𝐁=1y\_\{\\mathbf\{B\}\}=1indicates a hallucination\. The loss function is defined as:

ℒ𝐁=log⁡\(1\+exp⁡\(−y𝐁​z𝐁​\(θ\)\)\)\.\\displaystyle\\mathcal\{L\}\_\{\\mathbf\{B\}\}=\\log\\left\(1\+\\exp\\left\(\-y\_\{\\mathbf\{B\}\}z\_\{\\mathbf\{B\}\}\(\\theta\)\\right\)\\right\)\.By minimizing this objective function, the model learns to assign higher scores to responses containing hallucinations while maintaining lower scores for factually correct outputs\. Moreover, we analyze the first\-order margin\-expansion dynamics of this model under logistic\-loss updates\. Our analysis shows that the Bag\-wise margin expansion is governed by the squared gradient norm, which allows us to compare max pooling and mean pooling in sparse MIL regimes\.

While it is possible to apply max pooling directly to the raw d\-dimensional hidden states, we introduce a feature\-extraction layerfϕf\_\{\\phi\}to project these states into a more specialized latent space\. We provide a comparative analysis of these configurations in a subsequent section\.

### 4\.2Margin Analysis

First, we establish that for any model utilizing a pooling operator, the margin expands during training\.

###### Theorem 2\(Expected Bag\-wise Margin Expansion\)\.

Letℳ​\(θ\)=𝔼𝐁​\[m𝐁​\(θ\)\]\\mathcal\{M\}\(\\theta\)=\\mathbb\{E\}\_\{\\mathbf\{B\}\}\[m\_\{\\mathbf\{B\}\}\(\\theta\)\]be the expected Bag\-wise margin\. Then, each sampled bag’s own gradient update with a sufficiently smallη\>0\\eta\>0increases the expected Bag\-wise margin as follows:

ℳ​\(θt\+1\)=ℳ​\(θt\)\+η​𝔼𝐁​\[‖∇θz𝐁​\(θt\)‖21\+exp⁡\(m𝐁​\(θt\)\)\]\+O​\(η2\)\.\\mathcal\{M\}\(\\theta\_\{t\+1\}\)=\\mathcal\{M\}\(\\theta\_\{t\}\)\+\\eta\\mathbb\{E\}\_\{\\mathbf\{B\}\}\\left\[\\frac\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\)\\\|^\{2\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\)\)\}\\right\]\+O\(\\eta^\{2\}\)\.

#### Gradient Norm Drives Margin Expansion\.

The increase of the margin is governed by‖∇θzB​\(θ\)‖2\\\|\\nabla\_\{\\theta\}z\_\{B\}\(\\theta\)\\\|^\{2\}\. In particular, larger gradient norms directly lead to larger margin increases\. However, the expansion in Theorem[2](https://arxiv.org/html/2605.08863#Thmtheorem2)is local inη\\eta: for a large learning rate, theO​\(η2\)O\(\\eta^\{2\}\)remainder can dominate the positive first\-order term\. In Appendix[D](https://arxiv.org/html/2605.08863#A4), we show that under aβ\\beta\-smoothness assumption on the bag\-wise margin, this problem can be controlled by an explicit step\-size condition, yielding positive bag\-wise self\-margin expansion for sufficiently small but quantifiableη\\eta\.

#### Motivation: Hallucination Detection as Sparse MIL\.

Hallucination is a sparse MIL problem, typically triggered by a small fraction of erroneous tokens within a vast, factually correct context\[[21](https://arxiv.org/html/2605.08863#bib.bib28),[23](https://arxiv.org/html/2605.08863#bib.bib2)\]\. While mean pooling captures overall semantic context, it theoretically dilutes critical hallucination signals by averaging them with numerous normal tokens\[[5](https://arxiv.org/html/2605.08863#bib.bib37)\], hindering precise error localization\. Conversely, max pooling provides a more direct gradient path by routing updates exclusively through the most salient instances\[[5](https://arxiv.org/html/2605.08863#bib.bib37)\], thereby mitigating signal attenuation\. To formalize this structural advantage of max pooling in sparse scenarios, we introduce the following assumption\.

###### Assumption 1\(Sparse MIL Structure\)\.

For fixedθ=\(W,w\)\\theta=\(W,w\), let𝐠𝐁,i,j:=𝐡𝐁,i​𝟏​\{Wj⊤​𝐡𝐁,i\>0\}\\mathbf\{g\}\_\{\\mathbf\{B\},i,j\}:=\\mathbf\{h\}\_\{\\mathbf\{B\},i\}\\mathbf\{1\}\\\{W\_\{j\}^\{\\top\}\\mathbf\{h\}\_\{\\mathbf\{B\},i\}\>0\\\}\. For each bag𝐁\\mathbf\{B\}, letS𝐁,j⊆\{1,…,T𝐁\}S\_\{\\mathbf\{B\},j\}\\subseteq\\\{1,\\dots,T\_\{\\mathbf\{B\}\}\\\}be the set of informative instances \(s𝐁,j:=\|S𝐁,j\|s\_\{\\mathbf\{B\},j\}:=\|S\_\{\\mathbf\{B\},j\}\|\),s𝐁:=maxj⁡s𝐁,js\_\{\\mathbf\{B\}\}:=\\max\_\{j\}s\_\{\\mathbf\{B\},j\}, andJ𝐁:=\{j:s𝐁,j≥1\}J\_\{\\mathbf\{B\}\}:=\\\{j:s\_\{\\mathbf\{B\},j\}\\geq 1\\\}\. We assume:

1. 1\.Sparsity:i∉S𝐁,j⟹u𝐁,i,j=0i\\notin S\_\{\\mathbf\{B\},j\}\\implies u\_\{\\mathbf\{B\},i,j\}=0and𝐠𝐁,i,j=𝟎\\mathbf\{g\}\_\{\\mathbf\{B\},i,j\}=\\mathbf\{0\}\.
2. 2\.Boundedness:For allj∈J𝐁j\\in J\_\{\\mathbf\{B\}\}andi∈S𝐁,ji\\in S\_\{\\mathbf\{B\},j\}, there exist positive constants such thatu𝐁,i,j∈\[u¯,u¯\]u\_\{\\mathbf\{B\},i,j\}\\in\[\\underline\{u\},\\overline\{u\}\]and‖𝐠𝐁,i,j‖2∈\[g¯,g¯\]\\\|\\mathbf\{g\}\_\{\\mathbf\{B\},i,j\}\\\|\_\{2\}\\in\[\\underline\{g\},\\overline\{g\}\]\.

###### Theorem 3\(Gradient Norm Ratio in Sparse MIL\)\.

Under Assumption[1](https://arxiv.org/html/2605.08863#Thmassumption1)for any positive bag𝐁\\mathbf\{B\}, the ratio of the gradient norms for max pooling and mean pooling satisfies:

‖∇θz𝐁max‖2‖∇θz𝐁mean‖2=Ω​\(\(T𝐁s𝐁\)2\)\.\\frac\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}^\{\\max\}\\\|^\{2\}\}\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}^\{\\mathrm\{mean\}\}\\\|^\{2\}\}=\\Omega\\left\(\\left\(\\frac\{T\_\{\\mathbf\{B\}\}\}\{s\_\{\\mathbf\{B\}\}\}\\right\)^\{2\}\\right\)\.

#### Gradient Dynamics and Margin Expansion during Training\.

Theorem[2](https://arxiv.org/html/2605.08863#Thmtheorem2)indicates that the instantaneous growth of the bag\-wise margin is proportional to the squared gradient norm‖∇θzB​\(θ\)‖2\\\|\\nabla\_\{\\theta\}z\_\{B\}\(\\theta\)\\\|^\{2\}, scaled by\(1\+exp⁡\(m𝐁​\(θ\)\)\)−1\(1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\)^\{\-1\}\. At a given margin level, the expansion force is determined by the gradient magnitude\. In sparse MIL regimes \(s𝐁≪T𝐁s\_\{\\mathbf\{B\}\}\\ll T\_\{\\mathbf\{B\}\}\), Theorem[3](https://arxiv.org/html/2605.08863#Thmtheorem3)shows that this force is amplified for max pooling by a factor ofΩ​\(\(T𝐁/s𝐁\)2\)\\Omega\(\(T\_\{\\mathbf\{B\}\}/s\_\{\\mathbf\{B\}\}\)^\{2\}\)relative to mean pooling\. Thus, max pooling effectively overcomes signal dilution from negative instances, leading to faster and larger margin expansion\.

### 4\.3Rademacher Complexity Analysis

We analyze Rademacher complexity\[[4](https://arxiv.org/html/2605.08863#bib.bib33)\]to compare generalization with a baseline model\.

#### Setup\.

Let𝒮=\{𝐁i\}i=1n\\mathcal\{S\}=\\\{\\mathbf\{B\}\_\{i\}\\\}\_\{i=1\}^\{n\}be a set of bags, where𝐁i=\{hi,t\}t=1Ti⊂ℝd\\mathbf\{B\}\_\{i\}=\\\{h\_\{i,t\}\\\}\_\{t=1\}^\{T\_\{i\}\}\\subset\\mathbb\{R\}^\{d\},‖hi,t‖2≤R\\\|h\_\{i,t\}\\\|\_\{2\}\\leq R, andT=maxi⁡TiT=\\max\_\{i\}T\_\{i\}\. We compare two max\-pooling architectures: the modelℱbase\\mathcal\{F\}\_\{\\mathrm\{base\}\}that pools raw input features directly, and the modelℱfeat\\mathcal\{F\}\_\{\\mathrm\{feat\}\}that transforms instances into a latent space before pooling\.

With ReLU activationa​\(x\)=max⁡\{0,x\}a\(x\)=\\max\\\{0,x\\\}, instance\-wise max poolingρ\\rho, and weight bounds‖Wj‖2≤B1\\\|W\_\{j\}\\\|\_\{2\}\\leq B\_\{1\}and‖w‖2≤B2\\\|w\\\|\_\{2\}\\leq B\_\{2\}, their hypothesis classes are defined as:

ℱfeat\\displaystyle\\mathcal\{F\}\_\{\\mathrm\{feat\}\}=\{w⊤​ρ​\(\{a​\(Wj​ht\)\}j=1,…,D;t=1,…,T\)\},\\displaystyle=\\left\\\{w^\{\\top\}\\rho\\big\(\\\{a\(W\_\{j\}h\_\{t\}\)\\\}\_\{j=1,\\dots,D;\\;t=1,\\dots,T\}\\big\)\\right\\\},ℱbase\\displaystyle\\mathcal\{F\}\_\{\\mathrm\{base\}\}=\{w⊤​a​\(W​ρ​\(𝐁\)\)\}\.\\displaystyle=\\left\\\{w^\{\\top\}a\\big\(W\\,\\rho\(\\mathbf\{B\}\)\\big\)\\right\\\}\.
###### Proposition 1\(Rademacher Complexity Bounds\)\.

Under the setup above, the empirical Rademacher complexities are bounded as follows:

ℛ^S​\(ℱfeat\)≤2​2​R​B1​B2​D​Tn,ℛ^S​\(ℱbase\)≤2​R​B1​B2​D​dn\.\\displaystyle\\hat\{\\mathcal\{R\}\}\_\{S\}\(\\mathcal\{F\}\_\{\\mathrm\{feat\}\}\)\\;\\leq\\;\\frac\{2\\sqrt\{2\}\\,RB\_\{1\}B\_\{2\}\\sqrt\{DT\}\}\{\\sqrt\{n\}\},\\quad\\hat\{\\mathcal\{R\}\}\_\{S\}\(\\mathcal\{F\}\_\{\\mathrm\{base\}\}\)\\;\\leq\\;\\frac\{2RB\_\{1\}B\_\{2\}\\sqrt\{Dd\}\}\{\\sqrt\{n\}\}\.

#### Discussion\.

These bounds reveal a critical difference in capacity control\. The baselineℱbase\\mathcal\{F\}\_\{\\mathrm\{base\}\}scales as𝒪​\(D​d/n\)\\mathcal\{O\}\(\\sqrt\{Dd/n\}\), suffering from the curse of dimensionality\. In contrast,ℱfeat\\mathcal\{F\}\_\{\\mathrm\{feat\}\}scales as𝒪​\(D​T/n\)\\mathcal\{O\}\(\\sqrt\{DT/n\}\), completely removing the dependence on the input dimensiondd\. Thus, placing the feature extraction layer before max pooling is crucial for generalization: it replacesddwith the maximum bag sizeTT, yielding a significantly tighter bound whenT≪dT\\ll d\.

## 5Experiments

In this section, we present three main findings:

\(i\) Sections[5\.2](https://arxiv.org/html/2605.08863#S5.SS2)and[5\.3](https://arxiv.org/html/2605.08863#S5.SS3): A simple architecture with feature extraction followed by max pooling is competitive with HaMI while achieving over10,000×10\{,\}000\\timesfaster inference\.

\(ii\) Section[5\.4](https://arxiv.org/html/2605.08863#S5.SS4): The feature extraction layer is crucial, especially for max pooling\.

\(iii\) Section[5\.5](https://arxiv.org/html/2605.08863#S5.SS5): Semantic scaling enlarges the classification margin over the original HaMI, and max pooling consistently yields larger margins than mean pooling\.

### 5\.1Experimental Setup

#### Models and Datasets\.

We employ open\-sourced LLMs, LLaMA\[[8](https://arxiv.org/html/2605.08863#bib.bib24)\]and Mistral family\[[14](https://arxiv.org/html/2605.08863#bib.bib25)\], and evaluated our method using LLaMA\-3\.1\-8B, Mistral\-Nemo\-12B, and LLaMA\-3\.3\-70B \(4\-bit quantized\), extracting hidden states with a temperature of0\.50\.5\. We conducted experiments on four benchmarks: TriviaQA\[[15](https://arxiv.org/html/2605.08863#bib.bib15)\], SQuAD\[[26](https://arxiv.org/html/2605.08863#bib.bib16)\], Natural Questions \(NQ\)\[[18](https://arxiv.org/html/2605.08863#bib.bib17)\], and BioASQ\[[17](https://arxiv.org/html/2605.08863#bib.bib18)\]\. We used4,0004,000samples for training and10,00010,000for evaluation per dataset except BioASQ, for which we used2,3002,300training samples and6,0006,000evaluation samples due to size limits\.

#### Labeling and Refinement\.

We used a two\-stage evaluation protocol with GPT\-5o mini\[[24](https://arxiv.org/html/2605.08863#bib.bib23)\]serving as LLM\-as\-a\-Judge\. Initially, responses were compared against ground\-truth answers\. To ensure high fidelity,only incorrect responses underwent a second re\-evaluationfor factual consistency without ground\-truth access\. Samples with inconsistent labels were discarded\. For semantic clustering, GPT\-4o mini\[[1](https://arxiv.org/html/2605.08863#bib.bib22)\]grouped five independent responses per query to characterize the predictive distribution\.

#### Implementation\.

For each layer, a lightweight MLP detector was trained for100100epochs \(batch size:128128\), with the best checkpoint selected via validation AUC\. After refinement, the final dataset comprised approximately3,5003,500training,4,0004,000validation, and5,0005,000test samples\. Data distributions and hyperparameter settings are detailed in Appendix[F](https://arxiv.org/html/2605.08863#A6)and[A\.1](https://arxiv.org/html/2605.08863#A1.SS1)\.

### 5\.2Classification Performance

Results in Table[2](https://arxiv.org/html/2605.08863#S5.T2)show that embedding\-based methods outperform unweighted HaMI\. This advantage stems from aggregating information in a high\-dimensional space \(ℝD\\mathbb\{R\}^\{D\}\), which minimizes information loss and preserves diverse hallucination\-related representations better than instance\-based methods that compress data into a single dimension before pooling\.

Notably,max pooling consistently outperforms mean poolingacross all models and datasets\. Quantitatively, max pooling achieves an average AUROC improvement of approximately1\.51\.5–2\.52\.5points over mean pooling\. For instance, on Mistral\-Nemo\-Instruct \(12B\) with TriviaQA, max pooling improves the AUROC from0\.9290\.929to0\.9450\.945\. On LLaMA\-3\.1\-8B \(NQ\), the improvement is even more pronounced, rising from0\.7920\.792to0\.8140\.814\.

Furthermore, our max\-pooling method achieves performance comparable to, or in several cases exceeding,HaMI \(SP\)\. Specifically, on the LLaMA\-3\.3\-70B model, max pooling achieves the best results across all datasets\. A significant margin is observed on TriviaQA, where it reaches an AUROC of0\.9450\.945, surpassing HaMI \(SP\) by3\.13\.1points\. This empirical superiority ofsimple max poolingsupports that high\-dimensional max\-aggregation is effective at capturing hallucination signals\.

Table 2:AUC comparison across methods\. Best and second\-best results are highlighted inredandblue, respectively\. Results are averaged over five runs\. Standard errors are reported in Appendix[E\.1](https://arxiv.org/html/2605.08863#A5.SS1)\.
### 5\.3Efficiency Comparison

To evaluate computational efficiency, we compared the inference throughput \(samples/sec\) of each method using approximately4,0004,000samples\. Table[3](https://arxiv.org/html/2605.08863#S5.T3)shows that embedding\-based mean and max pooling consistently outperform instance\-based HaMI across all configurations\. Specifically, our max\-pooling strategy is up to1\.6×\\timesfasterthan the original HaMI\. Furthermore, max pooling is more than10,000×\\timesfasterthan HaMI \(SP\) across all datasets\. These results highlight that embedding\-based aggregation is not onlyhighly efficientbut alsowell suited to large\-scale hallucination detection, where high\-throughput inference is essential\.

Table 3:Inference throughput comparison \(Samples/sec\)\. The best results are highlighted inred, and the second\-best results are shown inblue\.
### 5\.4Significance of the Feature Extraction Layer

To verify the necessity of the feature extraction layerpriorto pooling, we compared our architecture against a baseline that applied pooling directly to the raw hidden states \(ℝd\\mathbb\{R\}^\{d\}\)\. In the latter case, the MLP was placedafterthe pooling operation\. As shown in Table[4](https://arxiv.org/html/2605.08863#S5.T4), placing the feature extraction layer before pooling consistently improves max pooling AUROC by22–44points, whereas mean\-pooling performance remains largely unchanged regardless of the layer position\.

This gain empirically validates the theoretical benefit of the bottleneckDDdiscussed in Section[4\.3](https://arxiv.org/html/2605.08863#S4.SS3)\. Applying max pooling directly to high\-dimensional raw states \(ℝd\\mathbb\{R\}^\{d\}\) makes the model susceptible to thecurse of dimensionality\(𝒪​\(D​d/n\)\\mathcal\{O\}\(\\sqrt\{Dd/n\}\)\), as the pooling operation cannot effectively filter noise in the original feature space\. By inserting the feature extraction layer first, we reduce the complexity to𝒪​\(D​T/n\)\\mathcal\{O\}\(\\sqrt\{DT/n\}\)\. These results confirm that the feature extraction layer isparticularly important for max poolingto transform raw representations into a noise\-robust space before aggregation\.

Table 4:AUC comparison based on the input to the pooling operation\. We compared applying pooling torawhidden states versusthe extracted features\.Boldvalues indicate the best performance\.
### 5\.5Margin Results

We compare margins after 100 training epochs \(Table[5](https://arxiv.org/html/2605.08863#S5.T5)\), measured at the layer that achieves the best AUC for each method\. Since fundamental structural differences preclude direct numerical comparison between HaMI and embedding\-based methods, we focus on margin gains within each category\.

HaMI \(SP\)significantly widens the margin over the original HaMI, with gains of3030–60%60\\%across most models and a notable144\.0%144\.0\\%improvement on LLaMA\-3\.1\-8B \(SQuAD\)\. This supports our theoretical insight that semantic weighting effectively isolates relevant information from noise\.

Similarly, among embedding\-based methods,max poolingconsistently outperforms mean pooling by1010–20%20\\%, including a20\.9%20\.9\\%increase on LLaMA\-3\.1\-8B \(NQ\)\. These results confirm that max\-aggregation better captures salient features for margin enlargement than its mean pooling counterpart\.

Table 5:Margin Comparison across different methods\. Results are averaged over five runs with different random seeds\. Standard errors are reported in Appendix[E\.1](https://arxiv.org/html/2605.08863#A5.SS1)

## 6Conclusion

In this study, we addressed the significant computational overhead and operational challenges inherent in existing hallucination detection methods, such as HaMI\[[23](https://arxiv.org/html/2605.08863#bib.bib2)\], which rely on external LLM calls\. We proposed an efficient, self\-contained framework that leverages the internal feature representations of the generator model\.

Our method successfully eliminates substantial API costs, and data privacy risks by utilizing onlyhidden states\. Experimental results demonstrate a dramatic speedup of over10,000×\\timescompared to the original HaMI, significantly enhancing the feasibility of real\-time deployment\.

Beyond efficiency, we established arigorous theoretical foundationfor our architecture\. We proved that integrating a feature\-extraction layer with Max\-pooling effectively optimizes generalization bounds by constraining the Rademacher complexity\. Furthermore, our analysis revealed that max\-aggregation inherently enlarges the decision margin in sparse MIL setting, providing a theoretical justification for its superior classification performance without the need for external weighting\.

Empirical evaluations across diverse QA benchmarks confirm that our approach maintains competitive performance with HaMI while offering significant improvements in efficiency\. However, our evaluation is limited to QA tasks, and its effectiveness for longer\-form generation tasks remains to be explored\. Extending the framework to these settings is an important direction for future work\.

## References

- \[1\]J\. Achiam, S\. Adler, S\. Agarwal, L\. Ahmad, I\. Akkaya, F\. L\. Aleman, D\. Almeida, J\. Altenschmidt, S\. Altman, S\. Anadkat,et al\.\(2023\)GPT\-4 technical report\.External Links:2303\.08774,[Link](https://arxiv.org/abs/2303.08774)Cited by:[§5\.1](https://arxiv.org/html/2605.08863#S5.SS1.SSS0.Px2.p1.1)\.
- \[2\]R\. Arora, A\. Basu, P\. Mianjy, and A\. Mukherjee\(2018\)Understanding deep neural networks with rectified linear units\.External Links:1611\.01491,[Link](https://arxiv.org/abs/1611.01491)Cited by:[§C\.2](https://arxiv.org/html/2605.08863#A3.SS2),[§3\.2](https://arxiv.org/html/2605.08863#S3.SS2.p2.5)\.
- \[3\]A\. Azaria and T\. Mitchell\(2023\)The internal state of an llm knows when it’s lying\.External Links:2304\.13734,[Link](https://arxiv.org/abs/2304.13734)Cited by:[§1](https://arxiv.org/html/2605.08863#S1.p1.1)\.
- \[4\]P\. L\. Bartlett and S\. Mendelson\(2003\-03\)Rademacher and gaussian complexities: risk bounds and structural results\.J\. Mach\. Learn\. Res\.3\(null\),pp\. 463–482\.External Links:ISSN 1532\-4435Cited by:[§4\.3](https://arxiv.org/html/2605.08863#S4.SS3.p1.1)\.
- \[5\]Y\. Boureau, J\. Ponce, and Y\. LeCun\(2010\)A theoretical analysis of feature pooling in visual recognition\.InProceedings of the 27th International Conference on International Conference on Machine Learning,ICML’10,Madison, WI, USA,pp\. 111–118\.External Links:ISBN 9781605589077Cited by:[§4\.2](https://arxiv.org/html/2605.08863#S4.SS2.SSS0.Px2.p1.1)\.
- \[6\]M\. Carbonneau, V\. Cheplygina, E\. Granger, and G\. Gagnon\(2018\-05\)Multiple instance learning: a survey of problem characteristics and applications\.Pattern Recognition77,pp\. 329–353\.External Links:ISSN 0031\-3203,[Link](http://dx.doi.org/10.1016/j.patcog.2017.10.009),[Document](https://dx.doi.org/10.1016/j.patcog.2017.10.009)Cited by:[§2\.3](https://arxiv.org/html/2605.08863#S2.SS3.p1.3)\.
- \[7\]N\. Carlini, F\. Tramer, E\. Wallace, M\. Jagielski, A\. Herbert\-Voss, K\. Lee, A\. Roberts, T\. Brown, D\. Song, U\. Erlingsson, A\. Oprea, and C\. Raffel\(2021\)Extracting training data from large language models\.External Links:2012\.07805,[Link](https://arxiv.org/abs/2012.07805)Cited by:[§1](https://arxiv.org/html/2605.08863#S1.p1.1)\.
- \[8\]A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Yang, A\. Fan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§5\.1](https://arxiv.org/html/2605.08863#S5.SS1.SSS0.Px1.p1.5)\.
- \[9\]S\. Farquhar, J\. Kossen, L\. Kuhn, and Y\. Gal\(2024\)Detecting hallucinations in large language models using semantic entropy\.Nature630\(8017\),pp\. 625–630\.Cited by:[§A\.1](https://arxiv.org/html/2605.08863#A1.SS1.SSS0.Px2.p2.1),[§1](https://arxiv.org/html/2605.08863#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.08863#S2.SS2.p1.2)\.
- \[10\]C\. Gao, H\. Chen, C\. Xiao, Z\. Chen, Z\. Liu, and M\. Sun\(2025\)H\-neurons: on the existence, impact, and origin of hallucination\-associated neurons in llms\.External Links:2512\.01797,[Link](https://arxiv.org/abs/2512.01797)Cited by:[§1](https://arxiv.org/html/2605.08863#S1.p1.1)\.
- \[11\]L\. Huang, W\. Yu, W\. Ma, W\. Zhong, Z\. Feng, H\. Wang, Q\. Chen, W\. Peng, X\. Feng, B\. Qin, and T\. Liu\(2025\-01\)A survey on hallucination in large language models: principles, taxonomy, challenges, and open questions\.ACM Trans\. Inf\. Syst\.43\(2\)\.External Links:ISSN 1046\-8188,[Link](https://doi.org/10.1145/3703155),[Document](https://dx.doi.org/10.1145/3703155)Cited by:[§1](https://arxiv.org/html/2605.08863#S1.p1.1)\.
- \[12\]M\. Ilse, J\. M\. Tomczak, and M\. Welling\(2018\)Attention\-based deep multiple instance learning\.External Links:1802\.04712,[Link](https://arxiv.org/abs/1802.04712)Cited by:[§E\.2](https://arxiv.org/html/2605.08863#A5.SS2.p1.7),[§2\.3](https://arxiv.org/html/2605.08863#S2.SS3.p1.3)\.
- \[13\]Z\. Ji, N\. Lee, R\. Frieske, T\. Yu, D\. Su, Y\. Xu, E\. Ishii, Y\. J\. Bang, A\. Madotto, and P\. Fung\(2023\)Survey of hallucination in natural language generation\.ACM Computing Surveys55\(12\),pp\. 1–38\.Cited by:[§1](https://arxiv.org/html/2605.08863#S1.p1.1)\.
- \[14\]A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. de las Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier,et al\.\(2023\)Mistral 7b\.arXiv preprint arXiv:2310\.06825\.Cited by:[§5\.1](https://arxiv.org/html/2605.08863#S5.SS1.SSS0.Px1.p1.5)\.
- \[15\]M\. Joshi, E\. Choi, D\. S\. Weld, and L\. Zettlemoyer\(2017\)TriviaQA: a large scale distantly supervised challenge dataset for reading comprehension\.External Links:1705\.03551,[Link](https://arxiv.org/abs/1705.03551)Cited by:[§5\.1](https://arxiv.org/html/2605.08863#S5.SS1.SSS0.Px1.p1.5)\.
- \[16\]Y\. Kim\(2014\)Convolutional neural networks for sentence classification\.External Links:1408\.5882,[Link](https://arxiv.org/abs/1408.5882)Cited by:[Table 1](https://arxiv.org/html/2605.08863#S1.T1.4.3.2.3.1.1),[§1](https://arxiv.org/html/2605.08863#S1.p4.1),[§4](https://arxiv.org/html/2605.08863#S4),[§4](https://arxiv.org/html/2605.08863#S4.p1.1)\.
- \[17\]A\. Krithara, A\. Nentidis, K\. Bougiatiotis, and G\. Paliouras\(2023\)BioASQ\-qa: a manually curated corpus for biomedical question answering\.Scientific Data10,pp\. 170\.External Links:[Link](https://doi.org/10.1038/s41597-023-02068-4)Cited by:[§5\.1](https://arxiv.org/html/2605.08863#S5.SS1.SSS0.Px1.p1.5)\.
- \[18\]T\. Kwiatkowski, J\. Palomaki, O\. Redfield, M\. Collins, A\. Parikh, C\. Alberti, D\. Epstein, I\. Polosukhin, J\. Devlin, K\. Lee, K\. Toutanova, L\. Jones, M\. Kelcey, M\. Chang, A\. M\. Dai, J\. Uszkoreit, Q\. Le, and S\. Petrov\(2019\)Natural questions: a benchmark for question answering research\.Transactions of the Association for Computational Linguistics7,pp\. 452–466\.External Links:[Link](https://aclanthology.org/Q19-1026/),[Document](https://dx.doi.org/10.1162/tacl%5Fa%5F00276)Cited by:[§5\.1](https://arxiv.org/html/2605.08863#S5.SS1.SSS0.Px1.p1.5)\.
- \[19\]S\. Longpre, K\. Perisetla, A\. Chen, N\. Ramesh, C\. DuBois, and S\. Singh\(2022\)Entity\-based knowledge conflicts in question answering\.External Links:2109\.05052,[Link](https://arxiv.org/abs/2109.05052)Cited by:[§1](https://arxiv.org/html/2605.08863#S1.p1.1)\.
- \[20\]O\. Maron and T\. Lozano\-Pérez\(1997\)A framework for multiple\-instance learning\.InAdvances in Neural Information Processing Systems,M\. Jordan, M\. Kearns, and S\. Solla \(Eds\.\),Vol\.10,pp\.\.External Links:[Link](https://proceedings.neurips.cc/paper_files/paper/1997/file/82965d4ed8150294d4330ace00821d77-Paper.pdf)Cited by:[§2\.3](https://arxiv.org/html/2605.08863#S2.SS3.p1.3)\.
- \[21\]J\. Maynez, S\. Narayan, B\. Bohnet, and R\. McDonald\(2020\)On faithfulness and factuality in abstractive summarization\.External Links:2005\.00661,[Link](https://arxiv.org/abs/2005.00661)Cited by:[§4\.2](https://arxiv.org/html/2605.08863#S4.SS2.SSS0.Px2.p1.1)\.
- \[22\]G\. Montúfar, R\. Pascanu, K\. Cho, and Y\. Bengio\(2014\)On the number of linear regions of deep neural networks\.External Links:1402\.1869,[Link](https://arxiv.org/abs/1402.1869)Cited by:[§C\.2](https://arxiv.org/html/2605.08863#A3.SS2.p2.5)\.
- \[23\]M\. Niu, H\. Haddadi, and G\. Pang\(2025\)Robust hallucination detection in llms via adaptive token selection\.External Links:2504\.07863,[Link](https://arxiv.org/abs/2504.07863)Cited by:[§A\.1](https://arxiv.org/html/2605.08863#A1.SS1.SSS0.Px1.p1.2),[§A\.1](https://arxiv.org/html/2605.08863#A1.SS1.SSS0.Px2.p2.1),[Table 1](https://arxiv.org/html/2605.08863#S1.T1.4.1.1.2.1.1.1),[§1](https://arxiv.org/html/2605.08863#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.08863#S2.SS2.SSS0.Px2.p1.2.1),[§3](https://arxiv.org/html/2605.08863#S3),[§4\.2](https://arxiv.org/html/2605.08863#S4.SS2.SSS0.Px2.p1.1),[§6](https://arxiv.org/html/2605.08863#S6.p1.1)\.
- \[24\]OpenAI\(2026\)GPT\-5 mini\.Note:[https://platform\.openai\.com/](https://platform.openai.com/)Cited by:[§5\.1](https://arxiv.org/html/2605.08863#S5.SS1.SSS0.Px2.p1.1)\.
- \[25\]H\. Orgad, M\. Toker, Z\. Gekhman, R\. Reichart, I\. Szpektor, H\. Kotek, and Y\. Belinkov\(2025\)LLMs know more than they show: on the intrinsic representation of llm hallucinations\.External Links:2410\.02707,[Link](https://arxiv.org/abs/2410.02707)Cited by:[§1](https://arxiv.org/html/2605.08863#S1.p1.1)\.
- \[26\]P\. Rajpurkar, J\. Zhang, K\. Lopyrev, and P\. Liang\(2016\)SQuAD: 100,000\+ questions for machine comprehension of text\.External Links:1606\.05250,[Link](https://arxiv.org/abs/1606.05250)Cited by:[§5\.1](https://arxiv.org/html/2605.08863#S5.SS1.SSS0.Px1.p1.5)\.
- \[27\]M\. Sundararajan, A\. Taly, and Q\. Yan\(2017\)Axiomatic attribution for deep networks\.External Links:1703\.01365,[Link](https://arxiv.org/abs/1703.01365)Cited by:[§C\.3](https://arxiv.org/html/2605.08863#A3.SS3)\.
- \[28\]H\. Tan, F\. Sun, S\. Liu, D\. Su, Q\. Cao, X\. Chen, J\. Wang, X\. Cai, Y\. Wang, H\. Shen, and X\. Cheng\(2025\)Too consistent to detect: a study of self\-consistent errors in llms\.External Links:2505\.17656,[Link](https://arxiv.org/abs/2505.17656)Cited by:[§1](https://arxiv.org/html/2605.08863#S1.p1.1)\.
- \[29\]S\. Zheng, J\. Huang, and K\. C\. Chang\(2023\)Why does chatgpt fall short in providing truthful answers?\.External Links:2304\.10513,[Link](https://arxiv.org/abs/2304.10513)Cited by:[§2\.1](https://arxiv.org/html/2605.08863#S2.SS1.p1.1)\.

## Appendix AImplementation Details

### A\.1Setup

#### Model Architecture and Environment\.

All models, including the original HaMI and our proposed embedding\-based pooling strategies, use a two\-layer MLP architecture\. For HaMI, we follow the implementation of\[[23](https://arxiv.org/html/2605.08863#bib.bib2)\], where hidden states are first projected to a 256\-dimensional space, followed by batch normalization and a ReLU activation, before producing instance\-level logits\. For our embedding\-based models \(mean and max pooling\), a similar transformation is applied to each hidden state before aggregating them across the token dimension\. All experiments were implemented using PyTorch2\.5\.12\.5\.1and Transformers5\.5\.35\.5\.3\.

#### Hyperparameters and Training\.

The hyperparameters were tailored for each method to ensure optimal convergence\. ForHaMI, we used theAdam optimizerwith a learning rate of1×10−31\\times 10^\{\-3\}, a weight decay of5×10−45\\times 10^\{\-4\}, and a scaling factorλ=1\\lambda=1\. For ourmax\-poolingmodel, we employed a learning rate of2×10−42\\times 10^\{\-4\}and a weight decay of5×10−35\\times 10^\{\-3\}\. The bottleneck dimensionDDwas set to256, and the aggregation thresholdkkfor HaMI was10%\.

All training was conducted on a singleNVIDIA A100 GPUfor100 epochswith a batch size of128, selecting the best checkpoint based on the peak validation AUROC\. The total training time excluding data generation and labeling was approximately140 secondsfor LLaMA\-3\.1\-8B,155 secondsfor Mistral\-Nemo\-12B, and220 secondsfor LLaMA\-3\.3\-70B\. In all experiments, we adopt prompts from\[[9](https://arxiv.org/html/2605.08863#bib.bib7)\], following the setup of\[[23](https://arxiv.org/html/2605.08863#bib.bib2)\]\. We generated five responses for each QA pair without providing any additional context\.

#### Prompts

The generation prompt is shown inPrompt 1\.

Prompt 1: Generation[⬇](data:text/plain;base64,QW5zd2VyIHRoZSBmb2xsb3dpbmcgcXVlc3Rpb24gaW4gYSBzaW5nbGUgYnV0IGNvbXBsZXRlIHNlbnRlbmNlIG9ubHkuClF1ZXN0aW9uOiB7cXVlc3Rpb259CkFuc3dlcjo=)Answerthefollowingquestioninasinglebutcompletesentenceonly\.Question:\{question\}Answer:

Next, to assign labels to the generated answers, we prompt GPT\-5 mini\. The labeling prompt is shown inPrompt 2\.

Prompt 2: Labeling[⬇](data:text/plain;base64,V2UgYXJlIGFzc2Vzc2luZyB0aGUgcXVhbGl0eSBvZiBhbnN3ZXJzIHRvIHRoZSBmb2xsb3dpbmcgcXVlc3Rpb246IHtxdWVzdGlvbn0KVGhlIHByb3Bvc2VkIGFuc3dlciBpczoge3ByZWRpY3RlZF9hbnN3ZXJ9CkJhc2VkIG9uIHRoZSBjb250ZXh0IG9mIHF1ZXN0aW9uIGFuZCB5b3VyIG93biBrbm93bGVkZ2UsIGlzIHRoZSBwcm9wb3NlZCBhbnN3ZXIgY29ycmVjdD8KUGxlYXNlIHRoaW5rIGNhcmVmdWxseSBhbmQgcmVzcG9uZCBvbmx5IHdpdGggeWVzIG9yIG5vLgpSZXNwb25zZTo=)Weareassessingthequalityofanswerstothefollowingquestion:\{question\}Theproposedansweris:\{predicted\_answer\}Basedonthecontextofquestionandyourownknowledge,istheproposedanswercorrect?Pleasethinkcarefullyandrespondonlywithyesorno\.Response:

We employ GPT\-4o mini as a natural language inference \(NLI\) judge to assess the semantic relationship between responses\. The detailed formulation is shown inPrompt 3\.

Prompt 3: Entailment[⬇](data:text/plain;base64,Q29udGV4dDogUXVlc3Rpb24gYXNrZWQgdG8gYW4gQUk6ICd7cXVlc3Rpb259JwpBbnN3ZXIgQToge3RleHQxfQpBbnN3ZXIgQjoge3RleHQyfQpEZXRlcm1pbmUgaWYgQW5zd2VyIEEgZW50YWlscyBBbnN3ZXIgQi4KUmVzcG9uZCB3aXRoIG9ubHkgb25lIHdvcmQ6IGVudGFpbG1lbnQsY29udHJhZGljdGlvbiwgb3IgbmV1dHJhbC4=)Context:QuestionaskedtoanAI:’\{question\}’AnswerA:\{text1\}AnswerB:\{text2\}DetermineifAnswerAentailsAnswerB\.Respondwithonlyoneword:entailment,contradiction,orneutral\.

### A\.2HaMI Architecture

The architecture of HaMI is illustrated in Figure[1](https://arxiv.org/html/2605.08863#A1.F1)\. The model operates on an instance\-based MIL framework, where each instance in a bag is first scaled by its semantic probabilityPsemP\_\{\\mathrm\{sem\}\}\. These scaled representations are then processed through a two\-layer MLPffto compute individual instance logits\. To determine the bag\-level score, HaMI employs a TopK pooling strategy, which selects and averages the most discriminative instance scores\.

![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/arch/hami_arch.png)Figure 1:HaMI Architecture
### A\.3Max Pooling Architecture

We model each response as a bag of token\-level hidden states, as illustrated in Figure[2](https://arxiv.org/html/2605.08863#A1.F2)\. Each state is projected into a lower\-dimensional space, and max pooling is used to extract a bag representation by selecting the most salient features\. A linear classifier then produces a hallucination probability\.

The model is trained with a binary classification objective\. Theoretically, this design promotes margin expansion during optimization, helping distinguish hallucinated and factual responses\.

![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/arch/max_arch.png)Figure 2:Classical Max Pooling Network

## Appendix BProofs

### B\.1Proof of Theorem[1](https://arxiv.org/html/2605.08863#Thmtheorem1)

Given the exact decompositionz~𝐁,i=z𝐁,i−λ​Psem𝐁​C​\(𝐱𝐁,i\)\\tilde\{z\}\_\{\\mathbf\{B\},i\}=z\_\{\\mathbf\{B\},i\}\-\\lambda P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}C\(\\mathbf\{x\}\_\{\\mathbf\{B\},i\}\), the change in the bag\-level logit,Δ​Z𝐁=Z~𝐁−Z𝐁\\Delta Z\_\{\\mathbf\{B\}\}=\\tilde\{Z\}\_\{\\mathbf\{B\}\}\-Z\_\{\\mathbf\{B\}\}, is exactly:

Δ​Z𝐁=−λ​Psem𝐁​C¯𝐁\.\\Delta Z\_\{\\mathbf\{B\}\}=\-\\lambda P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}\.The total margin changeΔ​Mh​a​m​i=Mh​a​m​i​\(λ\)−Mh​a​m​i​\(0\)\\Delta M\_\{hami\}=M\_\{hami\}\(\\lambda\)\-M\_\{hami\}\(0\)is the sum of contributions from the positive and negative bags:

Δ​Mh​a​m​i\\displaystyle\\Delta M\_\{hami\}=1\|𝒮\|​\(∑𝐁∈𝒮p​o​s\(\+1\)​Δ​Z𝐁\+∑𝐁∈𝒮n​e​g\(−1\)​Δ​Z𝐁\)\\displaystyle=\\frac\{1\}\{\|\\mathcal\{S\}\|\}\\left\(\\sum\_\{\\mathbf\{B\}\\in\\mathcal\{S\}\_\{pos\}\}\(\+1\)\\Delta Z\_\{\\mathbf\{B\}\}\+\\sum\_\{\\mathbf\{B\}\\in\\mathcal\{S\}\_\{neg\}\}\(\-1\)\\Delta Z\_\{\\mathbf\{B\}\}\\right\)=λ\|𝒮\|​\(∑𝐁∈𝒮n​e​gPsem𝐁​C¯𝐁−∑𝐁∈𝒮p​o​sPsem𝐁​C¯𝐁\)\.\\displaystyle=\\frac\{\\lambda\}\{\|\\mathcal\{S\}\|\}\\left\(\\sum\_\{\\mathbf\{B\}\\in\\mathcal\{S\}\_\{neg\}\}P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}\-\\sum\_\{\\mathbf\{B\}\\in\\mathcal\{S\}\_\{pos\}\}P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}\\right\)\.By expressing the sums as empirical averages over the respective subsets, the overall expected margin change over the dataset𝒮\\mathcal\{S\}is exactly:

𝔼𝐁∈𝒮​\[Δ​Mh​a​m​i\]\\displaystyle\\mathbb\{E\}\_\{\\mathbf\{B\}\\in\\mathcal\{S\}\}\[\\Delta M\_\{hami\}\]=λ​\(\|𝒮n​e​g\|\|𝒮\|​𝔼𝐁∈𝒮n​e​g​\[Psem𝐁​C¯𝐁\]−\|𝒮p​o​s\|\|𝒮\|​𝔼𝐁∈𝒮p​o​s​\[Psem𝐁​C¯𝐁\]\)\\displaystyle=\\lambda\\left\(\\frac\{\|\\mathcal\{S\}\_\{neg\}\|\}\{\|\\mathcal\{S\}\|\}\\mathbb\{E\}\_\{\\mathbf\{B\}\\in\\mathcal\{S\}\_\{neg\}\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}\]\-\\frac\{\|\\mathcal\{S\}\_\{pos\}\|\}\{\|\\mathcal\{S\}\|\}\\mathbb\{E\}\_\{\\mathbf\{B\}\\in\\mathcal\{S\}\_\{pos\}\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}\]\\right\)=λ​𝔼𝐁∈𝒮p​o​s​\[Psem𝐁​C¯𝐁\]​\|𝒮p​o​s\|\|𝒮\|​\(γ​\|𝒮n​e​g\|\|𝒮p​o​s\|−1\),\\displaystyle=\\lambda\\mathbb\{E\}\_\{\\mathbf\{B\}\\in\\mathcal\{S\}\_\{pos\}\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}\]\\frac\{\|\\mathcal\{S\}\_\{pos\}\|\}\{\|\\mathcal\{S\}\|\}\\left\(\\gamma\\frac\{\|\\mathcal\{S\}\_\{neg\}\|\}\{\|\\mathcal\{S\}\_\{pos\}\|\}\-1\\right\),where𝔼𝐁∈𝒮n​e​g\\mathbb\{E\}\_\{\\mathbf\{B\}\\in\\mathcal\{S\}\_\{neg\}\}and𝔼𝐁∈𝒮p​o​s\\mathbb\{E\}\_\{\\mathbf\{B\}\\in\\mathcal\{S\}\_\{pos\}\}denote the empirical expectations\. From this expression, it follows that the expected margin increases, if and only if:

\|𝒮n​e​g\|\|𝒮p​o​s\|\>1γ\.\\frac\{\|\\mathcal\{S\}\_\{neg\}\|\}\{\|\\mathcal\{S\}\_\{pos\}\|\}\>\\frac\{1\}\{\\gamma\}\.Thus, asymmetric scaling selectively deepens the negative logits on average, increasing the expected logit\-space margin\.

### B\.2Proof of Bag\-level Margin Dynamics under Logistic Loss

###### Lemma 4\(Bag\-level Margin Dynamics\)\.

For a sampled bag𝐁\\mathbf\{B\}, the logistic\-loss update induced byBBproduces a positive first\-order change in its own signed marginmB=y𝐁​zBm\_\{B\}=y\_\{\\mathbf\{B\}\}z\_\{B\}, whenever∇θzB≠0\\nabla\_\{\\theta\}z\_\{B\}\\neq 0\. The magnitude of this increase is determined by the squared gradient norm scaled by the current margin’s confidence\.

This lemma characterizes the local fitness of the model to the bagBB\. It shows that the logistic loss naturally induces a gradient flow that pushes the model’s parameters in a direction that enlarges the margin of the current sample\. Importantly, it establishes that the magnitude of this margin increase is directly proportional to the squared norm of the gradient\.

#### Proof\.

We first derive the first\-order change in the bag\-level margin\. Letℒ𝐁​\(θ\)=log⁡\(1\+exp⁡\(−y𝐁​z𝐁​\(θ\)\)\)\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\)=\\log\(1\+\\exp\(\-y\_\{\\mathbf\{B\}\}z\_\{\\mathbf\{B\}\}\(\\theta\)\)\)be the logistic loss for a bagBB\. Its gradient with respect toθ=\{W,w\}\\theta=\\\{W,w\\\}is:

∇θℒ𝐁​\(θ\)=∂ℒ𝐁∂z𝐁​∇θz𝐁​\(θ\)=−y𝐁1\+exp⁡\(y𝐁​z𝐁​\(θ\)\)​∇θz𝐁​\(θ\)\.\\displaystyle\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\)=\\frac\{\\partial\\mathcal\{L\}\_\{\\mathbf\{B\}\}\}\{\\partial z\_\{\\mathbf\{B\}\}\}\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)=\-\\frac\{y\_\{\\mathbf\{B\}\}\}\{1\+\\exp\(y\_\{\\mathbf\{B\}\}z\_\{\\mathbf\{B\}\}\(\\theta\)\)\}\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)\.Using the SGD update ruleθt\+1=θt−η​∇θℒ𝐁​\(θt\)\\theta\_\{t\+1\}=\\theta\_\{t\}\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\), the parameter shift is:

θt\+1−θt=η​y𝐁1\+exp⁡\(m𝐁​\(θt\)\)​∇θz𝐁​\(θt\),\\displaystyle\\theta\_\{t\+1\}\-\\theta\_\{t\}=\\eta\\frac\{y\_\{\\mathbf\{B\}\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\)\)\}\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\),wherem𝐁​\(θ\)=y𝐁​z𝐁​\(θ\)m\_\{\\mathbf\{B\}\}\(\\theta\)=y\_\{\\mathbf\{B\}\}z\_\{\\mathbf\{B\}\}\(\\theta\)\. For a sufficiently small learning rateη\\eta, the first\-order Taylor expansion ofz𝐁​\(θt\+1\)z\_\{\\mathbf\{B\}\}\(\\theta\_\{t\+1\}\)aroundθt\\theta\_\{t\}yields:

z𝐁​\(θt\+1\)\\displaystyle z\_\{\\mathbf\{B\}\}\(\\theta\_\{t\+1\}\)=z𝐁​\(θt\)\+∇θz𝐁​\(θt\)⊤​\(θt\+1−θt\)\+O​\(η2\)\\displaystyle=z\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\)\+\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\)^\{\\top\}\(\\theta\_\{t\+1\}\-\\theta\_\{t\}\)\+O\(\\eta^\{2\}\)=z𝐁​\(θt\)\+η​y𝐁1\+exp⁡\(m𝐁​\(θt\)\)​‖∇θz𝐁​\(θt\)‖2\+O​\(η2\)\.\\displaystyle=z\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\)\+\\eta\\frac\{y\_\{\\mathbf\{B\}\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\)\)\}\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\)\\\|^\{2\}\+O\(\\eta^\{2\}\)\.Multiplying both sides byy𝐁y\_\{\\mathbf\{B\}\}and usingy𝐁2=1y\_\{\\mathbf\{B\}\}^\{2\}=1, we obtain the margin increase:

Δ​m𝐁​\(θt\)=η​‖∇θz𝐁​\(θt\)‖21\+exp⁡\(m𝐁​\(θt\)\)\+O​\(η2\)\.\\displaystyle\\Delta m\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\)=\\eta\\frac\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\)\\\|^\{2\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\_\{t\}\)\)\}\+O\(\\eta^\{2\}\)\.Sinceη\>0\\eta\>0and‖∇θz𝐁‖2≥0\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\\\|^\{2\}\\geq 0, the margin increases as long as the gradient is non\-zero\.

The squared gradient norm‖∇θz𝐁‖2\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\\\|^\{2\}can be decomposed into contributions from the classification weightswwand feature extraction weightsWW\. Sinceθ\\thetaconsists of\{wj,Wj\}j=1D\\\{w\_\{j\},W\_\{j\}\\\}\_\{j=1\}^\{D\}, we have:

‖∇θz𝐁‖2=∑j=1D\(\|∂z𝐁∂wj\|2\+‖∇Wjz𝐁‖22\)=∑j=1DγB,j\.\\displaystyle\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\\\|^\{2\}=\\sum\_\{j=1\}^\{D\}\\left\(\\left\|\\frac\{\\partial z\_\{\\mathbf\{B\}\}\}\{\\partial w\_\{j\}\}\\right\|^\{2\}\+\\left\\\|\\nabla\_\{W\_\{j\}\}z\_\{\\mathbf\{B\}\}\\right\\\|\_\{2\}^\{2\}\\right\)=\\sum\_\{j=1\}^\{D\}\\gamma\_\{B,j\}\.
ForMax Pooling, wherezB,j=maxi⁡a​\(Wj⊤​hB,i\)z\_\{B,j\}=\\max\_\{i\}a\(W\_\{j\}^\{\\top\}h\_\{B,i\}\), the output depends only on the maximally activated instanceiB,j∗i\_\{B,j\}^\{\*\}\. The derivatives are:

∂zB∂wj=uB,iB,j∗,j,∂zB∂Wj=wj​hB,iB,j∗​1​\{Wj⊤​hB,iB,j∗\>0\}\.\\displaystyle\\frac\{\\partial z\_\{B\}\}\{\\partial w\_\{j\}\}=u\_\{B,i\_\{B,j\}^\{\*\},j\},\\quad\\frac\{\\partial z\_\{B\}\}\{\\partial W\_\{j\}\}=w\_\{j\}\\,h\_\{B,i^\{\*\}\_\{B,j\}\}\\,\\mathbf\{1\}\\\{W\_\{j\}^\{\\top\}h\_\{B,i^\{\*\}\_\{B,j\}\}\>0\\\}\.Thus,γB,jmax=uB,iB,j∗,j2\+wj2​𝟏​\{Wj​hB,iB∗​\(j\)\>0\}​‖hB,iB∗​\(j\)‖2\\gamma\_\{B,j\}^\{\\max\}=u\_\{B,i\_\{B,j\}^\{\*\},j\}^\{2\}\+w\_\{j\}^\{2\}\\mathbf\{1\}\\left\\\{W\_\{j\}h\_\{B,i\_\{B\}^\{\*\}\(j\)\}\>0\\right\\\}\\left\\\|h\_\{B,i\_\{B\}^\{\*\}\(j\)\}\\right\\\|^\{2\}\.

Formean Pooling, wherezB,j=1T𝐁​∑i=1T𝐁a​\(Wj⊤​hB,i\)z\_\{B,j\}=\\frac\{1\}\{T\_\{\\mathbf\{B\}\}\}\\sum\_\{i=1\}^\{T\_\{\\mathbf\{B\}\}\}a\(W\_\{j\}^\{\\top\}h\_\{B,i\}\), the derivatives are averaged over the bag:

∂zB∂wj=1T𝐁​∑i=1T𝐁uB,i,j,∂zB∂Wj=wjT𝐁​∑i=1T𝐁hB,i​1​\{Wj⊤​hB,i\>0\}\.\\displaystyle\\frac\{\\partial z\_\{B\}\}\{\\partial w\_\{j\}\}=\\frac\{1\}\{T\_\{\\mathbf\{B\}\}\}\\sum\_\{i=1\}^\{T\_\{\\mathbf\{B\}\}\}u\_\{B,i,j\},\\quad\\frac\{\\partial z\_\{B\}\}\{\\partial W\_\{j\}\}=\\frac\{w\_\{j\}\}\{T\_\{\\mathbf\{B\}\}\}\\sum\_\{i=1\}^\{T\_\{\\mathbf\{B\}\}\}h\_\{B,i\}\\,\\mathbf\{1\}\\\{W\_\{j\}^\{\\top\}h\_\{B,i\}\>0\\\}\.This yieldsγB,jmean=\(1T𝐁​∑i=1T𝐁uB,i,j\)2\+wj2T𝐁2​‖∑i=1T𝐁hB,i​1​\{Wj⊤​hB,i\>0\}‖22\.\\gamma\_\{B,j\}^\{\\mathrm\{mean\}\}=\\left\(\\frac\{1\}\{T\_\{\\mathbf\{B\}\}\}\\sum\_\{i=1\}^\{T\_\{\\mathbf\{B\}\}\}u\_\{B,i,j\}\\right\)^\{2\}\+\\frac\{w\_\{j\}^\{2\}\}\{T\_\{\\mathbf\{B\}\}^\{2\}\}\\left\\\|\\sum\_\{i=1\}^\{T\_\{\\mathbf\{B\}\}\}h\_\{B,i\}\\,\\mathbf\{1\}\\\{W\_\{j\}^\{\\top\}h\_\{B,i\}\>0\\\}\\right\\\|\_\{2\}^\{2\}\.

Combining these with the margin update equation completes the proof\.

### B\.3Proof of Theorem[2](https://arxiv.org/html/2605.08863#Thmtheorem2)

Letℳη​\(θ\):=𝔼𝐁∼𝒟​\[m𝐁​\(θ−η​∇θℒ𝐁​\(θ\)\)\]\\mathcal\{M\}\_\{\\eta\}\(\\theta\):=\\mathbb\{E\}\_\{\\mathbf\{B\}\\sim\\mathcal\{D\}\}\[m\_\{\\mathbf\{B\}\}\(\\theta\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\)\)\]\. For a fixed bag, letθη:=θ−η​∇θℒ𝐁​\(θ\)\\theta\_\{\\eta\}:=\\theta\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\)\. z𝐁​\(θ\)z\_\{\\mathbf\{B\}\}\(\\theta\)is continuous piecewise\-affine, and is therefore differentiable almost everywhere with respect to the parametersθ\\theta\. The derivative with respect toη\\etaatη=0\\eta=0is:

dd​η​m𝐁​\(θη\)\|η=0=−∇θm𝐁​\(θ\)⊤​∇θℒ𝐁​\(θ\)\.\\displaystyle\\left\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\eta\}m\_\{\\mathbf\{B\}\}\(\\theta\_\{\\eta\}\)\\right\|\_\{\\eta=0\}=\-\\nabla\_\{\\theta\}m\_\{\\mathbf\{B\}\}\(\\theta\)^\{\\top\}\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\)\.Substituting the gradient∇θℒ𝐁​\(θ\)=−y𝐁1\+exp⁡\(m𝐁​\(θ\)\)​∇θz𝐁​\(θ\)\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\)=\-\\frac\{y\_\{\\mathbf\{B\}\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\}\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\):

−∇θm𝐁​\(θ\)⊤​∇θℒ𝐁​\(θ\)\\displaystyle\-\\nabla\_\{\\theta\}m\_\{\\mathbf\{B\}\}\(\\theta\)^\{\\top\}\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\)=−\(y𝐁​∇θz𝐁​\(θ\)\)⊤​\(−y𝐁1\+exp⁡\(m𝐁​\(θ\)\)​∇θz𝐁​\(θ\)\)\\displaystyle=\-\(y\_\{\\mathbf\{B\}\}\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)\)^\{\\top\}\\left\(\-\\frac\{y\_\{\\mathbf\{B\}\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\}\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)\\right\)=‖∇θz𝐁​\(θ\)‖21\+exp⁡\(m𝐁​\(θ\)\)\.\\displaystyle=\\frac\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)\\\|^\{2\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\}\.By interchanging the derivative and expectation:

dd​η​ℳη​\(θ\)\|η=0=𝔼𝐁​\[‖∇θz𝐁​\(θ\)‖21\+exp⁡\(m𝐁​\(θ\)\)\]≥0\.\\displaystyle\\left\.\\frac\{\\mathrm\{d\}\}\{\\mathrm\{d\}\\eta\}\\mathcal\{M\}\_\{\\eta\}\(\\theta\)\\right\|\_\{\\eta=0\}=\\mathbb\{E\}\_\{\\mathbf\{B\}\}\\left\[\\frac\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)\\\|^\{2\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\}\\right\]\\geq 0\.The Taylor expansion aroundη=0\\eta=0gives:

ℳη​\(θ\)=ℳ0​\(θ\)\+η​𝔼𝐁​\[‖∇θz𝐁​\(θ\)‖21\+exp⁡\(m𝐁​\(θ\)\)\]\+O​\(η2\)\.\\displaystyle\\mathcal\{M\}\_\{\\eta\}\(\\theta\)=\\mathcal\{M\}\_\{0\}\(\\theta\)\+\\eta\\mathbb\{E\}\_\{\\mathbf\{B\}\}\\left\[\\frac\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)\\\|^\{2\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\}\\right\]\+O\(\\eta^\{2\}\)\.For sufficiently smallη\>0\\eta\>0,ℳη​\(θ\)≥ℳ0​\(θ\)\\mathcal\{M\}\_\{\\eta\}\(\\theta\)\\geq\\mathcal\{M\}\_\{0\}\(\\theta\), confirming that a single SGD update increases the expected bag\-wise margin\.

### B\.4Proof of Theorem[3](https://arxiv.org/html/2605.08863#Thmtheorem3)

LetJ𝐁:=\{j:s𝐁,j≥1\}J\_\{\\mathbf\{B\}\}:=\\\{j:s\_\{\\mathbf\{B\},j\}\\geq 1\\\}be the set of active channels\. Forj∉J𝐁j\\notin J\_\{\\mathbf\{B\}\}, the gradient contributionγ𝐁,j\\gamma\_\{\\mathbf\{B\},j\}vanishes by Assumption[1](https://arxiv.org/html/2605.08863#Thmassumption1)\. Thus, the squared norms decompose as∑j∈J𝐁γ𝐁,j\\sum\_\{j\\in J\_\{\\mathbf\{B\}\}\}\\gamma\_\{\\mathbf\{B\},j\}\.

#### 1\. Max Pooling Lower Bound\.

Forj∈J𝐁j\\in J\_\{\\mathbf\{B\}\}, leti∗∈S𝐁,ji^\{\*\}\\in S\_\{\\mathbf\{B\},j\}be the maximizing instance\. By Assumption[1](https://arxiv.org/html/2605.08863#Thmassumption1), we have:

γ𝐁,jmax=u𝐁,i∗,j2\+wj2​‖𝐠𝐁,i∗,j‖22≥c1​\(1\+wj2\),\\gamma\_\{\\mathbf\{B\},j\}^\{\\max\}=u\_\{\\mathbf\{B\},i^\{\*\},j\}^\{2\}\+w\_\{j\}^\{2\}\\\|\\mathbf\{g\}\_\{\\mathbf\{B\},i^\{\*\},j\}\\\|\_\{2\}^\{2\}\\geq c\_\{1\}\(1\+w\_\{j\}^\{2\}\),wherec1:=min⁡\{u¯2,g¯2\}c\_\{1\}:=\\min\\\{\\underline\{u\}^\{2\},\\underline\{g\}^\{2\}\\\}\. Summing overJ𝐁J\_\{\\mathbf\{B\}\}yields‖∇θz𝐁max‖2≥c1​∑j∈J𝐁\(1\+wj2\)\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}^\{\\max\}\\\|^\{2\}\\geq c\_\{1\}\\sum\_\{j\\in J\_\{\\mathbf\{B\}\}\}\(1\+w\_\{j\}^\{2\}\)\.

#### 2\. Mean Pooling Upper Bound\.

For mean pooling, since onlyi∈S𝐁,ji\\in S\_\{\\mathbf\{B\},j\}are non\-zero \(s𝐁,j≤s𝐁s\_\{\\mathbf\{B\},j\}\\leq s\_\{\\mathbf\{B\}\}\), the triangle inequality and Assumption[1](https://arxiv.org/html/2605.08863#Thmassumption1)give:

γ𝐁,jmean=\(∑i∈S𝐁,jui,jT𝐁\)2\+wj2T𝐁2​‖∑i∈S𝐁,j𝐠i,j‖22≤s𝐁2T𝐁2​c2​\(1\+wj2\),\\gamma\_\{\\mathbf\{B\},j\}^\{\\mathrm\{mean\}\}=\\left\(\\frac\{\\sum\_\{i\\in S\_\{\\mathbf\{B\},j\}\}u\_\{i,j\}\}\{T\_\{\\mathbf\{B\}\}\}\\right\)^\{2\}\+\\frac\{w\_\{j\}^\{2\}\}\{T\_\{\\mathbf\{B\}\}^\{2\}\}\\left\\\|\\sum\_\{i\\in S\_\{\\mathbf\{B\},j\}\}\\mathbf\{g\}\_\{i,j\}\\right\\\|\_\{2\}^\{2\}\\leq\\frac\{s\_\{\\mathbf\{B\}\}^\{2\}\}\{T\_\{\\mathbf\{B\}\}^\{2\}\}c\_\{2\}\(1\+w\_\{j\}^\{2\}\),wherec2:=max⁡\{u¯2,g¯2\}c\_\{2\}:=\\max\\\{\\overline\{u\}^\{2\},\\overline\{g\}^\{2\}\\\}\. Thus,‖∇θz𝐁mean‖2≤c2​\(s𝐁T𝐁\)2​∑j∈J𝐁\(1\+wj2\)\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}^\{\\mathrm\{mean\}\}\\\|^\{2\}\\leq c\_\{2\}\(\\frac\{s\_\{\\mathbf\{B\}\}\}\{T\_\{\\mathbf\{B\}\}\}\)^\{2\}\\sum\_\{j\\in J\_\{\\mathbf\{B\}\}\}\(1\+w\_\{j\}^\{2\}\)\.

#### 3\. Conclusion\.

Combining these bounds, we obtain:

‖∇θz𝐁max‖2‖∇θz𝐁mean‖2≥c1c2​\(T𝐁s𝐁\)2=Ω​\(\(T𝐁s𝐁\)2\)\.\\frac\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}^\{\\max\}\\\|^\{2\}\}\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}^\{\\mathrm\{mean\}\}\\\|^\{2\}\}\\geq\\frac\{c\_\{1\}\}\{c\_\{2\}\}\\left\(\\frac\{T\_\{\\mathbf\{B\}\}\}\{s\_\{\\mathbf\{B\}\}\}\\right\)^\{2\}=\\Omega\\left\(\\left\(\\frac\{T\_\{\\mathbf\{B\}\}\}\{s\_\{\\mathbf\{B\}\}\}\\right\)^\{2\}\\right\)\.

### B\.5Proof of Proposition[1](https://arxiv.org/html/2605.08863#Thmproposition1)

We provide the derivations for the Rademacher complexity bounds of both the feature\-extraction\-based model \(ℱfeat\\mathcal\{F\}\_\{\\mathrm\{feat\}\}\) and the baseline model \(ℱbase\\mathcal\{F\}\_\{\\mathrm\{base\}\}\)\.

#### 1\. Bound for the Model with Feature Extraction Layer \(ℱfeat\\mathcal\{F\}\_\{\\mathrm\{feat\}\}\)

We first bound the empirical Rademacher complexity of the hypothesis class:

ℱfeat=\{fθ​\(𝐁\)=w⊤​ρ​\(\{a​\(Wj​ht\)\}j,t\)\|‖Wj‖2≤B1,‖w‖2≤B2\}\.\\mathcal\{F\}\_\{\\mathrm\{feat\}\}=\\left\\\{f\_\{\\theta\}\(\\mathbf\{B\}\)=w^\{\\top\}\\rho\\big\(\\\{a\(W\_\{j\}h\_\{t\}\)\\\}\_\{j,t\}\\big\)\\;\\middle\|\\;\\\|W\_\{j\}\\\|\_\{2\}\\leq B\_\{1\},\\\|w\\\|\_\{2\}\\leq B\_\{2\}\\right\\\}\.
Applying the Cauchy\-Schwarz inequality to separate the classification weightsww, we have:

ℛ^S​\(ℱfeat\)≤B2n​𝔼σ​\[sup‖Wj‖2≤B1‖∑i=1nσi​ρ​\(𝐁i\)‖2\]\.\\hat\{\\mathcal\{R\}\}\_\{S\}\(\\mathcal\{F\}\_\{\\mathrm\{feat\}\}\)\\leq\\frac\{B\_\{2\}\}\{n\}\\mathbb\{E\}\_\{\\sigma\}\\left\[\\sup\_\{\\\|W\_\{j\}\\\|\_\{2\}\\leq B\_\{1\}\}\\left\\\|\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}\\rho\(\\mathbf\{B\}\_\{i\}\)\\right\\\|\_\{2\}\\right\]\.Since the norm constraint‖Wj‖2≤B1\\\|W\_\{j\}\\\|\_\{2\}\\leq B\_\{1\}applies independently to each rowj∈\{1,…,D\}j\\in\\\{1,\\dots,D\\\}, the supremum of theL2L\_\{2\}norm factors into independent suprema across theDDdimensions:

supW‖∑i=1nσi​ρ​\(𝐁i\)‖2≤D​sup‖w1‖2≤B1\|∑i=1nσi​max1≤t≤Ti⁡a​\(w1​hi,t\)\|\.\\sup\_\{W\}\\left\\\|\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}\\rho\(\\mathbf\{B\}\_\{i\}\)\\right\\\|\_\{2\}\\leq\\sqrt\{D\}\\sup\_\{\\\|w\_\{1\}\\\|\_\{2\}\\leq B\_\{1\}\}\\left\|\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}\\max\_\{1\\leq t\\leq T\_\{i\}\}a\(w\_\{1\}h\_\{i,t\}\)\\right\|\.By introducing a factor of 2 to handle the absolute value, and sequentially applying Maurer’s vector contraction inequality for the max operator which introduces independent Rademacher variablesσi,t∈\{−1,\+1\}\\sigma\_\{i,t\}\\in\\\{\-1,\+1\\\}for each instancettand the Ledoux\-Talagrand contraction inequality for the ReLU functiona​\(⋅\)a\(\\cdot\), we obtain:

ℛ^S​\(ℱfeat\)\\displaystyle\\hat\{\\mathcal\{R\}\}\_\{S\}\(\\mathcal\{F\}\_\{\\mathrm\{feat\}\}\)≤2​2​B2​Dn​𝔼σ​\[sup‖w1‖2≤B1∑i=1n∑t=1Tiσi,t​w1​hi,t\]\\displaystyle\\leq\\frac\{2\\sqrt\{2\}B\_\{2\}\\sqrt\{D\}\}\{n\}\\mathbb\{E\}\_\{\\sigma\}\\left\[\\sup\_\{\\\|w\_\{1\}\\\|\_\{2\}\\leq B\_\{1\}\}\\sum\_\{i=1\}^\{n\}\\sum\_\{t=1\}^\{T\_\{i\}\}\\sigma\_\{i,t\}w\_\{1\}h\_\{i,t\}\\right\]≤2​2​B1​B2​Dn​𝔼σ​\[‖∑i=1n∑t=1Tiσi,t​hi,t‖2\]\\displaystyle\\leq\\frac\{2\\sqrt\{2\}B\_\{1\}B\_\{2\}\\sqrt\{D\}\}\{n\}\\mathbb\{E\}\_\{\\sigma\}\\left\[\\left\\\|\\sum\_\{i=1\}^\{n\}\\sum\_\{t=1\}^\{T\_\{i\}\}\\sigma\_\{i,t\}h\_\{i,t\}\\right\\\|\_\{2\}\\right\]≤2​2​R​B1​B2​D​Tn,\\displaystyle\\leq\\frac\{2\\sqrt\{2\}RB\_\{1\}B\_\{2\}\\sqrt\{DT\}\}\{\\sqrt\{n\}\},whereT=maxi⁡TiT=\\max\_\{i\}T\_\{i\}\. This confirms the𝒪​\(D​T/n\)\\mathcal\{O\}\(\\sqrt\{DT/n\}\)scaling, where the dependence on the input dimensionddis removed\.

#### 2\. Bound for the Baseline Model \(ℱbase\\mathcal\{F\}\_\{\\mathrm\{base\}\}\)

For the baseline classℱbase\\mathcal\{F\}\_\{\\text\{base\}\}, max poolingρ​\(𝐁i\)\\rho\(\\mathbf\{B\}\_\{i\}\)is performed directly on thedd\-dimensional input space\. Letzi=ρ​\(𝐁i\)∈ℝdz\_\{i\}=\\rho\(\\mathbf\{B\}\_\{i\}\)\\in\\mathbb\{R\}^\{d\}\. Under the assumption‖hi,t‖2≤R\\\|h\_\{i,t\}\\\|\_\{2\}\\leq R, which implies\|hi,t,ℓ\|≤R\|h\_\{i,t,\\ell\}\|\\leq Rfor each coordinate, the norm of the pooled vector is bounded by:

‖zi‖2=∑k=1d\(maxt⁡\|hi,t,k\|\)2≤d​R2=R​d\.\\\|z\_\{i\}\\\|\_\{2\}=\\sqrt\{\\sum\_\{k=1\}^\{d\}\(\\max\_\{t\}\|h\_\{i,t,k\}\|\)^\{2\}\}\\leq\\sqrt\{dR^\{2\}\}=R\\sqrt\{d\}\.Treating this as a standard two\-layer neural network with weight constraintsB1,B2B\_\{1\},B\_\{2\}acting on\{zi\}\\\{z\_\{i\}\\\}, we apply standard Rademacher complexity results:

ℛ^S​\(ℱbase\)\\displaystyle\\hat\{\\mathcal\{R\}\}\_\{S\}\(\\mathcal\{F\}\_\{\\mathrm\{base\}\}\)≤2​B1​B2n​𝔼σ​‖∑i=1nσi​zi‖2\\displaystyle\\leq\\frac\{2B\_\{1\}B\_\{2\}\}\{n\}\\mathbb\{E\}\_\{\\sigma\}\\left\\\|\\sum\_\{i=1\}^\{n\}\\sigma\_\{i\}z\_\{i\}\\right\\\|\_\{2\}≤2​B1​B2n​maxi⁡‖zi‖2≤2​R​B1​B2​D​dn\.\\displaystyle\\leq\\frac\{2B\_\{1\}B\_\{2\}\}\{\\sqrt\{n\}\}\\max\_\{i\}\\\|z\_\{i\}\\\|\_\{2\}\\leq\\frac\{2RB\_\{1\}B\_\{2\}\\sqrt\{Dd\}\}\{\\sqrt\{n\}\}\.This result illustrates that without the feature extraction bottleneck, the complexity explicitly depends on the ambient dimensiondd, suffering from the curse of dimensionality\.

## Appendix CGeneralization to Non\-invariant Active Set in HaMI

In the analysis of ReLU networks, the fact that input perturbations cause changes in activation patterns poses a significant challenge\. In this section, we show that the entire network is a Continuous Piecewise\-Affine \(CPWA\) map, mathematically guaranteeing the validity of sensitivity analysis through path integration\.

### C\.1Properties and Compositions of Piecewise\-Affine Maps

Here, we summarize the properties regarding the composition of piecewise\-affine maps defined on a finite number of polyhedral regions\.

#### Definition: Piecewise\-Affine Map

A functionF:ℝd→ℝmF:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{m\}is defined as piecewise\-affine if there exists a partition of the input spaceR1,…,RMR\_\{1\},\\ldots,R\_\{M\}, such that on each regionRiR\_\{i\}, it can be described usingAi∈ℝm×dA\_\{i\}\\in\\mathbb\{R\}^\{m\\times d\}andbi∈ℝmb\_\{i\}\\in\\mathbb\{R\}^\{m\}as follows:

F​\(x\)=Ai​x\+bi\(∀x∈Ri\)\.\\displaystyle F\(x\)=A\_\{i\}x\+b\_\{i\}\\quad\(\\forall x\\in R\_\{i\}\)\.

#### Primary Properties of Composition

The following properties hold for the composition of affine maps and piecewise\-affine maps\. Furthermore, since the composition of continuous functions is continuous, if each component is continuous, the composite function is also a continuous piecewise\-affine \(CPWA\) map\.

1. 1\.\(Affine\)∘\\circ\(Affine\)is an affine map\.
2. 2\.\(Affine\)∘\\circ\(PWA\)is a piecewise\-affine map\.
3. 3\.\(PWA\)∘\\circ\(Affine\)is a piecewise\-affine map\.
4. 4\.\(PWA\)∘\\circ\(PWA\)is a piecewise\-affine map\.

#### Mathematical Proofs

- •Proof of 1:LetH​\(x\):=A​x\+bH\(x\):=Ax\+bandG​\(y\):=C​y\+dG\(y\):=Cy\+d\. Then\(G∘H\)​\(x\)=C​\(A​x\+b\)\+d=\(C​A\)​x\+\(C​b\+d\)\(G\\circ H\)\(x\)=C\(Ax\+b\)\+d=\(CA\)x\+\(Cb\+d\), which is an affine map\.
- •Proof of 2:LetFFbe an affine mapAi​x\+biA\_\{i\}x\+b\_\{i\}on regionRiR\_\{i\}\. Then\(G∘F\)​\(x\)=C​\(Ai​x\+bi\)\+d=\(C​Ai\)​x\+\(C​bi\+d\)\(G\\circ F\)\(x\)=C\(A\_\{i\}x\+b\_\{i\}\)\+d=\(CA\_\{i\}\)x\+\(Cb\_\{i\}\+d\), making it a PWA map on the same regionRiR\_\{i\}\.
- •Proof of 3:LetGGbe an affine mapCj​y\+djC\_\{j\}y\+d\_\{j\}on regionSjS\_\{j\}\. The inverse imageTj=H−1​\(Sj\)T\_\{j\}=H^\{\-1\}\(S\_\{j\}\)is also a polyhedral region, and on eachTjT\_\{j\}, the map becomes\(Cj​A\)​x\+\(Cj​b\+dj\)\(C\_\{j\}A\)x\+\(C\_\{j\}b\+d\_\{j\}\), thus it is a PWA map\.
- •Proof of 4:Let the regions forFFandGGbeRiR\_\{i\}andSjS\_\{j\}, respectively\. Defining the region where the composite function is affine asTi​j=Ri∩\{x:Ai​x\+bi∈Sj\}T\_\{ij\}=R\_\{i\}\\cap\\\{x:A\_\{i\}x\+b\_\{i\}\\in S\_\{j\}\\\}, this forms a polyhedral region as the intersection of polyhedral regions\. On eachTi​jT\_\{ij\}, it can be written asCj​\(Ai​x\+bi\)\+djC\_\{j\}\(A\_\{i\}x\+b\_\{i\}\)\+d\_\{j\}, meaning it is a PWA map\.

### C\.2Analysis of Continuous Piecewise\-Affine Properties in ReLU Networks\[[2](https://arxiv.org/html/2605.08863#bib.bib30)\]

A functionF:ℝd→ℝmF:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}^\{m\}is defined ascontinuous piecewise\-affine \(CPWA\)if the input space is partitioned into a finite number of polyhedral regionsR1,…,RLR\_\{1\},\\ldots,R\_\{L\}, and on each regionRℓR\_\{\\ell\}, it can be described as an affine mapF​\(x\)=Aℓ​x\+bℓF\(x\)=A\_\{\\ell\}x\+b\_\{\\ell\}usingAℓ∈ℝm×dA\_\{\\ell\}\\in\\mathbb\{R\}^\{m\\times d\}andbℓ∈ℝmb\_\{\\ell\}\\in\\mathbb\{R\}^\{m\}\.

In a ReLU network, each regionRℓR\_\{\\ell\}corresponds to a region where theactive patternof all ReLU units is fixed\. That is, within a certain region, the On/Off state of each unit is invariant, and the ReLU can be viewed as a linear transformationD​uDuusing a diagonal mask matrixD=diag⁡\(𝟙\{u\>0\}\)D=\\operatorname\{diag\}\(\\mathbb\{1\}\_\{\\\{u\>0\\\}\}\)\[[22](https://arxiv.org/html/2605.08863#bib.bib29)\]\. For example, in a two\-layer ReLU networkF​\(x\)=A2​ReLU⁡\(A1​x\+b1\)\+b2F\(x\)=A\_\{2\}\\operatorname\{ReLU\}\(A\_\{1\}x\+b\_\{1\}\)\+b\_\{2\}, sinceD1=diag⁡\(𝟙\{A1​x\+b1\>0\}\)D\_\{1\}=\\operatorname\{diag\}\(\\mathbb\{1\}\_\{\\\{A\_\{1\}x\+b\_\{1\}\>0\\\}\}\)is fixed within a specific activation region,

F​\(x\)=A2​D1​\(A1​x\+b1\)\+b2=A2​D1​A1​x\+\(A2​D1​b1\+b2\)\\displaystyle F\(x\)=A\_\{2\}D\_\{1\}\(A\_\{1\}x\+b\_\{1\}\)\+b\_\{2\}=A\_\{2\}D\_\{1\}A\_\{1\}x\+\(A\_\{2\}D\_\{1\}b\_\{1\}\+b\_\{2\}\)which reduces to an affine mapA′​x\+cA^\{\\prime\}x\+c\.

Generalizing this to anLL\-layer deep ReLU network, let the output of each layer beh0=xh\_\{0\}=x,hℓ=ReLU⁡\(Aℓ​hℓ−1\+bℓ\)h\_\{\\ell\}=\\operatorname\{ReLU\}\(A\_\{\\ell\}h\_\{\\ell\-1\}\+b\_\{\\ell\}\)forℓ=1,…,L−1\\ell=1,\\ldots,L\-1, and the final output beF​\(x\)=AL​hL−1\+bLF\(x\)=A\_\{L\}h\_\{L\-1\}\+b\_\{L\}\. Within a region where a certain activation patternD=\(D1,…,DL−1\)D=\(D\_\{1\},\\ldots,D\_\{L\-1\}\)is fixed, it can be written as follows:

F​\(x\)=\(AL​DL−1​AL−1​⋯​D1​A1\)​x\+c​\(D\)\\displaystyle F\(x\)=\\left\(A\_\{L\}D\_\{L\-1\}A\_\{L\-1\}\\cdots D\_\{1\}A\_\{1\}\\right\)x\+c\(D\)Here,c​\(D\)c\(D\)is a constant term that depends only on the activation pattern and is independent of the inputxxwithin the region:

c​\(D\)=bL\+AL​DL−1​bL−1\+AL​DL−1​AL−1​DL−2​bL−2\+⋯\+AL​DL−1​⋯​A2​D1​b1\.\\displaystyle c\(D\)=b\_\{L\}\+A\_\{L\}D\_\{L\-1\}b\_\{L\-1\}\+A\_\{L\}D\_\{L\-1\}A\_\{L\-1\}D\_\{L\-2\}b\_\{L\-2\}\+\\cdots\+A\_\{L\}D\_\{L\-1\}\\cdots A\_\{2\}D\_\{1\}b\_\{1\}\.Thus, a ReLU network partitions the input space into numerous polyhedral regions and possesses the structure of a CPWA function that behaves locally linearly within each region\.

### C\.3Sensitivity Analysis via Path Integration of Continuous Piecewise\-Affine Functions\[[27](https://arxiv.org/html/2605.08863#bib.bib31)\]

Let the functionF:ℝd→ℝF:\\mathbb\{R\}^\{d\}\\rightarrow\\mathbb\{R\}be a continuous piecewise\-affine \(CPWA\) function described asF​\(x\)=am⊤​x\+bmF\(x\)=a\_\{m\}^\{\\top\}x\+b\_\{m\}on a finite number of polyhedral regionsR1,…,RMR\_\{1\},\\ldots,R\_\{M\}\. As discussed previously, deep ReLU networks also reduce to this form\. In this subsection, we derive the change in output for a finite perturbation from an inputx0∈ℝdx\_\{0\}\\in\\mathbb\{R\}^\{d\}tox1=x0\+Δ​xx\_\{1\}=x\_\{0\}\+\\Delta xas the integral of local sensitivities along a path\.

First, we define a linear pathγ​\(t\):=x0\+t​Δ​x\\gamma\(t\):=x\_\{0\}\+t\\Delta x\(wheret∈\[0,1\]t\\in\[0,1\]\) connecting the two points, and consider a single\-variable functiong​\(t\):=F​\(γ​\(t\)\)g\(t\):=F\(\\gamma\(t\)\)along this path\. Even ifFFis a multivariable CPWA function,g​\(t\)g\(t\)becomes an affine map with respect totton each interval, thus forming a continuous piecewise\-linear \(CPWL\) function on\[0,1\]\[0,1\]\. Specifically, when the path is within a regionRmR\_\{m\},

g​\(t\)=am⊤​\(x0\+t​Δ​x\)\+bm=\(am⊤​Δ​x\)​t\+\(am⊤​x0\+bm\)\\displaystyle g\(t\)=a\_\{m\}^\{\\top\}\(x\_\{0\}\+t\\Delta x\)\+b\_\{m\}=\(a\_\{m\}^\{\\top\}\\Delta x\)t\+\(a\_\{m\}^\{\\top\}x\_\{0\}\+b\_\{m\}\)and its slope changes each time the path enters a different region\. Therefore, letting the points where the path crosses region boundaries be0=τ0<τ1<⋯<τM=10=\\tau\_\{0\}<\\tau\_\{1\}<\\cdots<\\tau\_\{M\}=1,g​\(t\)g\(t\)can be written asg​\(t\)=αm​t\+βmg\(t\)=\\alpha\_\{m\}t\+\\beta\_\{m\}in each interval\(τm−1,τm\)\(\\tau\_\{m\-1\},\\tau\_\{m\}\)\. Applying the fundamental theorem of calculus in each interval and summing over all intervals yields:

g​\(1\)−g​\(0\)=∑m=1M\[g​\(τm\)−g​\(τm−1\)\]=∑m=1M∫τm−1τmg′​\(t\)​dt=∫01g′​\(t\)​dt\.\\displaystyle g\(1\)\-g\(0\)=\\sum\_\{m=1\}^\{M\}\[g\(\\tau\_\{m\}\)\-g\(\\tau\_\{m\-1\}\)\]=\\sum\_\{m=1\}^\{M\}\\int\_\{\\tau\_\{m\-1\}\}^\{\\tau\_\{m\}\}g^\{\\prime\}\(t\)\\mathrm\{d\}t=\\int\_\{0\}^\{1\}g^\{\\prime\}\(t\)\\mathrm\{d\}t\.Since the breakpointsτm\\tau\_\{m\}are finite in number \(measure zero\), they do not affect the integral value\. Here, by the chain rule, at points whereFFis differentiable alongγ​\(t\)\\gamma\(t\),g′​\(t\)=∇F​\(γ​\(t\)\)⊤​γ′​\(t\)=∇F​\(x0\+t​Δ​x\)⊤​Δ​xg^\{\\prime\}\(t\)=\\nabla F\(\\gamma\(t\)\)^\{\\top\}\\gamma^\{\\prime\}\(t\)=\\nabla F\(x\_\{0\}\+t\\Delta x\)^\{\\top\}\\Delta x, finally leading to the following equation:

F​\(x1\)−F​\(x0\)=∫01∇F​\(x0\+t​Δ​x\)⊤​Δ​x​dt\.\\displaystyle F\(x\_\{1\}\)\-F\(x\_\{0\}\)=\\int\_\{0\}^\{1\}\\nabla F\(x\_\{0\}\+t\\Delta x\)^\{\\top\}\\Delta x\\mathrm\{d\}t\.Viewing this result in the form of a finite sum givesF​\(x1\)−F​\(x0\)=∑m=1M\(τm−τm−1\)​am⊤​Δ​xF\(x\_\{1\}\)\-F\(x\_\{0\}\)=\\sum\_\{m=1\}^\{M\}\(\\tau\_\{m\}\-\\tau\_\{m\-1\}\)a\_\{m\}^\{\\top\}\\Delta x, which explicitly shows the process where the local sensitivityam⊤​Δ​xa\_\{m\}^\{\\top\}\\Delta xobtained in each region accumulates according to the residence time\.

### C\.4Application to HaMI: Semantic Scaling and Integral Sensitivity Analysis

In this section, we apply the theoretical foundations established so far to the semantic scaling of HaMI under inference\-time\. First, we show that the bag\-level logit scoreZ𝐁​\(p\)=TopK​\-​Meani⁡f​\(\(1\+p\)​h𝐁,i\)Z\_\{\\mathbf\{B\}\}\(p\)=\\operatorname\{TopK\\text\{\-\}Mean\}\_\{i\}f\(\(1\+p\)h\_\{\\mathbf\{B\},i\}\)when scaling the hidden state ash​\(p\)=\(1\+p\)​hh\(p\)=\(1\+p\)hpreserves the CPWA property\.

To this end, we prove that the aggregation operationTopKSum\\operatorname\{TopKSum\}is a CPWA map\. For two continuous functionsffandgg, their maximum can be written asmax⁡\{f​\(x\),g​\(x\)\}=12​\(f​\(x\)\+g​\(x\)\+\|f​\(x\)−g​\(x\)\|\)\\max\\\{f\(x\),g\(x\)\\\}=\\frac\{1\}\{2\}\(f\(x\)\+g\(x\)\+\|f\(x\)\-g\(x\)\|\)\. Because the sum, difference, absolute value, and composition of continuous functions are continuous, the maximum of a finite number of continuous functions is also continuous\. The sum of the topkkelements for a vectoru∈ℝTu\\in\\mathbb\{R\}^\{T\}can be defined using the linear formaS⊤​ua\_\{S\}^\{\\top\}ufor the index setS⊆\{1,…,T\}S\\subseteq\\\{1,\\ldots,T\\\}with\|S\|=k\|S\|=kas:

TopKSumk⁡\(u\)=maxS:\|S\|=k⁡aS⊤​u\\displaystyle\\operatorname\{TopKSum\}\_\{k\}\(u\)=\\max\_\{S:\|S\|=k\}a\_\{S\}^\{\\top\}uThis is an operation that takes the maximum of a finite number of linear functions \(Max\-affine\); it is affine on each regionRS=\{u:aS⊤​u≥aS′⊤​u,∀S′\}R\_\{S\}=\\\{u:a\_\{S\}^\{\\top\}u\\geq a\_\{S^\{\\prime\}\}^\{\\top\}u,\\forall S^\{\\prime\}\\\}and continuous overall\. Therefore,TopKSum\\operatorname\{TopKSum\}and its scalar multiple,TopK​\-​Mean\\operatorname\{TopK\\text\{\-\}Mean\}, are CPWA maps\.

From the above discussion, even if the modelffcontains ReLU, Batch Normalization, and various pooling operations, their compositionZ𝐁​\(p\)Z\_\{\\mathbf\{B\}\}\(p\)becomes a single\-variable continuous piecewise\-linear \(CPWL\) function with respect topp\. Based on this property, applying the sensitivity analysis from the previous section allows the change in output to be described as follows:

Z𝐁​\(p𝐁\)−Z𝐁​\(0\)\\displaystyle Z\_\{\\mathbf\{B\}\}\(p\_\{\\mathbf\{B\}\}\)\-Z\_\{\\mathbf\{B\}\}\(0\)=∫0p𝐁Z𝐁′​\(p\)​dp,\\displaystyle=\\int\_\{0\}^\{p\_\{\\mathbf\{B\}\}\}Z\_\{\\mathbf\{B\}\}^\{\\prime\}\(p\)\\mathrm\{d\}p,Z𝐁​\(p𝐁\)\\displaystyle Z\_\{\\mathbf\{B\}\}\(p\_\{\\mathbf\{B\}\}\)=Z𝐁​\(0\)−p𝐁​C¯𝐁int​\(p𝐁\)\.\\displaystyle=Z\_\{\\mathbf\{B\}\}\(0\)\-p\_\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}\(p\_\{\\mathbf\{B\}\}\)\.Here,C¯𝐁int​\(p𝐁\):=−1p𝐁​∫0p𝐁Z𝐁′​\(p\)​dp\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}\(p\_\{\\mathbf\{B\}\}\):=\-\\frac\{1\}\{p\_\{\\mathbf\{B\}\}\}\\int\_\{0\}^\{p\_\{\\mathbf\{B\}\}\}Z\_\{\\mathbf\{B\}\}^\{\\prime\}\(p\)\\mathrm\{d\}pis the average cost term calculated across changes in the active set\. Furthermore, by settingp𝐁=λ​Psem𝐁p\_\{\\mathbf\{B\}\}=\\lambda P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}, the general formula for HaMI is obtained:

Z𝐁​\(λ​Psem𝐁\)=Z𝐁​\(0\)−λ​Psem𝐁​C¯𝐁int​\(λ​Psem𝐁\)\.\\displaystyle Z\_\{\\mathbf\{B\}\}\(\\lambda P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\)=Z\_\{\\mathbf\{B\}\}\(0\)\-\\lambda P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}\(\\lambda P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\)\.This formulation maintains mathematical consistency even under dynamic situations where the activation pattern changes\.

#### Numerical Approximation:

In our empirical evaluations, we setλ\\lambdato11and approximate the integral by discretizing the intervalp∈\[0,p𝐁\]p\\in\[0,p\_\{\\mathbf\{B\}\}\]intoN=1000N=1000uniform steps\. For each stepkk, we compute the local gradientZ𝐁′​\(pk\)Z\_\{\\mathbf\{B\}\}^\{\\prime\}\(p\_\{k\}\)\. This high\-resolution discretization provides an accurate numerical approximation of the integrated sensitivity, even when the active set changes frequently\.

### C\.5Distribution of Integrated SensitivityC¯𝐁int\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}

We analyzed the sensitivity distributions in Figures[3](https://arxiv.org/html/2605.08863#A3.F3)–[4](https://arxiv.org/html/2605.08863#A3.F4)\. All values were calculated withλ=1\\lambda=1and using the integrated costC¯𝐁int\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}via numerical approximation withN=1000N=1000steps\. Across all datasets and models, negative bags exhibit a clear rightward shift, indicating that the model learns sensitivity to the negative subspace\. This bias leads semantic scaling to prioritize negative instances, consistently across training and test splits\. Expected values are provided in Section[C\.7](https://arxiv.org/html/2605.08863#A3.SS7)\.

![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/trivia_qa/train/layer_07/gradients_layer_07_neg_dzdp_pavg_mean.png)\(a\)TriviaQA \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/trivia_qa/train/layer_05/gradients_layer_05_neg_dzdp_pavg_mean.png)\(b\)TriviaQA \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/trivia_qa/train/layer_06/gradients_layer_06_neg_dzdp_pavg_mean.png)\(c\)TriviaQA \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/squad/train/layer_07/gradients_layer_07_neg_dzdp_pavg_mean.png)\(d\)SQuAD \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/squad/train/layer_05/gradients_layer_05_neg_dzdp_pavg_mean.png)\(e\)SQuAD \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/squad/train/layer_06/gradients_layer_06_neg_dzdp_pavg_mean.png)\(f\)SQuAD \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/nq/train/layer_07/gradients_layer_07_neg_dzdp_pavg_mean.png)\(g\)NQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/nq/train/layer_05/gradients_layer_05_neg_dzdp_pavg_mean.png)\(h\)NQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/nq/train/layer_06/gradients_layer_06_neg_dzdp_pavg_mean.png)\(i\)NQ \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/bioasq/train/layer_07/gradients_layer_07_neg_dzdp_pavg_mean.png)\(j\)BioASQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/bioasq/train/layer_05/gradients_layer_05_neg_dzdp_pavg_mean.png)\(k\)BioASQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/bioasq/train/layer_06/gradients_layer_06_neg_dzdp_pavg_mean.png)\(l\)BioASQ \(Llama\-3\.3\-70B\)

Figure 3:Empirical distribution ofC¯𝐁int\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}intraindata \(Non\-invariant case\)\.![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/trivia_qa/validation/layer_07/gradients_layer_07_neg_dzdp_pavg_mean.png)\(a\)TriviaQA \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/trivia_qa/validation/layer_05/gradients_layer_05_neg_dzdp_pavg_mean.png)\(b\)TriviaQA \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/trivia_qa/validation/layer_06/gradients_layer_06_neg_dzdp_pavg_mean.png)\(c\)TriviaQA \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/squad/validation/layer_07/gradients_layer_07_neg_dzdp_pavg_mean.png)\(d\)SQuAD \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/squad/validation/layer_05/gradients_layer_05_neg_dzdp_pavg_mean.png)\(e\)SQuAD \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/squad/validation/layer_06/gradients_layer_06_neg_dzdp_pavg_mean.png)\(f\)SQuAD \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/nq/validation/layer_07/gradients_layer_07_neg_dzdp_pavg_mean.png)\(g\)NQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/nq/validation/layer_05/gradients_layer_05_neg_dzdp_pavg_mean.png)\(h\)NQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/nq/validation/layer_06/gradients_layer_06_neg_dzdp_pavg_mean.png)\(i\)NQ \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/bioasq/validation/layer_07/gradients_layer_07_neg_dzdp_pavg_mean.png)\(j\)BioASQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/bioasq/validation/layer_05/gradients_layer_05_neg_dzdp_pavg_mean.png)\(k\)BioASQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/bioasq/validation/layer_06/gradients_layer_06_neg_dzdp_pavg_mean.png)\(l\)BioASQ \(Llama\-3\.3\-70B\)

Figure 4:Empirical distribution ofC¯𝐁int\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}intestdata \(Non\-invariant case\)\.
### C\.6Distribution ofPsem𝐁​C¯𝐁intP\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}

We computed the joint productPsem𝐁​C¯𝐁intP\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}for each bag \(Figures[5](https://arxiv.org/html/2605.08863#A3.F5)–[6](https://arxiv.org/html/2605.08863#A3.F6)\)\. All values were calculated withλ=1\\lambda=1and using the integrated costC¯𝐁int\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}via numerical approximation withN=1000N=1000steps\.

Although individual bag distributions exhibit some overlap, the negative class demonstrates a higher density at larger values\. The class\-wise expectations satisfy𝔼neg​\[Psem𝐁​C¯𝐁int\]\>𝔼pos​\[Psem𝐁​C¯𝐁int\]\\mathbb\{E\}\_\{\\mathrm\{neg\}\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}\]\>\\mathbb\{E\}\_\{\\mathrm\{pos\}\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}\]across all evaluated datasets and splits\. This is consistent with the observation that bothPsem𝐁P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}andC¯𝐁int\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}are larger for negative bags, leading to a higher joint expectation\. The expected values for each condition are provided in Section[C\.7](https://arxiv.org/html/2605.08863#A3.SS7)\.

![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/trivia_qa/train/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_pavg_mean.png)\(a\)TriviaQA \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/trivia_qa/train/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_pavg_mean.png)\(b\)TriviaQA \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/trivia_qa/train/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_pavg_mean.png)\(c\)TriviaQA \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/squad/train/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_pavg_mean.png)\(d\)SQuAD \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/squad/train/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_pavg_mean.png)\(e\)SQuAD \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/squad/train/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_pavg_mean.png)\(f\)SQuAD \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/nq/train/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_pavg_mean.png)\(g\)NQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/nq/train/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_pavg_mean.png)\(h\)NQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/nq/train/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_pavg_mean.png)\(i\)NQ \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/bioasq/train/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_pavg_mean.png)\(j\)BioASQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/bioasq/train/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_pavg_mean.png)\(k\)BioASQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/bioasq/train/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_pavg_mean.png)\(l\)BioASQ \(Llama\-3\.3\-70B\)

Figure 5:Distribution of the joint productPsem𝐁​C¯𝐁intP\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}intraindata\.![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/trivia_qa/validation/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_pavg_mean.png)\(a\)TriviaQA \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/trivia_qa/validation/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_pavg_mean.png)\(b\)TriviaQA \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/trivia_qa/validation/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_pavg_mean.png)\(c\)TriviaQA \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/squad/validation/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_pavg_mean.png)\(d\)SQuAD \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/squad/validation/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_pavg_mean.png)\(e\)SQuAD \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/squad/validation/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_pavg_mean.png)\(f\)SQuAD \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/nq/validation/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_pavg_mean.png)\(g\)NQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/nq/validation/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_pavg_mean.png)\(h\)NQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/nq/validation/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_pavg_mean.png)\(i\)NQ \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_8b/bioasq/validation/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_pavg_mean.png)\(j\)BioASQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/mistral_12b/bioasq/validation/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_pavg_mean.png)\(k\)BioASQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradient_non_invariant/llama3_70b/bioasq/validation/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_pavg_mean.png)\(l\)BioASQ \(Llama\-3\.3\-70B\)

Figure 6:Distribution of the joint productPsem𝐁​C¯𝐁intP\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}intestdata\.
### C\.7Summary of Expected Values via Path Integration

We summarized the expected values for the positive \(𝔼pos\\mathbb\{E\}\_\{\\mathrm\{pos\}\}\) and negative \(𝔼neg\\mathbb\{E\}\_\{\\mathrm\{neg\}\}\) classes\. All sensitivity values were calculated withλ=1\\lambda=1and using the integrated costC¯𝐁int\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}via numerical approximation withN=1000N=1000steps\. Crucially, the results demonstrate that even under these non\-invariant conditions the empirical ratios consistently satisfy the theoretical requirement for margin enlargement derived in Equation \([4](https://arxiv.org/html/2605.08863#S3.E4)\)

Notably, by observing the joint product𝔼​\[Psem𝐁​C¯𝐁int\]\\mathbb\{E\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}^\{\\mathrm\{int\}\}\]across all evaluated models, the threshold1/γ1/\\gammaconsistently falls within the range of approximately0\.370\.37to0\.520\.52\. This significant quantitative gap provides strong theoretical insight into the robustness of the margin\-widening effect\. Since the expected penalty incurred by positive instances is roughly half or less of the expected gain for negative instances, the overall expected margin is guaranteed to expand reliably as long as the proportion of negative samples exceeds this threshold\.

Comparing these values with the empirical label distributions in Table[19](https://arxiv.org/html/2605.08863#A6.T19), it is evident that the condition in Equation \([4](https://arxiv.org/html/2605.08863#S3.E4)\) is strictly satisfied across all datasets, including those with a higher proportion of positive samples like SQuAD and NQ\. This empirical evidence demonstrates that HaMI effectively suppresses negative instances while safely preserving the signal of positive instances, thereby establishing a more robust decision boundary for hallucination detection\.

Table 6:Empirical expected values forLLaMA\-3\.1\-8B\(Train\) under non\-invariant conditions\.Table 7:Empirical expected values forLLaMA\-3\.1\-8B\(Test\) under non\-invariant conditions\.Table 8:Empirical expected values forMistral\-12B\(Train\) under non\-invariant conditions\.Table 9:Empirical expected values forMistral\-12B\(Test\) under non\-invariant conditions\.Table 10:Empirical expected values forLLaMA\-3\.3\-70B\(Train\) under non\-invariant conditions\.Table 11:Empirical expected values forLLaMA\-3\.3\-70B\(Test\) under non\-invariant conditions\.

## Appendix DBag\-wise Margin Expansion underβ\\beta\-Smoothness

In the proof of Theorem[2](https://arxiv.org/html/2605.08863#Thmtheorem2), the expected margin increase relies on a first\-order Taylor approximation\. However, this approximation is only local in the learning rateη\\eta\. The second\-order remainder termO​\(η2\)O\(\\eta^\{2\}\)may dominate the positive first\-order term, potentially preventing us from guaranteeing a margin increase\. In this section, we introduce aβ\\beta\-smoothness assumption to explicitly bound this remainder, yielding an upper bound onη\\etathat ensures a positive expected margin increase\.

###### Assumption 2\(β\\beta\-Smoothness of the Margin\)\.

For any sampled bag𝐁\\mathbf\{B\}, the margin functionm𝐁​\(θ\)m\_\{\\mathbf\{B\}\}\(\\theta\)isβ\\beta\-smooth with respect to the parametersθ\\theta\. That is, its gradient is Lipschitz continuous with a constantβ\>0\\beta\>0, implying that for any parameterθ\\thetaand update stepΔ​θ\\Delta\\theta:

m𝐁​\(θ\+Δ​θ\)≥m𝐁​\(θ\)\+∇θm𝐁​\(θ\)⊤​Δ​θ−β2​‖Δ​θ‖2\.m\_\{\\mathbf\{B\}\}\(\\theta\+\\Delta\\theta\)\\geq m\_\{\\mathbf\{B\}\}\(\\theta\)\+\\nabla\_\{\\theta\}m\_\{\\mathbf\{B\}\}\(\\theta\)^\{\\top\}\\Delta\\theta\-\\frac\{\\beta\}\{2\}\\\|\\Delta\\theta\\\|^\{2\}\.

Under the online SGD update rule with logistic loss, the parameter step induced by a sampled bag𝐁\\mathbf\{B\}isΔ​θ=−η​∇θℒ𝐁​\(θ\)\\Delta\\theta=\-\\eta\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\)\. Lettingθη:=θ\+Δ​θ\\theta\_\{\\eta\}:=\\theta\+\\Delta\\thetadenote the updated parameter, substituting this into theβ\\beta\-smoothness inequality yields:

m𝐁​\(θη\)−m𝐁​\(θ\)≥−η​∇θm𝐁​\(θ\)⊤​∇θℒ𝐁​\(θ\)−β​η22​‖∇θℒ𝐁​\(θ\)‖2\.m\_\{\\mathbf\{B\}\}\(\\theta\_\{\\eta\}\)\-m\_\{\\mathbf\{B\}\}\(\\theta\)\\geq\-\\eta\\nabla\_\{\\theta\}m\_\{\\mathbf\{B\}\}\(\\theta\)^\{\\top\}\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\)\-\\frac\{\\beta\\eta^\{2\}\}\{2\}\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\)\\\|^\{2\}\.
Recall from Appendix[B\.3](https://arxiv.org/html/2605.08863#A2.SS3)that∇θm𝐁​\(θ\)=y𝐁​∇θz𝐁​\(θ\)\\nabla\_\{\\theta\}m\_\{\\mathbf\{B\}\}\(\\theta\)=y\_\{\\mathbf\{B\}\}\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)and∇θℒ𝐁​\(θ\)=−y𝐁1\+exp⁡\(m𝐁​\(θ\)\)​∇θz𝐁​\(θ\)\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\)=\-\\frac\{y\_\{\\mathbf\{B\}\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\}\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)\. Sincey𝐁∈\{−1,1\}y\_\{\\mathbf\{B\}\}\\in\\\{\-1,1\\\}, the squared norm of the loss gradient simplifies to:

‖∇θℒ𝐁​\(θ\)‖2=‖∇θz𝐁​\(θ\)‖2\(1\+exp⁡\(m𝐁​\(θ\)\)\)2\.\\\|\\nabla\_\{\\theta\}\\mathcal\{L\}\_\{\\mathbf\{B\}\}\(\\theta\)\\\|^\{2\}=\\frac\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)\\\|^\{2\}\}\{\(1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\)^\{2\}\}\.
Substituting the gradient terms back into the inequality gives:

m𝐁​\(θη\)−m𝐁​\(θ\)\\displaystyle m\_\{\\mathbf\{B\}\}\(\\theta\_\{\\eta\}\)\-m\_\{\\mathbf\{B\}\}\(\\theta\)≥η​‖∇θz𝐁​\(θ\)‖21\+exp⁡\(m𝐁​\(θ\)\)−β​η22​‖∇θz𝐁​\(θ\)‖2\(1\+exp⁡\(m𝐁​\(θ\)\)\)2\\displaystyle\\geq\\eta\\frac\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)\\\|^\{2\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\}\-\\frac\{\\beta\\eta^\{2\}\}\{2\}\\frac\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)\\\|^\{2\}\}\{\(1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\)^\{2\}\}=η​‖∇θz𝐁​\(θ\)‖21\+exp⁡\(m𝐁​\(θ\)\)​\(1−β​η2​\(1\+exp⁡\(m𝐁​\(θ\)\)\)\)\.\\displaystyle=\\eta\\frac\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)\\\|^\{2\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\}\\left\(1\-\\frac\{\\beta\\eta\}\{2\(1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\)\}\\right\)\.
Taking the expectation over the data distribution𝐁∼𝒟\\mathbf\{B\}\\sim\\mathcal\{D\}, the expected margin change from a single\-bag update is bounded by:

𝔼𝐁​\[m𝐁​\(θη\)−m𝐁​\(θ\)\]≥𝔼𝐁​\[η​‖∇θz𝐁​\(θ\)‖21\+exp⁡\(m𝐁​\(θ\)\)​\(1−β​η2​\(1\+exp⁡\(m𝐁​\(θ\)\)\)\)\]\.\\mathbb\{E\}\_\{\\mathbf\{B\}\}\[m\_\{\\mathbf\{B\}\}\(\\theta\_\{\\eta\}\)\-m\_\{\\mathbf\{B\}\}\(\\theta\)\]\\geq\\mathbb\{E\}\_\{\\mathbf\{B\}\}\\left\[\\eta\\frac\{\\\|\\nabla\_\{\\theta\}z\_\{\\mathbf\{B\}\}\(\\theta\)\\\|^\{2\}\}\{1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\}\\left\(1\-\\frac\{\\beta\\eta\}\{2\(1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\)\}\\right\)\\right\]\.
To guarantee a positive margin expansion \(𝔼𝐁​\[m𝐁​\(θη\)−m𝐁​\(θ\)\]\>0\\mathbb\{E\}\_\{\\mathbf\{B\}\}\[m\_\{\\mathbf\{B\}\}\(\\theta\_\{\\eta\}\)\-m\_\{\\mathbf\{B\}\}\(\\theta\)\]\>0\) whenever the gradient is non\-zero, the multiplicative factor inside the parenthesis must be positive for all bags\. This requires:

η<2​\(1\+exp⁡\(m𝐁​\(θ\)\)\)β\.\\eta<\\frac\{2\(1\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\)\}\{\\beta\}\.
Sinceexp⁡\(m𝐁​\(θ\)\)\>0\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\>0, we have1\+exp⁡\(m𝐁​\(θ\)\)\>11\+\\exp\(m\_\{\\mathbf\{B\}\}\(\\theta\)\)\>1\. Therefore, a sufficient condition for the learning rate to guarantee margin expansion, independent of the current margin value, is:

η≤2β\.\\eta\\leq\\frac\{2\}\{\\beta\}\.This result confirms that under a bounded step size determined by the smoothness of the network, the first\-order gradient dynamics faithfully drive the bag\-wise margin expansion\.

## Appendix EDetailed Experimental Results

### E\.1The standard deviations

The standard deviations in our experiments are summarized in Table[12](https://arxiv.org/html/2605.08863#A5.T12)and Table[13](https://arxiv.org/html/2605.08863#A5.T13)\. In token\-level contrastive approaches such as HaMI, we observe that noise can propagate and lead to variability in the final AUC\. In contrast, incorporating semantic consistency for weighting improves stability by leveraging information from the entire sentence\. Furthermore, methods based on pooling sentence\-level representations exhibit highly stable training behavior\.

Table 12:Comparison of AUC standard deviations across different methods\.Table 13:Comparison of margin standard deviations across different methods\.
### E\.2Extended Comparisons of Pooling Strategies

We also evaluated attention pooling and gated attention pooling using the Attention\-based Deep Multiple Instance Learning framework\[[12](https://arxiv.org/html/2605.08863#bib.bib3)\]\.

S​\(𝐁\)=g​\(∑i=1T𝐁a𝐁,i​𝐡𝐁,i\)\\displaystyle S\(\\mathbf\{B\}\)=g\\left\(\\sum\_\{i=1\}^\{T\_\{\\mathbf\{B\}\}\}a\_\{\\mathbf\{B\},i\}\\mathbf\{h\}\_\{\\mathbf\{B\},i\}\\right\)Here, the attention weightaia\_\{i\}is defined as:

a𝐁,i=exp⁡\(w⊤​tanh⁡\(𝐕𝐡𝐁,i\)\)∑j=1T𝐁exp⁡\(w⊤​tanh⁡\(𝐕𝐡𝐁,j\)\)\.\\displaystyle a\_\{\\mathbf\{B\},i\}=\\frac\{\\exp\\left\(w^\{\\top\}\\tanh\(\\mathbf\{V\}\\mathbf\{h\}\_\{\\mathbf\{B\},i\}\)\\right\)\}\{\\sum\_\{j=1\}^\{T\_\{\\mathbf\{B\}\}\}\\exp\\left\(w^\{\\top\}\\tanh\(\\mathbf\{V\}\\mathbf\{h\}\_\{\\mathbf\{B\},j\}\)\\right\)\}\.The gated attention weightaia\_\{i\}is defined as:

a𝐁,i=exp⁡\(w⊤​\(tanh⁡\(𝐕𝐡𝐁,i\)⊙σ​\(𝐔𝐡𝐁,i\)\)\)∑j=1T𝐁exp⁡\(w⊤​\(tanh⁡\(𝐕𝐡𝐁,j\)⊙σ​\(𝐔𝐡𝐁,j\)\)\),\\displaystyle a\_\{\\mathbf\{B\},i\}=\\frac\{\\exp\\left\(w^\{\\top\}\\bigl\(\\tanh\(\\mathbf\{V\}\\mathbf\{h\}\_\{\\mathbf\{B\},i\}\)\\odot\\sigma\(\\mathbf\{U\}\\mathbf\{h\}\_\{\\mathbf\{B\},i\}\)\\bigr\)\\right\)\}\{\\sum\_\{j=1\}^\{T\_\{\\mathbf\{B\}\}\}\\exp\\left\(w^\{\\top\}\\bigl\(\\tanh\(\\mathbf\{V\}\\mathbf\{h\}\_\{\\mathbf\{B\},j\}\)\\odot\\sigma\(\\mathbf\{U\}\\mathbf\{h\}\_\{\\mathbf\{B\},j\}\)\\bigr\)\\right\)\},where𝐕∈ℝL×d\\mathbf\{V\}\\in\\mathbb\{R\}^\{L\\times d\}and𝐔∈ℝL×d\\mathbf\{U\}\\in\\mathbb\{R\}^\{L\\times d\}are learnable weight matrices, andw∈ℝL×1w\\in\\mathbb\{R\}^\{L\\times 1\}is a projection vector\. In our experiments, the hidden dimension of the attention mechanism was set toL=256L=256\.

In terms of detection performance, as shown in Table[14](https://arxiv.org/html/2605.08863#A5.T14), both max\-pooling and attention\-based methods achieve strong results\. However, regarding computational efficiency, Table[15](https://arxiv.org/html/2605.08863#A5.T15)reveals a substantial gap between simple pooling and attention\-based mechanisms\.

Specifically,max pooling and mean pooling are significantly faster than attention\-based methods, demonstrating a clear advantage in inference speed\. For instance, on the LLaMA\-3\.3\-70B model, max pooling achieves a throughput approximately1\.71\.7×\\timesto1\.81\.8×\\timeshigherthan Gated Attention Pooling\. On the Mistral\-Nemo\-Instruct \(12B\) model across the NQ and BioASQ datasets, max pooling maintains a processing speed that is roughly6060–65%65\\%fasterthan Gated Attention\.

While Attention Pooling shows competitive AUC scores, the computational overhead is non\-negligible, especially in high\-throughput scenarios\. Overall,max pooling achieves the best balance, delivering top performance in terms of both AUC and inference speed, often providing the highest throughput among all methods\.

Table 14:Performance \(AUC\) comparison between different pooling methods\. The best results are highlighted inred, and the second\-best results are shown inblue\. Results are averaged over five runs\.Table 15:Throughput \(samples/sec\) comparison across models and datasets\. Best results are highlighted inred, and second\-best inblue\.
### E\.3Impact of Feature DimensionDD

In this section, we evaluated how the feature dimensionDDof the extraction layer affects the model’s performance across different datasets\. Tables[16](https://arxiv.org/html/2605.08863#A5.T16),[17](https://arxiv.org/html/2605.08863#A5.T17), and[18](https://arxiv.org/html/2605.08863#A5.T18)summarize the AUC scores for LLaMA\-3\.1\-8B, Mistral\-Nemo\-12B, and LLaMA\-3\.3\-70B, respectively\.

The experimental results consistently demonstrate that increasing the dimensionDDleads to a significant improvement in AUC compared to the instance\-based approach \(D=1D=1\)\. WhileD=1D=1represents a collapsed representation of the input, higher values ofDDallow the model to preserve and use more granular token\-level features\. This suggests that a higher\-dimensional feature space enables the extraction layer to learn more expressive sentence representations by effectively capturing the nuances of the token\-wise distribution\.

Furthermore, we observe a trend of diminishing returns: performance gains are most substantial when increasingDDfrom 1 to 16, after which the AUC scores begin to plateau\. Across all three models, saturation typically occurs aroundD=128D=128orD=256D=256\. This indicates that while capturing token\-level information is crucial for robust performance, a relatively low\-dimensional projection is sufficient to encode the necessary characteristics for the task\.

Table 16:Impact of feature dimension \(DD\) on AUC for LLaMA\-3\.1\-8B \(max pooling\)\.Table 17:Impact of feature dimension \(DD\) on AUC for Mistral\-Nemo\-Instruct \(12B\)\(max pooling\)\.Table 18:Impact of feature dimension \(DD\) on AUC for LLaMA\-3\.3\-Instruct\-70B \(max pooling\)\.

## Appendix FData Distribution

### F\.1Label Distribution

Table[19](https://arxiv.org/html/2605.08863#A6.T19)presents the distribution of positive and negative samples in the training and test sets across all datasets\. The labels were assigned based on responses from the Llama 3\.1\-8B model and subsequently annotated using GPT\-5 mini\.

TriviaQA contains a disproportionately large number of negative samples, which may indicate that hallucination detection is relatively easier on this dataset\. In contrast, SQuAD and NQ exhibit a higher proportion of positive samples, which may make them more challenging for hallucination detection\. We conducted all experiments under these distributions\.

Table 19:Number of positive and negative samples and the class ratio\|𝒮n​e​g\|/\|𝒮p​o​s\|\|\\mathcal\{S\}\_\{neg\}\|/\|\\mathcal\{S\}\_\{pos\}\|\.
### F\.2Token length Distribution

Figure[7](https://arxiv.org/html/2605.08863#A6.F7)illustrates the token length distributions, where the horizontal axis represents the number of tokens and the vertical axis represents the sample count\. Since the model was specifically prompted to generate concise responses, the distributions are heavily concentrated within the range of fewer than2020tokens\. These observations confirm that the experiments were conducted using a dataset characterized by short\-form text generation\.

![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/token_outputs/trivia_qa/train/a.png)\(a\)TriviaQA
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/token_outputs/squad/train/a.png)\(b\)SQuAD
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/token_outputs/nq/train/a.png)\(c\)NQ
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/token_outputs/bioasq/train/a.png)\(d\)BioASQ
\(1\) Train Split distributions across datasets

![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/token_outputs/trivia_qa/validation/a.png)\(e\)TriviaQA
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/token_outputs/squad/validation/a.png)\(f\)SQuAD
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/token_outputs/nq/validation/a.png)\(g\)NQ
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/token_outputs/bioasq/validation/a.png)\(h\)BioASQ
\(2\) Test Split distributions across datasets

Figure 7:Detailed comparison of token length distributions for Training and Evaluation sets\.

## Appendix GEmpirical Observations in HaMI

### G\.1Distribution of Semantic Probability

Figures[8](https://arxiv.org/html/2605.08863#A7.F8)\(train\) and[9](https://arxiv.org/html/2605.08863#A7.F9)\(test\) show the empirical distribution of semantic probabilitiesPsem𝐁P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\. Across all datasets and models, negative bags exhibit a higher density in high\-probability regions than positive bags, confirming the relationship𝔼n​e​g​\[Psem𝐁\]\>𝔼p​o​s​\[Psem𝐁\]\\mathbb\{E\}\_\{neg\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\]\>\\mathbb\{E\}\_\{pos\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\]\. The expected values for each condition are provided in Appendix[G\.4](https://arxiv.org/html/2605.08863#A7.SS4)\. The AUC performance is summarized in Table[20](https://arxiv.org/html/2605.08863#A7.T20)\.Psem𝐁P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}consistently achieves high discriminative power\.

![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/trivia_qa/train/layer_07/gradients_layer_07_prob_distribution.png)\(a\)TriviaQA \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/trivia_qa/train/layer_05/gradients_layer_05_prob_distribution.png)\(b\)TriviaQA \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/trivia_qa/train/layer_06/gradients_layer_06_prob_distribution.png)\(c\)TriviaQA \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/squad/train/layer_07/gradients_layer_07_prob_distribution.png)\(d\)SQuAD \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/squad/train/layer_05/gradients_layer_05_prob_distribution.png)\(e\)SQuAD \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/squad/train/layer_06/gradients_layer_06_prob_distribution.png)\(f\)SQuAD \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/nq/train/layer_07/gradients_layer_07_prob_distribution.png)\(g\)NQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/nq/train/layer_05/gradients_layer_05_prob_distribution.png)\(h\)NQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/nq/train/layer_06/gradients_layer_06_prob_distribution.png)\(i\)NQ \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/bioasq/train/layer_07/gradients_layer_07_prob_distribution.png)\(j\)BioASQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/bioasq/train/layer_05/gradients_layer_05_prob_distribution.png)\(k\)BioASQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/bioasq/train/layer_06/gradients_layer_06_prob_distribution.png)\(l\)BioASQ \(Llama\-3\.3\-70B\)

Figure 8:Empirical distribution of semantic probabilitiesPsem𝐁P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}intraindata\.![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/trivia_qa/validation/layer_07/gradients_layer_07_prob_distribution.png)\(a\)TriviaQA \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/trivia_qa/validation/layer_05/gradients_layer_05_prob_distribution.png)\(b\)TriviaQA \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/trivia_qa/validation/layer_06/gradients_layer_06_prob_distribution.png)\(c\)TriviaQA \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/squad/validation/layer_07/gradients_layer_07_prob_distribution.png)\(d\)SQuAD \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/squad/validation/layer_05/gradients_layer_05_prob_distribution.png)\(e\)SQuAD \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/squad/validation/layer_06/gradients_layer_06_prob_distribution.png)\(f\)SQuAD \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/nq/validation/layer_07/gradients_layer_07_prob_distribution.png)\(g\)NQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/nq/validation/layer_05/gradients_layer_05_prob_distribution.png)\(h\)NQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/nq/validation/layer_06/gradients_layer_06_prob_distribution.png)\(i\)NQ \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/bioasq/validation/layer_07/gradients_layer_07_prob_distribution.png)\(j\)BioASQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/bioasq/validation/layer_05/gradients_layer_05_prob_distribution.png)\(k\)BioASQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/bioasq/validation/layer_06/gradients_layer_06_prob_distribution.png)\(l\)BioASQ \(Llama\-3\.3\-70B\)

Figure 9:Empirical distribution of semantic probabilitiesPsem𝐁P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}intestdata\.Table 20:AUC scores for Semantic Probability as a discriminator across datasets and models\.
### G\.2Distribution ofC¯B\\bar\{C\}\_\{B\}

To validate𝔼n​e​g​\[C¯𝐁\]\>𝔼p​o​s​\[C¯𝐁\]\\mathbb\{E\}\_\{neg\}\[\\bar\{C\}\_\{\\mathbf\{B\}\}\]\>\\mathbb\{E\}\_\{pos\}\[\\bar\{C\}\_\{\\mathbf\{B\}\}\], we analyzed the sensitivity distributions in Figures[10](https://arxiv.org/html/2605.08863#A7.F10)–[11](https://arxiv.org/html/2605.08863#A7.F11)and Appendix[G\.4](https://arxiv.org/html/2605.08863#A7.SS4)\. Across all datasets, negative bags exhibit a distinct rightward shift, confirming that the network actively acquires an inherent sensitivity to the negative subspace during optimization to enhance discriminative performance\. This robust, learned bias ensures that semantic scaling naturally prioritizes negative instances\. Consistency across data splits further substantiates that this disparity is a fundamental property attained through learning\.

![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/trivia_qa/train/layer_07/gradients_layer_07_neg_dzdp_topk_mean.png)\(a\)TriviaQA \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/trivia_qa/train/layer_05/gradients_layer_05_neg_dzdp_topk_mean.png)\(b\)TriviaQA \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/trivia_qa/train/layer_06/gradients_layer_06_neg_dzdp_topk_mean.png)\(c\)TriviaQA \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/squad/train/layer_07/gradients_layer_07_neg_dzdp_topk_mean.png)\(d\)SQuAD \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/squad/train/layer_05/gradients_layer_05_neg_dzdp_topk_mean.png)\(e\)SQuAD \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/squad/train/layer_06/gradients_layer_06_neg_dzdp_topk_mean.png)\(f\)SQuAD \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/nq/train/layer_07/gradients_layer_07_neg_dzdp_topk_mean.png)\(g\)NQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/nq/train/layer_05/gradients_layer_05_neg_dzdp_topk_mean.png)\(h\)NQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/nq/train/layer_06/gradients_layer_06_neg_dzdp_topk_mean.png)\(i\)NQ \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/bioasq/train/layer_07/gradients_layer_07_neg_dzdp_topk_mean.png)\(j\)BioASQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/bioasq/train/layer_05/gradients_layer_05_neg_dzdp_topk_mean.png)\(k\)BioASQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/bioasq/train/layer_06/gradients_layer_06_neg_dzdp_topk_mean.png)\(l\)BioASQ \(Llama\-3\.3\-70B\)

Figure 10:Empirical distribution of the bag\-level sensitivity to input scalingC¯𝐁\\bar\{C\}\_\{\\mathbf\{B\}\}intraindata\.![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/trivia_qa/validation/layer_07/gradients_layer_07_neg_dzdp_topk_mean.png)\(a\)TriviaQA \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/trivia_qa/validation/layer_05/gradients_layer_05_neg_dzdp_topk_mean.png)\(b\)TriviaQA \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/trivia_qa/validation/layer_06/gradients_layer_06_neg_dzdp_topk_mean.png)\(c\)TriviaQA \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/squad/validation/layer_07/gradients_layer_07_neg_dzdp_topk_mean.png)\(d\)SQuAD \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/squad/validation/layer_05/gradients_layer_05_neg_dzdp_topk_mean.png)\(e\)SQuAD \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/squad/validation/layer_06/gradients_layer_06_neg_dzdp_topk_mean.png)\(f\)SQuAD \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/nq/validation/layer_07/gradients_layer_07_neg_dzdp_topk_mean.png)\(g\)NQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/nq/validation/layer_05/gradients_layer_05_neg_dzdp_topk_mean.png)\(h\)NQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/nq/validation/layer_06/gradients_layer_06_neg_dzdp_topk_mean.png)\(i\)NQ \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/bioasq/validation/layer_07/gradients_layer_07_neg_dzdp_topk_mean.png)\(j\)BioASQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/bioasq/validation/layer_05/gradients_layer_05_neg_dzdp_topk_mean.png)\(k\)BioASQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/bioasq/validation/layer_06/gradients_layer_06_neg_dzdp_topk_mean.png)\(l\)BioASQ \(Llama\-3\.3\-70B\)

Figure 11:Empirical distribution of the bag\-level sensitivity to input scalingC¯𝐁\\bar\{C\}\_\{\\mathbf\{B\}\}intestdata\.
### G\.3Distribution ofPsem𝐁​C¯𝐁P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}

We computed the joint productPsem𝐁​C¯𝐁P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}for each bag\. Figures[12](https://arxiv.org/html/2605.08863#A7.F12)and[13](https://arxiv.org/html/2605.08863#A7.F13)illustrate the empirical distributions of this product for the training and test sets, respectively\. Although the distributions of individual bags naturally exhibit some overlap, the negative class consistently demonstrates a higher density at larger values\. We therefore summarize the expected values for each class\. Although there is an overlap in the distributions of individual bags, the class\-wise expected values satisfy the inequality𝔼n​e​g​\[Psem𝐁​C¯𝐁\]\>𝔼p​o​s​\[Psem𝐁​C¯𝐁\]\\mathbb\{E\}\_\{neg\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}\]\>\\mathbb\{E\}\_\{pos\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}\]across all evaluated datasets and across both data splits\. The expected values for each condition are provided in Appendix[G\.4](https://arxiv.org/html/2605.08863#A7.SS4)\.

![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/trivia_qa/train/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_topk_mean.png)\(a\)TriviaQA \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/trivia_qa/train/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_topk_mean.png)\(b\)TriviaQA \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/trivia_qa/train/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_topk_mean.png)\(c\)TriviaQA \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/squad/train/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_topk_mean.png)\(d\)SQuAD \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/squad/train/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_topk_mean.png)\(e\)SQuAD \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/squad/train/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_topk_mean.png)\(f\)SQuAD \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/nq/train/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_topk_mean.png)\(g\)NQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/nq/train/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_topk_mean.png)\(h\)NQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/nq/train/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_topk_mean.png)\(i\)NQ \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/bioasq/train/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_topk_mean.png)\(j\)BioASQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/bioasq/train/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_topk_mean.png)\(k\)BioASQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/bioasq/train/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_topk_mean.png)\(l\)BioASQ \(Llama\-3\.3\-70B\)

Figure 12:Distribution of the joint productPsem𝐁​C¯𝐁P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}intraindata\.![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/trivia_qa/validation/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_topk_mean.png)\(a\)TriviaQA \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/trivia_qa/validation/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_topk_mean.png)\(b\)TriviaQA \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/trivia_qa/validation/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_topk_mean.png)\(c\)TriviaQA \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/squad/validation/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_topk_mean.png)\(d\)SQuAD \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/squad/validation/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_topk_mean.png)\(e\)SQuAD \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/squad/validation/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_topk_mean.png)\(f\)SQuAD \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/nq/validation/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_topk_mean.png)\(g\)NQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/nq/validation/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_topk_mean.png)\(h\)NQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/nq/validation/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_topk_mean.png)\(i\)NQ \(Llama\-3\.3\-70B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_8b/bioasq/validation/layer_07/gradients_layer_07_prob_weighted_neg_dzdp_topk_mean.png)\(j\)BioASQ \(Llama\-3\.1\-8B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/mistral_12b/bioasq/validation/layer_05/gradients_layer_05_prob_weighted_neg_dzdp_topk_mean.png)\(k\)BioASQ \(Mistral\-12B\)
![Refer to caption](https://arxiv.org/html/2605.08863v1/Data/gradients/llama3_70b/bioasq/validation/layer_06/gradients_layer_06_prob_weighted_neg_dzdp_topk_mean.png)\(l\)BioASQ \(Llama\-3\.3\-70B\)

Figure 13:Distribution of the joint productPsem𝐁​C¯𝐁P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}intestdata\.
### G\.4Summary of Expected Values

We summarize the expected values for the positive \(𝔼pos\\mathbb\{E\}\_\{\\mathrm\{pos\}\}\) and negative \(𝔼neg\\mathbb\{E\}\_\{\\mathrm\{neg\}\}\) classes across all evaluated datasets\. Tables[21](https://arxiv.org/html/2605.08863#A7.T21)to[26](https://arxiv.org/html/2605.08863#A7.T26)report the expectations of semantic probability \(Psem𝐁P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\), sensitivity \(C¯𝐁\\bar\{C\}\_\{\\mathbf\{B\}\}\), and their joint product \(Psem𝐁​C¯𝐁P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}\) for LLaMA\-3\.1\-8B, Mistral\-12B, and LLaMA\-3\.3\-70B, evaluated on both training and test splits\.

Across all models, the joint expectation𝔼​\[Psem𝐁​C¯𝐁\]\\mathbb\{E\}\[P\_\{\\mathrm\{sem\}\}^\{\\mathbf\{B\}\}\\bar\{C\}\_\{\\mathbf\{B\}\}\]yields a consistent threshold1/γ1/\\gammain the range of approximately0\.370\.37to0\.500\.50\. This quantitative gap provides strong evidence for the robustness of the margin\-widening effect: the expected penalty from positive instances remains substantially smaller than the gain from negative ones, ensuring reliable margin expansion whenever the proportion of negative samples exceeds this threshold\.

Importantly, these findings extend beyond the idealized invariant setting\. As shown in Section[C](https://arxiv.org/html/2605.08863#A3), the same condition holds even under the more general non\-invariant regime, where the active set varies along the path\. Despite this added complexity, the empirical ratios consistently satisfy Equation \([4](https://arxiv.org/html/2605.08863#S3.E4)\), demonstrating that the underlying mechanism is stable under realistic conditions\.

Comparing these values with the empirical label distributions in Table[19](https://arxiv.org/html/2605.08863#A6.T19), we observe that the condition in Equation \([4](https://arxiv.org/html/2605.08863#S3.E4)\) is strictly satisfied across all datasets, including those with a higher proportion of positive samples such as SQuAD and NQ\. This provides strong empirical support that HaMI suppresses negative instances while preserving positive signals, resulting in a more robust decision boundary for hallucination detection\.

Table 21:Empirical expected values and sensitivity ratios forLLaMA\-3\.1\-8B\(Train\)\.Table 22:Empirical expected values and sensitivity ratios forLLaMA\-3\.1\-8B\(Test\)\.Table 23:Empirical expected values and sensitivity ratios forMistral\-12B\(Train\)\.Table 24:Empirical expected values and sensitivity ratios forMistral\-12B\(Test\)\.Table 25:Empirical expected values and sensitivity ratios forLLaMA\-3\.3\-70B\(Train\)\.Table 26:Empirical expected values and sensitivity ratios forLLaMA\-3\.3\-70B\(Test\)\.

Similar Articles

Automatic Layer Selection for Hallucination Detection

arXiv cs.AI

This paper proposes automatic layer selection for hallucination detection in LLMs and introduces First Effective Peak of Intrinsic Dimension (FEPoID), a training-free criterion that consistently identifies optimal intermediate layers, outperforming existing heuristics.

Hallucination as an Anomaly: Dynamic Intervention via Probabilistic Circuits

arXiv cs.CL

This paper presents PCNet, a probabilistic circuit trained as a tractable density estimator on LLM residual streams to detect hallucinations as geometric anomalies. It also introduces PC-LDCD, a dynamic correction method that only intervenes on hallucinated tokens, achieving near-perfect detection and reduced corruption rates.

HalluSAE: Detecting Hallucinations in Large Language Models via Sparse Auto-Encoders

arXiv cs.CL

Researchers from Beihang University and other institutions propose HalluSAE, a framework using sparse autoencoders and phase transition theory to detect hallucinations in LLMs by modeling generation as trajectories through a potential energy landscape and identifying critical transition zones where factual errors occur.