MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion

arXiv cs.LG 06/17/26, 04:00 AM Papers
ood-detection unsupervised-learning deep-learning scale-invariant feature-fusion mahalanobis arxiv
Summary
MM++ is a fully unsupervised, post-hoc framework for out-of-distribution detection that fuses discriminative intermediate layers via top-K gated feature fusion and uses a regularized tied covariance matrix for scale-invariant distance estimation.
arXiv:2606.17352v1 Announce Type: new Abstract: We introduce MM++ (Multilayer Mahalanobis++), a fully unsupervised, strictly post-hoc, and scale-invariant framework for Out-of-Distribution (OOD) detection. To address the trade-off between scale invariance and hierarchical expressivity, MM++ constructs a principled joint feature space. It first identifies discriminative intermediate layers by measuring entropy density drops, which mark the boundaries of sharp semantic compression. By fusing these selected layers with the terminal representation, the framework captures latent cross-layer correlations while mitigating early-layer noise. Crucially, a Ledoit-Wolf regularized tied covariance matrix stabilizes this unified space, enabling reliable distance estimation. Requiring no auxiliary OOD data, classifier fine-tuning, or architectural modifications, MM++ delivers robust performance across distinct architectures for both near- and far-OOD detection.
Original Article
View Cached Full Text
Cached at: 06/17/26, 05:37 AM
# Unsupervised Scale-Invariant Multilayer OOD Detection via Top-𝐾 Gated Feature Fusion
Source: [https://arxiv.org/html/2606.17352](https://arxiv.org/html/2606.17352)
Rahim Hossain Md Tawheedul Islam Bhuian Md Farhan Shadiq Kyoung\-Don Kang School of Computing State University of New York at Binghamton \{rhossain, mislambhuian, mshadiq, kang\}@binghamton\.edu

###### Abstract

We introduce MM\+\+ \(Multilayer Mahalanobis\+\+\), a fully unsupervised, strictly post\-hoc, and scale\-invariant framework for Out\-of\-Distribution \(OOD\) detection\. To address the trade\-off between scale invariance and hierarchical expressivity, MM\+\+ constructs a principled joint feature space\. It first identifies discriminative intermediate layers by measuring entropy density drops, which mark the boundaries of sharp semantic compression\. By fusing these selected layers with the terminal representation, the framework captures latent cross\-layer correlations while mitigating early\-layer noise\. Crucially, a Ledoit–Wolf regularized tied covariance matrix stabilizes this unified space, enabling reliable distance estimation\. Requiring no auxiliary OOD data, classifier fine\-tuning, or architectural modifications, MM\+\+ delivers robust performance across distinct architectures for both near\- and far\-OOD detection\.

## 1Introduction

Deep neural networks \(DNNs\) have achieved remarkable success but often exhibit overconfident predictions on out\-of\-distribution \(OOD\) inputs when deployed in open\-world environments\. This behavior poses significant risks in safety\-critical applications such as medical diagnosis and autonomous driving\[[48](https://arxiv.org/html/2606.17352#bib.bib22),[28](https://arxiv.org/html/2606.17352#bib.bib21)\]\. Consequently, reliable OOD detection remains a fundamental requirement for trustworthy DNN deployment\.

A key driver of this overconfidence is a geometric phenomenon known as neural collapse\[[32](https://arxiv.org/html/2606.17352#bib.bib20)\]\. During training, DNNs compress in\-distribution \(ID\) representations into highly structured, low\-dimensional class centroids, suppressing intra\-class variability\. While improving classification accuracy, this terminal compression reduces representational diversity, causing OOD samples to project near ID class structures and yield overconfident predictions\.

To mitigate this without retraining, post\-hoc methods often leverage intermediate representations\[[19](https://arxiv.org/html/2606.17352#bib.bib56),[35](https://arxiv.org/html/2606.17352#bib.bib48),[40](https://arxiv.org/html/2606.17352#bib.bib58),[31](https://arxiv.org/html/2606.17352#bib.bib52)\]\. Mahalanobis\-based approaches\[[19](https://arxiv.org/html/2606.17352#bib.bib56)\]model feature activations as class\-conditional Gaussian distributions\. To address feature scale sensitivity, Mahalanobis\+\+\[[31](https://arxiv.org/html/2606.17352#bib.bib52)\]introduces scale\-invariance via unit hypersphere projection prior to distance computation\. However, operating solely on the terminally compressed penultimate layer limits its ability to capture mid\-level structural cues critical for detecting near\-OOD samples\.

Multilayer methods like Mahalanobis\[[19](https://arxiv.org/html/2606.17352#bib.bib56)\]and X\-Mahalanobis\[[45](https://arxiv.org/html/2606.17352#bib.bib51)\]incorporate intermediate layers but inherently treat layer\-wise representations as independent marginals, relying on the additive fusion of individual scores\. This mathematical simplification discards cross\-layer conditional dependencies\. Consequently, they are less capable to detect hierarchical inconsistencies, where an anomaly mimics ID features at individual layers but violates the expected evolutionary trajectory between them\.

Furthermore, aggregating these independent scores requires regressing fusion weights using proxy OOD validation sets or classifier fine\-tuning\. This violates the strictly post\-hoc assumption, biasing the detector toward the specific geometry of auxiliary anomalies\.

A naive integration of Mahalanobis\+\+’s scale\-invariance with X\-Mahalanobis’s multi\-layer extraction remains limited if the underlying issue of marginal score aggregation persists\. Robust multi\-layer OOD detection requires modeling the unified joint distribution of the feature hierarchy, driven by intrinsic ID geometry, instead of additive score fusion\.

To address this, we proposeMM\+\+\(Multilayer Mahalanobis\+\+\), afully unsupervised,scale\-invariant, andstrictly post\-hocframework\. MM\+\+ eschews ad hoc layer weighting and score addition\. Instead, it extends hyperspherical normalization to intermediate representations and introduces a Top\-KKinformation gating mechanism to systematically select the most informative layers\.

Specifically, we quantify layer\-wise informativeness using covariance entropy estimated via Ledoit–Wolf shrinkage\[[18](https://arxiv.org/html/2606.17352#bib.bib10)\]\. We derive an entropy density drop \(Δl\\Delta\_\{l\}\) to identify layers undergoing the sharpest semantic compression\. The penultimate layer is included as an anchor, while the topK−1K\-1intermediate layers are selected based onΔl\\Delta\_\{l\}, ensuring focus on representations with the most discriminative structural transitions\.

The selectedℓ2\\ell\_\{2\}\-normalized features are concatenated into a unified representation, on which a single joint Mahalanobis\+\+ distance is computed\. By modeling cross\-layer dependencies through a shared precision matrix—estimated via Ledoit–Wolf shrinkage—MM\+\+ effectively penalizes samples that exhibit anomalous hierarchical trajectories\. Conceptually, this fusion mechanism generalizes across heterogeneous architectures, as it operates on normalized intermediate features without requiring explicit layer\-wise weighting or architecture\-specific design choices \(see Figure[1](https://arxiv.org/html/2606.17352#S1.F1)\)\.

![Refer to caption](https://arxiv.org/html/2606.17352v1/x1.png)Figure 1:Overview of the MM\+\+ Framework\.Unsupervised, scale\-invariant multi\-layer OOD detection via top\-KKgated feature fusion \(illustrated withK=2K=2, ConvNeXt\-T on ImageNet\-LT as ID, and ImageNet\-C as OOD\)\.Left \(Pipeline\):MM\+\+ first identifies top\-KKlayers by leveraging entropy density drops to capture maximum cross\-layer compression, anchoring the penultimate layer\. These intermediate and terminal features are concatenated into a unified representationϕ\(x\)\\phi\(x\), stabilized by a well\-conditioned, shrunken tied covariance matrix \(Σ^𝒦,shrink\\hat\{\\Sigma\}\_\{\\mathcal\{K\},\\text\{shrink\}\}\)\.Center \(Feature Space\):In this unified space, ID samples tightly cluster around established class centroids \(yielding small Mahalanobis\+\+ distances,dM2d\_\{M\}^\{2\}\), while OOD samples are explicitly pushed to the periphery\.Right \(Distributions\):Consequently, MM\+\+ significantly tightens the ID density and isolates the OOD distribution\. By pushing OOD samples into a negative tail, it achieves substantially higher separability and reduces ID\-OOD overlap compared to state\-of\-the\-art baselines\.Contributions\.

MM\+\+: A Unified Multilayer Framework\.We propose a fully unsupervised, strictly post\-hoc framework that shifts the multi\-layer paradigm from additive score aggregation to joint feature\-space modeling\. MM\+\+ integrates \(i\) a layer selection mechanism using covariance entropy and entropy density drops; \(ii\) architecture\-independent concatenation ofℓ2\\ell\_\{2\}\-normalized features anchored at the penultimate layer; and \(iii\) a single joint Mahalanobis\+\+ estimator with Ledoit–Wolf\-regularized precision to explicitly capture cross\-layer dependencies\. It requires no proxy OOD data or fine\-tuning, and introduces only a single hyperparameter \(KK\)\.

Empirical Validation\.We evaluate MM\+\+ across global attention \(ViTs\), hierarchical attention \(Swin\), and convolutional \(ConvNeXt\) backbones\. Unlike specialized methods such as X\-Mahalanobis\[[45](https://arxiv.org/html/2606.17352#bib.bib51)\], which are tailored to Transformers and balanced data, MM\+\+ offers a more universal solution\. It delivers consistently robust performance on ViTs, while extending state\-of\-the\-art multilayer OOD detection to convolutional and hierarchical paradigms\. Furthermore, in the long\-tailed ImageNet\-LT regime, MM\+\+ consistently outperforms baselines on challenging near\-OOD benchmarks \(ImageNet\-V2, \-C, \-ES, \-R\), demonstrating higher resilience to both architectural variance and class imbalance\.

## 2Related Work

OOD detection has evolved from simple output\-based heuristics to approaches that analyze the geometric structure of deep representations\[[48](https://arxiv.org/html/2606.17352#bib.bib22),[28](https://arxiv.org/html/2606.17352#bib.bib21)\]\. We categorize prior work into five directions, focusing on post\-hoc methods most relevant to our setting\. We exclude training\-based approaches, such as\[[13](https://arxiv.org/html/2606.17352#bib.bib43),[37](https://arxiv.org/html/2606.17352#bib.bib26),[6](https://arxiv.org/html/2606.17352#bib.bib1),[15](https://arxiv.org/html/2606.17352#bib.bib45),[30](https://arxiv.org/html/2606.17352#bib.bib25),[41](https://arxiv.org/html/2606.17352#bib.bib46),[27](https://arxiv.org/html/2606.17352#bib.bib47),[34](https://arxiv.org/html/2606.17352#bib.bib23),[8](https://arxiv.org/html/2606.17352#bib.bib24)\], as they require additional training and are orthogonal to the strictly post\-hoc regime considered in this work\.

Output\- and Logit\-Based Methods\.Early post\-hoc methods rely on final network outputs\.\[[12](https://arxiv.org/html/2606.17352#bib.bib54)\]showed that maximum softmax probability is typically lower for OOD samples\. ODIN\[[20](https://arxiv.org/html/2606.17352#bib.bib55)\]improves separability via temperature scaling and input perturbation, while energy\-based scores\[[23](https://arxiv.org/html/2606.17352#bib.bib53)\]reinterpret logits through the lens of energy\-based models\. Despite their efficiency, these approaches are constrained by their reliance on the low\-dimensional outputs of the final linear classification layer, which discard rich intermediate representations and limit access to geometric structures, such as those associated with neural collapse, that can be informative for robust OOD detection\.

Feature\-Based and Geometric Methods\.To overcome this limitation, feature\-based approaches operate in the pre\-logit space\. A seminal method is the Mahalanobis distance\[[19](https://arxiv.org/html/2606.17352#bib.bib56)\], which models features as class\-conditional Gaussians with shared covariance and measures distances to class centers\. Extensions include non\-parametric methods such askk\-nearest neighbors\[[40](https://arxiv.org/html/2606.17352#bib.bib58)\]and relative Mahalanobis variants\[[35](https://arxiv.org/html/2606.17352#bib.bib48)\]\. These approaches have also been applied beyond vision, including language modeling\[[16](https://arxiv.org/html/2606.17352#bib.bib42)\]and medical imaging\[[2](https://arxiv.org/html/2606.17352#bib.bib41),[44](https://arxiv.org/html/2606.17352#bib.bib40)\]\. However, their effectiveness depends on simplifying distributional assumptions, such as Gaussianity, that are often misaligned with the complex, anisotropic feature geometries observed in modern architectures\[[3](https://arxiv.org/html/2606.17352#bib.bib30),[31](https://arxiv.org/html/2606.17352#bib.bib52)\]\.

Scale Invariance and Feature Rectification\.Modern architectures often produce feature embeddings with substantial variation in magnitude, which can distort distance\-based metrics\. ReAct\[[39](https://arxiv.org/html/2606.17352#bib.bib57)\]addresses this by truncating extreme activations\. More recently, Mahalanobis\+\+\[[31](https://arxiv.org/html/2606.17352#bib.bib52)\]introduces scale invariance by projecting features onto a unit hypersphere prior to distance computation\. This improves robustness by reducing norm\-induced bias, but remains restricted to the penultimate layer\.

Multilayer and Hierarchical Detection\.Deep networks encode information hierarchically, from low\-level features to high\-level semantics\. Motivated by this, multilayer approaches aim to detect OOD signals across depth\[[19](https://arxiv.org/html/2606.17352#bib.bib56),[22](https://arxiv.org/html/2606.17352#bib.bib17),[40](https://arxiv.org/html/2606.17352#bib.bib58)\]\. In particular, X\-Mahalanobis\[[45](https://arxiv.org/html/2606.17352#bib.bib51)\]attempts to capture distributional shifts by densely aggregating layer\-wise distance scores\. While conceptually unsupervised, it relies on parameter\-efficient fine\-tuning of a task\-specific linear head, rendering it not strictly post\-hoc in practice\. Moreover, its dense aggregation scheme via variance\-based weighting may assign non\-negligible weights to early layers, which may allow high\-variance spatial noise to propagate and interfere with the deeper semantic representations essential for reliable OOD detection\.

Covariance Entropy and Sparsification\.Recent theory highlights the information bottleneck principle\[[42](https://arxiv.org/html/2606.17352#bib.bib13)\]and neural collapse\[[32](https://arxiv.org/html/2606.17352#bib.bib20)\], where deep features concentrate into low\-dimensional subspaces as semantic information becomes increasingly aligned and nuisance variation is progressively suppressed\. The structure of these representations can be quantified via the entropy of the covariance eigenspectrum\[[9](https://arxiv.org/html/2606.17352#bib.bib14),[17](https://arxiv.org/html/2606.17352#bib.bib15),[1](https://arxiv.org/html/2606.17352#bib.bib16)\]\. Layers exhibiting sharp decreases in this entropy, indicating rapid semantic compression, may correspond to critical transition points in the hierarchical representation\. However, standard soft\-weighting schemes based on raw entropy values do not induce sparsity across layers and therefore retain contributions from less compressed representations\. This may introduce residual early\-layer variability into the aggregated representation\[[29](https://arxiv.org/html/2606.17352#bib.bib3),[38](https://arxiv.org/html/2606.17352#bib.bib2)\]\. More fundamentally, such continuous weighting obscures discrete structural changes in representation geometry, making it difficult to localize compression boundaries\. Motivated by this observation, we instead seek a discrete criterion that identifies abrupt changes in representational complexity\. Specifically, we define entropy density drops to detect sharp transitions in covariance structure, and use these as indicators for selecting a small subset of semantically compressed layers\. This enables selective Top\-KKgating that emphasizes post\-collapse representations while suppressing high\-variance early\-layer signals\.

Connection to Our Work\.MM\+\+ builds upon these directions to reconcile the inherent trade\-off between hierarchical expressivity and terminal compression in a strictly post\-hoc setting\. It extends the scale\-invariantℓ2\\ell\_\{2\}\-normalization of Mahalanobis\+\+ to a multi\-layer framework, replacing heuristic aggregation—such as the variance\-based weighting utilized by X\-Mahalanobis\[[45](https://arxiv.org/html/2606.17352#bib.bib51)\]—with a principled Top\-KKinformation gating mechanism\. Specifically, we analyze the ID feature spectrum via entropy density drops to pinpoint the boundaries of sharp semantic compression\. We anchor the terminal layer, select theK−1K\-1most informative intermediate stages, fuse features into a unified representation space via architecture\-independent concatenation\. This joint space is stabilized by a Ledoit–Wolf regularized tied covariance matrix, allowing a single Mahalanobis\+\+ distance to capture vital cross\-layer geometric interactions while suppressing high\-variance noise from early layers or under\-represented classes\. Consequently, MM\+\+ achieves robust near\- and far\-OOD detection across heterogeneous architectures and long\-tailed regimes without requiring proxy OOD data or classifier fine\-tuning\.

## 3Method

### 3\.1Preliminaries

Let𝒟train=\{\(xi,yi\)\}i=1N\\mathcal\{D\}\_\{\\text\{train\}\}=\\\{\(x\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}be the in\-distribution training set, whereyi∈\{1,…,C\}y\_\{i\}\\in\\\{1,\\ldots,C\\\}andNcN\_\{c\}denotes the number of samples in classcc\. A pre\-trained deep networkℱ=g∘h\\mathcal\{F\}=g\\circ hdecomposes into a feature extractorhhand a classifier headgg\. We denote the feature extractorhhas anLL\-layer network\. Thus, layerLLrepresents the terminal feature space ofhh, which serves as the penultimate layer of the overall networkℱ\\mathcal\{F\}\. At inference, we extract intermediate feature mapshl\(x\)∈ℝDlh\_\{l\}\(x\)\\in\\mathbb\{R\}^\{D\_\{l\}\}fromLLdesignated layersl=1,…,Ll=1,\\ldots,L\.

To eliminate feature\-magnitude variations across layers and architectures, weℓ2\\ell\_\{2\}\-normalize every representation onto the unit hypersphere:

h~l\(x\)=hl\(x\)/‖hl\(x\)‖2\.\\tilde\{h\}\_\{l\}\(x\)=\{h\_\{l\}\(x\)\}/\\\|h\_\{l\}\(x\)\\\|\_\{2\}\.\(1\)For each layerl∈\{1,…,L\}l\\in\\\{1,\\ldots,L\\\}, we estimate class\-conditional means and a tied covariance matrix from the normalized training features:

μ^c,l=1Nc∑i:yi=ch~l\(xi\),Σ^l=1N∑c=1C∑i:yi=c\(h~l\(xi\)−μ^c,l\)\(h~l\(xi\)−μ^c,l\)⊤\.\\hat\{\\mu\}\_\{c,l\}=\\frac\{1\}\{N\_\{c\}\}\\sum\_\{i:\\,y\_\{i\}=c\}\\tilde\{h\}\_\{l\}\(x\_\{i\}\),\\;\\;\\;\\;\\;\\hat\{\\Sigma\}\_\{l\}=\\frac\{1\}\{N\}\\sum\_\{c=1\}^\{C\}\\sum\_\{i:\\,y\_\{i\}=c\}\\bigl\(\\tilde\{h\}\_\{l\}\(x\_\{i\}\)\-\\hat\{\\mu\}\_\{c,l\}\\bigr\)\\bigl\(\\tilde\{h\}\_\{l\}\(x\_\{i\}\)\-\\hat\{\\mu\}\_\{c,l\}\\bigr\)^\{\\\!\\top\}\.\(2\)BecauseΣ^l\\hat\{\\Sigma\}\_\{l\}can be rank\-deficient or ill\-conditioned whenDl\>\>NcD\_\{l\}\>\>N\_\{c\}, we apply Ledoit–Wolf shrinkage\[[18](https://arxiv.org/html/2606.17352#bib.bib10)\]to obtain awell\-conditionedcovariance matrix:

Σ^l,shrink=\(1−γ\)Σ^l\+γTr⁡\(Σ^l\)Dl𝐈\.\\hat\{\\Sigma\}\_\{l,\\text\{shrink\}\}=\(1\-\\gamma\)\\,\\hat\{\\Sigma\}\_\{l\}\+\\gamma\\,\\frac\{\\operatorname\{Tr\}\(\\hat\{\\Sigma\}\_\{l\}\)\}\{D\_\{l\}\}\\,\\mathbf\{I\}\.\(3\)whereγ∈\[0,1\]\\gamma\\in\[0,1\]isnot a hyperparameter but derived analyticallyby minimizing the expected mean squared error \(Frobenius norm\) only using ID data\[[18](https://arxiv.org/html/2606.17352#bib.bib10)\]\. Subsequently, the corresponding precision matrix,Σ^l,shrink−1\\hat\{\\Sigma\}^\{\-1\}\_\{l,\\text\{shrink\}\}, is computed\.

### 3\.2Intra\-Class Covariance Entropy

We characterize the representational richness of each layerllin the encoder through the Shannon entropy of the normalized, well\-conditioned eigenvalue spectrum ofΣ^l,shrink\\hat\{\\Sigma\}\_\{l,\\text\{shrink\}\}\. Let\{λi\(l\)\}i=1Dl\\\{\\lambda\_\{i\}^\{\(l\)\}\\\}\_\{i=1\}^\{D\_\{l\}\}be the eigenvalues of the shrunk covariance matrixΣ^l,shrink\\hat\{\\Sigma\}\_\{l,\\text\{shrink\}\}\. BecauseΣ^l,shrink\\hat\{\\Sigma\}\_\{l,\\text\{shrink\}\}is positive semi\-definite, allλi\(l\)≥0\\lambda\_\{i\}^\{\(l\)\}\\geq 0\. To ensure numerical stability, we setλi\(l\)=max⁡\(λi\(l\),ϵ\)\\lambda\_\{i\}^\{\(l\)\}=\\max\(\\lambda\_\{i\}^\{\(l\)\},\\epsilon\)before normalization, whereϵ\\epsilonis a small positive constant \(e\.g\.,10−810^\{\-8\}\)\. Given that, we define the normalized eigenspectrum and the intra\-class covariance entropy as:

λ¯i\(l\)=λi\(l\)/∑j=1Dlλj\(l\),Hl=−∑i=1Dlλ¯i\(l\)ln⁡λ¯i\(l\)\.\\bar\{\\lambda\}\_\{i\}^\{\(l\)\}=\\lambda\_\{i\}^\{\(l\)\}/\\sum\_\{j=1\}^\{D\_\{l\}\}\\lambda\_\{j\}^\{\(l\)\},\\quad H\_\{l\}=\-\\sum\_\{i=1\}^\{D\_\{l\}\}\\bar\{\\lambda\}\_\{i\}^\{\(l\)\}\\ln\\bar\{\\lambda\}\_\{i\}^\{\(l\)\}\.\(4\)

### 3\.3Layer Selection via Entropy Density Drops

We define the entropy density of layerll—representing the Shannon entropy per feature dimension—asρl=HlDl\\rho\_\{l\}=\\frac\{H\_\{l\}\}\{D\_\{l\}\}\. This formulation yields a scale\-invariant metric of representational richness that naturally accounts for varying layer widthsDlD\_\{l\}\. To quantify the stage\-wise feature compression across the network, we define theentropy density dropbetween consecutive layers as the straightforward difference in their entropy densities:

Δl=ρl−1−ρl=Hl−1Dl−1−HlDl,l=2,…,L\.\\Delta\_\{l\}=\\rho\_\{l\-1\}\-\\rho\_\{l\}=\\frac\{H\_\{l\-1\}\}\{D\_\{l\-1\}\}\-\\frac\{H\_\{l\}\}\{D\_\{l\}\},\\quad l=2,\\ldots,L\.\(5\)
A large positiveΔl\\Delta\_\{l\}signals a sharp drop in relative intrinsic dimensionality at layerll: the network undergoes strong semantic compression at that transition\. Layers at such boundaries carry the most discriminative geometric structure for OOD detection\.

Motivated by the neural collapse phenomenon\[[32](https://arxiv.org/html/2606.17352#bib.bib20)\], which indicates that the penultimate layer maximally encodes class\-discriminative geometry, we always include the penultimate layerLLin the active set\. We then selectK−1K\{\-\}1additional layers by their entropy density drops:

𝒦=\{L\}∪argtop\-\(K−1\)l∈\{2,…,L−1\}⁡ΔlwhereΔl\>0\\mathcal\{K\}=\\\{L\\\}\\;\\cup\\;\\operatorname\*\{arg\\,top\\text\{\-\}\(\\textit\{K\}\-1\)\}\_\{l\\in\\\{2,\\ldots,\{L\-1\}\\\}\}\\Delta\_\{l\}\\text\{ where \}\\Delta\_\{l\}\>0\(6\)
IfΔl<0\\Delta\_\{l\}<0, the entropy density increases from layerl−1l\-1to layerll\. This can happen in structural expansions \(e\.g\., channel upsampling in hierarchical backbones\) or attention\-based mixing\.

### 3\.4Feature Fusion and OOD Scoring

Given the selected layer set𝒦=\{l1,…,lK\}\\mathcal\{K\}=\\\{l\_\{1\},\\ldots,l\_\{K\}\\\}, we combine the per\-layer representations into a single vectorϕ\(x\)\\boldsymbol\{\\phi\}\(x\)on which a class\-conditional Mahalanobis detector is applied\.

Cross\-Layer Concatenated Fusion\.We propose retaining the full, uncompressed information from every selected layer by concatenating theℓ2\\ell\_\{2\}\-normalized feature vectors:

ϕ\(xi\)=\[h~l1\(xi\);…;h~lK\(xi\)\]∈ℝ∑l∈𝒦Dl\.\\boldsymbol\{\\phi\}\(x\_\{i\}\)=\\bigl\[\\,\\tilde\{h\}\_\{l\_\{1\}\}\(x\_\{i\}\);\\;\\ldots;\\;\\tilde\{h\}\_\{l\_\{K\}\}\(x\_\{i\}\)\\,\\bigr\]\\;\\in\\;\\mathbb\{R\}^\{\\sum\_\{l\\in\\mathcal\{K\}\}D\_\{l\}\}\.\(7\)This representation is lossless for each layerl∈𝒦l\\in\\mathcal\{K\}: it encodes the per\-layer features without any mixing\. Moreover, it isarchitecture\-independent:for homogeneous architectures \(e\.g\. ViT withDl=DD\_\{l\}=D\), the joint dimension isD𝒦=∑l∈𝒦Dl=K×DD\_\{\\mathcal\{K\}\}=\\sum\_\{l\\in\\mathcal\{K\}\}D\_\{l\}=K\\times D\. For heterogeneous architectures \(e\.g\., Swin\-T or ConvNeXt with varyingDlD\_\{l\}\),D𝒦=∑l∈𝒦DlD\_\{\\mathcal\{K\}\}=\\sum\_\{l\\in\\mathcal\{K\}\}D\_\{l\}\. Crucially,no explicit aggregation weights are required: the scalar contribution of each layer is learned implicitly by the joint precision matrix described below\. Thus, MM\+\+ introduces only one hyperparameter \(KK\)\.

Joint Precision Matrix Estimation\.We compute fused class means and the tied covariance matrix for the selectedKKlayers:

𝝁^c𝒦=1Nc∑i:yi=cϕ\(xi\),Σ^𝒦=1N∑c=1C∑i:yi=c\(ϕ\(xi\)−𝝁^yi𝒦\)\(ϕ\(xi\)−𝝁^yi𝒦\)⊤\.\\hat\{\\boldsymbol\{\\mu\}\}\_\{c\}^\{\\mathcal\{K\}\}=\\frac\{1\}\{N\_\{c\}\}\\sum\_\{i:\\,y\_\{i\}=c\}\\boldsymbol\{\\phi\}\(x\_\{i\}\),\\quad\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}=\\frac\{1\}\{N\}\\sum\_\{c=1\}^\{C\}\\sum\_\{i:\\,y\_\{i\}=c\}\\bigl\(\\boldsymbol\{\\phi\}\(x\_\{i\}\)\-\\hat\{\\boldsymbol\{\\mu\}\}\_\{y\_\{i\}\}^\{\\mathcal\{K\}\}\\bigr\)\\bigl\(\\boldsymbol\{\\phi\}\(x\_\{i\}\)\-\\hat\{\\boldsymbol\{\\mu\}\}\_\{y\_\{i\}\}^\{\\mathcal\{K\}\}\\bigr\)^\{\\\!\\top\}\.\(8\)We then apply Ledoit–Wolf shrinkage to the centered fused representations to obtain a well\-conditioned tied covariance matrixΣ^𝒦,shrink\\hat\{\\Sigma\}\_\{\\mathcal\{K\},\\text\{shrink\}\}:

Σ^𝒦,shrink=\(1−γ\)Σ^𝒦\+γ\(Tr⁡\(Σ^𝒦\)D𝒦\)𝐈∈ℝD𝒦×D𝒦,\\hat\{\\Sigma\}\_\{\\mathcal\{K\},\\text\{shrink\}\}=\(1\-\\gamma\)\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}\+\\gamma\\,\\left\(\\frac\{\\operatorname\{Tr\}\(\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}\)\}\{D\_\{\\mathcal\{K\}\}\}\\right\)\\mathbf\{I\}\\;\\in\\;\\mathbb\{R\}^\{D\_\{\\mathcal\{K\}\}\\times D\_\{\\mathcal\{K\}\}\},\(9\)where𝐈\\mathbf\{I\}is an identity matrix with dimensionD𝒦×D𝒦D\_\{\\mathcal\{K\}\}\\times D\_\{\\mathcal\{K\}\}\.

LetΣ^𝒦−1=Σ^𝒦,shrink−1\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}^\{\-1\}=\\hat\{\\Sigma\}\_\{\\mathcal\{K\},\\text\{shrink\}\}^\{\-1\}denote the shrunk precision matrix in the fused representation space\. The diagonal blocksΣ^𝒦,ll−1\\hat\{\\Sigma\}\_\{\\mathcal\{K\},ll\}^\{\-1\}capture intra\-layer class\-conditional covariance at each selected layerll, recovering the behavior of single\-layer Mahalanobis\+\+ detectors\. The off\-diagonal blocksΣ^𝒦,ll′−1\\hat\{\\Sigma\}\_\{\\mathcal\{K\},ll^\{\\prime\}\}^\{\-1\}\(l≠l′l\\neq l^\{\\prime\}\) encodecross\-layer covariance, capturing how representations at layersllandl′l^\{\\prime\}co\-vary within a class\. Such interactions cannot be modeled by additive fusion, which reduces the precision to a singleD×DD\\times Dmatrix and conflates intra\- and cross\-layer structure\. Ledoit–Wolf shrinkage ensures a well\-conditioned estimate even when∑l∈𝒦Dl≫Nc\\sum\_\{l\\in\\mathcal\{K\}\}D\_\{l\}\\gg N\_\{c\}\[[18](https://arxiv.org/html/2606.17352#bib.bib10)\], with the shrinkage coefficient computed analytically from ID training data\.

OOD Score of MM\+\+\.The final OOD confidence score for a test inputxxis the negative minimum Mahalanobis\+\+ distance over all classes in the joint feature space:

𝒮MM\+\+\(x\)=−minc∈\{1,…,C\}\(ϕ\(x\)−𝝁^c𝒦\)⊤Σ^𝒦−1\(ϕ\(x\)−𝝁^c𝒦\)\.\\mathcal\{S\}\_\{\\text\{MM\+\+\}\}\(x\)=\-\\min\_\{c\\in\\\{1,\\ldots,C\\\}\}\\;\\bigl\(\\boldsymbol\{\\phi\}\(x\)\-\\hat\{\\boldsymbol\{\\mu\}\}\_\{c\}^\{\\mathcal\{K\}\}\\bigr\)^\{\\\!\\top\}\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}^\{\-1\}\\bigl\(\\boldsymbol\{\\phi\}\(x\)\-\\hat\{\\boldsymbol\{\\mu\}\}\_\{c\}^\{\\mathcal\{K\}\}\\bigr\)\.\(10\)
Higher values of𝒮MM\+\+\(x\)\\mathcal\{S\}\_\{\\text\{MM\+\+\}\}\(x\)indicate thatxxis closer to some in\-distribution class cluster and is therefore more likely in\-distribution\. A test sample may appear plausible under the single\-layer distributions atl1l\_\{1\}andl2l\_\{2\}individually, yet be flagged byΣ^𝒦−1\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}^\{\-1\}because the joint configuration\(h~l1\(x\),h~l2\(x\)\)\(\\tilde\{h\}\_\{l\_\{1\}\}\(x\),\\tilde\{h\}\_\{l\_\{2\}\}\(x\)\)is inconsistent with any training class—afailure modethat additive fusion cannot detect\. This method requires neither labeled OOD data nor additional model training/fine\-tuning\. Instead, it relies solely on closed\-form Ledoit–Wolf covariance estimates computed from ID data\.

In MM\+\+, calibration that estimates𝝁^c𝒦\\hat\{\\boldsymbol\{\\mu\}\}\_\{c\}^\{\\mathcal\{K\}\}andΣ^𝒦−1\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}^\{\-1\}in the unified space is performed once offline\. Using them, MM\+\+ performs online OOD detection for test inputxx\(Eq[10](https://arxiv.org/html/2606.17352#S3.E10)\)\. The pseudocode for MM\+\+ is presented in Algorithm[1](https://arxiv.org/html/2606.17352#alg1), and a theoretical justification is provided in Appendix[A](https://arxiv.org/html/2606.17352#A1)\.

Algorithm 1MM\+\+: Offline Calibration and Online Inference0:ID dataset

𝒟ID\\mathcal\{D\}\_\{\\text\{ID\}\}, pre\-trained encoder

hhwith

LLlayers, fusion size

KK, test sample

xx
0:OOD score

𝒮MM\+\+\(x\)\\mathcal\{S\}\_\{\\text\{MM\+\+\}\}\(x\)
1:Phase I: One\-Time Offline Calibration

2:for

l∈\{1,…,L\}l\\in\\\{1,\\dots,L\\\}do

3:Extract and

ℓ2\\ell\_\{2\}\-normalize features

h~l\(xi\)\\tilde\{h\}\_\{l\}\(x\_\{i\}\)for all training samples

xi∈𝒟IDx\_\{i\}\\in\\mathcal\{D\}\_\{\\text\{ID\}\}
4:Estimate within\-class covariance

Σ^l\\hat\{\\Sigma\}\_\{l\}and apply Ledoit–Wolf shrinkage

5:Compute covariance entropy

HlH\_\{l\}
6:endfor

7:Compute entropy density drops

Δl\\Delta\_\{l\}for

l≥2l\\geq 2and select layers:

𝒦=\{L\}∪arg⁡topK−1\(\{Δl\}l=2L−1\)\\mathcal\{K\}=\\\{L\\\}\\cup\\arg\\text\{top\}\_\{K\-1\}\(\\\{\\Delta\_\{l\}\\\}\_\{l=2\}^\{L\-1\}\)
8:foreach sample

xi∈𝒟IDx\_\{i\}\\in\\mathcal\{D\}\_\{\\text\{ID\}\}do

9:Fuse by concatenation:

ϕ\(xi\)=\[h~l1\(xi\);…;h~lK\(xi\)\]\\boldsymbol\{\\phi\}\(x\_\{i\}\)=\\bigl\[\\,\\tilde\{h\}\_\{l\_\{1\}\}\(x\_\{i\}\);\\;\\ldots;\\;\\tilde\{h\}\_\{l\_\{K\}\}\(x\_\{i\}\)\\,\\bigr\]
10:endfor

11:forclass

c∈\{1,…,C\}c\\in\\\{1,\\dots,C\\\}do

12:Compute fused class mean:

𝝁^c𝒦=1Nc∑i:yi=cϕ\(xi\)\\hat\{\\boldsymbol\{\\mu\}\}\_\{c\}^\{\\mathcal\{K\}\}=\\frac\{1\}\{N\_\{c\}\}\\sum\_\{i:\\,y\_\{i\}=c\}\\boldsymbol\{\\phi\}\(x\_\{i\}\)
13:endfor

14:Compute tied covariance

Σ^𝒦\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}and apply Ledoit–Wolf shrinkage to obtain precision matrix

Σ^𝒦−1\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}^\{\-1\}
15:Phase II: Test\-Time Inference

16:For test input

xx, extract and

ℓ2\\ell\_\{2\}\-normalize features

h~l\(x\)\\tilde\{h\}\_\{l\}\(x\)for selected layers

l∈𝒦l\\in\\mathcal\{K\}
17:Construct fused representation:

ϕ\(x\)=\[h~l1\(x\);…;h~lK\(x\)\]\\boldsymbol\{\\phi\}\(x\)=\\bigl\[\\,\\tilde\{h\}\_\{l\_\{1\}\}\(x\);\\;\\ldots;\\;\\tilde\{h\}\_\{l\_\{K\}\}\(x\)\\,\\bigr\]
18:return

𝒮MM\+\+\(x\)=−minc\(ϕ\(x\)−𝝁^c𝒦\)⊤Σ^𝒦−1\(ϕ\(x\)−𝝁^c𝒦\)\\mathcal\{S\}\_\{\\text\{MM\+\+\}\}\(x\)=\-\\min\_\{c\}\\;\\bigl\(\\boldsymbol\{\\phi\}\(x\)\-\\hat\{\\boldsymbol\{\\mu\}\}\_\{c\}^\{\\mathcal\{K\}\}\\bigr\)^\{\\\!\\top\}\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}^\{\-1\}\\bigl\(\\boldsymbol\{\\phi\}\(x\)\-\\hat\{\\boldsymbol\{\\mu\}\}\_\{c\}^\{\\mathcal\{K\}\}\\bigr\)

## 4Evaluation

### 4\.1Experimental Setup

In\-Distribution Dataset \(𝒟ID\\mathcal\{D\}\_\{\\text\{ID\}\}\):We evaluate MM\+\+ on ImageNet\-1K\[[36](https://arxiv.org/html/2606.17352#bib.bib32)\]and its long\-tailed counterpart, ImageNet\-LT\[[26](https://arxiv.org/html/2606.17352#bib.bib60)\], as primary ID benchmarks\. This enables a rigorous assessment of the framework’s stability under both balanced and skewed class distributions\. The specific ID dataset is indicated in each respective results discussion\.

Out\-of\-Distribution Datasets:We evaluate challenging distribution shifts covering near\-OOD datasets with high semantic similarity \(ImageNet\-V2\[[33](https://arxiv.org/html/2606.17352#bib.bib62)\], ImageNet\-C\[[11](https://arxiv.org/html/2606.17352#bib.bib65)\], ImageNet\-R\[[10](https://arxiv.org/html/2606.17352#bib.bib66)\], ImageNet\-ES\[[21](https://arxiv.org/html/2606.17352#bib.bib63)\]\) and far\-OOD datasets exhibiting strong domain shifts \(ImageNet\-O\[[14](https://arxiv.org/html/2606.17352#bib.bib31)\], OpenImage\-O\[[47](https://arxiv.org/html/2606.17352#bib.bib7)\], NINCO\[[3](https://arxiv.org/html/2606.17352#bib.bib30)\], Places365\[[49](https://arxiv.org/html/2606.17352#bib.bib28)\], Textures\[[4](https://arxiv.org/html/2606.17352#bib.bib27)\], iNaturalist\[[43](https://arxiv.org/html/2606.17352#bib.bib67)\], SUN\[[46](https://arxiv.org/html/2606.17352#bib.bib68)\]\)\.

Metrics\.We report AUROC and FPR95\. The thresholdτ\\tauis set such thatP\(𝒮\(x\)≥τ∣x∈𝒟ID\)=0\.95P\(\\mathcal\{S\}\(x\)\\geq\\tau\\mid x\\in\\mathcal\{D\}\_\{\\text\{ID\}\}\)=0\.95\. An inputxxis classified as ID if𝒮\(x\)≥τ\\mathcal\{S\}\(x\)\\geq\\tau, and as OOD otherwise\.

Backbones\.Main results are reported using an ImageNet\-21K pretrained ViT\-B/16\[[5](https://arxiv.org/html/2606.17352#bib.bib36)\]\. Appendix[B](https://arxiv.org/html/2606.17352#A2)provides additional results demonstrating cross\-architectural applicability across diverse design paradigms, including Swin\-T\[[24](https://arxiv.org/html/2606.17352#bib.bib37)\]\(hierarchical attention\), ConvNeXt\-T\[[25](https://arxiv.org/html/2606.17352#bib.bib35)\]\(modernized convolutions\), and EVA02\-S14\[[7](https://arxiv.org/html/2606.17352#bib.bib69)\]\(a large\-scale pretrained vision transformer with improved representation capacity\)\.

Baselines\.We compare MM\+\+ against established post\-hoc methods: MSP\[[12](https://arxiv.org/html/2606.17352#bib.bib54)\], ODIN\[[20](https://arxiv.org/html/2606.17352#bib.bib55)\], Energy\[[23](https://arxiv.org/html/2606.17352#bib.bib53)\], ReAct\[[39](https://arxiv.org/html/2606.17352#bib.bib57)\], KNN\[[40](https://arxiv.org/html/2606.17352#bib.bib58)\], Mahalanobis\[[19](https://arxiv.org/html/2606.17352#bib.bib56)\], Relative Mahalanobis\[[35](https://arxiv.org/html/2606.17352#bib.bib48)\], Mahalanobis\+\+\[[31](https://arxiv.org/html/2606.17352#bib.bib52)\], Relative Mahalanobis\+\+ \(anℓ2\\ell\_\{2\}\-normalized variant for evaluation\), and X\-Mahalanobis\[[45](https://arxiv.org/html/2606.17352#bib.bib51)\]\(reviewed in Appendix[D](https://arxiv.org/html/2606.17352#A4)\)\.

We use public implementations with frozen parameters and consistent datasets\. Only X\-Mahalanobis fine\-tunes the classifier\[[45](https://arxiv.org/html/2606.17352#bib.bib51)\]\. For MM\+\+, we useK=2K=2\(the anchored penultimate layer and one information\-gated intermediate layer\)\. A sensitivity analysis forKKis presented in Section[4\.3](https://arxiv.org/html/2606.17352#S4.SS3)\.

### 4\.2Comparison to State\-of\-the\-Art

Results using ImageNet\-1K as ID\.As shown in Table[1](https://arxiv.org/html/2606.17352#S4.T1), specialized methods such as X\-Mahalanobis—which fine\-tunes on the ID training set—achieve the highest overall performance on the balanced ImageNet\-1K benchmark, proving effective at establishing broad decision boundaries for far\-OOD outliers\. Nevertheless, MM\+\+ offers a versatile, strictly post\-hoc alternative, achieving competitive performance and the best AUROC on Texture and SUN without requiring auxiliary OOD data, extensive hyperparameter calibration, or fine\-tuning\.

Results using ImageNet\-LT as ID\.Table[2](https://arxiv.org/html/2606.17352#S4.T2)evaluates the challenging long\-tailed ID regime\. Here, MM\+\+ demonstrates superior robustness, yielding the highest average AUROC \(83\.91%\)\. While X\-Mahalanobis’s fine\-tuning grants an advantage on pure semantic shifts \(NINCO, OpenImage\-O\), MM\+\+ excels under severe near\-OOD shifts \(ImageNet\-ES, \-R, \-V2\)\. By integrating structural correlations from intermediate blocks with the anchored penultimate layer, MM\+\+ effectively handles subtle semantic deviations that terminal\-layer representations or additive multi\-layer fusions, which aggregate individual layer\-wise scores, are likely to overlook in imbalanced regimes\.

Table 1:OOD detection performance using ViT\-B/16 with ImageNet\-1K as ID\. \(†\\daggerdenotes original paper results\.↑\\uparrow: higher is better,↓\\downarrow: lower is better;bestandsecond\-best; values are in %\.\)MethodImageNet\-OTexturePlaces365iNaturalistSUNAverageAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowMSP73\.6375\.3082\.7258\.3980\.4465\.9092\.5734\.3584\.7555\.9882\.8257\.98ODIN65\.4275\.1072\.7865\.5069\.6975\.0985\.8041\.4577\.5160\.7374\.2463\.57Energy75\.9760\.0083\.5347\.0773\.1668\.9591\.5929\.7383\.2549\.0681\.5050\.96ReAct79\.2056\.9585\.5344\.4578\.3761\.8795\.5418\.7584\.8147\.5984\.6945\.52KNN84\.9867\.9089\.5642\.9385\.1561\.5496\.5716\.2087\.4855\.2688\.7548\.77Maha83\.6575\.7085\.4364\.5784\.5764\.5896\.9813\.1485\.9764\.2687\.3256\.45rMaha84\.4270\.4086\.8958\.7985\.2863\.0997\.6110\.6287\.4159\.1188\.3252\.40Maha\+\+86\.4263\.7089\.2648\.2685\.7564\.1198\.765\.1588\.9053\.4789\.8246\.94rMaha\+\+86\.4563\.6589\.3148\.2685\.7764\.3298\.775\.1688\.9353\.6389\.8547\.00X\-Maha†\\text\{X\-Maha\}^\{\\dagger\}93\.7629\.8096\.6511\.7092\.0437\.7899\.402\.2689\.6446\.0094\.3025\.51MM\+\+ \(Ours\)88\.7654\.4597\.0014\.5287\.0555\.5898\.435\.7189\.6449\.6692\.1835\.98

Table 2:OOD detection performance using ViT\-B/16 with ImageNet\-LT as ID\.MethodNINCOOpenImage\-OImageNet\-CImageNet\-ESImageNet\-RImageNet\-V2AverageAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowMSP83\.2861\.8589\.2344\.8971\.2173\.9469\.3570\.0177\.6366\.2358\.7390\.1474\.2467\.84ODIN74\.2864\.6981\.9348\.2064\.6179\.3563\.2274\.3167\.9074\.4954\.8292\.1067\.7972\.52Energy83\.2351\.2289\.1633\.7672\.3466\.6567\.5470\.3172\.8767\.0457\.2990\.8573\.7463\.31ReAct85\.6248\.5892\.7626\.7473\.5565\.8569\.5767\.8478\.1960\.1158\.6589\.9176\.3959\.84KNN40\.4496\.0050\.3495\.0657\.7388\.4555\.8891\.4365\.6184\.1749\.8094\.3453\.391\.91Maha85\.7770\.0193\.5438\.5470\.9485\.0069\.5384\.5386\.2351\.8558\.2989\.7677\.3869\.95rMaha88\.5157\.8594\.6229\.8771\.7380\.3271\.3080\.2285\.0252\.0059\.1889\.6078\.3964\.98Maha\+\+90\.3747\.3095\.8124\.5574\.8673\.0573\.5173\.0987\.2246\.9259\.5289\.3780\.2259\.71rMaha\+\+90\.4844\.3395\.4725\.2472\.4974\.9771\.9072\.9385\.9349\.1859\.0889\.3479\.2259\.00X\-Maha†\\text\{X\-Maha\}^\{\\dagger\}94\.9826\.7498\.219\.7283\.9653\.5776\.7863\.4588\.4942\.8558\.3691\.3683\.4647\.95MM\+\+ \(Ours\)91\.3443\.2096\.4320\.9884\.3553\.8081\.7052\.4388\.8441\.8360\.7888\.9483\.9150\.20

![Refer to caption](https://arxiv.org/html/2606.17352v1/x2.png)Figure 2:Score distributions for ViT\-B/16 across five OOD benchmarks with ImageNet\-LT as ID\.Analysis of Score Distributions \(ViT\-B/16 and Swin\-T\)\.Figures[2](https://arxiv.org/html/2606.17352#S4.F2)and[3](https://arxiv.org/html/2606.17352#S4.F3)illustrate score distributions under the ImageNet\-LT ID regime\. Across both the ViT\-B/16 and hierarchical Swin\-T backbones, MM\+\+ effectively handles distinct shift types\. Under continuous covariate shifts \(ImageNet\-C, \-ES\), MM\+\+ exhibits a pronounced long tail \(extending near−10\-10\), demonstrating fine\-grained sensitivity to degradation severity\. Conversely, for pure semantic \(NINCO, ImageNet\-O\) and severe stylistic shifts \(ImageNet\-R\), MM\+\+ forms a concentrated, isolated mass, effectively separating these conceptual anomalies from the ID distribution\.

Comparatively, the single\-layer Mahalanobis\+\+ baseline produces tightly clustered scores with substantial ID overlap on covariate shifts\. X\-Mahalanobis incorporates multi\-layer data but falls short of MM\+\+’s sensitivity on ViT\-B/16, likely due to its additive fusion which aggregates independent layer scores rather than constructing a unified feature space\. Furthermore, X\-Mahalanobis is excluded from the Swin\-T evaluation; its assumption of homogeneous feature dimensionality makes it incompatible with Swin\-T’s stage\-wise spatial compression and channel expansion\.

Ultimately, this cross\-architectural consistency highlights MM\+\+’s core advantage: by leveraging top\-KKinformation gating and constructing a unified representation space, the framework seamlessly captures multi\-layer correlations across distinct attention paradigms without requiring architecture\-specific dimensional adaptations\.

![Refer to caption](https://arxiv.org/html/2606.17352v1/x3.png)Figure 3:Score distributions for Swin\-T across six OOD benchmarks with ImageNet\-LT as ID\.Table 3:Impact of layer selection and fusion strategies in MM\+\+ on OOD detection with ViT\-B/16, evaluated on ImageNet\-C \(ImageNet\-LT as ID\)\.Aggregation StrategyAUROC↑\\uparrowFPR95↓\\downarrowSingle\-layer baselinesPenultimate layer \(Mahalanobis\)70\.9485\.00Penultimate layer \+ℓ2\\ell\_\{2\}\-norm \(Mahalanobis\+\+\)74\.8673\.05Multi\-layer fusion \(prior / supervised\)Variance\-based weighting \+ fine\-tuning \(X\-Mahalanobis†\)83\.9653\.57MM\+\+ design variants\(1\) Top\-KKlayers \+ no penultimate anchoring \+ pseudo\-inverse73\.4783\.53\(2\) Top\-KK–1 layers \+ penultimate anchoring \+ pseudo\-inverse76\.0971\.30MM\+\+ \(Top\-KK–1 \+ penultimate \+ Ledoit\-Wolf shrinkage\)84\.3553\.80

### 4\.3Ablation Studies

Table[3](https://arxiv.org/html/2606.17352#S4.T3)details the effect of progressive layer aggregation on ViT\-B/16, evaluated on ImageNet\-C \(ImageNet\-LT ID\)\.

\(1\) Single\-layer limitations\.Relying solely on the penultimate layer \(Mahalanobis\) yields moderate performance, as a single representation fails to capture structural deviations critical for near\-OOD detection\. Incorporatingℓ2\\ell\_\{2\}normalization \(Mahalanobis\+\+\) improves performance, highlighting the necessity of scale invariance in high\-dimensional spaces\.

\(2\) Multi\-layer fusion improves separation\.Extending beyond single representations significantly enhances detection\. In particular, X\-Mahalanobis achieves competitive results via variance\-based weighting and fine\-tuning\. However, their heuristic aggregation introduces sensitivity to less informative layers and requires supervised calibration, limiting strictly post\-hoc practicality\.

\(3\) Anchoring the terminal representation\.Selecting layers solely via entropy density drops without enforcing penultimate inclusion causes substantial degradation \(particularly in FPR95\) compared to the anchored MM\+\+ variant\. This shows the penultimate layer provides a necessary structural anchor; retaining it while selecting the topK−1K\-1intermediate layers ensures stable discrimination\.

![Refer to caption](https://arxiv.org/html/2606.17352v1/x4.png)Figure 4:Sensitivity of MM\+\+ toKK\(4\) Robust precision estimation\.While traditional methods utilize pseudo\-inverse estimation for the tied precision matrix, MM\+\+ adopts Ledoit–Wolf shrinkage for well\-conditioned covariance\. This numerically stable foundation yields substantial empirical gains—an 8\.26% AUROC and 17\.5% FPR95 improvement over the pseudo\-inverse baseline\. Crucially, this effectiveness is intrinsically linked to our top\-KKgating, which isolates semantically relevant signals for the shrinkage estimator \(see Table[9](https://arxiv.org/html/2606.17352#A2.T9)in Appendix[B](https://arxiv.org/html/2606.17352#A2)\)\.

\(5\) Sensitivity to capacity parameterKK\.MM\+\+ exhibits measured sensitivity toKK, as shown in Figure[4](https://arxiv.org/html/2606.17352#S4.F4)\(ViT\-B/16, ImageNet\-LT as ID\)\. AUROC jumps significantly fromK=1K=1toK=2K=2, affirming that intermediate information crucially complements the terminal representation\. However, performance fluctuates, dropping atK=3K=3before rising for largerKK\. Structurally, this non\-monotonicity stems from information redundancy: subsequently ranked layers likely capture highly correlated features\. AtK=3K=3, marginal information gains are temporarily outweighed by the statistical noise of expanding the joint dimensionality\. WhileK=5K=5marginally outperformsK=2K=2on ImageNet\-V2 by eventually accumulating orthogonal signals, the minor improvement does not justify the increased computational footprint\. Ultimately, smallKKvalues provides the best trade\-off between performance and efficiency\.

## 5Conclusion

In this work, we introduce MM\+\+, a fully unsupervised framework for post\-hoc Out\-of\-Distribution \(OOD\) detection\. Our approach moves beyond reliance on terminal representations by demonstrating that the hierarchical trajectory of feature evolution contains valuable discriminative information\. The effectiveness of MM\+\+ stems from modeling hierarchical patterns without requiring heuristic weighting or auxiliary OOD datasets\. We propose a principled method that identifies layers where semantic compression is most pronounced via entropy density drops\. By fusing the anchored penultimate layer with these selected intermediate representations into a joint Mahalanobis\+\+ space, the framework captures cross\-layer correlations latent in single\-layer approaches\. This effectively addresses the dichotomy between the stability of terminal layers and the depth of multilayer architectures\. Ultimately, the consistent performance of MM\+\+ across representative Transformer and Convolutional models—specifically ViT, EVA02, Swin, and ConvNeXt—suggests that robust OOD detection can be achieved through generalizable principles rather than architecture\-specific tuning\. By grounding detection in information\-theoretic and geometric foundations, MM\+\+ offers a rigorous path toward more reliable deep learning systems\.

## References

- \[1\]\(2024\)NECO: neural collapse based out\-of\-distribution detection\.InProceedings of the International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p6.1)\.
- \[2\]H\. Anthony and K\. Kamnitsas\(2023\)On the use of Mahalanobis distance for out\-of\-distribution detection with neural networks for medical imaging\.InProceedings of the International Workshop on Uncertainty for Safe Utilization of Machine Learning in Medical Imaging,Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p3.1)\.
- \[3\]J\. Bitterwolf, M\. Müller, and M\. Hein\(2023\)In or out? Fixing ImageNet out\-of\-distribution detection evaluation\.InProceedings of the International Conference on Machine Learning,pp\. 2441–2472\.Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p3.1),[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p2.1)\.
- \[4\]M\. Cimpoi, S\. Maji, I\. Kokkinos, A\. Vedaldi, and S\. Soatto\(2014\)Describing textures in the wild\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 3606–3613\.Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p2.1)\.
- \[5\]A\. Dosovitskiy, L\. Beyer, A\. Kolesnikov, D\. Weissenborn, X\. Zhai, T\. Unterthiner, M\. Dehghani, M\. Minderer, G\. Heigold, S\. Gelly, J\. Uszkoreit, and N\. Houlsby\(2021\)An image is worth 16x16 words: transformers for image recognition at scale\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=YicbFdNTTy)Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p4.1)\.
- \[6\]X\. Du, Z\. Wang, M\. Cai, and S\. Li\(2022\)VOS: learning what you don’t know by virtual outlier synthesis\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=TW7d65uYu5M)Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p1.1)\.
- \[7\]Y\. Fang, Q\. Sun, X\. Wang, T\. Huang, X\. Wang, and Y\. Cao\(2024\)Eva\-02: a visual representation for neon genesis\.Image and Vision Computing149,pp\. 105171\.Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p4.1)\.
- \[8\]J\. Haas, W\. Yolland, and B\. T\. Rabus\(2024\)Exploring simple, high quality out\-of\-distribution detection with l2 normalization\.Transactions on Machine Learning Research2024\.Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p1.1)\.
- \[9\]M\. Y\. Harun, J\. Gallardo, and C\. Kanan\(2025\)Controlling neural collapse enhances out\-of\-distribution detection and transfer learning\.arXiv preprint arXiv:2502\.10691\.Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p6.1)\.
- \[10\]D\. Hendrycks, S\. Basart, N\. Mu, S\. Kadavath, F\. Wang, E\. Dorundo, R\. Desai, T\. Zhu, S\. Samat, S\. Pimentel,et al\.\(2021\)The many faces of robustness: a critical analysis of out\-of\-distribution generalization\.InProceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\),pp\. 8340–8349\.Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p2.1)\.
- \[11\]D\. Hendrycks and T\. Dietterich\(2019\)Benchmarking neural network robustness to common corruptions and perturbations\.InInternational Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=HJz6tiCqYm)Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p2.1)\.
- \[12\]D\. Hendrycks and K\. Gimpel\(2017\)A baseline for detecting misclassified and out\-of\-distribution examples in neural networks\.InProceedings of the International Conference on Learning Representations,Cited by:[1st item](https://arxiv.org/html/2606.17352#A4.I1.i1.p1.1.1),[§2](https://arxiv.org/html/2606.17352#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p5.1)\.
- \[13\]D\. Hendrycks, M\. Mazeika, S\. Kadavath, and D\. Song\(2019\)Using self\-supervised learning can improve model robustness and uncertainty\.InProceedings of the Advances in Neural Information Processing Systems,Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p1.1)\.
- \[14\]D\. Hendrycks, K\. Zhao, S\. Basart, J\. Steinhardt, and D\. Song\(2021\)Natural adversarial examples\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 15262–15271\.Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p2.1)\.
- \[15\]D\. Hendrycks, A\. Zou, M\. Mazeika, L\. Tang, B\. Li, D\. Song, and J\. Steinhardt\(2022\)Pixmix: Dreamlike pictures comprehensively improve safety measures\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 16783–16792\.Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p1.1)\.
- \[16\]C\. Hofmann, C\. Huber, B\. Lehner, D\. Klotz, S\. Hochreiter, and W\. Zellinger\(2026\)AP\-OOD: Attention Pooling for Out\-of\-Distribution Detection\.arXiv preprint arXiv:2602\.06031\.Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p3.1)\.
- \[17\]D\. Janiak, J\. Binkowski, and T\. Kajdanowicz\(2025\)A geometry\-based view of mahalanobis OOD detection\.arXiv preprint arXiv:2510\.15202\.Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p6.1)\.
- \[18\]O\. Ledoit and M\. Wolf\(2004\)A well\-conditioned estimator for large\-dimensional covariance matrices\.Journal of multivariate analysis88\(2\),pp\. 365–411\.Cited by:[§1](https://arxiv.org/html/2606.17352#S1.p8.3),[§3\.1](https://arxiv.org/html/2606.17352#S3.SS1.p2.4),[§3\.1](https://arxiv.org/html/2606.17352#S3.SS1.p2.6),[§3\.4](https://arxiv.org/html/2606.17352#S3.SS4.p4.9)\.
- \[19\]K\. Lee, K\. Lee, H\. Lee, and J\. Shin\(2018\)A simple unified framework for detecting out\-of\-distribution samples and adversarial attacks\.InProceedings of the Advances in Neural Information Processing Systems \(NeurIPS\),Vol\.31\.Cited by:[Appendix A](https://arxiv.org/html/2606.17352#A1.SS0.SSS0.Px2.p1.2),[6th item](https://arxiv.org/html/2606.17352#A4.I1.i6.p1.1.1),[§1](https://arxiv.org/html/2606.17352#S1.p3.1),[§1](https://arxiv.org/html/2606.17352#S1.p4.1),[§2](https://arxiv.org/html/2606.17352#S2.p3.1),[§2](https://arxiv.org/html/2606.17352#S2.p5.1),[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p5.1)\.
- \[20\]S\. Liang, Y\. Li, and R\. Srikant\(2018\)Enhancing the reliability of out\-of\-distribution image detection in neural networks\.InProceedings of the International Conference on Learning Representations,Cited by:[2nd item](https://arxiv.org/html/2606.17352#A4.I1.i2.p1.2.1),[§2](https://arxiv.org/html/2606.17352#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p5.1)\.
- \[21\]Y\. Lin, W\. Ding, S\. Qiang, L\. Deng, and G\. Li\(2021\)ES\-imagenet: a million event\-stream classification dataset for spiking neural networks\.Frontiers in Neuroscience15\.External Links:[Document](https://dx.doi.org/10.3389/fnins.2021.726582)Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p2.1)\.
- \[22\]Z\. Lin, S\. Roy, and Y\. Li\(2021\)Mood: multi\-level out\-of\-distribution detection\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 15313–15323\.Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p5.1)\.
- \[23\]W\. Liu, X\. Wang, J\. Owens, and Y\. Li\(2020\)Energy\-based out\-of\-distribution detection\.InProceedings of the Advances in Neural Information Processing Systems,Cited by:[3rd item](https://arxiv.org/html/2606.17352#A4.I1.i3.p1.1.1),[§2](https://arxiv.org/html/2606.17352#S2.p2.1),[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p5.1)\.
- \[24\]Z\. Liu, Y\. Lin, Y\. Cao, H\. Hu, Y\. Wei, Z\. Zhang, S\. Lin, and B\. Guo\(2021\-10\)Swin transformer: hierarchical vision transformer using shifted windows\.InProceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\),pp\. 10012–10022\.Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p4.1)\.
- \[25\]Z\. Liu, H\. Mao, C\. Wu, C\. Feichtenhofer, T\. Darrell, and S\. Xie\(2022\)A convnet for the 2020s\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 11976–11986\.Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p4.1)\.
- \[26\]Z\. Liu, Z\. Miao, X\. Zhan, J\. Wang, B\. Gong, and S\. X\. Yu\(2019\)Large\-scale long\-tailed recognition in an open world\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 2537–2546\.Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p1.1)\.
- \[27\]H\. Lu, D\. Gong, S\. Wang, J\. Xue, L\. Yao, and K\. Moore\(2024\)Learning with mixture of prototypes for out\-of\-distribution detection\.arXiv preprint arXiv:2402\.02653\.Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p1.1)\.
- \[28\]S\. Lu, Y\. Wang, L\. Sheng, L\. He, A\. Zheng, and J\. Liang\(2025\)Out\-of\-distribution detection: A task\-oriented survey of recent advances\.ACM Computing Surveys58\(2\),pp\. 1–39\.Cited by:[§1](https://arxiv.org/html/2606.17352#S1.p1.1),[§2](https://arxiv.org/html/2606.17352#S2.p1.1)\.
- \[29\]A\. F\. Martins and R\. F\. Astudillo\(2016\)From softmax to sparsemax: A sparse model of attention and multi\-label classification\.InInternational Conference on Machine Learning,pp\. 1614–1623\.Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p6.1)\.
- \[30\]Y\. Ming, Y\. Sun, O\. Dia, and Y\. Li\(2023\)How to exploit hyperspherical embeddings for out\-of\-distribution detection?\.InProceedings of the International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p1.1)\.
- \[31\]M\. Müller and M\. Hein\(2025\)Mahalanobis\+\+: Improving OOD detection via feature normalization\.InProceedings of the International Conference on Machine Learning,Cited by:[§B\.7](https://arxiv.org/html/2606.17352#A2.SS7.p1.1),[10th item](https://arxiv.org/html/2606.17352#A4.I1.i10.p1.1.1),[§1](https://arxiv.org/html/2606.17352#S1.p3.1),[§2](https://arxiv.org/html/2606.17352#S2.p3.1),[§2](https://arxiv.org/html/2606.17352#S2.p4.1),[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p5.1)\.
- \[32\]V\. Papyan, X\. Han, and D\. L\. Donoho\(2020\)Prevalence of neural collapse during the terminal phase of deep learning training\.Proceedings of the National Academy of Sciences117\(40\),pp\. 24652–24663\.Cited by:[Appendix A](https://arxiv.org/html/2606.17352#A1.SS0.SSS0.Px1.p1.1),[§1](https://arxiv.org/html/2606.17352#S1.p2.1),[§2](https://arxiv.org/html/2606.17352#S2.p6.1),[§3\.3](https://arxiv.org/html/2606.17352#S3.SS3.p3.2)\.
- \[33\]B\. Recht, R\. Roelofs, L\. Schmidt, and V\. Shankar\(2019\)Do imagenet classifiers generalize to imagenet?\.InInternational Conference on Machine Learning \(ICML\),Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p2.1)\.
- \[34\]S\. Regmi, B\. Panthi, S\. Dotel, P\. K\. Gyawali, D\. Stoyanov, and B\. Bhattarai\(2024\)T2fnorm: Train\-time feature normalization for OOD detection in image classification\.InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,pp\. 153–162\.Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p1.1)\.
- \[35\]J\. Ren, S\. Fort, J\. Liu, A\. G\. Roy, S\. Padhy, and B\. Lakshminarayanan\(2021\)A simple fix to Mahalanobis distance for improving near\-OOD detection\.arXiv preprint arXiv:2106\.09022\.Cited by:[7th item](https://arxiv.org/html/2606.17352#A4.I1.i7.p1.1.1),[§1](https://arxiv.org/html/2606.17352#S1.p3.1),[§2](https://arxiv.org/html/2606.17352#S2.p3.1),[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p5.1)\.
- \[36\]O\. Russakovsky, J\. Deng, H\. Su, J\. Krause, S\. Satheesh, S\. Ma, Z\. Huang, A\. Karpathy, A\. Khosla, M\. Bernstein, A\. C\. Berg, and L\. Fei\-Fei\(2015\)Imagenet large scale visual recognition challenge\.International Journal of Computer Vision115\(3\),pp\. 211–252\.Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p1.1)\.
- \[37\]V\. Sehwag, M\. Chiang, and P\. Mittal\(2021\)SSD: A unified framework for self\-supervised outlier detection\.InProceedings of the International Conference on Learning Representations,Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p1.1)\.
- \[38\]R\. Shwartz\-Ziv and N\. Tishby\(2017\)Opening the black box of deep neural networks via information\.arXiv preprint arXiv:1703\.00810\.Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p6.1)\.
- \[39\]Y\. Sun, C\. Guo, and Y\. Li\(2021\)ReAct: Out\-of\-distribution detection with rectified activations\.InProceedings of the Advances in Neural Information Processing Systems,Vol\.34,pp\. 144–157\.Cited by:[4th item](https://arxiv.org/html/2606.17352#A4.I1.i4.p1.3.1),[§2](https://arxiv.org/html/2606.17352#S2.p4.1),[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p5.1)\.
- \[40\]Y\. Sun, Y\. Ming, X\. Zhu, and Y\. Li\(2022\)Out\-of\-distribution detection with deep nearest neighbors\.InProceedings of the International Conference on Machine Learning,pp\. 20827–20840\.Cited by:[5th item](https://arxiv.org/html/2606.17352#A4.I1.i5.p1.1.1),[§1](https://arxiv.org/html/2606.17352#S1.p3.1),[§2](https://arxiv.org/html/2606.17352#S2.p3.1),[§2](https://arxiv.org/html/2606.17352#S2.p5.1),[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p5.1)\.
- \[41\]L\. Tao, X\. Du, J\. Zhu, and Y\. Li\(2023\)Non\-parametric outlier synthesis\.InThe Eleventh International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=JHklpEZqduQ)Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p1.1)\.
- \[42\]N\. Tishby and N\. Zaslavsky\(2015\)Deep learning and the information bottleneck principle\.InProceedings of the IEEE Information Theory Workshop,Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p6.1)\.
- \[43\]G\. Van Horn, O\. Mac Aodha, Y\. Song, Y\. Cui, C\. Sun, A\. Shepard, H\. Adam, P\. Perona, and S\. Belongie\(2018\)The inaturalist species classification and detection dataset\.In2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition \(CVPR\),pp\. 8769–8778\.External Links:[Document](https://dx.doi.org/10.1109/cvpr.2018.00914)Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p2.1)\.
- \[44\]J\. Wei and G\. Wang\(2023\)Fine\-grained out\-of\-distribution detection of medical images using combination of feature uncertainty and mahalanobis distance\.InProceedings of the IEEE International Symposium on Biomedical Imaging,pp\. 1–5\.Cited by:[§2](https://arxiv.org/html/2606.17352#S2.p3.1)\.
- \[45\]T\. Wei, B\. Wang, J\. Shi, Y\. Li, and M\. Zhang\(2025\)X\-Mahalanobis: Transformer feature mixing for reliable OOD detection\.InProceedings of the Annual Conference on Neural Information Processing Systems,Cited by:[Appendix A](https://arxiv.org/html/2606.17352#A1.SS0.SSS0.Px2.p1.2),[9th item](https://arxiv.org/html/2606.17352#A4.I1.i9.p1.1.1),[§1](https://arxiv.org/html/2606.17352#S1.p12.1),[§1](https://arxiv.org/html/2606.17352#S1.p4.1),[§2](https://arxiv.org/html/2606.17352#S2.p5.1),[§2](https://arxiv.org/html/2606.17352#S2.p7.3),[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p5.1),[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p6.2)\.
- \[46\]J\. Xiao, J\. Hays, B\. C\. Russell, G\. Patterson, K\. A\. Ehinger, A\. Torralba, and A\. Oliva\(2013\)Basic level scene understanding: categories, attributes and structures\.Frontiers in Psychology4\.External Links:[Document](https://dx.doi.org/10.3389/fpsyg.2013.00506)Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p2.1)\.
- \[47\]J\. Yang, P\. Wang, D\. Zou, Z\. Zhou, K\. Ding, W\. Peng, H\. Wang, G\. Chen, B\. Li, Y\. Sun,et al\.\(2022\)Openood: benchmarking generalized out\-of\-distribution detection\.Advances in Neural Information Processing Systems35,pp\. 32598–32611\.Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p2.1)\.
- \[48\]J\. Yang, K\. Zhou, Y\. Li, and Z\. Liu\(2024\)Generalized out\-of\-distribution detection: A survey\.International Journal of Computer Vision132\(12\),pp\. 5635–5662\.Cited by:[§1](https://arxiv.org/html/2606.17352#S1.p1.1),[§2](https://arxiv.org/html/2606.17352#S2.p1.1)\.
- \[49\]B\. Zhou, A\. Lapedriza, A\. Khosla, A\. Oliva, and A\. Torralba\(2017\)Places: a 10 million image database for scene recognition\.IEEE Transactions on Pattern Analysis and Machine Intelligence40\(6\),pp\. 1452–1464\.Cited by:[§4\.1](https://arxiv.org/html/2606.17352#S4.SS1.p2.1)\.

## Appendix ATheoretical Justification of MM\+\+

The effectiveness of MM\+\+ for out\-of\-distribution \(OOD\) detection is supported by three complementary principles: \(i\) entropy\-based layer selection, \(ii\) joint precision modeling across layers, and \(iii\) stable covariance estimation via shrinkage in high\-dimensional fused spaces\.

#### 1\. Entropy\-Based Layer Selection and Neural Collapse\.

The selection of informative layers via the entropy density dropΔl\\Delta\_\{l\}\(Eq\.[5](https://arxiv.org/html/2606.17352#S3.E5)\) is motivated by the geometric properties of neural collapse\[[32](https://arxiv.org/html/2606.17352#bib.bib20)\]\. As representations propagate through a deep network, within\-class variability transitions from high\-dimensional, distributed features in early layers to low\-dimensional, highly structured embeddings in later layers\.

Because our representations areℓ2\\ell\_\{2\}\-normalized onto the unit hypersphere, the within\-class covariance entropyHlH\_\{l\}acts as a direct proxy for the active intrinsic dimensionality of the ID data manifold\. A sharp increase inΔl\\Delta\_\{l\}identifies the precise network depth where non\-discriminative variability is aggressively suppressed, projecting ID samples onto a tightly constrained subspace\. By selecting layers with maximalΔl\\Delta\_\{l\}, MM\+\+ isolates the boundaries of strongest semantic compression\. OOD samples, lacking the semantic structure of ID classes, are not projected onto this same low\-dimensional manifold, resulting in a measurable orthogonal deviation that the Mahalanobis distance subsequently penalizes\.

#### 2\. Joint Precision and Cross\-Layer Consistency\.

Multi\-layer OOD detectors such as\[[45](https://arxiv.org/html/2606.17352#bib.bib51),[19](https://arxiv.org/html/2606.17352#bib.bib56)\]typically rely on additive fusion, which implicitly assumes independence across layers\. This corresponds to approximating the joint covariance as a block\-diagonal matrix, i\.e\.,Σll′=𝟎\\Sigma\_\{ll^\{\\prime\}\}=\\mathbf\{0\}forl≠l′l\\neq l^\{\\prime\}, thereby ignoring cross\-layer dependencies\.

Unlike methods that compute independent OOD scores for each layer and aggregate them post\-hoc,MM\+\+ computes a single Mahalanobis\+\+ distance in a unified representation space\.We estimate a joint precision matrixΣ^𝒦−1\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}^\{\-1\}across the entire set of selected layers𝒦\\mathcal\{K\}\. By definingϕc\(x\)\\phi\_\{c\}\(x\)as the concatenated,ℓ2\\ell\_\{2\}\-normalized feature residual vector across alll∈𝒦l\\in\\mathcal\{K\}, the class\-conditional score is expressed as a single quadratic form:

𝒮c\(x\)=−ϕc\(x\)⊤Σ^𝒦−1ϕc\(x\)\.\\mathcal\{S\}\_\{c\}\(x\)=\-\\phi\_\{c\}\(x\)^\{\\top\}\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}^\{\-1\}\\phi\_\{c\}\(x\)\.\(A\.1\)
To illustrate why this differs fundamentally from a layer\-wise calculation, we partitionΣ^𝒦−1\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}^\{\-1\}into block components\. The score expands as:

𝒮c\(x\)=∑l∈𝒦𝒮c,l\(x\)⏟Intra\-layer terms\+∑l∈𝒦∑l′≠l𝒮c\(l,l′\)\(x\)⏟Inter\-layer interactions,\\mathcal\{S\}\_\{c\}\(x\)=\\underbrace\{\\sum\_\{l\\in\\mathcal\{K\}\}\\mathcal\{S\}\_\{c,l\}\(x\)\}\_\{\\text\{Intra\-layer terms\}\}\+\\underbrace\{\\sum\_\{l\\in\\mathcal\{K\}\}\\sum\_\{l^\{\\prime\}\\neq l\}\\mathcal\{S\}\_\{c\}^\{\(l,l^\{\\prime\}\)\}\(x\)\}\_\{\\text\{Inter\-layer interactions\}\},\(A\.2\)where𝒮c,l\(x\)=−ϕc,l\(x\)⊤Σ^𝒦,ll−1ϕc,l\(x\)\\mathcal\{S\}\_\{c,l\}\(x\)=\-\\phi\_\{c,l\}\(x\)^\{\\top\}\\hat\{\\Sigma\}^\{\-1\}\_\{\\mathcal\{K\},ll\}\\phi\_\{c,l\}\(x\)evaluates the marginal fit within layerll\. Crucially, the cross\-layer term𝒮c\(l,l′\)\(x\)\\mathcal\{S\}\_\{c\}^\{\(l,l^\{\\prime\}\)\}\(x\)is governed by the off\-diagonal blocks:

𝒮c\(l,l′\)\(x\)=−ϕc,l\(x\)⊤Σ^𝒦,ll′−1ϕc,l′\(x\)\.\\mathcal\{S\}\_\{c\}^\{\(l,l^\{\\prime\}\)\}\(x\)=\-\\phi\_\{c,l\}\(x\)^\{\\top\}\\hat\{\\Sigma\}\_\{\\mathcal\{K\},ll^\{\\prime\}\}^\{\-1\}\\phi\_\{c,l^\{\\prime\}\}\(x\)\.\(A\.3\)
Mathematically, these off\-diagonal precision blocks act as conditional expectation penalties\. Under a joint Gaussian assumption, the optimal prediction of a feature at layerl′l^\{\\prime\}given the feature at layerlldepends heavily on their cross\-covariance\. The second term in Eq\.[A\.2](https://arxiv.org/html/2606.17352#A1.E2)explicitly penalizes deviations from the learned ID feature trajectory𝔼\[ϕc,l′\|ϕc,l\]\\mathbb\{E\}\[\\phi\_\{c,l^\{\\prime\}\}\|\\phi\_\{c,l\}\]\. Consequently, MM\+\+ detects sophisticated OOD samples that might perfectly match the marginal statistics of layerlland layerl′l^\{\\prime\}independently, but violate the expected cross\-layer evolutionary dynamics dictated by the ID data\.

#### 3\. Well\-Conditioned Estimation via Ledoit–Wolf Shrinkage\.

While concatenated fusion unlocks cross\-layer modeling, it significantly exacerbates the dimensionality problem\. LetD𝒦=∑l∈𝒦DlD\_\{\\mathcal\{K\}\}=\\sum\_\{l\\in\\mathcal\{K\}\}D\_\{l\}be the dimension of the joint space\. WhenD𝒦≫NcD\_\{\\mathcal\{K\}\}\\gg N\_\{c\}, the empirical joint covarianceΣ^𝒦\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}becomes highly ill\-conditioned or strictly rank\-deficient \(possessing zero\-valued eigenvalues\), meaning the raw inverseΣ^𝒦−1\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}^\{\-1\}does not exist\. Even when invertible, the empirical estimator systematically overestimates dominant eigenvalues and severely underestimates small eigenvalues\. In the Mahalanobis distance computation, directions corresponding to these underestimated small eigenvalues are inverted, yielding an unstable, near\-infinite penalty for negligible noise\.

To guarantee a theoretically stable and positive\-definite precision matrix, MM\+\+ applies Ledoit–Wolf shrinkage \(Eq\.[3](https://arxiv.org/html/2606.17352#S3.E3)\) directly to the fused space:

Σ^𝒦,shrink=\(1−γ\)Σ^𝒦\+γ\(Tr⁡\(Σ^𝒦\)D𝒦\)𝐈,\\hat\{\\Sigma\}\_\{\\mathcal\{K\},\\text\{shrink\}\}=\(1\-\\gamma\)\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}\+\\gamma\\,\\left\(\\frac\{\\operatorname\{Tr\}\(\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}\)\}\{D\_\{\\mathcal\{K\}\}\}\\right\)\\mathbf\{I\},\(A\.4\)whereγ∈\(0,1\)\\gamma\\in\(0,1\)is the analytically optimal shrinkage intensity\.

This operation rigidly bounds the eigenvalue spectrum of the resulting precision matrix\. Letλi≥0\\lambda\_\{i\}\\geq 0be an eigenvalue of the empiricalΣ^𝒦\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}\. The corresponding eigenvalue of the precision matrixΣ^𝒦,shrink−1\\hat\{\\Sigma\}\_\{\\mathcal\{K\},\\text\{shrink\}\}^\{\-1\}becomes:

λ~i−1=1\(1−γ\)λi\+γTr⁡\(Σ^𝒦\)D𝒦\.\\tilde\{\\lambda\}\_\{i\}^\{\-1\}=\\frac\{1\}\{\(1\-\\gamma\)\\lambda\_\{i\}\+\\gamma\\frac\{\\operatorname\{Tr\}\(\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}\)\}\{D\_\{\\mathcal\{K\}\}\}\}\.\(A\.5\)Asλi→0\\lambda\_\{i\}\\to 0, the precision eigenvalue is safely bounded by a maximum theoretical value ofD𝒦/\(γTr⁡\(Σ^𝒦\)\)D\_\{\\mathcal\{K\}\}/\(\\gamma\\operatorname\{Tr\}\(\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}\)\)\. Thus, shrinkage acts as an optimal, data\-driven Tikhonov regularizer\. It suppresses spurious responses to high\-frequency noise in the concatenated null\-space, ensuring that the OOD score𝒮c\(x\)\\mathcal\{S\}\_\{c\}\(x\)is driven by meaningful semantic deviations rather than numerical instability\.

#### Summary\.

Entropy\-based layer selection identifies the geometric boundaries of semantic compression, joint precision modeling captures conditional cross\-layer dependencies, and analytically derived shrinkage ensures the necessary topological stability of the high\-dimensional fused space\. Together, these components mathematically underpin the improved separability of MM\+\+\.

## Appendix BAdditional Results

### B\.1AUROC Visualization

To provide a holistic view of OOD detection performance across diverse distribution shifts, Figure[5](https://arxiv.org/html/2606.17352#A2.F5)\(a\) and \(b\) visualize the AUROC scores of MM\+\+ against a comprehensive suite of baseline methods using the ViT\-B/16 architecture\.

![Refer to caption](https://arxiv.org/html/2606.17352v1/x5.png)\(a\)ID dataset: Imagenet\-1K, Model: ViT\-B/16
![Refer to caption](https://arxiv.org/html/2606.17352v1/x6.png)\(b\)ID dataset: Imagenet\-LT, Model: ViT\-B/16

Figure 5:AUROC comparison across datasets, visualizing the results in Tables[1](https://arxiv.org/html/2606.17352#S4.T1)and[2](https://arxiv.org/html/2606.17352#S4.T2)Performance on Balanced ID \(ImageNet\-1K\)\.Figure[5](https://arxiv.org/html/2606.17352#A2.F5)\(a\) evaluates robustness under the standard ImageNet\-1K regime\. As illustrated, X\-Mahalanobis consistently forms the outermost perimeter of the radar plot, demonstrating its effectiveness for a balanced ID\. It outperforms the other approaches particularly on far\-OOD benchmarks likely due to its fine\-tuning\. MM\+\+ is competitive on far\-OOD datasets, while achieving the best AUROC on Texture and SUN\. Therefore, this visually confirms that MM\+\+ is a fully post\-hoc, unsupervised alternative\.

Robustness under Class Imbalance \(ImageNet\-LT\)\.The advantages of MM\+\+ are even more pronounced under the challenging long\-tailed regime as illustrated in Figure[5](https://arxiv.org/html/2606.17352#A2.F5)\(b\)\. While traditional distance\-based methods suffer from unreliable statistic estimation on data\-poor tail classes, MM\+\+ remains much more stable\. It establishes a dominant perimeter across all evaluated near\-OOD benchmarks\. This sustained resilience highlights the efficacy of combining top\-KKfeature gating with shrinkage\-based covariance calibration in the fused space, effectively mitigating the geometric vulnerabilities associated with underrepresented tail classes\.

Collectively, these visualizations demonstrate that unlike single\-layer baselines or additive multi\-layer approaches \(e\.g\., X\-Mahalanobis\), MM\+\+ avoids dataset\-specific failure modes, establishing a universally robust framework for varied deployment conditions\.

### B\.2Analysis of MM\+\+ Sensitivity toKK

Figure[6](https://arxiv.org/html/2606.17352#A2.F6)illustrates MM\+\+’s sensitivity toKKon ImageNet\-LT as the ID dataset\. As discussed in Section[4](https://arxiv.org/html/2606.17352#S4), a peak AUROC is typically achieved for a smallKKvalue\. Subsequently, AUROC drops until it rises again for largeKKvalues\. Nonetheless, the empirical AUROC of higherKKvalues are lower than the peak across diverse OOD datasets\. Therefore, this demonstrates that a smallKKvalue \(e\.g\.,K=2K=2\) provides best trade\-offs between performance and complexity, conforming to the ablation results discussed in Section[4](https://arxiv.org/html/2606.17352#S4)\.

![Refer to caption](https://arxiv.org/html/2606.17352v1/x7.png)\(a\)ConvNeXt\-T
![Refer to caption](https://arxiv.org/html/2606.17352v1/x8.png)\(b\)Swin\-T

Figure 6:AUROC of MM\+\+ as a function of the number of fused layersK∈1,…,kmaxK\\in\{1,\\dots,k\_\{max\}\}for each model, whereK=1K=1corresponds to using only the penultimate layer\.
### B\.3EVA02\-S14 Results

Table[6](https://arxiv.org/html/2606.17352#A2.T6)presents results on ImageNet\-LT using the EVA02\-S14 backbone \(eva02\_small\_patch14\_336\.mim\_in22k\_ft\_in1k\)\. As expected, confidence\-based methods degrade under long\-tailed distributions, while Mahalanobis\-based approaches provide stronger baselines\.

A clear distinction emerges between near\- and far\-OOD settings\. Single\-layer methods and X\-Mahalanobis perform well on far\-OOD datasets, where large semantic shifts are present\. In contrast, MM\+\+ consistently achieves the best performance on near\-OOD benchmarks \(ImageNet\-C, ImageNet\-ES, ImageNet\-R, and ImageNet\-V2\)\.

This behavior highlights the key advantage of MM\+\+: near\-OOD samples induce subtle, distributed deviations across multiple layers, which cannot be captured by single\-layer representations or independent score aggregation\. By modeling cross\-layer dependencies, MM\+\+ provides a more sensitive and robust detection mechanism for realistic distribution shifts\.

### B\.4ConvNeXt\-T Results

Table[6](https://arxiv.org/html/2606.17352#A2.T6)and Figure[7](https://arxiv.org/html/2606.17352#A2.F7)shows the results, and visualizes the score distribution for the ConvNeXt\-T backbone \(convnext\_tiny\.fb\_in1k\)\. The similar trend persists across architectures\. MM\+\+ consistently outperforms all baselines on near\-OOD datasets and achieves the best overall performance\.

Unlike ViT and EVA02\-S14, X\-Mahalanobis is not applicable in this setting due to its reliance on homogeneous transformer features and adaptation\-based design\. This highlights an important advantage of MM\+\+: it operates purely in a post\-hoc manner and naturally extends to heterogeneous architectures without requiring model modification\.

Overall, the results demonstrate that multi\-layer fusion enables MM\+\+ to capture both low\-level covariate shifts and high\-level semantic inconsistencies, leading to improved robustness under long\-tailed conditions\.

Table 4:OOD detection performance using EVA02\-S14 with ImageNet\-LT as ID\.MethodNINCOOpenImage\-OImageNet\-CImageNet\-ESImageNet\-RImageNet\-V2AverageAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowMSP80\.9962\.2889\.4941\.1970\.3272\.7767\.8470\.9177\.3561\.6657\.8289\.8273\.9766\.44ODIN71\.3172\.8082\.2452\.5465\.3878\.9662\.6580\.2170\.5270\.9955\.4791\.7967\.9374\.55Energy75\.6262\.3586\.2839\.7266\.5474\.0665\.3273\.2574\.1158\.1755\.8790\.4370\.6266\.33ReAct76\.1762\.3586\.7539\.0166\.3474\.4465\.2973\.4574\.4257\.9155\.9490\.3770\.8266\.25KNN72\.7580\.3987\.0553\.3261\.0690\.8558\.0592\.5075\.0664\.8553\.9292\.3267\.9879\.04Maha87\.9551\.8995\.9221\.3674\.2568\.4670\.5269\.6685\.1348\.5558\.1689\.2778\.6558\.20rMaha88\.2850\.9495\.5422\.9673\.9069\.2170\.3568\.0884\.1551\.2458\.4489\.2778\.4458\.62Maha\+\+88\.4250\.2996\.0820\.3575\.2065\.8271\.4265\.4585\.3947\.6658\.3589\.2479\.1556\.47rMaha\+\+88\.4250\.2996\.0820\.3575\.2065\.8171\.4265\.4585\.3947\.6658\.3689\.2479\.1556\.47X\-Maha†\\text\{X\-Maha\}^\{\\dagger\}88\.7151\.7595\.5024\.1776\.7065\.1774\.3064\.1487\.1546\.5558\.1290\.0980\.0856\.98MM\+\+ \(Ours\)82\.0672\.8787\.4966\.9689\.2542\.5188\.2447\.3190\.3142\.9165\.6885\.6283\.8459\.70

Table 5:OOD detection performance using ConvNeXt\-T with ImageNet\-LT as ID\.MethodNINCOOpenImage\-OImageNet\-CImageNet\-RImageNet\-ESImageNet\-V2AverageAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowMSP80\.6865\.4582\.8361\.6378\.6560\.8077\.5867\.0274\.3665\.0661\.5388\.4975\.9468\.08ODIN61\.7488\.3858\.6489\.8569\.3280\.0155\.0792\.3364\.6078\.8259\.7190\.6861\.5186\.68Energy63\.7187\.1157\.1192\.8370\.1180\.4657\.7191\.8964\.9876\.0958\.8689\.7362\.0886\.35ReAct68\.0285\.2364\.4590\.7273\.3377\.4664\.8689\.8867\.8075\.1260\.0689\.3766\.4284\.63KNN34\.6499\.0744\.8097\.7659\.5484\.8665\.5691\.8552\.8094\.4246\.2196\.3450\.5994\.05Maha82\.1174\.4991\.5748\.9675\.5974\.8587\.0351\.9270\.1382\.7857\.6191\.2277\.3470\.70rMaha85\.4366\.0192\.6142\.7177\.7469\.7585\.5552\.6273\.5179\.5759\.4090\.4379\.0466\.85Maha\+\+86\.6256\.4993\.1239\.5179\.2764\.2687\.3252\.2574\.3378\.4159\.2990\.3279\.9963\.54rMaha\+\+86\.6856\.4593\.1439\.4779\.2464\.3287\.2952\.3774\.3578\.4859\.3290\.3280\.0063\.57MM\+\+ \(Ours\)86\.9755\.6193\.4037\.6982\.5256\.3988\.0849\.1277\.9869\.3659\.7090\.3381\.4459\.75

Table 6:OOD detection performance using Swin\-T with ImageNet\-LT as ID\.MethodNINCOOpenImage\-OImageNet\-CImageNet\-RImageNet\-ESImageNet\-V2AverageAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowAUROC↑\\uparrowFPR95↓\\downarrowMSP80\.3070\.5285\.6860\.2176\.1968\.3680\.1463\.9276\.1764\.8859\.8690\.1176\.3969\.67ODIN67\.6885\.4172\.1281\.2862\.8185\.0872\.1078\.4563\.8080\.8956\.7092\.0265\.8783\.86Energy77\.1072\.4279\.0867\.9875\.2765\.3376\.9566\.5976\.7459\.4859\.7790\.1874\.1570\.33ReAct78\.6272\.2282\.9364\.0876\.4664\.1080\.7460\.1277\.5359\.3960\.4089\.8876\.1168\.30KNN43\.1695\.3054\.1189\.4166\.1674\.1577\.7765\.0561\.9882\.3350\.5693\.9358\.9683\.36Maha80\.4679\.7690\.4754\.3074\.5477\.2887\.1052\.1067\.3688\.5857\.8691\.3076\.3073\.89rMaha83\.5970\.5091\.8743\.7476\.3770\.6985\.4552\.4973\.1283\.0958\.5690\.4978\.1668\.50Maha\+\+85\.8757\.7993\.1436\.8679\.6058\.9888\.3245\.7276\.6363\.9059\.1389\.9280\.4558\.86rMaha\+\+86\.3256\.2193\.1537\.0380\.1457\.9788\.2946\.6577\.3862\.6759\.3389\.9580\.7758\.41MM\+\+ \(Ours\)85\.5560\.0792\.4443\.1286\.2643\.5689\.2942\.9186\.2541\.0960\.6789\.6183\.4153\.39

![Refer to caption](https://arxiv.org/html/2606.17352v1/x9.png)Figure 7:Score distributions for ConvNeXt\-T across six OOD benchmarks with ImageNet\-LT as ID\.
### B\.5Swin\-T Results

Table[6](https://arxiv.org/html/2606.17352#A2.T6)reports results using the Swin\-T backbone \(swin\_tiny\_patch4\_window7\_224\.ms\_in1k\)\. The observations remain consistent with the previous settings\.

MM\+\+ achieves the strongest performance on all near\-OOD benchmarks, while remaining competitive on far\-OOD datasets\. In contrast, single\-layer Mahalanobis variants show stronger performance on large semantic shifts but struggle to capture subtle distribution changes\.

Similar to ConvNeXt\-T, X\-Mahalanobis is not applicable to Swin\-T, further emphasizing the architectural flexibility of MM\+\+\. These results confirm that modeling cross\-layer interactions is particularly beneficial for detecting structured covariate shifts in modern hierarchical architectures\.

Table 7:AUROC on Near\-OOD datasets averaged over ImageNet\-V2, ImageNet\-C, ImageNet\-R, and ImageNet\-ES with ImageNet\-LT \(ID\)ModelValAccMSPODINEnergyReActKNNMaharMahaMaha\+\+rMaha\+\+MM\+\+\(ours\)ViT\-B16\-In21k\-augreg84\.4769\.5072\.3073\.3973\.1761\.8073\.6072\.8473\.7373\.7280\.15ViT\-L16\-In21k\-augreg85\.8169\.6374\.3974\.2974\.2061\.3674\.6472\.4674\.4874\.4778\.98ViT\-T16\-In21k\-augreg75\.4873\.0172\.3277\.4676\.7871\.1769\.5272\.9677\.5577\.5381\.26ViT\-S16\-In21k\-augreg81\.3770\.8472\.7275\.3375\.0564\.9272\.0572\.7875\.6775\.6482\.62ViT\-B16\-augreg79\.1371\.2770\.0174\.2774\.1054\.8973\.7773\.1574\.3174\.2974\.48ViT\-S16\-augreg78\.7971\.7470\.1174\.3774\.4547\.8574\.1974\.0274\.3274\.3080\.53ViT\-so400M\-SigLip89\.3364\.4461\.1462\.0864\.3855\.0870\.5368\.9071\.9471\.9366\.22ViT\-B16\-In21k\-orig81\.8071\.1770\.3373\.8973\.8356\.4573\.8573\.7674\.5074\.5874\.24ViT\-L16\-In21k\-orig81\.5071\.1669\.5473\.0273\.0853\.7773\.1873\.8174\.4474\.4373\.58ViT\-B16\-In21k\-mil84\.2671\.8770\.7373\.6673\.7553\.8770\.0172\.3373\.6873\.9370\.34ViT\-B16\-In21k\-augreg285\.0669\.2364\.0467\.5169\.9957\.2671\.2571\.8173\.7873\.8278\.92EVA02\-L14\-M38m\-In21k89\.5662\.3063\.6260\.8161\.0547\.0367\.8866\.3668\.8368\.8381\.68EVA02\-B14\-In21k88\.3865\.4463\.1762\.9763\.7845\.2269\.7668\.8971\.0471\.0482\.11EVA02\-S1485\.6468\.3363\.5165\.4665\.5062\.0272\.0271\.7172\.5972\.5983\.37EVA02\-T1480\.5870\.1164\.2068\.9068\.8363\.7973\.2172\.9174\.0274\.0378\.02DeiT3\-B1683\.7269\.8365\.1567\.9266\.6767\.0471\.5171\.4072\.1172\.3371\.90DeiT3\-L16\-In21k87\.5865\.9965\.5165\.0467\.5557\.8470\.3868\.0670\.9471\.0270\.78DeiT3\-L1685\.8066\.2564\.4163\.5456\.2969\.6972\.3570\.7272\.8871\.6772\.60DeiT3\-B16\-In1k84\.9570\.3361\.9164\.3560\.4368\.6672\.5872\.0873\.1173\.3973\.14DeiT3\-B16\-384\-In1k86\.6766\.8362\.5562\.8167\.0649\.8072\.2370\.8972\.8872\.8972\.75DeiT3\-B16\-224\-In1k85\.7266\.9563\.7063\.8367\.5751\.3671\.6470\.5372\.3972\.4072\.19DeiT3\-S16\-In21k84\.7466\.4659\.1864\.6666\.5942\.7871\.7871\.3372\.6172\.6172\.11DeiT3\-S1683\.4471\.6064\.4168\.3567\.4268\.7971\.9972\.7773\.4773\.5673\.26Swin\-T81\.2473\.0963\.8572\.1873\.7864\.1271\.7173\.3875\.9276\.2880\.62SwinV2\-S84\.0872\.0667\.3569\.4273\.3051\.3572\.2772\.0973\.7873\.8774\.91SwinV2\-B84\.6272\.4366\.7271\.0674\.8751\.9173\.3772\.2474\.3475\.7975\.98SwinV2\-L\-In21k87\.0570\.4265\.2965\.5470\.0068\.5471\.4870\.9273\.7974\.8776\.48SwinV2\-B\-In21k86\.6470\.2164\.4565\.1769\.3666\.6571\.8271\.3073\.9674\.1476\.78ConvNeXt\-T82\.7373\.0362\.1762\.9266\.5156\.0372\.5974\.0575\.0575\.0577\.07ConvNeXt\-B84\.4173\.5258\.0563\.9670\.5056\.6174\.0573\.6775\.2276\.3376\.82ConvNeXt\-B\-In21k85\.9668\.3667\.1264\.2466\.0556\.7672\.7671\.6173\.9473\.9375\.83ConvNeXtV2\-B\-In21k87\.0966\.4366\.1764\.8465\.4048\.3471\.0469\.6071\.6871\.6872\.63ConvNeXtV2\-T\-In21k84\.7568\.4171\.8366\.3666\.4850\.0572\.3071\.4272\.8872\.8874\.15Average84\.3169\.4666\.1267\.9969\.0257\.6672\.0471\.7273\.5173\.6375\.95

### B\.6Extended Results

The extended results at Table[7](https://arxiv.org/html/2606.17352#A2.T7)and Table[8](https://arxiv.org/html/2606.17352#A2.T8)show that MM\+\+ achieves the best performance across different models under both AUROC and FPR95 metrics on Near\-OOD benchmarks\. MM\+\+ achieves substantial improvements for models trained with standard supervised or augreg pipelines, where intermediate representations contains complementary semantic information that is effectively fused\. The gains are particularly pronounced for architectures with strong hierarchical feature diversity, such as Swin and ConvNeXt\. These models preserve complementary spatial and semantic statistics across layers\. This behavior suggests that near\-OOD shifts are pronounced at multiple abstraction levels rather than exclusively in the penultimate representation\.

Across most DeiT3 variants, MM\+\+ remains competitive but does not consistently outperform single\-layer Mahalanobis\+\+ baselines\. DeiT3 models are trained with strong augmentations \(e\.g\., Mixup, CutMix\) and regularization techniques \(e\.g\., label smoothing\), which are known to produce smoother decision boundaries and more uniform feature representations\.

As a result, intermediate layers tend to be more aligned with the final representation, reducing the diversity of information across layers\. This limits the benefit of multi\-layer fusion, as the representations become increasingly redundant, and single\-layer methods already capture most of the discriminative structure\.

Consequently, MM\+\+ provides smaller gains in DeiT3 compared to other architectures, where stronger hierarchical feature diversity enables more effective cross\-layer modeling\.

Table 8:FPR95 on Near\-OOD averaged over ImageNet\-V2, ImageNet\-C, ImageNet\-R, and ImageNet\-ES with ImageNet\-LT \(ID\)ModelValAccMSPODINEnergyReActKNNMaharMahaMaha\+\+rMaha\+\+MM\+\+\(ours\)ViT\-B16\-In21k\-augreg84\.4773\.9572\.4966\.5468\.5284\.3668\.3669\.9466\.9767\.0254\.04ViT\-L16\-In21k\-augreg85\.8174\.5667\.7966\.4167\.4084\.3268\.3971\.7766\.9066\.9057\.10ViT\-T16\-In21k\-augreg75\.4870\.3871\.9660\.0662\.9167\.4082\.1080\.6959\.4659\.5355\.79ViT\-S16\-In21k\-augreg81\.3772\.0870\.2863\.2764\.2879\.6874\.7874\.7562\.1062\.1350\.77ViT\-B16\-augreg79\.1373\.4580\.1270\.1370\.4090\.3771\.4173\.1969\.5369\.5669\.03ViT\-S16\-augreg78\.7973\.1079\.2469\.5068\.9595\.2668\.3769\.9468\.4968\.5854\.87ViT\-so400M\-SigLip89\.3381\.4882\.4278\.1976\.7993\.1873\.6674\.3969\.4969\.5089\.90ViT\-B16\-In21k\-orig81\.8071\.6677\.4867\.8268\.0590\.8671\.1170\.7668\.5168\.3269\.20ViT\-L16\-In21k\-orig81\.5072\.4478\.9170\.7470\.8793\.9774\.7972\.3870\.7370\.7672\.17ViT\-B16\-In21k\-mil84\.2670\.6071\.7365\.0365\.2191\.8383\.6477\.9268\.2067\.6374\.67ViT\-B16\-In21k\-augreg285\.0675\.0877\.8873\.7170\.9389\.6077\.7975\.5470\.6170\.5159\.25EVA02\-L14\-M38m\-In21k89\.5681\.8883\.3081\.6381\.7394\.9878\.6378\.5577\.3377\.3359\.59EVA02\-B14\-In21k88\.3877\.7582\.5578\.4878\.2296\.0074\.3874\.4972\.1172\.1156\.89EVA02\-S1485\.6473\.7980\.4973\.9874\.0485\.1368\.9969\.4567\.0467\.0454\.59EVA02\-T1480\.5873\.7481\.4271\.6271\.8881\.2170\.0570\.5166\.6266\.6265\.29DeiT3\-B1683\.7275\.6977\.5179\.7381\.8480\.4377\.7676\.5675\.4875\.1379\.00DeiT3\-L16\-In21k87\.5882\.1781\.5479\.1375\.6183\.6674\.3574\.1273\.7173\.7574\.19DeiT3\-L1685\.8078\.5279\.3885\.5392\.1371\.1671\.0571\.1968\.9169\.6172\.14DeiT3\-B16\-In1k84\.9575\.2480\.2982\.9286\.4776\.8474\.1473\.8572\.7972\.4874\.97DeiT3\-B16\-384\-In1k86\.6779\.5082\.4578\.9474\.5594\.6672\.1471\.4570\.4570\.4672\.26DeiT3\-B16\-224\-In1k85\.7279\.1781\.6178\.2474\.3494\.0574\.0472\.7071\.6271\.6173\.90DeiT3\-S16\-In21k84\.7476\.9582\.1274\.5972\.4297\.3373\.5072\.8170\.3070\.3073\.47DeiT3\-S1683\.4472\.7976\.8976\.6777\.9976\.0276\.4675\.1570\.6070\.4679\.77Swin\-T81\.2471\.8284\.1170\.4068\.3778\.8677\.3274\.1964\.6364\.3154\.29SwinV2\-S84\.0872\.2078\.9271\.9867\.9091\.9473\.5672\.7266\.7166\.6465\.52SwinV2\-B84\.6272\.3276\.9468\.0363\.7188\.1372\.0171\.3067\.2866\.2566\.18SwinV2\-L\-In21k87\.0575\.3381\.5379\.3677\.9472\.1774\.9473\.0267\.6366\.9464\.36SwinV2\-B\-In21k86\.6474\.4980\.9176\.1373\.3873\.3273\.7972\.5467\.1467\.0964\.48ConvNeXt\-T82\.7370\.3485\.4684\.5482\.9691\.8775\.1973\.0971\.3171\.3766\.30ConvNeXt\-B84\.4168\.9885\.3482\.2676\.8883\.2672\.7971\.7571\.6073\.8068\.65ConvNeXt\-B\-In21k85\.9674\.4376\.1879\.3876\.6592\.1868\.4069\.3566\.6666\.6664\.28ConvNeXtV2\-B\-In21k87\.0976\.8779\.8575\.5075\.2095\.0171\.2071\.4970\.1870\.1869\.40ConvNeXtV2\-T\-In21k84\.7573\.6572\.0674\.5174\.4595\.2968\.8769\.7967\.3067\.3065\.96Average84\.3174\.7478\.8274\.3973\.7386\.5073\.5773\.0769\.0469\.0366\.43

#### Summary\.

Across diverse architectures—including standard transformers, hierarchical transformers, and convolutional networks—MM\+\+ consistently improves performance on near\-OOD benchmarks\. While single\-layer methods remain effective for large semantic shifts, they fail to capture subtle distribution changes\.

In contrast, MM\+\+ leverages entropy\-guided layer selection and joint multi\-layer modeling to detect these shifts more effectively\. Importantly, this is achieved in a strictly post\-hoc setting without fine\-tuning, making MM\+\+ both practical and broadly applicable across heterogeneous architectures\.

Table 9:Ablation of covariance estimation and calibration\-set size for Maha\+\+ on ViT\-B/16 with ImageNet\-1K as ID\. Values are AUROC \(%\)\. EC denotes empirical covariance and LW denotes Ledoit–Wolf shrinkage\.MethodiNaturalistSUNPlaces365TexturesImageNet\-OMaha\+\+ EC \(full 1\.28M\)98\.7688\.9086\.4889\.2686\.42Maha\+\+ LW \(full 1\.28M\)98\.7988\.9086\.4989\.2786\.44Maha\+\+ LW \(115K\)98\.7688\.8886\.5489\.2886\.39Maha\+\+ EC \(115K\)98\.7488\.8886\.5489\.2886\.38

### B\.7Additional Ablation Results

Covariance estimation ablation\.Table[9](https://arxiv.org/html/2606.17352#A2.T9)evaluates the impact of covariance estimation methods and calibration\-set size on the single\-layer Mahalanobis\+\+ baseline\. We compare empirical covariance \(EC\) and Ledoit–Wolf \(LW\) shrinkage using both the full ImageNet\-1K training set \(1\.28M samples\) and a reduced subset of 115K samples\. Across all five OOD benchmarks, the performance differences between EC and LW are marginal, with AUROC variations remaining within 0\.05%\. Similarly, reducing the calibration set size from 1\.28M to 115K samples has a negligible impact\. These findings suggest that, followingℓ2\\ell\_\{2\}normalization, the single penultimate\-layer feature distribution is already sufficiently well\-conditioned; thus, standard EC estimation remains highly stable and data\-efficient\. Consequently, we retain EC for the standard Mahalanobis\+\+ baseline in our experiments to maintain consistency with the original protocol\[[31](https://arxiv.org/html/2606.17352#bib.bib52)\]\.

Furthermore, these results demonstrate that applying LW shrinkage in isolation—without top\-KKinformation gating and feature fusion—yields limited performance gains\. This observation explicitly motivates the design of our proposed framework\. While a single layer is inherently well\-conditioned, MM\+\+ concatenates the topK−1K\-1intermediate layers with the penultimate layer into a unified, high\-dimensional representation space\. Because this architecture\-independent feature fusion creates a complex joint distribution prone to ill\-conditioning, MM\+\+ crucially relies on LW shrinkage to regularize the precision matrix, allowing the model to effectively leverage multi\-layer information for robust OOD detection\.

Table 10:Practical overhead comparison on ViT\-B/16OfflineOnlineMethodFine\-tuningTotal Time\# of LayersMemory/sampleGPU LatencyMaha\+\+None∼\\sim16s13 KB0\.0009 msMM\+\+ \(K=2K=2\)None∼\\sim55s26 KB0\.0020 msX\-Maha∼\\sim29h∼\\sim29h1236 KB0\.0464 ms

![Refer to caption](https://arxiv.org/html/2606.17352v1/x10.png)\(a\)Offline
![Refer to caption](https://arxiv.org/html/2606.17352v1/x11.png)\(b\)Online

Figure 8:Practical overhead comparison on ViT\-B/16\.Efficiency Ablation\.We compare the deployment cost of MM\+\+ with Mahalanobis\+\+ and X\-Mahalanobis in Table[10](https://arxiv.org/html/2606.17352#A2.T10)and Fig\.[8](https://arxiv.org/html/2606.17352#A2.F8)\. As shown in Fig\.[8\(a\)](https://arxiv.org/html/2606.17352#A2.F8.sf1), X\-Mahalanobis incurs substantial offline overhead due to fine\-tuning \(∼29\{\\sim\}29h\) particularly when ImageNet\-1K full training set is used as the ID dataset\. In contrast, both Mahalanobis\+\+ and MM\+\+ require no fine\-tuning and remain fully post\-hoc with lightweight calibration \(16 s and 55 s, respectively\)\.

For online OOD detection, MM\+\+ uses onlyK=2K\{=\}2layers, compared to 12 layers in X\-Mahalanobis, resulting in significantly lower memory usage and latency\. As illustrated in Fig\.[8\(b\)](https://arxiv.org/html/2606.17352#A2.F8.sf2), MM\+\+ requires only 6 KB of activation memory and achieves0\.0020\.002ms GPU latency per sample, remaining close to Mahalanobis\+\+ while being substantially more efficient than X\-Mahalanobis \(36 KB,0\.0460\.046ms\)\. Overall, MM\+\+ preserves the efficiency of post\-hoc methods while enabling multi\-layer modeling without expensive fine\-tuning\.

## Appendix CReproduction Details

Evaluation Rigor and Stability\.Because MM\+\+ is a strictly post\-hoc framework operating on frozen, standardized model checkpoints, the extracted feature representations and resulting OOD scores are algorithmically deterministic\.

Consequently, we empirically validate the robustness of our method across multiple distinct architectural paradigms and diverse distribution shifts, rather than through stochastic repetition\. The consistent performance of MM\+\+ across multiple, distinct architectural designs \(global attention, hierarchical attention, and modernized convolutions\) and under highly varied distribution shifts \(from long\-tailed class imbalance to severe covariate degradation\) serves as a comprehensive validation of its algorithmic robustness\.

Hyperparameter, Feature Extraction, and Well\-Conditioned Covariance Estimation\.MM\+\+ is a data\-driven, training\-free method with a single hyperparameter \(K=2K=2\)\. We use publicly available pretrained backbones \(e\.g\., ViT\-B/16, Swin\-T, and ConvNeXt\-T\) and standard OOD benchmarks with their official splits\. Features are extracted from intermediate layers,ℓ2\\ell\_\{2\}\-normalized, and class\-conditional statistics are computed from in\-distribution training data\.

The joint covariance matrix over the selected layers is estimated using Ledoit–Wolf shrinkage, and OOD scores are computed via the Mahalanobis\+\+ formulation described in Section[3](https://arxiv.org/html/2606.17352#S3)\. No additional training or hyperparameter tuning is required\. Moreover, our detailed experimental set\-up is described in Section[4](https://arxiv.org/html/2606.17352#S4)\.

Code Release\.An anonymized ZIP archive containing the complete implementation is included with the submission\. Upon acceptance, the codebase will be made publicly available\. Our implementation ensures full reproducibility of all reported results using the publicly accessible OOD benchmarks detailed in this work\.

Compute Resources\.All our calibration and evaluation are performed using on a workstation with an Intel Core i9 processor \(16 cores\), 32 GB DDR5 SDRAM, and an NVIDIA RTX 4090 GPU with 24GB GDDR6X SDRAM\.

## Appendix DMethods

We briefly review the OOD detection methods evaluated in this paper\.

Letf\(x\)∈ℝCf\(x\)\\in\\mathbb\{R\}^\{C\}denote the pre\-softmax logits of the network, andh\(x\)∈ℝdh\(x\)\\in\\mathbb\{R\}^\{d\}denote the penultimate feature representation \(withhl\(x\)h\_\{l\}\(x\)for layer\-wise features\)\.

- •MSP\[[12](https://arxiv.org/html/2606.17352#bib.bib54)\]:Uses the maximum softmax probability as the confidence score: 𝒮MSP\(x\)=maxc⁡exp⁡\(fc\(x\)\)∑jexp⁡\(fj\(x\)\)\\mathcal\{S\}\_\{\\text\{MSP\}\}\(x\)=\\max\_\{c\}\\frac\{\\exp\(f\_\{c\}\(x\)\)\}\{\\sum\_\{j\}\\exp\(f\_\{j\}\(x\)\)\}
- •ODIN\[[20](https://arxiv.org/html/2606.17352#bib.bib55)\]:Applies temperature scaling \(TT\) and input perturbation \(x→x~x\\rightarrow\\tilde\{x\}\): 𝒮ODIN\(x\)=maxc⁡exp⁡\(fc\(x~\)/T\)∑jexp⁡\(fj\(x~\)/T\)\\mathcal\{S\}\_\{\\text\{ODIN\}\}\(x\)=\\max\_\{c\}\\frac\{\\exp\(f\_\{c\}\(\\tilde\{x\}\)/T\)\}\{\\sum\_\{j\}\\exp\(f\_\{j\}\(\\tilde\{x\}\)/T\)\}
- •Energy\[[23](https://arxiv.org/html/2606.17352#bib.bib53)\]:Uses the free energy of logits as an unnormalized confidence score: 𝒮Energy\(x\)=Tlog∑cexp⁡\(fc\(x\)/T\)\\mathcal\{S\}\_\{\\text\{Energy\}\}\(x\)=T\\log\\sum\_\{c\}\\exp\(f\_\{c\}\(x\)/T\)
- •ReAct\[[39](https://arxiv.org/html/2606.17352#bib.bib57)\]:Truncates the penultimate activationsh\(x\)h\(x\)at a thresholdvvto produce rectified logitsfclip\(x\)f^\{\\text\{clip\}\}\(x\), computing the energy score as: 𝒮ReAct\(x\)=Tlog∑c=1Cexp⁡\(fcclip\(x\)/T\)\\mathcal\{S\}\_\{\\text\{ReAct\}\}\(x\)=T\\log\\sum\_\{c=1\}^\{C\}\\exp\(f\_\{c\}^\{\\text\{clip\}\}\(x\)/T\)
- •KNN\[[40](https://arxiv.org/html/2606.17352#bib.bib58)\]:Uses the negative distance to thekk\-th nearest neighbor in feature space: 𝒮KNN\(x\)=−‖h\(x\)−h\(k\)\(x\)‖2\\mathcal\{S\}\_\{\\text\{KNN\}\}\(x\)=\-\\\|h\(x\)\-h\_\{\(k\)\}\(x\)\\\|\_\{2\}
- •Mahalanobis\[[19](https://arxiv.org/html/2606.17352#bib.bib56)\]:Models features as class\-conditional Gaussians with shared covarianceΣ\\Sigma: 𝒮Maha\(x\)=−minc\(h\(x\)−μc\)⊤Σ−1\(h\(x\)−μc\)\\mathcal\{S\}\_\{\\text\{Maha\}\}\(x\)=\-\\min\_\{c\}\(h\(x\)\-\\mu\_\{c\}\)^\{\\top\}\\Sigma^\{\-1\}\(h\(x\)\-\\mu\_\{c\}\)
- •Relative Mahalanobis\[[35](https://arxiv.org/html/2606.17352#bib.bib48)\]:Refines the Mahalanobis score by subtracting a background distance term based on\(μ0,Σ0\)\(\\mu\_\{0\},\\Sigma\_\{0\}\): 𝒮rMaha\(x\)=−minc⁡\[dc\(x\)−d0\(x\)\],where\\mathcal\{S\}\_\{\\text\{rMaha\}\}\(x\)=\-\\min\_\{c\}\\left\[d\_\{c\}\(x\)\-d\_\{0\}\(x\)\\right\],\\text\{where \}dc\(x\)=\(h\(x\)−μc\)⊤Σ−1\(h\(x\)−μc\),d0\(x\)=\(h\(x\)−μ0\)⊤Σ0−1\(h\(x\)−μ0\)\.d\_\{c\}\(x\)=\(h\(x\)\-\\mu\_\{c\}\)^\{\\top\}\\Sigma^\{\-1\}\(h\(x\)\-\\mu\_\{c\}\),\\quad d\_\{0\}\(x\)=\(h\(x\)\-\\mu\_\{0\}\)^\{\\top\}\\Sigma\_\{0\}^\{\-1\}\(h\(x\)\-\\mu\_\{0\}\)\.
- •Relative Mahalanobis\+\+:Extends relative Mahalanobis by usingℓ2\\ell\_\{2\}\-normalized featuresh~\(x\)\\tilde\{h\}\(x\): 𝒮rMaha\+\+\(x\)=−minc⁡\[\(h~\(x\)−μ~c\)⊤Σ~−1\(h~\(x\)−μ~c\)−\(h~\(x\)−μ~0\)⊤Σ0−1\(h~\(x\)−μ~0\)\]\\mathcal\{S\}\_\{\\text\{rMaha\+\+\}\}\(x\)=\-\\min\_\{c\}\\Big\[\(\\tilde\{h\}\(x\)\-\\tilde\{\\mu\}\_\{c\}\)^\{\\top\}\\tilde\{\\Sigma\}^\{\-1\}\(\\tilde\{h\}\(x\)\-\\tilde\{\\mu\}\_\{c\}\)\-\(\\tilde\{h\}\(x\)\-\\tilde\{\\mu\}\_\{0\}\)^\{\\top\}\\Sigma\_\{0\}^\{\-1\}\(\\tilde\{h\}\(x\)\-\\tilde\{\\mu\}\_\{0\}\)\\Big\]
- •X\-Mahalanobis\[[45](https://arxiv.org/html/2606.17352#bib.bib51)\]:Linearly aggregates layer\-wise Mahalanobis distances using variance\-based weightsαl\\alpha\_\{l\}: 𝒮X\-Maha\(x\)=−minc∑l=1Lαl\(hl\(x\)−μc,l\)⊤Σl−1\(hl\(x\)−μc,l\)\\mathcal\{S\}\_\{\\text\{X\-Maha\}\}\(x\)=\-\\min\_\{c\}\\sum\_\{l=1\}^\{L\}\\alpha\_\{l\}\(h\_\{l\}\(x\)\-\\mu\_\{c,l\}\)^\{\\top\}\\Sigma\_\{l\}^\{\-1\}\(h\_\{l\}\(x\)\-\\mu\_\{c,l\}\)
- •Mahalanobis\+\+\[[31](https://arxiv.org/html/2606.17352#bib.bib52)\]:Applies a scale\-invariant Mahalanobis distance onℓ2\\ell\_\{2\}\-normalized features: 𝒮Maha\+\+\(x\)=−minc\(h~\(x\)−μ~c\)⊤Σ~−1\(h~\(x\)−μ~c\)\\mathcal\{S\}\_\{\\text\{Maha\+\+\}\}\(x\)=\-\\min\_\{c\}\(\\tilde\{h\}\(x\)\-\\tilde\{\\mu\}\_\{c\}\)^\{\\top\}\\tilde\{\\Sigma\}^\{\-1\}\(\\tilde\{h\}\(x\)\-\\tilde\{\\mu\}\_\{c\}\)
- •MM\+\+ \(Ours\):Evaluates a single joint Mahalanobis\+\+ distance on a unified representation spaceϕc\(x\)\\phi\_\{c\}\(x\), fusingℓ2\\ell\_\{2\}\-normalized features from the terminal \(penultimate\) layer and the topK−1K\-1intermediate layers\. The tied precision matrixΣ^𝒦−1\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}^\{\-1\}is derived analytically via Ledoit–Wolf shrinkage: 𝒮MM\+\+\(x\)=−minc\(ϕ\(x\)−𝝁^c𝒦\)⊤Σ^𝒦−1\(ϕ\(x\)−𝝁^c𝒦\)\\mathcal\{S\}\_\{\\text\{MM\+\+\}\}\(x\)=\-\\min\_\{c\}\\bigl\(\\boldsymbol\{\\phi\}\(x\)\-\\hat\{\\boldsymbol\{\\mu\}\}\_\{c\}^\{\\mathcal\{K\}\}\\bigr\)^\{\\\!\\top\}\\hat\{\\Sigma\}\_\{\\mathcal\{K\}\}^\{\-1\}\\bigl\(\\boldsymbol\{\\phi\}\(x\)\-\\hat\{\\boldsymbol\{\\mu\}\}\_\{c\}^\{\\mathcal\{K\}\}\\bigr\)

## Appendix ELimitations

MM\+\+ incurs additional computational and memory overhead compared to single\-layer methods, due to eigenspectrum estimation of within\-class covariance matrices and storage of layer\-wise statistics\. However, the overhead is minimal particularly whenKKis small \(e\.g\.,K=2K=2\) as shown in Table[10](https://arxiv.org/html/2606.17352#A2.T10)\. This is because Top\-KKinformation gating selectsK−1K\-1intermediates layers with the sharpest entropy density drops, while anchoring the penultimate layer\.

As a strictly post\-hoc method, MM\+\+ also depends on the geometry of the pretrained representation space\. Its gating mechanism assumes identifiable entropy density drops \(Δl\\Delta\_\{l\}\) associated with semantic compression; when such structure is weak or absent, intermediate features may offer limited discriminative signal for near\-OOD detection\. For this reason, MM\+\+ does not select intermediate layerllifΔl≤0\\Delta\_\{l\}\\leq 0\.

Extending MM\+\+ to dense prediction tasks remains an open challenge, requiring spatially localized estimates of intrinsic dimensionality to account for heterogeneous feature collapse\.

## Appendix FBroader Impacts

This work improves the reliability of deep neural networks in open\-world settings by enabling unsupervised, post\-hoc out\-of\-distribution \(OOD\) detection\. The proposed approach reduces computational overhead by avoiding retraining and does not rely on labeled proxy OOD data, thereby limiting potential sources of bias\.

However, MM\+\+ shares limitations common to OOD detection methods\. It is not guaranteed to detect all anomalous inputs, and over\-reliance on its outputs may lead to insufficient oversight\. As such, it should be used as part of a broader safety framework that includes complementary safeguards, such as uncertainty estimation or human supervision\.

## Appendix GPublic Resources Used

In this section, we acknowledge the public resources used during the course of this work\.

### G\.1Public Datasets Used

- •OpenOOD\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.MIT License
- •NINCO\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.MIT License
- •ImageNet\-1K\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.ImageNet Agreement
- •ImageNet\-LT\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.BSD\-3\-Clause license
- •Texture\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.Creative Commons BY\-SA 4\.0
- •OpenImage\-O\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.Apache License 2\.0
- •ImageNet\-C\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.ImageNet Agreement
- •ImageNet\-ES\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.ImageNet Agreement
- •ImageNet\-R\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.ImageNet Agreement
- •ImageNet\-V2\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.ImageNet Agreement
- •Places365\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.Creative Commons Attribution 4\.0 License
- •iNaturalist\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.Mixed Creative Commons Licenses
- •SUN\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.SUN Database License

### G\.2Public Implementations Used

- •Mahalanobis\+\+\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.Creative Commons BY 4\.0 License
- •timm\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.Apache License 2\.0
- •scikit\-learn\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.BSD\-3\-Clause License
- •X\-Mahalanobis\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.Creative Commons BY 4\.0 License
- •ODIN\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.MIT License
- •Energy\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.MIT License
- •ReAct\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.\.MIT License
MM++: Unsupervised Scale-Invariant Multilayer OOD Detection via Top-K Gated Feature Fusion

Similar Articles

Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins

Automatic Layer Selection for Hallucination Detection

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings

Towards Robust Federated Multimodal Graph Learning under Modality Heterogeneity

Submit Feedback

Similar Articles

Don't Collapse Your Features: Why CenterLoss Hurts OOD Detection and Multi-Scale Mahalanobis Wins
Automatic Layer Selection for Hallucination Detection
Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation
MMCORE: MultiModal COnnection with Representation Aligned Latent Embeddings
Towards Robust Federated Multimodal Graph Learning under Modality Heterogeneity