Uncovering the Latent Potential of Deep Intermediate Representations
Summary
This paper introduces LOES (Layer-wise Optimal Embedding Selection) and GeoReg (Geometric Regularization Loss), methods that select and fuse task-relevant intermediate layers from deep models to improve transfer learning performance, demonstrating consistent gains across architectures and modalities.
View Cached Full Text
Cached at: 05/25/26, 08:58 AM
# Uncovering the Latent Potential of Deep Intermediate Representations
Source: [https://arxiv.org/html/2605.23033](https://arxiv.org/html/2605.23033)
###### Abstract
Foundational Models pretrained on huge amount of data learn representations that evolve across depth, forming a hierarchy of embeddings with distinct semantic content and geometric structure\. Contrary to the widespread practice of using only the final layer or shallow mixtures, we show that task\-relevant information is distributed non\-monotonically across layers and cannot be recovered by naïve aggregation\. Through a geometric and empirical study across multiple modalities, we show that effective transfer depends on identifying which layers encode task\-discriminative structure and how their embeddings are geometrically organized\. We introduce Layer\-wise Optimal Embedding Selection \(LOES\), a constructive spectral method that identifies task\-discriminative subspaces by minimizing residual error under orthogonality and isotropy constraints\. To align fine\-tuning with this selection principle, we further propose Geometric Regularization Loss \(GeoReg\), which enforces a simplicial structure on class manifolds and stabilizes representation geometry during fine\-tuning\. Across a wide range of architectures, depths, modalities, and data regimes, LOES consistently outperforms standard baselines, with gains that grow as model depth increases\. Beyond accuracy, our method reveals how semantic factors are distributed across layers, thereby enabling cross\-lingual and cross\-modal interpretability analyses\. Together, our results provide strong evidence that layerwise embedding geometry is not incidental but central to how deep models represent and transfer knowledge\.
representation learning, layer selection, linear probes, embedding geometry, transfer learning, deep encoders, foundation models
Figure 1:Standard vs\. LOES\-based transfer learning: \(a\) conventional transfer learning uses a single encoder layer, typically the final layer, for downstream prediction, whereas \(b\) LOES selects and fuses multiple task\-relevant layers from the encoder hierarchy using target supervision, enabling transfer that exploits complementary information across layers\.## 1Introduction
While foundational models\(Baevskiet al\.,[2020](https://arxiv.org/html/2605.23033#bib.bib65); Devlinet al\.,[2019](https://arxiv.org/html/2605.23033#bib.bib2); Girdharet al\.,[2023](https://arxiv.org/html/2605.23033#bib.bib1); Oquabet al\.,[2023](https://arxiv.org/html/2605.23033#bib.bib61); Radfordet al\.,[2021](https://arxiv.org/html/2605.23033#bib.bib64)\)define state\-of\-the\-art transfer learning, standard adaptation protocols typically rely on the final layer output under the assumption that semantic utility increases monotonically with depth\. Recent empirical analyses challenge this view, demonstrating that intermediate layers\(Skeanet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib3); Parket al\.,[2024](https://arxiv.org/html/2605.23033#bib.bib4); Alain and Bengio,[2016](https://arxiv.org/html/2605.23033#bib.bib12)\)frequently outperform the final representations by up to 10% in downstream tasks\. This phenomenon occurs because the final layers often specialize overly to pretraining objectives like next\-token prediction or suffer from representation collapse\. In autoregressive language models specifically, a “compression valley”\(Queipo\-de\-Llanoet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib5); Liet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib6)\)emerges in mid\-depth layers where the network optimally balances signal retention with noise suppression\. Similar patterns are observed in vision transformers\(Dosovitskiy,[2020](https://arxiv.org/html/2605.23033#bib.bib7)\), suggesting this to be a domain\-general property driven by optimization dynamics rather than data modality\.
To effectively recover this information, we argue that one must identify the optimal task\-specific subspace formed by the top\-k discriminative layers regardless of their depth\. To this end, we introduceLayer\-wise Optimal Embedding Selection \(LOES\), a constructive spectral framework for subspace approximation\. LOES \(Figure[1](https://arxiv.org/html/2605.23033#S0.F1)\) formulates layer selection as a ridge\-residual optimization problem constrained by orthogonality and isotropy\. By analyzing the eigenspectrum of layer\-wise representations, our method systematically identifies orthogonal combinations of embeddings that maximize separability\. To prevent feature collapse during the subsequent adaptation, we complement this selection withGeometric Regularization\(GeoReg\), which is an auxiliary loss that enforces a simplicial topology on the class manifolds\. Our contributions are as follows:
- •We demonstrate that the optimal representations for downstream tasks are rarely the final layer alone but rather a task\-specific subspace formed by a subset of intermediate layers\.
- •We proposeLOES, a spectral algorithm that identifies optimal layer combinations by minimizing residual error under geometric constraints\. This method outperforms learnable weighting baselines by explicitly constructing subspaces with high effective rank and low anisotropy\.
- •By analyzing the layer selection preferences ofLOES, we provide a new lens for interpretability that reveals exactly where different foundational models encode specific semantic or structural attributes\.
- •Our analysis demonstrates that performance gains scale with model depth, and that these effects are modality\- and data\-agnostic working across a variety of foundational model architectures and pretraining techniques\.
- •We also introduceGeoReg, a loss function that improves stability in fine\-tuning settings, mitigating feature collapse often observed in large\-scale models\.
## 2Related Works
### 2\.1Suboptimality of Final Layer Embeddings
Research across modalities suggests that final\-layer embeddings are not universally optimal\. In LLMs, “Concept Depth” indicates that while shallow layers handle simpler tasks, deeper layers are required for complex abstractions\(Jinet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib41)\)\. Studies have shown that mid\-depth layers often encode robust features, challenging the usual emphasis on final layer representations\(Ladet al\.,[2024](https://arxiv.org/html/2605.23033#bib.bib42); Fanet al\.,[2024](https://arxiv.org/html/2605.23033#bib.bib43)\)\. Similarly, in vision and multi\-modal encoders, final\-layer features are often suboptimal for downstream tasks when the model is trained on proxy or self\-supervised objectives\. Studies using image colorization as a pretraining task\(Zhanget al\.,[2016](https://arxiv.org/html/2605.23033#bib.bib44); Zhenget al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib45)\)found middle layers more effective for classification\. This trend persists in autoregressive models like iGPT\(Chenet al\.,[2020](https://arxiv.org/html/2605.23033#bib.bib47)\)and AIMv1\(El\-Noubyet al\.,[2024](https://arxiv.org/html/2605.23033#bib.bib46)\), as well as video models like Toto\(Rajasegaranet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib48)\), where intermediate layers excel in classification, tracking, and robotics\.
### 2\.2Interpretability and Representational Dynamics
Researchers have introduced linear probes to monitor layer\-wise classification suitability, noting monotonic increases in separability with depth\(Alain and Bengio,[2016](https://arxiv.org/html/2605.23033#bib.bib12)\)\. Techniques such as SVCCA\(Raghuet al\.,[2017](https://arxiv.org/html/2605.23033#bib.bib13)\)enable cross\-layer comparisons, showing that networks typically converge bottom\-up\. While BERT’s\(Devlinet al\.,[2019](https://arxiv.org/html/2605.23033#bib.bib2)\)hierarchy often mirrors classical NLP pipelines\(De Vrieset al\.,[2020](https://arxiv.org/html/2605.23033#bib.bib9)\)\. Recent spectral analyses identify three distinct geometric phases: warmup, entropy\-seeking, and compression that define representational evolution during pretraining\(Liet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib6)\)\. Saponati et al\.\(Saponatiet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib49)\)present a theoretical analysis of how different pretext tasks, such as next\-token prediction and masked language modeling, influence the structure of learned representations\.
### 2\.3Performance Metrics for Intermediate Representations
Metrics like RankMe\(Garridoet al\.,[2023](https://arxiv.org/html/2605.23033#bib.bib50)\)uses effective rank as an unsupervised performance indicator, while LevyScore\([Maeset al\.,](https://arxiv.org/html/2605.23033#bib.bib51)\)assesses confidence via pretraining density deviations\. Layer by Layer\(Skeanet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib3)\)presents an information\-theoretic and geometric framework which explains why mid\-depth representations often outperform final layers, a property leveraged by Perception Encoder\(Bolyaet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib39)\)through specialized intermediate alignment\.
Motivated by closed\-form optimization approaches\(Bertinettoet al\.,[2018](https://arxiv.org/html/2605.23033#bib.bib52)\)and recent findings suggesting that embeddings approaching an isotropic Gaussian distribution yield lower downstream prediction risk\(Balestriero and LeCun,[2025](https://arxiv.org/html/2605.23033#bib.bib27)\), our algorithm aims to select the most effective layers for a downstream task mainly by leveraging closed\-form ridge regression to predict residuals and encouraging isotropy in the resulting representations\.
## 3Methodology
We propose a unified framework that challenges the assumption of a single task\-optimal layer by explicitly modeling the encoder as a hierarchy of candidate representation subspaces\. Our approach consists of two coupled components: \(1\) a Layer\-wise Optimal Embedding Selection \(LOES\) algorithm \(Algorithm[1](https://arxiv.org/html/2605.23033#alg1)\) that approximates the task\-optimal subspace via spectral\-ridge minimization, and \(2\) Geometric Regularization \(GeoReg\) as an auxiliary objective that aligns the training dynamics with the topological requirements of LOES while fine\-tuning\.
### 3\.1Preliminaries and Problem Formulation
Consider a foundational encoderfθ\(⋅\)f\_\{\\theta\}\(\\cdot\)that maps an inputxxto a sequence ofLLhierarchical representationsℋ=\{h\(1\),h\(2\),…,h\(L\)\}\\mathcal\{H\}=\\\{h^\{\(1\)\},h^\{\(2\)\},\\dots,h^\{\(L\)\}\\\}, whereh\(ℓ\)∈ℝdℓh^\{\(\\ell\)\}\\in\\mathbb\{R\}^\{d\_\{\\ell\}\}\. Standard transfer learning typically utilizes a projection on the final layery^=Wh\(L\)\\hat\{y\}=Wh^\{\(L\)\}, or a learned scalar combinationy^=W\(∑ℓαℓh\(ℓ\)\)\\hat\{y\}=W\(\\sum\_\{\\ell\}\\alpha\_\{\\ell\}h^\{\(\\ell\)\}\)\(Yanget al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib8); De Vrieset al\.,[2020](https://arxiv.org/html/2605.23033#bib.bib9); Peterset al\.,[2018](https://arxiv.org/html/2605.23033#bib.bib10); Chiuet al\.,[2024](https://arxiv.org/html/2605.23033#bib.bib11); Alain and Bengio,[2016](https://arxiv.org/html/2605.23033#bib.bib12)\)\. Standard transfer learning restricts adaptation to the final\-layer basis, assuming that downstream risk decreases monotonically with depth\. This assumption is violated when task\-discriminative information is distributed non\-linearly across layers\(Raghuet al\.,[2017](https://arxiv.org/html/2605.23033#bib.bib13); Kornblithet al\.,[2019](https://arxiv.org/html/2605.23033#bib.bib14)\)\. We therefore define the task\-optimal representation as a constructive latent subspaceS⊂span\(ℋ\)S\\subset\\mathrm\{span\}\(\\mathcal\{H\}\), formed by orthogonal selection of informative features across the encoder hierarchy\(Saxeet al\.,[2013](https://arxiv.org/html/2605.23033#bib.bib16)\)\. This enables selective use of final\-layer representations while integrating complementary intermediate features, yielding a geometry that minimizes empirical risk and regularizes against representation collapse\(Andriopouloset al\.,[2024](https://arxiv.org/html/2605.23033#bib.bib17)\)\. While we have primarily focused on classification and regression tasks along with their derivatives such as semantic segmentation, the LOES framework is inherently task\-agnostic\. By altering the target label space Y, the selection mechanism can be seamlessly adapted for complex multimodal objectives like Visual Question Answering \(VQA\) or sequence\-to\-sequence generation, where informative signals are often sparsely distributed across the encoder depth\.
### 3\.2Layer\-wise Optimal Embedding Selection \(LOES\)
We propose a Layer\-wise Optimal Embedding Selection \(LOES\) framework that works on layerwise embeddings and constructs a compact, task\-discriminative subspace by iteratively selecting non\-redundant layers whose embeddings explain residual task signal as well as exhibit geometric properties that promote stable probing and fine\-tuning\. LOES proceeds in two distinct stages\.\(i\) Initialization:the first layer is selected by fitting a ridge probe of the*raw*features𝐗ℓ\\mathbf\{X\}\_\{\\ell\}against the original targets𝐘\\mathbf\{Y\}and minimizing a simplified score that combines fit loss with an isotropy term\.\(ii\) Iterative selection:given a current selected setSSand cumulative prediction𝐘^\\widehat\{\\mathbf\{Y\}\}, each remaining layer is orthogonalized againstspan\(𝐗S\)\\mathrm\{span\}\(\\mathbf\{X\}\_\{S\}\)\(Eq\.[3](https://arxiv.org/html/2605.23033#S3.E3)\), scored against the current residual𝐑=𝐘−𝐘^\\mathbf\{R\}=\\mathbf\{Y\}\-\\widehat\{\\mathbf\{Y\}\}using a composite objective \(Eq\.[7](https://arxiv.org/html/2605.23033#S3.E7)\), and the best candidate is then refit on its*raw*features against𝐘\\mathbf\{Y\}and accumulated into𝐘^\\widehat\{\\mathbf\{Y\}\}\. The procedure stops when\|S\|=K\|S\|=K\.
Let\{\(𝐱i,yi\)\}i=1N\\\{\(\\mathbf\{x\}\_\{i\},y\_\{i\}\)\\\}\_\{i=1\}^\{N\}be a calibration set \(sampled from the training set\)\. For an encoder withLLlayers,𝐗ℓ∈ℝN×dℓ\\mathbf\{X\}\_\{\\ell\}\\in\\mathbb\{R\}^\{N\\times d\_\{\\ell\}\}denotes the feature matrix forNsamples extracted at layerℓ\\ell\. For classification intoCCclasses, we use one\-hot encoded target values𝐘∈ℝN×C\\mathbf\{Y\}\\in\\mathbb\{R\}^\{N\\times C\}\. All feature matrices are column\-centered\. LOES measures the*linear accessibility*of task information in a layer via closed\-form Tikhonov\-regularized regression\(Hoerl and Kennard,[1970](https://arxiv.org/html/2605.23033#bib.bib18)\)\. Given features𝐗∈ℝN×d\\mathbf\{X\}\\in\\mathbb\{R\}^\{N\\times d\}and targets𝐘∈ℝN×C\\mathbf\{Y\}\\in\\mathbb\{R\}^\{N\\times C\}, we consider the objective
𝐖⋆=argmin𝐖∈ℝd×C‖𝐗𝐖−𝐘‖F2\+λ‖𝐖‖F2,\\mathbf\{W\}^\{\\star\}=\\arg\\min\_\{\\mathbf\{W\}\\in\\mathbb\{R\}^\{d\\times C\}\}\\\|\\mathbf\{X\}\\mathbf\{W\}\-\\mathbf\{Y\}\\\|\_\{F\}^\{2\}\+\\lambda\\\|\\mathbf\{W\}\\\|\_\{F\}^\{2\},\(1\)whereλ\>0\\lambda\>0is the Tikhonov \(ridge\) regularizer\. This yields a closed\-form solution
𝐖⋆=\(𝐗⊤𝐗\+λ𝐈d\)−1𝐗⊤𝐘\.\\mathbf\{W\}^\{\\star\}=\(\\mathbf\{X\}^\{\\top\}\\mathbf\{X\}\+\\lambda\\mathbf\{I\}\_\{d\}\)^\{\-1\}\\mathbf\{X\}^\{\\top\}\\mathbf\{Y\}\.\\vskip\-5\.0pt\(2\)Treating classification as multi\-output regression with one\-hot encoded targets allows a ridge probe to produce linear class scores, with prediction given by the maximum score across classes\. This formulation directly measures the linear accessibility of class information, independent of any specific classifier or loss\.
Ridge regularization is essential in deep feature spaces\. Encoder embeddings are often anisotropic or effectively low\-rank\(Li and Huang,[2026](https://arxiv.org/html/2605.23033#bib.bib19); Godeyet al\.,[2024](https://arxiv.org/html/2605.23033#bib.bib20);[Huhet al\.,](https://arxiv.org/html/2605.23033#bib.bib21)\), making the unregularized normal equations ill\-conditioned and leading to unstable, high\-variance probes\. Ridge regularization stabilizes the inversion in \([2](https://arxiv.org/html/2605.23033#S3.E2)\), controls the effective degrees of freedom of the fitted predictor, and yields numerically stable scores that are comparable across layers and models\. When the feature dimension exceeds the calibration set size, we compute the exact solution efficiently using the Woodbury identity\(Hager,[1989](https://arxiv.org/html/2605.23033#bib.bib23)\)\. Implementation details are provided in the Appendix\.
To avoid selecting redundant layers, LOES evaluates candidate layers after removing components already explained by the currently selected subspace\. IfSSdenotes the indices of selected layers and𝐗S=\[𝐗j1𝐗j2⋯\]∈ℝN×DS\\mathbf\{X\}\_\{S\}=\[\\mathbf\{X\}\_\{j\_\{1\}\}\\;\\mathbf\{X\}\_\{j\_\{2\}\}\\;\\cdots\]\\in\\mathbb\{R\}^\{N\\times D\_\{S\}\}denotes their concatenation, then for a candidate layerℓ∉S\\ell\\notin Swe compute ridge\-regularized orthogonalized features as
𝐗~ℓ=𝐗ℓ−𝐗S\(𝐗S⊤𝐗S\+ε𝐈DS\)−1𝐗S⊤𝐗ℓ\.\\widetilde\{\\mathbf\{X\}\}\_\{\\ell\}=\\mathbf\{X\}\_\{\\ell\}\-\\mathbf\{X\}\_\{S\}\(\\mathbf\{X\}\_\{S\}^\{\\top\}\\mathbf\{X\}\_\{S\}\+\\varepsilon\\mathbf\{I\}\_\{D\_\{S\}\}\)^\{\-1\}\\mathbf\{X\}\_\{S\}^\{\\top\}\\mathbf\{X\}\_\{\\ell\}\.\(3\)𝐗~ℓ\\widetilde\{\\mathbf\{X\}\}\_\{\\ell\}is the minimum\-norm residual of𝐗ℓ\\mathbf\{X\}\_\{\\ell\}with respect to the span of𝐗S\\mathbf\{X\}\_\{S\}in the ridge sense andε\>0\\varepsilon\>0is a small value for numerical stability\. Orthogonalization ensures that the fit term in our composite score measures only the marginal explanatory power of layerℓ\\ellbeyondSS; a separate redundancy term \(introduced below\) acts on the raw features to filter globally similar candidates before they enter the score\.
Selection in LOES balances residual reduction with desired geometric properties that improves probe stability and downstream separability\. We quantify three geometric diagnostics\(Bihani and Rayz,[2021](https://arxiv.org/html/2605.23033#bib.bib24); Kudrjashovet al\.,[2024](https://arxiv.org/html/2605.23033#bib.bib25)\)\. The first is an isotropy score computed from the empirical covariance𝚺ℓ=1N𝐗ℓ⊤𝐗ℓ\\mathbf\{\\Sigma\}\_\{\\ell\}=\\tfrac\{1\}\{N\}\\mathbf\{X\}\_\{\\ell\}^\{\\top\}\\mathbf\{X\}\_\{\\ell\}\. Let\{μj\}j=1dℓ\\\{\\mu\_\{j\}\\\}\_\{j=1\}^\{d\_\{\\ell\}\}be the eigenvalues of𝚺ℓ\\mathbf\{\\Sigma\}\_\{\\ell\}andμ¯=1dℓ∑jμj\\overline\{\\mu\}=\\tfrac\{1\}\{d\_\{\\ell\}\}\\sum\_\{j\}\\mu\_\{j\}\. The isotropy score is
Iso\(𝐗ℓ\)=μ¯Var\(\{μj\}\)\+δ,\\vskip\-5\.0pt\\mathrm\{Iso\}\(\\mathbf\{X\}\_\{\\ell\}\)=\\frac\{\\overline\{\\mu\}\}\{\\sqrt\{\\operatorname\{Var\}\(\\\{\\mu\_\{j\}\\\}\)\+\\delta\}\},\(4\)with smallδ\>0\\delta\>0for numerical stability\. High isotropy indicates a concentrated spectrum and a well\-conditioned embedding\. Such embeddings yield lower\-variance ridge probes, more stable regression mappings, and typically smoother downstream optimization trajectories \(Figure[3](https://arxiv.org/html/2605.23033#S3.F3)\)\. The second diagnosticRedℓ\\mathrm\{Red\}\_\{\\ell\}complements Eq\. \([3](https://arxiv.org/html/2605.23033#S3.E3)\) rather than duplicating it: the two mechanisms act at different stages and on different objects\. Eq\. \([3](https://arxiv.org/html/2605.23033#S3.E3)\) produces the geometric residual𝐗~ℓ\\widetilde\{\\mathbf\{X\}\}\_\{\\ell\}used*inside*the fit loss, whereasRedℓ\\mathrm\{Red\}\_\{\\ell\}acts on the*raw*features through a normalized Frobenius inner product between column spaces,
Redℓ=maxj∈S‖𝐗ℓ⊤𝐗j‖F‖𝐗ℓ‖F‖𝐗j‖F,\\mathrm\{Red\}\_\{\\ell\}=\\max\_\{j\\in S\}\\frac\{\\\|\\mathbf\{X\}\_\{\\ell\}^\{\\top\}\\mathbf\{X\}\_\{j\}\\\|\_\{F\}\}\{\\\|\\mathbf\{X\}\_\{\\ell\}\\\|\_\{F\}\\ \\\|\\mathbf\{X\}\_\{j\}\\\|\_\{F\}\},\(5\)measuring global alignment with each selected layer before any projection, so that high values indicate that the candidate largely re\-expresses already captured directions\. In short, Eq\. \([3](https://arxiv.org/html/2605.23033#S3.E3)\) controls*what enters*the fit loss, while Eq\. \([5](https://arxiv.org/html/2605.23033#S3.E5)\) controls*which candidates are admitted*\. The third term is an optional term only for classification, for regression and dense prediction we set it to 0, Here, on the orthogonalized features𝐗~ℓ\\widetilde\{\\mathbf\{X\}\}\_\{\\ell\}, we compute class centroids and estimate average triangle area among random triplets of centroids,
Tri\(𝐗~ℓ\)≈𝔼\(a,b,c\)\[12‖b−a‖2‖c−a‖2−⟨b−a,c−a⟩2\]\.\\mathrm\{Tri\}\(\\widetilde\{\\mathbf\{X\}\}\_\{\\ell\}\)\\approx\\mathbb\{E\}\_\{\(a,b,c\)\}\\bigl\[\\tfrac\{1\}\{2\}\\sqrt\{\\\|b\-a\\\|^\{2\}\\\|c\-a\\\|^\{2\}\-\\langle b\-a,c\-a\\rangle^\{2\}\}\\,\\bigr\]\.\(6\)This term \([6](https://arxiv.org/html/2605.23033#S3.E6)\) indicates that class centroids span a higher\-volume simplex rather than lying near a low\-dimensional line, which correlates with linear separability and robustness to perturbations\. Let𝐘^\\widehat\{\\mathbf\{Y\}\}denote the cumulative ridge prediction from layers inSS\(initialized at𝟎\\mathbf\{0\}\) and𝐑=𝐘−𝐘^\\mathbf\{R\}=\\mathbf\{Y\}\-\\widehat\{\\mathbf\{Y\}\}the current residual\. LOES fits a ridge probe on each orthogonalized candidate\(𝐗~ℓ,𝐑\)\(\\widetilde\{\\mathbf\{X\}\}\_\{\\ell\},\\mathbf\{R\}\)and computes the mean squared error loss,Lossℓ\\mathrm\{Loss\}\_\{\\ell\}, which thus measures only the*marginal*contribution of layerℓ\\ellbeyondSS\. Candidates are scored by the composite objective
Score\(ℓ\)=Lossℓ\+α\(1−Iso\(𝐗ℓ\)\)\+γRedℓ−ηTri\(𝐗~ℓ\),\\mathrm\{Score\}\(\\ell\)=\\mathrm\{Loss\}\_\{\\ell\}\+\\alpha\\bigl\(1\-\\mathrm\{Iso\}\(\\mathbf\{X\}\_\{\\ell\}\)\\bigr\)\+\\gamma\\,\\mathrm\{Red\}\_\{\\ell\}\-\\eta\\,\\mathrm\{Tri\}\(\\widetilde\{\\mathbf\{X\}\}\_\{\\ell\}\),\(7\)with nonnegative trade\-off parametersα,γ,η\\alpha,\\gamma,\\eta\. Note thatIso\\mathrm\{Iso\}andRedℓ\\mathrm\{Red\}\_\{\\ell\}are computed on the raw features𝐗ℓ\\mathbf\{X\}\_\{\\ell\}\(intrinsic geometry, independent ofSS\), whileLossℓ\\mathrm\{Loss\}\_\{\\ell\}andTri\\mathrm\{Tri\}are computed on the orthogonalized residual𝐗~ℓ\\widetilde\{\\mathbf\{X\}\}\_\{\\ell\}\(marginal contribution beyondSS\)\. Upon selecting the layerℓ⋆\\ell^\{\\star\}that minimizes this score, we refit a ridge probe on the*raw*features𝐗ℓ⋆\\mathbf\{X\}\_\{\\ell^\{\\star\}\}against the original targets𝐘\\mathbf\{Y\}\(not against𝐑\\mathbf\{R\}\)\. The resulting prediction is accumulated to the ensemble:
𝐘^←𝐘^\+𝐗ℓ⋆𝐖ℓ⋆,\\widehat\{\\mathbf\{Y\}\}\\;\\leftarrow\\;\\widehat\{\\mathbf\{Y\}\}\+\\mathbf\{X\}\_\{\\ell^\{\\star\}\}\\mathbf\{W\}\_\{\\ell^\{\\star\}\},after which the residual𝐑=𝐘−𝐘^\\mathbf\{R\}=\\mathbf\{Y\}\-\\widehat\{\\mathbf\{Y\}\}is recomputed, the layer is appended to the selected setSS, and the procedure is repeated until the layer budgetKKis reached\. This scoring/refit asymmetry is deliberate: scoring uses\(𝐗~ℓ,𝐑\)\(\\widetilde\{\\mathbf\{X\}\}\_\{\\ell\},\\mathbf\{R\}\)so candidates are ranked by marginal contribution, while refitting uses\(𝐗ℓ⋆,𝐘\)\(\\mathbf\{X\}\_\{\\ell^\{\\star\}\},\\mathbf\{Y\}\)so the downstream probe operates on unmodified encoder features at inference\.
In practice LOES is applied on a modest calibration budgetNcal≪N\_\{\\mathrm\{cal\}\}\\llfull dataset size\. The isotropy term improves conditioning of the probe and reduces variance in the fitted predictor;, while the redundancy penalization avoids selecting layers that merely repackage existing directions\. These geometric elements consistently improve stability and generalization over residual\-only selection in our ablations\.
Figure 2:GeoReg prevents representation collapse during fine\-tuning\.Validation accuracy with trainable BERT Base\(Devlinet al\.,[2019](https://arxiv.org/html/2605.23033#bib.bib2)\)on TweetEval \- Emoji\(Barbieriet al\.,[2020](https://arxiv.org/html/2605.23033#bib.bib87),[2018](https://arxiv.org/html/2605.23033#bib.bib88)\)Dataset\. Without GeoReg \(green\), accuracy degrades after∼10\{\\sim\}10k steps despite initial gains\. GeoReg \(magenta\) maintains stable performance\. The last\-layer baseline \(blue\) exhibits similar collapse\. Dots indicate best checkpoint\.Furthermore, we provide a theoretical analysis \(Appendix[A\.7](https://arxiv.org/html/2605.23033#A1.SS7)\) demonstrating that our isotropy maximization objective guarantees the selection of features that minimize the worst\-case parameter estimation error\.
### 3\.3Geometric Regularization \(GeoReg\)
LOES selects layers whose embeddings exhibit favorable geometry: high isotropy and well\-separated class centroids\. However, when the encoder is fine\-tuned, gradient updates can degrade these properties, causing the fused representation to collapse into a low\-dimensional subspace or class centroids to converge\. This phenomenon, termed representation collapse, is well documented in self\-supervised learning\(Bardeset al\.,[2021](https://arxiv.org/html/2605.23033#bib.bib26); Balestriero and LeCun,[2025](https://arxiv.org/html/2605.23033#bib.bib27)\), where methods such as VICReg and LeJEPA address it through variance\-covariance regularization during*pretraining*\. Related geometric losses that enforce inter\-class orthogonality have also been explored for metric learning\(Lezamaet al\.,[2018](https://arxiv.org/html/2605.23033#bib.bib89)\)\. GeoReg applies analogous principles at*transfer time*, preserving the geometric structure that motivated layer selection in the first place\. Let\{𝐳i\}i=1B\\\{\\mathbf\{z\}\_\{i\}\\\}\_\{i=1\}^\{B\}denote the fused embeddings for a minibatch, obtained by concatenating adapter \([A\.3](https://arxiv.org/html/2605.23033#A1.SS3)\) outputs from LOES\-selected layers\. GeoReg regularizes two complementary aspects of this representation\. The first term penalizes*spectral imbalance*in the embedding covariance\. Let𝚺=1B∑i\(𝐳i−𝐳¯\)\(𝐳i−𝐳¯\)⊤\\mathbf\{\\Sigma\}=\\frac\{1\}\{B\}\\sum\_\{i\}\(\\mathbf\{z\}\_\{i\}\-\\bar\{\\mathbf\{z\}\}\)\(\\mathbf\{z\}\_\{i\}\-\\bar\{\\mathbf\{z\}\}\)^\{\\top\}with eigenvalues\{λj\}\\\{\\lambda\_\{j\}\\\}\. We define
ℒiso=Var\(\{λj\}\),\\mathcal\{L\}\_\{\\mathrm\{iso\}\}=\\mathrm\{Var\}\(\\\{\\lambda\_\{j\}\\\}\),\(8\)which is minimized when eigenvalues are uniform, corresponding to isotropic utilization of the representational capacity\. The second term encourages*class separation*using the same geometric criterion as in LOES layer scoring \(Eq\.[6](https://arxiv.org/html/2605.23033#S3.E6)\): we compute class centroids\{𝝁c\}\\\{\\boldsymbol\{\\mu\}\_\{c\}\\\}from the fused embeddings within each minibatch and penalize configurations where centroids span low volume, indicating collapse toward a degenerate subspace\. The combined objective is
ℒGeoReg=λgeo\(ℒiso−log\(A\+ϵ\)\),\\mathcal\{L\}\_\{\\mathrm\{GeoReg\}\}=\\lambda\_\{\\mathrm\{geo\}\}\\left\(\\mathcal\{L\}\_\{\\mathrm\{iso\}\}\-\\log\(A\+\\epsilon\)\\right\),\(9\)whereAAdenotes the centroid triangle area andϵ\\epsilonis very small ensuring numerical stability of the solution\.
Unlike VICReg and LeJEPA, which regularize encoder representations during pretraining, GeoReg operates on the*fused*multi\-layer embeddings during downstream adaptation\. This placement is deliberate: the geometry exploited by LOES must be preserved under competing fine\-tuning gradients\. As shown in Figure[2](https://arxiv.org/html/2605.23033#S3.F2), without GeoReg, validation accuracy degrades after initial gains when the encoder is trainable, a signature of representational collapse\. GeoReg stabilizes the optimization trajectory throughout the training\. When the encoder is frozen, GeoReg has no effect since the embedding geometry remains fixed, confirming that its benefit arises specifically from constraining how fine\-tuning reshapes the representation manifold\.
Figure 3:Layer\-wise representation geometry for CLIP\-B/32 on Stanford Cars\. Effective rank \(top; higher means more dimensions contribute\) and isotropy score \(bottom; higher means a flatter covariance eigenspectrum\) peak in mid layers\. Stars mark LOES\-selected layers, which align with high\-rank, near\-isotropic representations\.
## 4Computational Efficiency
Table[1](https://arxiv.org/html/2605.23033#S4.T1)shows that LOES introduces minimal computational overhead across both image and text benchmarks\. In terms of time complexity, LOES adds a one\-time calibration and layer\-selection cost that scales linearly with the number of encoder layers and calibration samples and does not scale with training epochs\. Exact time complexity and pseudocode are provided in the Appendix\. During training, LOES differs from last\-layer transfer only by fusing a small number of intermediate representations, while the dominant cost remains that of the encoder’s forward pass, which is identical for both methods\. The parameter overhead introduced by LOES arises solely from the linear probe operating on the fused representation\. For a representative 100M\-parameter frozen backbone, the probe adds fewer than 1M parameters, corresponding to an increase of under 1% in total parameter count\.
Table 1:End\-to\-end wall\-clock time in seconds over 15 epochs for last\-layer transfer and LOES\.Empirically, this efficiency is seen in wall\-clock results, with LOES remaining within a few percent of the last\-layer baseline across datasets and model depths, including ModernBERT and task\-adapted variants for segmentation and regression\. Overall, LOES is computationally efficient and a drop\-in replacement for last\-layer transfer\.
Table 2:Pretraining paradigm influences layer selection patterns and LOES gains\.Selected layers are shown for representative tasks \(vision: Stanford Cars, CUB\-200; language: MTOP, Emotion; speech: ASVspoof, CREMA\-D\)\. Relative gain is computed as the average percentage improvement of LOES over last\-layer transfer across evaluated datasets\.Figure 4:LOES score distribution across encoder depth\(lower is better\)\. Models pretrained exclusively on ImageNet \(ViT\-IN21k, MAE, DeiT\) exhibit monotonically decreasing scores toward final layers, indicating task\-discriminative information concentrates at depth\. CLIP, pretrained on 400M diverse image\-text pairs, shows comparatively flatter profiles with competitive scores in mid\-depth layers, consistent with the early\-to\-mid layer selections reported in Table[2](https://arxiv.org/html/2605.23033#S4.T2)\. These patterns suggest that pretraining data diversity influences how task\-relevant information is distributed across the encoder hierarchy\.
## 5Experimental setup
All experiments follow a single, consistent transfer pipeline unless otherwise stated\. Layer\-wise embeddings are extracted from pretrained encoders and used to train a linear probe\. All probe and adapter training runs use 15 epochs\. Optimization is performed with AdamW\(Loshchilov and Hutter,[2017](https://arxiv.org/html/2605.23033#bib.bib60)\)with weight decay1×10−41\\times 10^\{\-4\}and a cosine\-annealing learning\-rate schedule\. The probe and adapters use a learning rate of1×10−41\\times 10^\{\-4\}, while backbone fine\-tuning, when enabled, uses1×10−51\\times 10^\{\-5\}\. The random seed is fixed to 0 for all experiments\. Standard batch size is 256\. Layer selection and scoring are performed on a calibration split comprising20%20\\%of the training data\. Based on the calibration study \(Appendix Table[A3](https://arxiv.org/html/2605.23033#A1.T3)\), we fix the calibration fraction to 20% in all subsequent experiments, as it consistently achieves near\-peak performance across datasets with substantially fewer calibration samples\.
Hyperparameters\.LOES uses a small set of interpretable hyperparameters controlling isotropy \(α\\alpha\), redundancy \(γ\\gamma\), class geometry \(η\\eta\), and the number of selected layerskk\. In all experiments, we useα=1\.0\\alpha=1\.0,γ=0\.5\\gamma=0\.5,η=0\.1/0\.0\\eta=0\.1/0\.0, and selectk∈\{3,4\}k\\in\\\{3,4\\\}layers, which consistently yields optimal or near\-optimal performance\. We observe that performance typically saturates afterk=3k=3–44, with negligible gains beyond this range\. A detailed ablation and sensitivity analysis validating these choices is provided in Appendix[A\.4\.1](https://arxiv.org/html/2605.23033#A1.SS4.SSS1)\(Appendix Table[A1](https://arxiv.org/html/2605.23033#A1.T1)\)\.
Models\.We evaluate LOES on a diverse set of pretrained encoders covering multiple pretraining paradigms\. Vision models include DINOv2\(Oquabet al\.,[2023](https://arxiv.org/html/2605.23033#bib.bib61)\)and DINOv3\(Siméoniet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib62)\)\(self\-distillation\), MAE \(masked reconstruction\), DeiT\(Touvronet al\.,[2021](https://arxiv.org/html/2605.23033#bib.bib63)\)and ViT\(Dosovitskiy,[2020](https://arxiv.org/html/2605.23033#bib.bib7)\)\(supervised or distillation\-based pretraining\), and CLIP\(Chiuet al\.,[2024](https://arxiv.org/html/2605.23033#bib.bib11)\)\(contrastive vision\-language alignment\)\. For language, we consider both classical and modern masked language models, including BERT\-large\(Devlinet al\.,[2019](https://arxiv.org/html/2605.23033#bib.bib2)\)and ModernBERT\(Warneret al\.,[2024](https://arxiv.org/html/2605.23033#bib.bib59)\), enabling analysis of depth and large\-scale pretraining effects\. We additionally evaluate speech encoders such as Wav2Vec 2\.0\(Baevskiet al\.,[2020](https://arxiv.org/html/2605.23033#bib.bib65)\), trained via self\-supervised temporal prediction\.
Tasks\.We evaluate LOES across a diverse set of downstream tasks, including classification, regression, and dense prediction\. Image classification experiments are conducted on ImageNet\-1K\(Russakovskyet al\.,[2015](https://arxiv.org/html/2605.23033#bib.bib57)\), CUB\-200\(Wahet al\.,[2011](https://arxiv.org/html/2605.23033#bib.bib66)\), Stanford Cars\(Krauseet al\.,[2013](https://arxiv.org/html/2605.23033#bib.bib58)\), CIFAR\-100\(Krizhevskyet al\.,[2009](https://arxiv.org/html/2605.23033#bib.bib67)\), DTD\(Cimpoiet al\.,[2014](https://arxiv.org/html/2605.23033#bib.bib68)\), Mini\-ImageNet\(Russakovskyet al\.,[2015](https://arxiv.org/html/2605.23033#bib.bib57)\), and SUN397\(Xiaoet al\.,[2010](https://arxiv.org/html/2605.23033#bib.bib55),[2014](https://arxiv.org/html/2605.23033#bib.bib56)\)\. Text classification is evaluated on multiple MTEB\(Enevoldsenet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib70); Muennighoffet al\.,[2023](https://arxiv.org/html/2605.23033#bib.bib69)\)benchmarks, including Emotion\(Saraviaet al\.,[2018](https://arxiv.org/html/2605.23033#bib.bib71)\), Amazon Massive Intent\(FitzGeraldet al\.,[2022](https://arxiv.org/html/2605.23033#bib.bib72)\), Amazon Massive Scenario\(FitzGeraldet al\.,[2022](https://arxiv.org/html/2605.23033#bib.bib72)\), Amazon Counterfactual\(O’Neillet al\.,[2021](https://arxiv.org/html/2605.23033#bib.bib73)\), MTOP Domain Classification\(Liet al\.,[2021](https://arxiv.org/html/2605.23033#bib.bib74)\), Banking77\(Casanuevaet al\.,[2020](https://arxiv.org/html/2605.23033#bib.bib75)\), Tweet Sentiment Extraction\(Maggie,[2020](https://arxiv.org/html/2605.23033#bib.bib76)\), and Toxic Conversations 50K\(cjadamset al\.,[2019](https://arxiv.org/html/2605.23033#bib.bib77)\)\. We further evaluate speech classification on ASVspoof 2019\(Wanget al\.,[2020](https://arxiv.org/html/2605.23033#bib.bib78)\), CREMA\-D\(Caoet al\.,[2014](https://arxiv.org/html/2605.23033#bib.bib79)\), and Google Speech Commands\(Warden,[2018](https://arxiv.org/html/2605.23033#bib.bib80)\), regression/classification on multimodal and vision–language datasets such as Amazon Products 23\(Asaniczka,[2023](https://arxiv.org/html/2605.23033#bib.bib81)\), Fakeddit\(Nakamuraet al\.,[2020](https://arxiv.org/html/2605.23033#bib.bib83)\), FashionGen\(Rostamzadehet al\.,[2018](https://arxiv.org/html/2605.23033#bib.bib82)\)and dense prediction on semantic segmentation benchmarks Cityscapes\(Cordtset al\.,[2016](https://arxiv.org/html/2605.23033#bib.bib84)\)and COCOStuff\(Caesaret al\.,[2018](https://arxiv.org/html/2605.23033#bib.bib85)\)\. For classification, LOES is used in its standard form to select a task\-discriminative subset of layers, whose representations are concatenated and fed to a linear classifier\. For regression and dense prediction, the LOES objective is adapted by replacing classification\-specific terms with residual and geometric criteria suited to continuous or pixel\-level supervision \(see Appendix\)\. Across all settings, the core principle remains unchanged: selecting complementary, non\-redundant layers that maximize linearly accessible task signal\.
Table 3:Test\-set accuracy \(%\) comparing the last\-layer baseline and LOES withk=3k=3\.All results are reported as mean±\\pmstandard deviation over55independent runs with different random seeds\. The best result for each model–dataset pair is highlighted in bold\.Encoder / StateMethod \(kk\)S\. CarsMini\-INS\. DogsCUB\-200DTDCIFAR\-100SUN397DINOv2\-S/14\(Frozen\)Last Layer \(–\)49\.493\.781\.376\.361\.683\.468\.5Learnable Weights \(–\)49\.694\.082\.677\.867\.883\.469\.9Last 3 Layers Concat \(–\)50\.194\.183\.277\.967\.884\.470\.1LOES \+ GeoReg \(2\)55\.794\.083\.380\.367\.483\.971\.0LOES \+ GeoReg \(4\)60\.394\.184\.383\.169\.684\.272\.7LOES \+ GeoReg \(3\)60\.194\.284\.083\.168\.684\.272\.0DINOv2\-S/14\(Trainable\)Last Layer \(–\)76\.893\.780\.579\.272\.690\.866\.4Learnable Weights \(–\)78\.093\.880\.280\.873\.491\.066\.8Last 3 Layers Concat \(–\)79\.393\.380\.480\.872\.490\.466\.9LOES \+ GeoReg \(2\)80\.693\.981\.081\.273\.691\.168\.3LOES \+ GeoReg \(4\)81\.994\.081\.181\.972\.891\.069\.0LOES \(no GeoReg\) \(3\)81\.994\.081\.282\.174\.791\.068\.9LOES \+ GeoReg \(3\)82\.794\.881\.382\.274\.891\.169\.9Table 4:Layer selection and geometric regularization improve transfer accuracy\.Top\-1 classification accuracy \(%\) across seven image benchmarks using DINOv2\-S/14\.Table 5:Test accuracy \(%\) across MTEB classification datasets\.Average is computed over all datasets excluding Toxic\. Horizontal rules separate*Last*,*Last\-4*, and*LOES \(k=4k=4\)*\. Best result per model block is bolded\.
## 6Results and Discussion
### 6\.1Pretraining Paradigm Shapes Layer Selection
A central finding of our experiments is that the pretraining objective influences which layers encode task\-discriminative information\. The final layers selected, and the extent to which LOES improves over baselines may depend on the downstream task as well\. Table[2](https://arxiv.org/html/2605.23033#S4.T2)summarizes layer selection patterns across modalities for different pretraining paradigms, while Figure[4](https://arxiv.org/html/2605.23033#S4.F4)visualizes the LOES score distribution across layers for a given downstream task\.
### 6\.2LOES Improves Existing Layer Selection Methods
LOES\-selected embeddings outperform not only last\-layer transfer \(Figure[5](https://arxiv.org/html/2605.23033#S6.F5), Table[3](https://arxiv.org/html/2605.23033#S5.T3)\) but also learnable layer weights, fixed concatenation of final layers, and some other recently proposed selection methods\.
Figure 5:LOES boosts performance and leads to faster convergence across multiple downstream tasks like classification, segmentation and regression using popular foundation models like DINOv2, ModernBERT and CLIP\.Vision Encoders:Layer selection patterns correlate with pretraining data diversity \(Table[2](https://arxiv.org/html/2605.23033#S4.T2)\)\. CLIP, trained on 400 million image\-text pairs, and DINOv2/DINOv3, trained on the 142\-million image LVD dataset, consistently select early\-to\-middle layers \(\[5, 6, 4\] for CLIP on Stanford Cars; \[11, 10, 7\] for DINOv2\)\. In contrast, MAE, DeiT, and ViT\-IN21k, pretrained exclusively on ImageNet variants, select predominantly final layers \(\[11, 10, 9\] for MAE; \[10, 9, 8\] for ViT\-IN21k\)\. This pattern suggests that models pretrained on larger and more diverse data distribute task\-relevant information more evenly across depth, whereas models trained on narrower distributions concentrate discriminative features in later layers\. Consequently, explicit layer inspection via methods like LOES becomes valuable as pretraining corpora grows in scale and diversity\.
On Stanford Cars with trainable DINOv2\-S/14 \(Table[4](https://arxiv.org/html/2605.23033#S5.T4)\), last\-layer transfer achieves 76\.8%, last\-3 concatenation achieves 79\.3%, learnable weights achieve 78\.0%, while LOES achieves 82\.7%, demonstrating that principled selection outperforms both single\-layer and naive multi\-layer baselines\. LOES also extends to dense prediction tasks: on Cityscapes and COCOStuff semantic segmentation \(Appendix Table[A13](https://arxiv.org/html/2605.23033#A1.T13)\), LOES consistently selects early layers alongside the final layer \(\[0, 9, 11\] for DINOv2; \[0, 3, 11\] for BEiT\), with BEiT showing the largest gain \(\+3\.14 mIoU on Cityscapes\)\. We additionally compared LOES against exhaustive search over all 220 three\-layer subsets of frozen BERT\-base on MTOP and Emotion \(Appendix Table[A5](https://arxiv.org/html/2605.23033#A1.T5)\)\. LOES recovers a subset within 1\.16 and 1\.69 percentage points of the global optimum at a small fraction of the search cost, supporting its use on deeper encoders where exhaustive enumeration is infeasible\.
Language Encoders:Model depth amplifies the benefits of layer selection\. BERT\-large \(24 layers\) improves modestly from 95\.32 % to 97\.81 % on MTOP \(Table[5](https://arxiv.org/html/2605.23033#S5.T5)\), with LOES selecting predominantly late layers \[9, 19, 20, 21\]\. ModernBERT \(22 layers\) improves substantially from 78\.07% to 94\.48%, with LOES selecting \[3, 15, 1, 4\], spanning early and later layers\. Embeddings selected by LOES even outperform \(Table[5](https://arxiv.org/html/2605.23033#S5.T5)\) the use of intrinsic dimensionality of layers\(Chenget al\.,[2024](https://arxiv.org/html/2605.23033#bib.bib53); Razzhigaevet al\.,[2024](https://arxiv.org/html/2605.23033#bib.bib54)\)and a recent framework of layer selection\(Skeanet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib3)\)that introduced information\-theoretic, geometric and augmentation invariance metrics like entropy and curvature\.
LOES also outperforms a stronger learned fusion baseline that applies per\-layer linear projections and concatenates across all 22 layers of MBERT\-base \(Appendix Table[A9](https://arxiv.org/html/2605.23033#A1.T9)\): LOES\-4 reaches 94\.19 on MTOP and 78\.51 on AM\-Scenario, against 93\.91 and 77\.51 for the all\-layer learned fusion, despite using 1\.7x fewer parameters and FLOPs\. This indicates the gains come from the selection criterion itself rather than from the richer concatenated representation\.
Figure 6:Cross\-lingual evaluation on Amazon Massive Scenario \(mBERT\-base\)\.Left:LOES \(kk=4\) outperforms baselines, with larger gains on underrepresented languages \(Hindi \+6\.5%, Arabic \+7\.2%, Urdu \+10\.9% over last\-3\)\.Right:LOES consistently selects mid\-depth layer 6 alongside the final layer across languages, indicating cross\-lingually transferable structure at intermediate depths\.Cross\-lingual results on Amazon MASSIVE with mBERT\-base \(22 layers\)\(Maroneet al\.,[2025](https://arxiv.org/html/2605.23033#bib.bib86)\)\. \(Figure[6](https://arxiv.org/html/2605.23033#S6.F6), Table[A12](https://arxiv.org/html/2605.23033#A1.T12)\) reveals that LOES benefits vary substantially across languages\. High\-resource languages with Latin scripts \(English, German, French\) show moderate gains over the last\-3 baseline \(\+2–3%\)\. However, languages underrepresented in typical pretraining corpora show markedly larger improvements: Hindi \(\+6\.5%\), Arabic \(\+7\.2%\), and Urdu \(\+10\.9%\)\. Despite these differences in magnitude, LOES consistently selects layer 6 alongside the final layer across all languages, suggesting that mid\-depth representations encode cross\-lingually transferable structure that complements language\-specific features in later layers\. These results indicate that LOES may be particularly beneficial for transfer to underrepresented languages, where the final layer alone inadequately captures task\-relevant information\. Full cross\-lingual results are provided in Appendix Table[A12](https://arxiv.org/html/2605.23033#A1.T12)\.
Speech Encoders:Wav2Vec 2\.0 \(Table[A14](https://arxiv.org/html/2605.23033#A1.T14)\) exhibits task\-dependent layer selection\. For spoofed speech detection \(ASVspoof 2019\), LOES selects early layers \[3, 0, 1, 2\], improving from 90\.89% to 98\.80%\. For emotion recognition \(CREMA\-D\), mid\-depth layers \[7, 6, 8, 5\] are selected, improving from 37\.84% to 69\.06%\. This heterogeneity suggests that acoustic artifacts relevant to spoofing detection manifest in early layers, while paralinguistic features for emotion recognition emerge at intermediate depths\.
Multimodal Regression Tasks:LOES extends beyond classification to regression and multimodal settings \(Appendix Table[A11](https://arxiv.org/html/2605.23033#A1.T11)\)\. On Amazon Products 23, LOES reduces RMSE from 161\.80 to 154\.44; on FashionGen, from 527\.19 to 460\.46\. In both cases, LOES selects mid\-to\-late layers \(\[5, 8, 10, 11\] and \[5, 9, 10, 11\] respectively\), indicating that the framework generalizes across output modalities while maintaining interpretable layer selection patterns\.
### 6\.3Loes Outperforms Greedy and Random Layer Selection
We include Random baselines and a Greedy selection \(Table[6](https://arxiv.org/html/2605.23033#S6.T6)\) that picks the top 4 layers by probe accuracy \(for fair comparison with LOES\-4\), evaluated on MBERT\-B across two datasets\. For Greedy we did training a probe per layer for 2 settings: 1 or 5 epochs, incurring notable overhead \( 2\.8 min for Greedy1, 14 min for Greedy5\)\. LOES\-4 outperforms Random \(2\-5\), Greedy1 and matches Greedy5, while giving 32x and 160x faster layer selection than Greedy1 and Greedy5 respectively\.
Table 6:Comparison of LOES with Random\-kkand Greedy layer selection on MBERT\-B\. LOES\-4 outperforms Random and Greedy1, while Greedy5 achieves comparable performance at significantly higher computational cost\.
## 7Conclusion
We presented LOES, a spectral framework for identifying task\-discriminative layers in pretrained encoders\. Our experiments reveal that task\-relevant information is distributed non\-monotonically across depth, challenging the widespread reliance on final\-layer representations\. The distribution of useful features depends systematically on pretraining: models trained on large, diverse corpora encode transferable structure in earlier layers, while those trained on narrower distributions concentrate information at depth\. This finding suggests that as foundation models scale and diversify, principled layer selection becomes increasingly valuable\. LOES consistently outperforms last\-layer transfer, learnable weighting, and prior selection methods across vision, language, speech, and multimodal benchmarks\. Cross\-lingual evaluation reveals that underrepresented languages benefit disproportionately, suggesting practical value for low\-resource transfer\. Beyond accuracy improvements, LOES provides interpretable insights into how pretrained models organize knowledge\. The layer selection patterns expose which depths encode task\-relevant structure, offering a complementary view to existing probing methodologies\. We hope this work encourages further investigation into the geometry of intermediate representations and their role in effective transfer learning\.
## Acknowledgements
The authors thank the Infosys Centre for AI and Indraprastha Institute of Information Technology Delhi \(IIIT\-Delhi\) for financial support and for providing computational infrastructure used in this work\. We also thank the members of SBILab for helpful discussions and feedback\.
## Software and Data Availability
## Impact Statement
This paper presents work whose goal is to improve transfer learning and interpretability for pretrained foundation models\. By identifying task\-relevant intermediate layers, LOES can reduce reliance on full\-model adaptation and provide insight into how useful information is distributed across model depth\. This may make downstream adaptation more efficient and more transparent in some settings\. We do not introduce new datasets involving sensitive personal information, and all experiments are conducted on established public benchmarks\.
## Accessibility
We have aimed to make the paper accessible by using standard LaTeX typesetting, vector\-based figures where possible, descriptive captions, and visualizations that do not rely solely on color for interpretation\.
00footnotetext:Average relative gain is computed as the mean of per\-dataset percentage improvements:1\|D\|∑d∈DLOESd−LastdLastd×100\\frac\{1\}\{\|D\|\}\\sum\_\{d\\in D\}\\frac\{\\text\{LOES\}\_\{d\}\-\\text\{Last\}\_\{d\}\}\{\\text\{Last\}\_\{d\}\}\\times 100\.
## References
- G\. Alain and Y\. Bengio \(2016\)Understanding intermediate layers using linear classifier probes\.arXiv preprint arXiv:1610\.01644\.Cited by:[§1](https://arxiv.org/html/2605.23033#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.23033#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.23033#S3.SS1.p1.8)\.
- G\. Andriopoulos, Z\. Dong, L\. Guo, Z\. Zhao, and K\. Ross \(2024\)The prevalence of neural collapse in neural multivariate regression\.Advances in Neural Information Processing Systems37,pp\. 126417–126451\.Cited by:[§3\.1](https://arxiv.org/html/2605.23033#S3.SS1.p1.8)\.
- Asaniczka \(2023\)Amazon products dataset 2023 \(1\.4m products\)\.Kaggle\.External Links:[Link](https://www.kaggle.com/ds/3798081),[Document](https://dx.doi.org/10.34740/KAGGLE/DS/3798081)Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- A\. Baevski, Y\. Zhou, A\. Mohamed, and M\. Auli \(2020\)Wav2vec 2\.0: a framework for self\-supervised learning of speech representations\.Advances in neural information processing systems33,pp\. 12449–12460\.Cited by:[§1](https://arxiv.org/html/2605.23033#S1.p1.1),[§5](https://arxiv.org/html/2605.23033#S5.p3.1)\.
- R\. Balestriero and Y\. LeCun \(2025\)Lejepa: provable and scalable self\-supervised learning without the heuristics\.arXiv preprint arXiv:2511\.08544\.Cited by:[§2\.3](https://arxiv.org/html/2605.23033#S2.SS3.p2.1),[§3\.3](https://arxiv.org/html/2605.23033#S3.SS3.p1.3)\.
- F\. Barbieri, J\. Camacho\-Collados, L\. Espinosa\-Anke, and L\. Neves \(2020\)TweetEval:Unified Benchmark and Comparative Evaluation for Tweet Classification\.InProceedings of Findings of EMNLP,Cited by:[Figure 2](https://arxiv.org/html/2605.23033#S3.F2),[Figure 2](https://arxiv.org/html/2605.23033#S3.F2.2.1.1)\.
- F\. Barbieri, J\. Camacho\-Collados, F\. Ronzano, L\. Espinosa\-Anke, M\. Ballesteros, V\. Basile, V\. Patti, and H\. Saggion \(2018\)Semeval 2018 task 2: multilingual emoji prediction\.InProceedings of The 12th International Workshop on Semantic Evaluation,pp\. 24–33\.Cited by:[Figure 2](https://arxiv.org/html/2605.23033#S3.F2),[Figure 2](https://arxiv.org/html/2605.23033#S3.F2.2.1.1)\.
- A\. Bardes, J\. Ponce, and Y\. LeCun \(2021\)Vicreg: variance\-invariance\-covariance regularization for self\-supervised learning\.arXiv preprint arXiv:2105\.04906\.Cited by:[§A\.4\.5](https://arxiv.org/html/2605.23033#A1.SS4.SSS5.p1.1),[§3\.3](https://arxiv.org/html/2605.23033#S3.SS3.p1.3)\.
- L\. Bertinetto, J\. F\. Henriques, P\. H\. Torr, and A\. Vedaldi \(2018\)Meta\-learning with differentiable closed\-form solvers\.arXiv preprint arXiv:1805\.08136\.Cited by:[§2\.3](https://arxiv.org/html/2605.23033#S2.SS3.p2.1)\.
- G\. Bihani and J\. Rayz \(2021\)Low anisotropy sense retrofitting \(laser\): towards isotropic and sense enriched representations\.InProceedings of deep learning inside out \(DeeLIO\): The 2nd workshop on knowledge extraction and integration for deep learning architectures,pp\. 81–95\.Cited by:[§3\.2](https://arxiv.org/html/2605.23033#S3.SS2.p5.4)\.
- D\. Bolya, P\. Huang, P\. Sun, J\. H\. Cho, A\. Madotto, C\. Wei, T\. Ma, J\. Zhi, J\. Rajasegaran, H\. Rasheed, J\. Wang, M\. Monteiro, H\. Xu, S\. Dong, N\. Ravi, D\. Li, P\. Dollár, and C\. Feichtenhofer \(2025\)Perception encoder: the best visual embeddings are not at the output of the network\.arXiv preprint arXiv:2504\.13181\.External Links:[Document](https://dx.doi.org/10.48550/arXiv.2504.13181)Cited by:[§2\.3](https://arxiv.org/html/2605.23033#S2.SS3.p1.1)\.
- H\. Caesar, J\. Uijlings, and V\. Ferrari \(2018\)Coco\-stuff: thing and stuff classes in context\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 1209–1218\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- H\. Cao, D\. G\. Cooper, M\. K\. Keutmann, R\. C\. Gur, A\. Nenkova, and R\. Verma \(2014\)Crema\-d: crowd\-sourced emotional multimodal actors dataset\.IEEE transactions on affective computing5\(4\),pp\. 377–390\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- I\. Casanueva, T\. Temčinas, D\. Gerz, M\. Henderson, and I\. Vulić \(2020\)Efficient intent detection with dual sentence encoders\.InProceedings of the 2nd Workshop on Natural Language Processing for Conversational AI,T\. Wen, A\. Celikyilmaz, Z\. Yu, A\. Papangelis, M\. Eric, A\. Kumar, I\. Casanueva, and R\. Shah \(Eds\.\),Online,pp\. 38–45\.External Links:[Document](https://dx.doi.org/10.18653/v1/2020.nlp4convai-1.5),[Link](https://aclanthology.org/2020.nlp4convai-1.5)Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- M\. Chen, A\. Radford, R\. Child, J\. Wu, H\. Jun, D\. Luan, and I\. Sutskever \(2020\)Generative pretraining from pixels\.InInternational conference on machine learning,pp\. 1691–1703\.Cited by:[§2\.1](https://arxiv.org/html/2605.23033#S2.SS1.p1.1)\.
- E\. Cheng, D\. Doimo, C\. Kervadec, I\. Macocco, J\. Yu, A\. Laio, and M\. Baroni \(2024\)Emergence of a high\-dimensional abstraction phase in language transformers\.arXiv preprint arXiv:2405\.15471\.Cited by:[§6\.2](https://arxiv.org/html/2605.23033#S6.SS2.p4.1)\.
- S\. Chiu, C\. Wu, J\. Hsieh, Y\. Tsao, and H\. Wang \(2024\)Learnable layer selection and model fusion for speech self\-supervised learning models\.InProc\. Interspeech 2024,pp\. 3914–3918\.Cited by:[§3\.1](https://arxiv.org/html/2605.23033#S3.SS1.p1.8),[§5](https://arxiv.org/html/2605.23033#S5.p3.1)\.
- M\. Cimpoi, S\. Maji, I\. Kokkinos, S\. Mohamed, and A\. Vedaldi \(2014\)Describing textures in the wild\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 3606–3613\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- cjadams, D\. Borkan, inversion, J\. Sorensen, L\. Dixon, L\. Vasserman, and nithum \(2019\)Jigsaw unintended bias in toxicity classification\.Kaggle\.External Links:[Link](https://kaggle.com/competitions/jigsaw-unintended-bias-in-toxicity-classification)Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- M\. Cordts, M\. Omran, S\. Ramos, T\. Rehfeld, M\. Enzweiler, R\. Benenson, U\. Franke, S\. Roth, and B\. Schiele \(2016\)The cityscapes dataset for semantic urban scene understanding\.InProceedings of the IEEE conference on computer vision and pattern recognition,pp\. 3213–3223\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- W\. De Vries, A\. Van Cranenburgh, and M\. Nissim \(2020\)What’s so special about bert’s layers? a closer look at the nlp pipeline in monolingual and multilingual models\.arXiv preprint arXiv:2004\.06499\.Cited by:[§2\.2](https://arxiv.org/html/2605.23033#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.23033#S3.SS1.p1.8)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 \(long and short papers\),pp\. 4171–4186\.Cited by:[§1](https://arxiv.org/html/2605.23033#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.23033#S2.SS2.p1.1),[Figure 2](https://arxiv.org/html/2605.23033#S3.F2),[Figure 2](https://arxiv.org/html/2605.23033#S3.F2.2.1.1),[§5](https://arxiv.org/html/2605.23033#S5.p3.1)\.
- A\. Dosovitskiy \(2020\)An image is worth 16x16 words: transformers for image recognition at scale\.arXiv preprint arXiv:2010\.11929\.Cited by:[§1](https://arxiv.org/html/2605.23033#S1.p1.1),[§5](https://arxiv.org/html/2605.23033#S5.p3.1)\.
- A\. El\-Nouby, M\. Klein, S\. Zhai, M\. A\. Bautista, A\. Toshev, V\. Shankar, J\. M\. Susskind, and A\. Joulin \(2024\)Scalable pre\-training of large autoregressive image models\.arXiv preprint arXiv:2401\.08541\.Cited by:[§2\.1](https://arxiv.org/html/2605.23033#S2.SS1.p1.1)\.
- K\. Enevoldsen, I\. Chung, I\. Kerboua, M\. Kardos, A\. Mathur, D\. Stap, J\. Gala, W\. Siblini, D\. Krzemiński, G\. I\. Winata,et al\.\(2025\)Mmteb: massive multilingual text embedding benchmark\.arXiv preprint arXiv:2502\.13595\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- S\. Fan, X\. Jiang, X\. Li, X\. Meng, P\. Han, S\. Shang, A\. Sun, Y\. Wang, and Z\. Wang \(2024\)Not all layers of llms are necessary during inference\.arXiv preprint arXiv:2403\.02181\.Cited by:[§2\.1](https://arxiv.org/html/2605.23033#S2.SS1.p1.1)\.
- J\. FitzGerald, C\. Hench, C\. Peris, S\. Mackie, K\. Rottmann, A\. Sanchez, A\. Nash, L\. Urbach, V\. Kakarala, R\. Singh, S\. Ranganath, L\. Crist, M\. Britan, W\. Leeuwis, G\. Tur, and P\. Natarajan \(2022\)MASSIVE: a 1m\-example multilingual natural language understanding dataset with 51 typologically\-diverse languages\.External Links:2204\.08582Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- Q\. Garrido, R\. Balestriero, L\. Najman, and Y\. Lecun \(2023\)Rankme: assessing the downstream performance of pretrained self\-supervised representations by their rank\.InInternational conference on machine learning,pp\. 10929–10974\.Cited by:[§2\.3](https://arxiv.org/html/2605.23033#S2.SS3.p1.1)\.
- R\. Girdhar, A\. El\-Nouby, Z\. Liu, M\. Singh, K\. V\. Alwala, A\. Joulin, and I\. Misra \(2023\)Imagebind: one embedding space to bind them all\.InProceedings of the IEEE/CVF conference on computer vision and pattern recognition,pp\. 15180–15190\.Cited by:[§1](https://arxiv.org/html/2605.23033#S1.p1.1)\.
- N\. Godey, É\. Clergerie, and B\. Sagot \(2024\)Anisotropy is inherent to self\-attention in transformers\.InProceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 35–48\.Cited by:[§3\.2](https://arxiv.org/html/2605.23033#S3.SS2.p3.1)\.
- W\. W\. Hager \(1989\)Updating the inverse of a matrix\.SIAM review31\(2\),pp\. 221–239\.Cited by:[§3\.2](https://arxiv.org/html/2605.23033#S3.SS2.p3.1)\.
- A\. E\. Hoerl and R\. W\. Kennard \(1970\)Ridge regression: biased estimation for nonorthogonal problems\.Technometrics12\(1\),pp\. 55–67\.Cited by:[§3\.2](https://arxiv.org/html/2605.23033#S3.SS2.p2.8)\.
- \[33\]M\. Huh, H\. Mobahi, R\. Zhang, B\. Cheung, P\. Agrawal, and P\. IsolaThe low\-rank simplicity bias in deep networks, march 2023\.URL http://arxiv\. org/abs/2103\.10427\.Cited by:[§3\.2](https://arxiv.org/html/2605.23033#S3.SS2.p3.1)\.
- M\. Jin, Q\. Yu, J\. Huang, Q\. Zeng, Z\. Wang, W\. Hua, H\. Zhao, K\. Mei, Y\. Meng, K\. Ding, F\. Yang, M\. Du, and Y\. Zhang \(2025\)Exploring concept depth: how large language models acquire knowledge and concept at different layers?\.External Links:2404\.07066,[Link](https://arxiv.org/abs/2404.07066)Cited by:[§2\.1](https://arxiv.org/html/2605.23033#S2.SS1.p1.1)\.
- S\. Kornblith, M\. Norouzi, H\. Lee, and G\. Hinton \(2019\)Similarity of neural network representations revisited\.InInternational conference on machine learning,pp\. 3519–3529\.Cited by:[§3\.1](https://arxiv.org/html/2605.23033#S3.SS1.p1.8)\.
- J\. Krause, J\. Deng, M\. Stark, and L\. Fei\-Fei \(2013\)Collecting a large\-scale dataset of fine\-grained cars\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- A\. Krizhevsky, G\. Hinton,et al\.\(2009\)Learning multiple layers of features from tiny images\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- S\. Kudrjashov, O\. Karpik, and E\. Klyshinsky \(2024\)Shrink the longest: improving latent space isotropy with simplicial geometry\.InInternational Conference on Analysis of Images, Social Networks and Texts,pp\. 120–130\.Cited by:[§3\.2](https://arxiv.org/html/2605.23033#S3.SS2.p5.4)\.
- V\. Lad, J\. H\. Lee, W\. Gurnee, and M\. Tegmark \(2024\)The remarkable robustness of llms: stages of inference?\.arXiv preprint arXiv:2406\.19384\.Cited by:[§2\.1](https://arxiv.org/html/2605.23033#S2.SS1.p1.1)\.
- J\. Lezama, Q\. Qiu, P\. Musé, and G\. Sapiro \(2018\)Ole: orthogonal low\-rank embedding\-a plug and play geometric loss for deep learning\.InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition,pp\. 8109–8118\.Cited by:[§3\.3](https://arxiv.org/html/2605.23033#S3.SS3.p1.3)\.
- H\. Li, A\. Arora, S\. Chen, A\. Gupta, S\. Gupta, and Y\. Mehdad \(2021\)MTOP: a comprehensive multilingual task\-oriented semantic parsing benchmark\.InProceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume,P\. Merlo, J\. Tiedemann, and R\. Tsarfaty \(Eds\.\),Online,pp\. 2950–2962\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.eacl-main.257),[Link](https://aclanthology.org/2021.eacl-main.257)Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- H\. Li and C\. Huang \(2026\)Task\-driven kernel flows: label rank compression and laplacian spectral filtering\.arXiv preprint arXiv:2601\.00276\.Cited by:[§3\.2](https://arxiv.org/html/2605.23033#S3.SS2.p3.1)\.
- M\. Z\. Li, K\. K\. Agrawal, A\. Ghosh, K\. K\. Teru, A\. Santoro, G\. Lajoie, and B\. A\. Richards \(2025\)Tracing the representation geometry of language models from pretraining to post\-training\.arXiv preprint arXiv:2509\.23024\.Cited by:[§1](https://arxiv.org/html/2605.23033#S1.p1.1),[§2\.2](https://arxiv.org/html/2605.23033#S2.SS2.p1.1)\.
- I\. Loshchilov and F\. Hutter \(2017\)Decoupled weight decay regularization\.arXiv preprint arXiv:1711\.05101\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p1.4)\.
- \[45\]L\. Maes, D\. Scieur, and R\. BalestrieroLevyScore: a fast sample\-wise confidence score of pretrained joint embedding model\.InUniReps: 3rd Edition of the Workshop on Unifying Representations in Neural Models,Cited by:[§2\.3](https://arxiv.org/html/2605.23033#S2.SS3.p1.1)\.
- W\. C\. Maggie \(2020\)Tweet sentiment extraction\.Kaggle\.External Links:[Link](https://kaggle.com/competitions/tweet-sentiment-extraction)Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- M\. Marone, O\. Weller, W\. Fleshman, E\. Yang, D\. Lawrie, and B\. Van Durme \(2025\)Mmbert: a modern multilingual encoder with annealed language learning\.arXiv preprint arXiv:2509\.06888\.Cited by:[§6\.2](https://arxiv.org/html/2605.23033#S6.SS2.p6.1)\.
- N\. Muennighoff, N\. Tazi, L\. Magne, and N\. Reimers \(2023\)Mteb: massive text embedding benchmark\.InProceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics,pp\. 2014–2037\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- K\. Nakamura, S\. Levy, and W\. Y\. Wang \(2020\)Fakeddit: a new multimodal benchmark dataset for fine\-grained fake news detection\.InProceedings of the twelfth language resources and evaluation conference,pp\. 6149–6157\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- J\. O’Neill, P\. Rozenshtein, R\. Kiryo, M\. Kubota, and D\. Bollegala \(2021\)I wish I would have loved this one, but I didn’t – a multilingual dataset for counterfactual detection in product review\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing,M\. Moens, X\. Huang, L\. Specia, and S\. W\. Yih \(Eds\.\),Online and Punta Cana, Dominican Republic,pp\. 7092–7108\.External Links:[Document](https://dx.doi.org/10.18653/v1/2021.emnlp-main.568),[Link](https://aclanthology.org/2021.emnlp-main.568)Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- M\. Oquab, T\. Darcet, T\. Moutakanni, H\. Vo, M\. Szafraniec, V\. Khalidov, P\. Fernandez, D\. Haziza, F\. Massa, A\. El\-Nouby,et al\.\(2023\)Dinov2: learning robust visual features without supervision\.arXiv preprint arXiv:2304\.07193\.Cited by:[§1](https://arxiv.org/html/2605.23033#S1.p1.1),[§5](https://arxiv.org/html/2605.23033#S5.p3.1)\.
- K\. Park, Y\. J\. Choe, Y\. Jiang, and V\. Veitch \(2024\)The geometry of categorical and hierarchical concepts in large language models\.arXiv preprint arXiv:2406\.01506\.Cited by:[§1](https://arxiv.org/html/2605.23033#S1.p1.1)\.
- M\. E\. Peters, M\. Neumann, M\. Iyyer, M\. Gardner, C\. Clark, K\. Lee, and L\. Zettlemoyer \(2018\)Deep contextualized word representations\.InProceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long Papers\),M\. Walker, H\. Ji, and A\. Stent \(Eds\.\),New Orleans, Louisiana,pp\. 2227–2237\.External Links:[Link](https://aclanthology.org/N18-1202/),[Document](https://dx.doi.org/10.18653/v1/N18-1202)Cited by:[§3\.1](https://arxiv.org/html/2605.23033#S3.SS1.p1.8)\.
- E\. Queipo\-de\-Llano, Á\. Arroyo, F\. Barbero, X\. Dong, M\. Bronstein, Y\. LeCun, and R\. Shwartz\-Ziv \(2025\)Attention sinks and compression valleys in llms are two sides of the same coin\.arXiv preprint arXiv:2510\.06477\.Cited by:[§1](https://arxiv.org/html/2605.23033#S1.p1.1)\.
- A\. Radford, J\. W\. Kim, C\. Hallacy, A\. Ramesh, G\. Goh, S\. Agarwal, G\. Sastry, A\. Askell, P\. Mishkin, J\. Clark,et al\.\(2021\)Learning transferable visual models from natural language supervision\.InInternational conference on machine learning,pp\. 8748–8763\.Cited by:[§1](https://arxiv.org/html/2605.23033#S1.p1.1)\.
- M\. Raghu, J\. Gilmer, J\. Yosinski, and J\. Sohl\-Dickstein \(2017\)Svcca: singular vector canonical correlation analysis for deep learning dynamics and interpretability\.Advances in neural information processing systems30\.Cited by:[§2\.2](https://arxiv.org/html/2605.23033#S2.SS2.p1.1),[§3\.1](https://arxiv.org/html/2605.23033#S3.SS1.p1.8)\.
- J\. Rajasegaran, I\. Radosavovic, R\. Ravishankar, Y\. Gandelsman, C\. Feichtenhofer, and J\. Malik \(2025\)An empirical study of autoregressive pre\-training from videos\.arXiv preprint arXiv:2501\.05453\.Cited by:[§2\.1](https://arxiv.org/html/2605.23033#S2.SS1.p1.1)\.
- A\. Razzhigaev, M\. Mikhalchuk, E\. Goncharova, I\. Oseledets, D\. Dimitrov, and A\. Kuznetsov \(2024\)The shape of learning: anisotropy and intrinsic dimensions in transformer\-based models\.InFindings of the Association for Computational Linguistics: EACL 2024,pp\. 868–874\.Cited by:[§6\.2](https://arxiv.org/html/2605.23033#S6.SS2.p4.1)\.
- N\. Rostamzadeh, S\. Hosseini, T\. Boquet, W\. Stokowiec, Y\. Zhang, C\. Jauvin, and C\. Pal \(2018\)Fashion\-gen: the generative fashion dataset and challenge\.arXiv preprint arXiv:1806\.08317\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- O\. Russakovsky, J\. Deng, H\. Su, J\. Krause, S\. Satheesh, S\. Ma, Z\. Huang, A\. Karpathy, A\. Khosla, M\. Bernstein, A\. C\. Berg, and L\. Fei\-Fei \(2015\)ImageNet Large Scale Visual Recognition Challenge\.International Journal of Computer Vision \(IJCV\)115\(3\),pp\. 211–252\.External Links:[Document](https://dx.doi.org/10.1007/s11263-015-0816-y)Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- M\. Saponati, P\. Sager, P\. V\. Aceituno, T\. Stadelmann, and B\. Grewe \(2025\)The underlying structures of self\-attention: symmetry, directionality, and emergent dynamics in transformer training\.arXiv preprint arXiv:2502\.10927\.Cited by:[§2\.2](https://arxiv.org/html/2605.23033#S2.SS2.p1.1)\.
- E\. Saravia, H\. T\. Liu, Y\. Huang, J\. Wu, and Y\. Chen \(2018\)CARER: contextualized affect representations for emotion recognition\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing,E\. Riloff, D\. Chiang, J\. Hockenmaier, and J\. Tsujii \(Eds\.\),Brussels, Belgium,pp\. 3687–3697\.External Links:[Document](https://dx.doi.org/10.18653/v1/D18-1404),[Link](https://aclanthology.org/D18-1404)Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- A\. M\. Saxe, J\. L\. McClelland, and S\. Ganguli \(2013\)Exact solutions to the nonlinear dynamics of learning in deep linear neural networks\.arXiv preprint arXiv:1312\.6120\.Cited by:[§3\.1](https://arxiv.org/html/2605.23033#S3.SS1.p1.8)\.
- O\. Siméoni, H\. V\. Vo, M\. Seitzer, F\. Baldassarre, M\. Oquab, C\. Jose, V\. Khalidov, M\. Szafraniec, S\. Yi, M\. Ramamonjisoa,et al\.\(2025\)Dinov3\.arXiv preprint arXiv:2508\.10104\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p3.1)\.
- O\. Skean, M\. R\. Arefin, D\. Zhao, N\. Patel, J\. Naghiyev, Y\. LeCun, and R\. Shwartz\-Ziv \(2025\)Layer by layer: uncovering hidden representations in language models\.arXiv preprint arXiv:2502\.02013\.Cited by:[§1](https://arxiv.org/html/2605.23033#S1.p1.1),[§2\.3](https://arxiv.org/html/2605.23033#S2.SS3.p1.1),[§6\.2](https://arxiv.org/html/2605.23033#S6.SS2.p4.1)\.
- H\. Touvron, M\. Cord, M\. Douze, F\. Massa, A\. Sablayrolles, and H\. Jégou \(2021\)Training data\-efficient image transformers & distillation through attention\.InInternational conference on machine learning,pp\. 10347–10357\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p3.1)\.
- C\. Wah, S\. Branson, P\. Welinder, P\. Perona, and S\. Belongie \(2011\)The caltech\-ucsd birds\-200\-2011 dataset\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- X\. Wang, J\. Yamagishi, M\. Todisco, H\. Delgado, A\. Nautsch, N\. Evans, M\. Sahidullah, V\. Vestman, T\. Kinnunen, K\. A\. Lee,et al\.\(2020\)ASVspoof 2019: a large\-scale public database of synthesized, converted and replayed speech\.Computer Speech & Language64,pp\. 101114\.Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- P\. Warden \(2018\)Speech Commands: A Dataset for Limited\-Vocabulary Speech Recognition\.ArXiv e\-prints\.External Links:1804\.03209,[Link](https://arxiv.org/abs/1804.03209)Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- B\. Warner, A\. Chaffin, B\. Clavié, O\. Weller, O\. Hallström, S\. Taghadouini, A\. Gallagher, R\. Biswas, F\. Ladhak, T\. Aarsen, N\. Cooper, G\. Adams, J\. Howard, and I\. Poli \(2024\)Smarter, better, faster, longer: a modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference\.External Links:2412\.13663,[Link](https://arxiv.org/abs/2412.13663)Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p3.1)\.
- J\. Xiao, K\. A\. Ehinger, J\. Hays, A\. Torralba, and A\. Oliva \(2014\)SUN database: exploring a large collection of scene categories\.International Journal of Computer Vision119,pp\. 3–22\.External Links:[Link](https://api.semanticscholar.org/CorpusID:10224573)Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- J\. Xiao, J\. Hays, K\. A\. Ehinger, A\. Oliva, and A\. Torralba \(2010\)SUN database: large\-scale scene recognition from abbey to zoo\.In2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition,Vol\.,pp\. 3485–3492\.External Links:[Document](https://dx.doi.org/10.1109/CVPR.2010.5539970)Cited by:[§5](https://arxiv.org/html/2605.23033#S5.p4.1)\.
- S\. Yang, H\. Wang, Z\. Xing, S\. Chen, and L\. Zhu \(2025\)Segdino: an efficient design for medical and natural image segmentation with dino\-v3\.arXiv preprint arXiv:2509\.00833\.Cited by:[§3\.1](https://arxiv.org/html/2605.23033#S3.SS1.p1.8)\.
- R\. Zhang, P\. Isola, and A\. A\. Efros \(2016\)Colorful image colorization\.InEuropean conference on computer vision,pp\. 649–666\.Cited by:[§2\.1](https://arxiv.org/html/2605.23033#S2.SS1.p1.1)\.
- H\. Zheng, S\. Wang, C\. Thomas, and L\. Huang \(2025\)Advancing chart question answering with robust chart component recognition\.In2025 IEEE/CVF Winter Conference on Applications of Computer Vision \(WACV\),pp\. 5741–5750\.Cited by:[§2\.1](https://arxiv.org/html/2605.23033#S2.SS1.p1.1)\.
## Appendix AAppendix
### A\.1Discussion and Future Directions
The results in this work consistently show that only a small subset of intermediate representations exhibit favorable geometric and statistical properties, particularly with respect to isotropy, reduced redundancy, and stable ridge\-based linear alignment\. These observations suggest several concrete and feasible directions for future research that directly build on the principles validated in our experiments\.
##### Parameter\-Efficient Adaptation\.
Our empirical findings indicate that layers selected based on residual alignment and isotropy tend to contribute complementary information for downstream tasks\. This naturally aligns with parameter\-efficient fine\-tuning methods such as low\-rank adaptation and lightweight adapters, where adaptation capacity must be carefully allocated\. A promising direction is to use the same layer\-wise diagnostics to determine where adaptation modules should be inserted, focusing updates on layers that are both linearly accessible and well\-conditioned\. Such integration may be particularly effective in low\-data regimes, where our results already suggest improved robustness through regularization and spectral balance\.
##### Unsupervised and Noisy Regimes\.
Several components emphasized in this work, particularly isotropy and redundancy, do not depend on label information and therefore remain meaningful in unsupervised settings\. This suggests a feasible extension in which intermediate representations are selected based on intrinsic geometric properties alone, providing a principled way to identify stable and complementary features prior to downstream use\. In settings with limited or noisy supervision, our results indicate that ridge regularization and spectral balance jointly reduce sensitivity to spurious correlations\. Future work could explore adaptive strategies that emphasize geometry\-driven criteria when labels are unreliable, while gradually incorporating residual alignment as supervision improves\. Such approaches may be especially relevant for real\-world data, where clean labels are often scarce or imperfect\.
##### Scaling, Multimodality, and Generation\.
As models grow deeper and are applied to more diverse tasks, exhaustive layer\-wise analysis becomes increasingly costly\. However, the core operations used here, ridge regression, orthogonal projection, and covariance\-based isotropy estimation, admit efficient approximations\. Hierarchical or block\-level selection strategies could preserve the selection behavior observed in our experiments while remaining practical for large\-scale architectures\. The same principles extend naturally to multimodal and generative settings, where redundancy across modalities or conditioning features is common\. Explicitly favoring isotropic and orthogonal representations may help identify features that contribute stable and complementary information prior to fusion or conditioning\.
##### Interpretability and Robustness\.
The layer\-wise scores introduced in this work provide a structured view of how representation geometry evolves across depth and tasks\. Analyzing trends in isotropy, redundancy, and residual alignment can offer insight into why certain layers consistently support downstream adaptation better than others\. This geometry\-driven perspective complements existing interpretability approaches by grounding explanations in measurable spectral properties\. Moreover, because isotropy and redundancy do not rely on labels, these ideas are well suited to unsupervised, low\-data, and noisy settings\. Emphasizing geometry\-driven criteria in such regimes may reduce sensitivity to unreliable supervision, while ridge\-based residual fitting provides a controlled mechanism for incorporating task information when available\.
Overall, these directions highlight how the empirical patterns and theoretical insights identified in this work can be extended to larger models, broader tasks, and more challenging data regimes, while remaining grounded in the same underlying principles\.
### A\.2Limitations
Theoretical and modeling assumptions limit guarantees\. Our analysis and the isotropy objective are derived under simplified priors \(e\.g\., a rotationally invariant task prior\) that are useful as conservative, worst\-case safeguards but do not exactly match every real downstream problem\. Tasks with strong, structured priors or domain\-specific signal can favor particular spectral directions, so forcing isotropy is not guaranteed to be strictly optimal in every case\. Consequently, the diagnostics and theorems should be read as guidance about structural risk and probe stability rather than as absolute causal claims about which features are “true” for a given task\.
Hyperparameters and calibration sensitivity remain a practical limitation\. LOES depends on a few interpretable knobs \(ridge strength, isotropy weight, redundancy weight, class\-geometry weight, and the number of selected layers\) and on a small calibration set\. Selection and final performance can vary when these settings are poorly chosen or when the calibration data is non\-representative of the deployment distribution\. Crucially, however, our empirical results show that the method is robust: even under suboptimal hyperparameter settings or modestly misspecified training conditions, LOES typically improves over the single\-layer baseline\. Ridge regularization and the isotropy penalty in particular reduce variance and make the probe less brittle to noisy labels or small calibration budgets, so practical gains often persist without extensive tuning\. A second practical caveat is that models pretrained narrowly on a single distribution \(e\.g\. supervised ImageNet distillation in DeiT\-B/16\) already concentrate task\-discriminative signal in the final layer, so the headroom for LOES is small and gains can be marginal or slightly negative \(see Appendix[A7](https://arxiv.org/html/2605.23033#A1.T7)\)\.
Computation and greedy search introduce additional practical constraints\. The greedy selection procedure is efficient and effective in our experiments but is not guaranteed to find the global optimum; early choices can affect later residuals\. Likewise, exact closed\-form solves and eigendecompositions scale poorly as embedding dimension or spatial resolution grows\. In practice these costs are mitigated by small calibration budgets, low layer budgets \(we usek\!∈\!3,4k\!\\in\!\{3,4\}\), and approximate linear\-algebra techniques when necessary, and the net effect is still a reproducible accuracy gain across modalities\. Nevertheless, users should be aware that extreme model sizes, strong distribution shift in calibration data, or highly noisy labels may reduce the margin of improvement and may require lightweight approximations or modest validation to retain the benefits described in the paper\.
### A\.3Layer Fusion Mechanisms
#### A\.3\.1Adaptors
Let
𝒮=\{l1,l2,…,lk\}\\mathcal\{S\}=\\\{l\_\{1\},l\_\{2\},\\dots,l\_\{k\}\\\}
denote the set of layer indices selected by LOES, wherek=\|𝒮\|k=\|\\mathcal\{S\}\|\. For an input samplexx, the hidden representation extracted from layerlil\_\{i\}is denoted as
𝐡li∈ℝD\\mathbf\{h\}\_\{l\_\{i\}\}\\in\\mathbb\{R\}^\{D\}
whereDDis the hidden dimensionality of the foundation model\. To reduce dimensionality and learn task\-specific transformations, each selected layer representation is passed through an independent lightweight adapter module\. Each adapter consists of a Layer Normalization layer, followed by a linear projection and a GELU activation:
𝐳li=GELU\(WiLN\(𝐡li\)\+bi\)\\mathbf\{z\}\_\{l\_\{i\}\}=\\mathrm\{GELU\}\\left\(W\_\{i\}\\,\\mathrm\{LN\}\(\\mathbf\{h\}\_\{l\_\{i\}\}\)\+b\_\{i\}\\right\)
where
Wi∈ℝdp×D,bi∈ℝdp,dp∈ℝ256,W\_\{i\}\\in\\mathbb\{R\}^\{d\_\{p\}\\times D\},\\qquad b\_\{i\}\\in\\mathbb\{R\}^\{d\_\{p\}\},d\_\{p\}\\in\\mathbb\{R\}^\{256\},
The transformed representations from all selected layers are then concatenated to form a fused representation:
𝐳fused=\[𝐳l1;𝐳l2;…;𝐳lk\]∈ℝkdp\\mathbf\{z\}\_\{\\text\{fused\}\}=\[\\mathbf\{z\}\_\{l\_\{1\}\};\\mathbf\{z\}\_\{l\_\{2\}\};\\dots;\\mathbf\{z\}\_\{l\_\{k\}\}\]\\in\\mathbb\{R\}^\{kd\_\{p\}\}
#### A\.3\.2Probes
The fused representation is subsequently passed through a task\-specific prediction head consisting of Layer Normalization, Dropout, and a linear transformation:
𝐲^=WcDropout\(LN\(𝐳fused\)\)\+bc\\hat\{\\mathbf\{y\}\}=W\_\{c\}\\,\\mathrm\{Dropout\}\\left\(\\mathrm\{LN\}\(\\mathbf\{z\}\_\{\\text\{fused\}\}\)\\right\)\+b\_\{c\}
where
Wc∈ℝC×dfusedW\_\{c\}\\in\\mathbb\{R\}^\{C\\times d\_\{\\text\{fused\}\}\}
andCCis the number of output classes for classification tasks\. For regression settings,C=1C=1\.
The overall architecture enables LOES\-selected intermediate representations to be efficiently fused while maintaining a lightweight parameter footprint through low\-dimensional adapters\.
### A\.4Additional Results
#### A\.4\.1Effect of Hyperparameters
LOES selects a subset ofKKlayers by minimizing a composite objective that balances residual task fitting and geometric regularization\. At each selection step, the score for a candidate layer is
ℒ=ℒres\+α\(1−Iso\)\+γRed−ηTri,\\mathcal\{L\}=\\mathcal\{L\}\_\{\\text\{res\}\}\+\\alpha\\,\(1\-\\mathrm\{Iso\}\)\+\\gamma\\,\\mathrm\{Red\}\-\\eta\\,\\mathrm\{Tri\},\(10\)whereℒres\\mathcal\{L\}\_\{\\text\{res\}\}denotes the residual regression loss,Iso\\mathrm\{Iso\}measures representation isotropy,Red\\mathrm\{Red\}captures similarity with previously selected layers, andTri\\mathrm\{Tri\}encourages class\-separating geometry via centroid\-based triangular area\.
Table A1:Ablation of LOES hyperparameters on MTOP Domain Classification\.All results are reported usingModernBERT\-base\.α\\alphacontrols isotropy regularization,γ\\gammapenalizes redundancy with previously selected layers, andη\\etapromotes class\-separating geometry\. Higher accuracy is better\.Method𝜶\\boldsymbol\{\\alpha\}𝜸\\boldsymbol\{\\gamma\}𝜼\\boldsymbol\{\\eta\}𝑲\\boldsymbol\{K\}𝒏cal\\boldsymbol\{n\_\{\\text\{cal\}\}\}Selected LayersTest Acc↑\\uparrowLast Layer–––1–\[12\]90\.89Last\-4 Concat–––4–\[9,10,11,12\]94\.83LOES0\.00\.00\.04512\[8, 11, 12, 10\]95\.53LOES0\.00\.10\.04512\[8, 0, 1, 2\]97\.90LOES0\.00\.50\.04512\[8, 0, 1, 2\]97\.90LOES1\.00\.50\.1464\[1, 0, 3, 2\]98\.13LOES1\.00\.50\.14512\[1, 0, 3, 2\]98\.30Table A2:Cross\-modal hyperparameter validation on ASVspoof 2019 LA with Wav2Vec 2\.0 \(frozen\)\.The default configuration \(α=1\.0,γ=0\.5,η=0\.1\\alpha\{=\}1\.0,\\ \\gamma\{=\}0\.5,\\ \\eta\{=\}0\.1\) selected for text in Table[A1](https://arxiv.org/html/2605.23033#A1.T1)also yields the best performance in the audio modality, with under 0\.3 percentage\-point spread across all non\-zero settings\. Even the weakest LOES configuration improves over Last\-4 Concat and the last\-layer baseline\.Figure 7:2D sensitivity sweep overα\\alphaandγ\\gammaon MTOP \(ModernBERT\-base,K=4K\{=\}4\)\.The accuracy surface is a broad plateau: the default\(α=1\.0,γ=0\.5\)\(\\alpha\{=\}1\.0,\\gamma\{=\}0\.5\)reaches 95\.90%, within 0\.25 percentage points of the grid optimum \(96\.15%\), and even the weakest configuration in the grid \(94\.80%\) outperforms the last\-layer baseline \(81\.37%\) by more than 13 points\. LOES is therefore insensitive to fine hyperparameter tuning within a wide regime\.Table A3:Comparison of Last, Last\-4, and LOES \(k=4k=4\) on MTEB classification tasks\.We report test accuracy \(%\) using MBERT\-B\.*Last*uses the final encoder layer, while*Last\-4*concatenates the final four layers\.*LOES \(k=4k=4\)*selects four layers using a calibration set comprising 1–50% of the training data and concatenates their representations\. For each calibration fraction, the first row reports accuracy and the italicized row lists the selected encoder layers per dataset\. Average accuracy is computed across all seven datasets\.ModelMode𝒌\\boldsymbol\{k\}Accuracy \(%\)Selected LayersModernBERT\-baseLast181\.37\[21\]Last\-3382\.44\[19,20,21\]LOES192\.50\[3\]LOES295\.51\[1,21\]LOES395\.99\[3,21,1\]LOES495\.76\[3,21,1,4\]LOES596\.01\[1,21,3,4,10\]LOES695\.92\[1,20,3,4,10,21\]LOES795\.78\[1,21,3,4,10,17,19\]LOES895\.67\[3,21,1,4,10,19,20,18\]LOES995\.12\[1,21,3,4,15,20,18,19,16\]LOES1095\.42\[3,21,1,4,11,20,19,18,17,15\]BERT\-baseLast196\.67\[11\]Last\-3397\.40\[9,10,11\]LOES196\.08\[7\]LOES297\.15\[11,10\]LOES397\.97\[11,10,6\]LOES497\.74\[11,10,6,7\]LOES597\.83\[6,10,7,11,9\]LOES697\.74\[11,10,6,9,7,8\]LOES798\.18\[7,10,6,9,8,11,4\]LOES898\.34\[11,10,9,6,7,8,4,5\]LOES998\.38\[11,10,6,9,7,8,4,5,3\]LOES1098\.43\[7,10,9,6,11,8,5,4,3,1\]Table A4:Effect of LOES layer count \(kk\) on MTOP domain classification\.Test accuracy \(%\) reported for varyingkk\. Horizontal rules separate*Last*,*Last\-3*, and LOES sweeps\. Best result per model is in bold\.
#### A\.4\.2Comparison Against Globally Optimal Subsets
LOES uses a greedy selection rule, which is not guaranteed to recover the globally optimal subset ofKKlayers\. To quantify the gap, we exhaustively trained and evaluated all\(123\)=220\\binom\{12\}\{3\}\{=\}220three\-layer subsets of frozen BERT\-base on two MTEB benchmarks and compared each against the greedy LOES choice\.
Table A5:LOES against exhaustive search over\(123\)=220\\binom\{12\}\{3\}\{=\}220subsetson frozen BERT\-base\. LOES reaches within 1\.16 percentage points of the optimum on MTOP and 1\.69 on Emotion, sharing 2 of 3 selected layers with the optimal subset in both cases\.LOES recovers a subset within 1\.16 percentage points \(MTOP\) and 1\.69 percentage points \(Emotion\) of the globally optimal one, sharing two of three layers with it on both datasets\. Crucially, the exhaustive search required training all 220 subsets, costing several GPU\-hours even on BERT\-base\. For deeper encoders the search space grows rapidly: ModernBERT\-base \(22 layers\) gives\(223\)=1,540\\binom\{22\}\{3\}\{=\}1\{,\}540subsets, and BERT\-large \(24 layers\) gives\(243\)=2,024\\binom\{24\}\{3\}\{=\}2\{,\}024\. LOES, in contrast, requires a small calibration set and a few closed\-form computations followed by a single training run, and yields the largest gains precisely on these deeper models \(Table[5](https://arxiv.org/html/2605.23033#S5.T5)\)\.
##### Isotropy weightα\\alpha\.
Non\-zero isotropy regularization is essential for strong performance\. Settingα=0\\alpha=0leads to noticeably lower accuracy, while moderate values yield substantial gains\. On MTOP Domain Classification, the best\-performing configurations consistently occur atα=0\.5\\alpha=0\.5toα=1\.0\\alpha=1\.0, indicating that mildly encouraging isotropic representations improves linear separability without suppressing task\-relevant signal\.
##### Redundancy weightγ\\gamma\.
Contrary to acting purely as a penalty, the redundancy term can positively contribute to performance\. Increasingγ\\gammafrom zero improves accuracy in several settings, with the best results achieved atγ=0\.5\\gamma=0\.5\. This suggests that controlled redundancy encourages the selection of semantically aligned layers that reinforce task\-discriminative features, rather than enforcing strict decorrelation\.
##### Geometric weightη\\eta\.
The triangular class\-geometry term has a stabilizing effect but does not require precise tuning\. Performance remains stable across a broad range ofη∈\[0,0\.2\]\\eta\\in\[0,0\.2\], once suitable layers have been selected\. This indicates that class\-geometry regularization serves as a secondary refinement rather than a dominant driver of layer selection\.
##### Number of selected layersKKand calibration sizencaln\_\{\\text\{cal\}\}\.
All experiments useK=4K\{=\}4\. As shown in Appendix Table[A4](https://arxiv.org/html/2605.23033#A1.T4), accuracy continues to improve for some encoders beyond this point but at a roughly linear cost in head parameters and fused\-representation dimension, soK∈\{3,4\}K\{\\in\}\\\{3,4\\\}offers the best accuracy\-efficiency tradeoff\. Calibration is sample\-efficient:ncal=64n\_\{\\text\{cal\}\}\{=\}64consistently matches or closely approaches the best performance, demonstrating that LOES is robust to calibration set size\.
##### Summary\.
The optimal regime
α=\[0\.5,1\.0\],γ=0\.5,η∈\[0,0\.2\],K=4\\alpha=\[0\.5,1\.0\],\\quad\\gamma=0\.5,\\quad\\eta\\in\[0,0\.2\],\\quad K=4balances residual task fitting with representation geometry\. This regime is stable across modalities: the same default configuration is near\-optimal for text \(Table[A1](https://arxiv.org/html/2605.23033#A1.T1)\) and for audio \(Table[A2](https://arxiv.org/html/2605.23033#A1.T2)\), and the 2Dα\\alpha–γ\\gammasurface in Figure[7](https://arxiv.org/html/2605.23033#A1.F7)forms a broad plateau rather than a sharp optimum\. These findings validate the design of LOES and support its use without per\-task hyperparameter tuning\.
#### A\.4\.3Non\-Linear Probing
We did some additional experiments using a ReLU activation in the probe \([A\.3](https://arxiv.org/html/2605.23033#A1.SS3):
𝐲^=ReLU\(WcDropout\(LN\(𝐳fused\)\)\+bc\)\\hat\{\\mathbf\{y\}\}=\\mathrm\{ReLU\}\\left\(W\_\{c\}\\,\\mathrm\{Dropout\}\\left\(\\mathrm\{LN\}\(\\mathbf\{z\}\_\{\\text\{fused\}\}\)\\right\)\+b\_\{c\}\\right\)Table \([A6](https://arxiv.org/html/2605.23033#A1.T6)\) shows the results on MBERT\-B for AM\-Scenario and MTOP datasets\.
Table A6:Performance of LOES under non\-linear probing \(ReLU\) on MBERT\-B\. LOES generalizes effectively beyond linear probes, significantly outperforming standard baselines\.Table A7:Statistical significance of LOES vs last\-layer baseline on the test set\.Pairedtt\-tests are conducted overn=5n\{=\}5seeds\.Table A8:One\-way ANOVA across models using LOES test accuracy\.The analysis tests whether different pretrained encoders achieve significantly different performance on each dataset\. All datasets show statistically significant differences across models \(p<0\.05p<0\.05\)\.Table A9:LOES against a stronger learned fusion baseline on MBERT\-base\.Learnable Concat applies a per\-layer linear projection to a low\-dimensional space and concatenates across all 22 layers, giving it 1\.7x more head parameters and FLOPs than LOES\-4\. LOES\-4 still matches or exceeds this baseline on both tasks, indicating that the gains come from the selection criterion rather than from richer downstream fusion\.
#### A\.4\.4Sensitivity to the Ridge Regularizerλ\\lambda
Throughout all experiments, the Tikhonov regularizerλ\\lambdain the ridge probe \(Eq\.[1](https://arxiv.org/html/2605.23033#S3.E1)\) is fixed at10−310^\{\-3\}for numerical stability and is not tuned per dataset\. To verify that LOES is not sensitive to this choice, we sweepλ\\lambdaacross four orders of magnitude on two MTEB classification tasks using ModernBERT\-base\. As shown in Appendix Table[A10](https://arxiv.org/html/2605.23033#A1.T10), LOES outperforms the Last\-4 concatenation baseline by 14–16 percentage points across all values ofλ\\lambda, with less than1\.4%1\.4\\%variation across the sweep range\. This confirms that the ridge regularizer functions purely as a conditioning safeguard for the closed\-form solve, not as a performance\-critical hyperparameter, and that the default valueλ=10−3\\lambda=10^\{\-3\}can be used without per\-dataset tuning\.
Table A10:Sensitivity of LOES to the ridge regularizerλ\\lambda\.Test accuracy \(%\) on MTOP and AM\-Scenario using ModernBERT\-base withk=4k=4\. LOES is robust across four orders of magnitude inλ\\lambda, with all configurations outperforming the Last\-4 baseline by a wide margin\.
#### A\.4\.5Comparison with Variance\-Covariance Regularizers
GeoReg is designed to preserve spectral isotropy and class\-centroid separation during fine\-tuning, where competing task gradients can otherwise erode both\. This concern is shared with self\-supervised regularizers such as VICReg\(Bardeset al\.,[2021](https://arxiv.org/html/2605.23033#bib.bib26)\)and SIGReg, which similarly target representation collapse via variance\-covariance penalties\. The key distinction is timing: VICReg and related methods regularize the encoder during*pretraining*, whereas GeoReg operates on the*fused*multi\-layer representation during downstream adaptation, where the LOES\-selected geometry must be defended against task\-specific gradient updates\.
To validate this design choice, we compare GeoReg against a VICReg\-style adaptation applied at the same fusion point on TweetEval\-Emoji using BERT\-base withK=3K=3\. Without geometric regularization, validation accuracy degrades after approximately55k steps due to representation collapse, consistent with the trajectory shown in Figure[2](https://arxiv.org/html/2605.23033#S3.F2)\. The VICReg\-adapted baseline achieves30\.69%30\.69\\%, marginally above the last\-layer baseline at30\.15%30\.15\\%\. GeoReg reaches31\.04%31\.04\\%and exhibits stronger resistance to late\-training collapse, indicating that the combination of spectral isotropy and centroid\-volume terms is more effective than variance\-covariance penalties alone in the transfer setting\. A more detailed comparison covering VICReg, SIGReg, and additional regularizers is in preparation for the main paper but omitted here due to page constraints\.
Figure 8:t\-SNE visualizations comparing standard last\-layer representations with LOES\-selected layer fusion on CIFAR\-100 \(top, using DINOv2\) and ASVspoof 2019 \(bottom, using Wav2Vec 2\.0\)\. On CIFAR\-100, simple concatenation of the last three layers exhibits moderate class mixing, whereas LOES \(k=3k=3; layers 6, 7, and last\) produces tighter and better\-separated clusters, demonstrating the advantage of selective layer fusion\. On ASVspoof 2019, last\-layer embeddings show strong overlap between bonafide and spoof samples, while concatenation of the last three layers provides only marginal improvement\. In contrast, LOES\-selected representations achieve substantially clearer separation, with LOES Best Layer \(k=1k=1\) identifying a more informative representation and LOES Best 3 Layers \(k=3k=3\) yielding the most compact and well\-separated clusters\.
#### A\.4\.6t\-SNE for LOES Selected Layers
LOES\-selected layers produce tighter, more separated clusters compared to last\-layer and last\-K baselines\. Figure[8](https://arxiv.org/html/2605.23033#A1.F8)shows that the adapted representations \(after training\) of LOES have better cluster compactness and separability than the last layer\(s\) \(also adapted\), consistent with the gains reported earlier\.
### A\.5Statistical Evaluation Protocol
We conduct a comprehensive statistical analysis to assess whether LOES provides consistent and significant improvements over a last\-layer baseline\. All experiments are performed using five independent random seeds per model and dataset to account for variability arising from initialization and data ordering\.
##### Mean and Variance Estimation\.
For each model and dataset pair, we compute the mean and standard deviation of validation and test accuracies across seeds\. These statistics are reported as mean±\\pmstandard deviation in all result tables and serve as stable estimates of expected performance and variability\.
##### Paired Significance Testing\.
To evaluate whether LOES significantly outperforms the baseline, we perform paired two\-sidedtt\-tests for each model and dataset combination\. The paired design matches LOES and baseline runs using identical random seeds, thereby controlling for seed\-specific noise\.
Letdid\_\{i\}denote the difference in accuracy between LOES and the baseline for seedii\. The test statistic is computed as
t=d¯sd/n,t=\\frac\{\\bar\{d\}\}\{s\_\{d\}/\\sqrt\{n\}\},whered¯\\bar\{d\}is the mean of the per\-seed differences,sds\_\{d\}is their standard deviation, andn=5n=5is the number of seeds\.
We apply this test independently to validation and test accuracies\. A result is considered statistically significant if the correspondingpp\-value is below0\.050\.05\. While validation results are analyzed for completeness, we primarily report test\-set significance, as it reflects generalization performance\.
##### Interpretation of Results\.
The pairedtt\-tests indicate that LOES yields statistically significant improvements over the last\-layer baseline in the majority of model and dataset combinations\. Gains are particularly pronounced on fine\-grained benchmarks such as CUB\-200, Stanford Cars, and DTD\. In a small number of cases, differences are not statistically significant, indicating that LOES does not degrade performance when improvements are absent\.
The sign of thett\-statistic further reflects the direction of change\. Positive values correspond to LOES outperforming the baseline, while negative values indicate marginal baseline advantages, which are rare and often not statistically significant\.
##### Characterizing Failure Cases\.
Two model and dataset pairs show a statistically significant negative effect: DeiT\-B/16 on Mini\-ImageNet and DeiT\-B/16 on Stanford Dogs\. The absolute differences are small \(0\.19% and 0\.25% respectively\), so the regression is mild in practice but consistent across seeds\. DeiT\-B/16 is pretrained via supervised distillation exclusively on ImageNet variants, which concentrates task\-discriminative information in the final layers\. In this setting the last layer is already close to optimal, intermediate layers contribute noise rather than complementary signal, and LOES has little room to improve over the baseline\. This pattern is consistent with the broader trend in Table[2](https://arxiv.org/html/2605.23033#S4.T2): narrowly pretrained models \(DeiT, MAE, ViT\-IN21k\) concentrate discriminative features in late layers, whereas models pretrained on diverse corpora \(CLIP, DINOv2\) distribute them across depth and benefit most from LOES\.
##### Cross\-Model Analysis\.
To examine whether different pretrained encoders exhibit significantly different performance under LOES, we additionally perform one\-way analysis of variance tests across models for each dataset\. The results show that model choice remains a significant factor in performance, confirming that LOES complements rather than replaces the inductive biases introduced by pretraining\.
Overall, the statistical analysis demonstrates that LOES provides robust and reproducible improvements across architectures and datasets\. The observed gains are consistent across random seeds and are supported by rigorous paired significance testing, providing strong evidence for the effectiveness of the proposed method\.
### A\.6Additional Discussion
This subsection provides additional discussion and supporting empirical results that complement the main analysis and further validate the proposed framework across different settings, metrics, and evaluation protocols\.
##### Effect of Layer Count on Performance\.
Appendix Table[A4](https://arxiv.org/html/2605.23033#A1.T4)reports performance as a function of the number of selected layerskkon MTOP\. The trend depends on the encoder: ModernBERT\-base peaks atk=5k\{=\}5\(96\.01%\) with marginal differences acrossk∈\{3,4,5\}k\{\\in\}\\\{3,4,5\\\}, while BERT\-base improves nearly monotonically fromk=1k\{=\}1\(96\.08%\) tok=10k\{=\}10\(98\.43%\)\. However, largerkkcomes at a roughly linear cost in fused\-representation dimensionality and head parameters: for ModernBERT, moving fromk=3k\{=\}3tok=5k\{=\}5adds 67% more parameters in the head for a 0\.02 point gain; for BERT\-base, moving fromk=4k\{=\}4tok=10k\{=\}10adds 2\.5x parameters for a 0\.69 point gain\. We therefore usek∈\{3,4\}k\{\\in\}\\\{3,4\\\}throughout the paper as the best accuracy\-efficiency tradeoff rather than as the saturation point of the underlying curve\.
##### Calibration\-Free and Minimal\-Calibration Baselines\.
Appendix Table[A3](https://arxiv.org/html/2605.23033#A1.T3)includes Last and Last\-4 baselines that require no calibration data\. These results show that naive concatenation of final layers provides limited improvements, whereas LOES consistently yields stronger performance even with small calibration budgets\.
##### Statistical Robustness Across Random Seeds\.
Appendix Table[A7](https://arxiv.org/html/2605.23033#A1.T7)reports pairedtt\-test results comparing LOES with the last\-layer baseline across multiple vision benchmarks\. The results indicate that the observed improvements are statistically significant across random seeds and not driven by favorable initialization\.
##### Cross\-Model Variability Under LOES\.
Appendix Table[A8](https://arxiv.org/html/2605.23033#A1.T8)presents one\-way ANOVA results assessing performance differences across pretrained encoders under LOES\. While LOES improves accuracy across models, encoder choice remains statistically significant, indicating that LOES preserves and leverages model\-specific inductive biases rather than homogenizing representations\.
##### Additional Multimodal and Regression Results\.
Appendix Table[A11](https://arxiv.org/html/2605.23033#A1.T11)reports layer\-selection results for multimodal least squares regression and binary classification tasks using CLIP ViT\-B\. LOES improves performance across heterogeneous output spaces, including settings evaluated with RMSE, demonstrating that the method extends beyond classification accuracy\.
##### Extended Segmentation and Speech Results\.
Appendix Tables[A13](https://arxiv.org/html/2605.23033#A1.T13)and[A14](https://arxiv.org/html/2605.23033#A1.T14)provide additional evidence that LOES generalizes to dense prediction and speech classification tasks\.
Table A11:Layer selection results using CLIP ViT\-B \(base\) with joint image–text embeddings\.Results are reported on Amazon Products 23 and FashionGen \(least squares regression; lower RMSE is better\) and Fakeddit \(binary classification; higher accuracy is better\)\. Baseline methods and LOES variants are separated by horizontal rules\.LanguageLastLast\-3LOES \(kk=4\)Δ\\Deltavs Last\-3Selected LayersEnglish \(EN\)85\.284\.686\.6\+2\.0\[6, 21, 7, 9\]French \(FR\)80\.977\.581\.0\+3\.5\[6, 21, 7, 19\]German \(DE\)81\.981\.184\.0\+2\.9\[6, 21, 7, 9\]Italian \(IT\)82\.481\.884\.6\+2\.8\[6, 21, 7, 4\]Japanese \(JA\)83\.783\.083\.7\+0\.7\[6, 21, 7, 19\]Spanish \(ES\)82\.580\.884\.1\+3\.3\[4, 21, 6, 7\]Russian \(RU\)81\.180\.582\.6\+2\.1\[6, 21, 7, 19\]Hindi \(HI\)76\.474\.981\.4\+6\.5\[6, 21, 9, 7\]Arabic \(AR\)75\.473\.480\.6\+7\.2\[6, 21, 7, 9\]Urdu \(UR\)71\.768\.779\.6\+10\.9\[6, 20, 9, 7\]Table A12:Cross\-lingual LOES results on Amazon Massive Scenario using mBERT\-base \(21 layers\)\.Accuracy \(%\) reported for last\-layer, last\-3 concatenation, and LOES withkk=4\.Δ\\Deltavs Last\-3 denotes the improvement of LOES over last\-3 concatenation in percentage points\. Languages below the mid\-rule \(Hindi, Arabic, Urdu\) are underrepresented in typical pretraining corpora and show substantially larger LOES gains\. Layer indices are zero\-indexed; layer 21 is the final layer\.Table A13:LOES extends to semantic segmentation\.Mean IoU \(%\) reported for last\-layer and LOES withkk=3 on Cityscapes and COCOStuff\-164k\.Δ\\Deltadenotes improvement in percentage points\. All models use frozen encoders with a linear segmentation head\. Cityscapes models were trained for 20 epochs; COCOStuff models were trained for 8 epochs due to larger dataset size\. Layer indices are zero\-indexed; layer 11 is the final layer\.
#### A\.6\.1Adapting LOES for Semantic Segmentation
The standard LOES framework operates on image\-level \(pooled\) embeddings for classification tasks\. To extend LOES to dense prediction tasks such as semantic segmentation, we adapt the calibration and selection procedure to operate at the pixel level while preserving the core algorithmic structure\.
##### Pixel\-Level Calibration\.
For a calibration set ofNcalN\_\{\\text\{cal\}\}images, we extract intermediate representations from all encoder layers and reshape them to spatial feature maps\. Let𝐅ℓ∈ℝB×Hf×Wf×d\\mathbf\{F\}\_\{\\ell\}\\in\\mathbb\{R\}^\{B\\times H\_\{f\}\\times W\_\{f\}\\times d\}denote the spatial features at layerℓ\\ell, whereHf=Wf=Himg/pH\_\{f\}=W\_\{f\}=H\_\{\\text\{img\}\}/pis the feature resolution andppis the patch size\. We downsample the ground\-truth segmentation masks to match this resolution using nearest\-neighbor interpolation\.
To construct a tractable calibration set, we randomly sampleMMpixels per image from valid \(non\-ignored\) spatial locations\. This yields a pixel\-level feature matrix𝐗ℓ∈ℝN×d\\mathbf\{X\}\_\{\\ell\}\\in\\mathbb\{R\}^\{N\\times d\}, whereN=Ncal×MN=N\_\{\\text\{cal\}\}\\times M, and corresponding one\-hot encoded class labels𝐘∈ℝN×C\\mathbf\{Y\}\\in\\mathbb\{R\}^\{N\\times C\}\.
##### Layer Scoring and Selection\.
Given pixel\-level embeddings, the LOES scoring and selection procedure remains unchanged\. For each candidate layerℓ\\ell, we compute the closed\-form ridge regression solution as in Eq\. \([2](https://arxiv.org/html/2605.23033#S3.E2)\) and evaluate the composite score:
Score\(ℓ\)=ℒridge\(𝐗ℓ,𝐘\)\+α\(1−Iso\(𝐗ℓ\)\)\+γRedℓ,\\mathrm\{Score\}\(\\ell\)=\\mathcal\{L\}\_\{\\text\{ridge\}\}\(\\mathbf\{X\}\_\{\\ell\},\\mathbf\{Y\}\)\+\\alpha\\bigl\(1\-\\mathrm\{Iso\}\(\\mathbf\{X\}\_\{\\ell\}\)\\bigr\)\+\\gamma\\,\\mathrm\{Red\}\_\{\\ell\},\(11\)whereIso\(⋅\)\\mathrm\{Iso\}\(\\cdot\)is the isotropy score \(Eq\. \([4](https://arxiv.org/html/2605.23033#S3.E4)\)\) andRedℓ\\mathrm\{Red\}\_\{\\ell\}penalizes redundancy with previously selected layers \(Eq\. \([5](https://arxiv.org/html/2605.23033#S3.E5)\)\)\. We omit the triangular class\-geometry term \(η=0\\eta=0\) for segmentation, as computing class centroids over hundreds of pixel classes is computationally prohibitive and the term provides marginal benefit in this setting\.
Layer selection proceeds greedily as in Algorithm[1](https://arxiv.org/html/2605.23033#alg1): we first select the layer minimizing the initial score, then iteratively add layers that best reduce the residual prediction error while avoiding redundancy with the current selection\.
##### Decoder Architecture\.
Selected layer features are fused via per\-layer linear adapters followed by concatenation, consistent with the classification pipeline\. The fused representation is passed through a lightweight convolutional segmentation head and bilinearly upsampled to the original image resolution\. The encoder remains frozen throughout training; only the adapters and segmentation head are optimized\.
##### Summary\.
The key adaptation for segmentation is the shift from image\-level to pixel\-level calibration: instead of pooling spatial tokens, we treat each valid pixel as an independent sample for ridge regression and isotropy computation\. This preserves the geometric and residual\-based selection principles of LOES while accommodating dense prediction tasks\. As shown in Appendix Table[A13](https://arxiv.org/html/2605.23033#A1.T13), this adaptation consistently improves mIoU over last\-layer baselines across multiple encoders and datasets\.
Table A14:Layer selection results using Wav2Vec 2\.0 with a frozen encoder\.Classification accuracy is reported on ASVspoof 2019, CREMA\-D, and Google Speech Commands\. For LOES withk=4k=4, selected layer indices are shown in the order: ASVspoof 2019 / CREMA\-D / Google Speech Commands\.Figure 9:Epoch\-wise validation accuracy comparing LOES \(k=3\) and last\-layer baselines across datasets and models\.
#### A\.6\.2Eigenspectrum Analysis Across Layers
Figure[10](https://arxiv.org/html/2605.23033#A1.F10)visualizes the normalized eigenspectrum of layer\-wise covariance matrices, revealing how representation geometry evolves with depth and varies by pretraining paradigm\.
Figure 10:Normalized eigenspectrum across encoder layers on Stanford Cars\. Each row shows the top\-50 eigenvalues \(log\-scale\) of the layer\-wise covariance matrix; brighter colors indicate higher eigenvalues\. Green borders mark LOES\-selected layers\. CLIP selects mid\-depth layers \(4–6\) with flatter spectra, while DINOv2 selects later layers \(7, 10, 11\), reflecting their distinct pretraining paradigms\.CLIP, trained on diverse image\-text pairs, exhibits relatively uniform spectra in mid\-depth layers, consistent with its early\-to\-mid layer selection in Table[2](https://arxiv.org/html/2605.23033#S4.T2)\. DINOv2, trained via self\-distillation on curated images, concentrates isotropic structure in later layers\. These spectral patterns provide geometric justification for the layer selection differences: LOES identifies layers where the covariance eigenspectrum is flatter \(higher isotropy\), yielding better\-conditioned ridge probes as established in Theorem[A\.4](https://arxiv.org/html/2605.23033#A1.Thmtheorem4)\.
### A\.7Theoretical Analysis: Geometric Regularization in LOES
The LOES algorithm selects layer residuals by maximizing a hybrid score:
𝒮\(l\)=−ℒridge\(𝐱~l\)−α\(1−Iso\(𝐱~l\)\)\.\\mathcal\{S\}\(l\)=\-\\mathcal\{L\}\_\{ridge\}\(\\tilde\{\\mathbf\{x\}\}\_\{l\}\)\-\\alpha\(1\-\\text\{Iso\}\(\\tilde\{\\mathbf\{x\}\}\_\{l\}\)\)\.\(12\)The first term,ℒridge\\mathcal\{L\}\_\{ridge\}, is intuitive: it ensures the selected feature correlates with the task labels on the available source data\. However, the second term, which enforces geometric isotropy, requires rigorous justification\. Why should we prefer isotropic residuals over anisotropic ones, even if they yield similar empirical errors?
In this section, we provide this justification from first principles\. We posit that for general\-purpose representation learning, the goal is not merely to minimize prediction error on a specific dataset \(which can lead to overfitting spurious correlations\), but to maximize the fidelity of the representation its ability to allow a linear probe to recover the true underlying causal mechanism of the task\. We prove that maximizing isotropy is mathematically equivalent to minimizing the worst\-case structural misalignment \(Bias\) of the probe\.
#### A\.7\.1Problem Formulation
Let𝐱~∈ℝd\\tilde\{\\mathbf\{x\}\}\\in\\mathbb\{R\}^\{d\}be the candidate residual feature vector \(the component of layerllorthogonal to the current subspace𝒮\\mathcal\{S\}\)\. We model the downstream task targetyyas a linear response to these features\.
###### Assumption A\.1\(Linear Mechanism\)\.
The targety∈ℝy\\in\\mathbb\{R\}is generated by an unknown task weight vector𝐰∗∈ℝd\\mathbf\{w\}^\{\*\}\\in\\mathbb\{R\}^\{d\}:
y=\(𝐰∗\)⊤𝐱~\+ϵ,ϵ∼𝒩\(0,σ2\),y=\(\\mathbf\{w\}^\{\*\}\)^\{\\top\}\\tilde\{\\mathbf\{x\}\}\+\\epsilon,\\quad\\epsilon\\sim\\mathcal\{N\}\(0,\\sigma^\{2\}\),\(13\)where𝐱~\\tilde\{\\mathbf\{x\}\}is centered with covariance𝚺=𝔼\[𝐱~𝐱~⊤\]\\mathbf\{\\Sigma\}=\\mathbb\{E\}\[\\tilde\{\\mathbf\{x\}\}\\tilde\{\\mathbf\{x\}\}^\{\\top\}\]\.
###### Assumption A\.2\(Spherical Task Prior\)\.
LOES constructs a representation intended to transfer to many unknown downstream tasks\. We lack a priori knowledge of which feature directions will be important for future tasks\. Therefore, we invoke thePrinciple of Maximum Entropyand assume the task weights𝐰∗\\mathbf\{w\}^\{\*\}are drawn from a rotationally invariant distribution:
𝔼\[𝐰∗\]=𝟎,𝔼\[𝐰∗\(𝐰∗\)⊤\]=R2d𝐈d\.\\mathbb\{E\}\[\\mathbf\{w\}^\{\*\}\]=\\mathbf\{0\},\\quad\\mathbb\{E\}\[\\mathbf\{w\}^\{\*\}\(\\mathbf\{w\}^\{\*\}\)^\{\\top\}\]=\\frac\{R^\{2\}\}\{d\}\\mathbf\{I\}\_\{d\}\.\(14\)This implies that no direction in the feature space is privileged; the ”true” task vector is equally likely to align with any eigenvector of the feature covariance\.
#### A\.7\.2Spectral Decomposition of Fidelity
We define theFidelityof the representation as the expected squared Euclidean distance between the estimated weights𝐰^λ\\hat\{\\mathbf\{w\}\}\_\{\\lambda\}\(learned by a Ridge probe\) and the true causal weights𝐰∗\\mathbf\{w\}^\{\*\}\. This metric,ℰparam\\mathcal\{E\}\_\{param\}, quantifies the structural correctness of the learned solution\.
###### Lemma A\.3\(Spectral Decomposition of Parameter Error\)\.
The expected squared parameter error decomposes spectrally into two distinct terms:Alignment Bias\(structural error\) andEstimation Variance\(stochastic error\), both dependent on the eigenvalues\{μi\}i=1d\\\{\\mu\_\{i\}\\\}\_\{i=1\}^\{d\}of𝚺\\mathbf\{\\Sigma\}:
ℰparam\(𝚺\)≔𝔼\[‖𝐰^λ−𝐰∗‖2\]=∑i=1d\(R2dλ2\(μi\+λ\)2⏟Alignment Biasℬ\(μi\)\+σ2μi\(μi\+λ\)2⏟Estimation Variance𝒱\(μi\)\)\\mathcal\{E\}\_\{param\}\(\\mathbf\{\\Sigma\}\)\\coloneqq\\mathbb\{E\}\[\\\|\\hat\{\\mathbf\{w\}\}\_\{\\lambda\}\-\\mathbf\{w\}^\{\*\}\\\|^\{2\}\]=\\sum\_\{i=1\}^\{d\}\\left\(\\underbrace\{\\frac\{R^\{2\}\}\{d\}\\frac\{\\lambda^\{2\}\}\{\(\\mu\_\{i\}\+\\lambda\)^\{2\}\}\}\_\{\\text\{Alignment Bias \}\\mathcal\{B\}\(\\mu\_\{i\}\)\}\+\\underbrace\{\\sigma^\{2\}\\frac\{\\mu\_\{i\}\}\{\(\\mu\_\{i\}\+\\lambda\)^\{2\}\}\}\_\{\\text\{Estimation Variance \}\\mathcal\{V\}\(\\mu\_\{i\}\)\}\\right\)\(15\)
###### Proof\.
Step 1: Explicit Form of the Estimator Error\.The Ridge Regression estimator is given by𝐰^λ=\(𝚺\+λ𝐈\)−1𝚺xy\\hat\{\\mathbf\{w\}\}\_\{\\lambda\}=\(\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}\)^\{\-1\}\\mathbf\{\\Sigma\}\_\{xy\}\. We substitute the generative model for the cross\-covariance𝚺xy=𝚺𝐰∗\+𝚺1/2𝐳\\mathbf\{\\Sigma\}\_\{xy\}=\\mathbf\{\\Sigma\}\\mathbf\{w\}^\{\*\}\+\\mathbf\{\\Sigma\}^\{1/2\}\\mathbf\{z\}, where𝐳\\mathbf\{z\}is standard Gaussian noise\. The error vector𝐞=𝐰^λ−𝐰∗\\mathbf\{e\}=\\hat\{\\mathbf\{w\}\}\_\{\\lambda\}\-\\mathbf\{w\}^\{\*\}is:
𝐞\\displaystyle\\mathbf\{e\}=\(𝚺\+λ𝐈\)−1\(𝚺𝐰∗\+𝚺1/2𝐳\)−𝐰∗\\displaystyle=\(\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}\)^\{\-1\}\(\\mathbf\{\\Sigma\}\\mathbf\{w\}^\{\*\}\+\\mathbf\{\\Sigma\}^\{1/2\}\\mathbf\{z\}\)\-\\mathbf\{w\}^\{\*\}\(16\)=\[\(𝚺\+λ𝐈\)−1𝚺−𝐈\]𝐰∗\+\(𝚺\+λ𝐈\)−1𝚺1/2𝐳\.\\displaystyle=\\left\[\(\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}\)^\{\-1\}\\mathbf\{\\Sigma\}\-\\mathbf\{I\}\\right\]\\mathbf\{w\}^\{\*\}\+\(\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}\)^\{\-1\}\\mathbf\{\\Sigma\}^\{1/2\}\\mathbf\{z\}\.\(17\)We simplify the bracketed bias term using the matrix identity𝐀−1𝐁−𝐈=𝐀−1\(𝐁−𝐀\)\\mathbf\{A\}^\{\-1\}\\mathbf\{B\}\-\\mathbf\{I\}=\\mathbf\{A\}^\{\-1\}\(\\mathbf\{B\}\-\\mathbf\{A\}\)\. Setting𝐀=𝚺\+λ𝐈\\mathbf\{A\}=\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}and𝐁=𝚺\\mathbf\{B\}=\\mathbf\{\\Sigma\}, we get𝐁−𝐀=−λ𝐈\\mathbf\{B\}\-\\mathbf\{A\}=\-\\lambda\\mathbf\{I\}\. Thus:
𝐞=−λ\(𝚺\+λ𝐈\)−1𝐰∗⏟𝐞bias\+\(𝚺\+λ𝐈\)−1𝚺1/2𝐳⏟𝐞var\.\\mathbf\{e\}=\\underbrace\{\-\\lambda\(\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}\)^\{\-1\}\\mathbf\{w\}^\{\*\}\}\_\{\\mathbf\{e\}\_\{bias\}\}\+\\underbrace\{\(\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}\)^\{\-1\}\\mathbf\{\\Sigma\}^\{1/2\}\\mathbf\{z\}\}\_\{\\mathbf\{e\}\_\{var\}\}\.\(18\)
Step 2: Computing the Expected Squared Norm\.We seek𝔼\[‖𝐞‖2\]\\mathbb\{E\}\[\\\|\\mathbf\{e\}\\\|^\{2\}\]\. Since the noise𝐳\\mathbf\{z\}has zero mean and is independent of𝐰∗\\mathbf\{w\}^\{\*\}, the cross\-term𝔼\[𝐞bias⊤𝐞var\]\\mathbb\{E\}\[\\mathbf\{e\}\_\{bias\}^\{\\top\}\\mathbf\{e\}\_\{var\}\]vanishes\. We analyze the squared norms of the bias and variance terms separately\.
Part A: The Bias Term\.
‖𝐞bias‖2=λ2\(𝐰∗\)⊤\(𝚺\+λ𝐈\)−2𝐰∗\.\\\|\\mathbf\{e\}\_\{bias\}\\\|^\{2\}=\\lambda^\{2\}\(\\mathbf\{w\}^\{\*\}\)^\{\\top\}\(\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}\)^\{\-2\}\\mathbf\{w\}^\{\*\}\.\(19\)Taking the expectation over the task prior𝐰∗\\mathbf\{w\}^\{\*\}using the trace trick𝔼\[𝐮⊤𝐌𝐮\]=Tr\(𝐌𝔼\[𝐮𝐮⊤\]\)\\mathbb\{E\}\[\\mathbf\{u\}^\{\\top\}\\mathbf\{M\}\\mathbf\{u\}\]=\\text\{Tr\}\(\\mathbf\{M\}\\mathbb\{E\}\[\\mathbf\{u\}\\mathbf\{u\}^\{\\top\}\]\):
𝔼𝐰∗\[‖𝐞bias‖2\]\\displaystyle\\mathbb\{E\}\_\{\\mathbf\{w\}^\{\*\}\}\[\\\|\\mathbf\{e\}\_\{bias\}\\\|^\{2\}\]=λ2Tr\(\(𝚺\+λ𝐈\)−2𝔼\[𝐰∗\(𝐰∗\)⊤\]\)\\displaystyle=\\lambda^\{2\}\\text\{Tr\}\\left\(\(\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}\)^\{\-2\}\\mathbb\{E\}\[\\mathbf\{w\}^\{\*\}\(\\mathbf\{w\}^\{\*\}\)^\{\\top\}\]\\right\)\(20\)=λ2Tr\(\(𝚺\+λ𝐈\)−2R2d𝐈\)\(by Assum\.[A\.2](https://arxiv.org/html/2605.23033#A1.Thmtheorem2)\)\\displaystyle=\\lambda^\{2\}\\text\{Tr\}\\left\(\(\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}\)^\{\-2\}\\frac\{R^\{2\}\}\{d\}\\mathbf\{I\}\\right\)\\quad\\text\{\(by Assum\. \\ref\{assum:spherical\}\)\}\(21\)=R2λ2d∑i=1d1\(μi\+λ\)2\.\\displaystyle=\\frac\{R^\{2\}\\lambda^\{2\}\}\{d\}\\sum\_\{i=1\}^\{d\}\\frac\{1\}\{\(\\mu\_\{i\}\+\\lambda\)^\{2\}\}\.\(22\)
Part B: The Variance Term\.
‖𝐞var‖2=𝐳⊤𝚺1/2\(𝚺\+λ𝐈\)−2𝚺1/2𝐳\.\\\|\\mathbf\{e\}\_\{var\}\\\|^\{2\}=\\mathbf\{z\}^\{\\top\}\\mathbf\{\\Sigma\}^\{1/2\}\(\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}\)^\{\-2\}\\mathbf\{\\Sigma\}^\{1/2\}\\mathbf\{z\}\.\(23\)Taking the expectation over noise𝐳\\mathbf\{z\}with𝔼\[𝐳𝐳⊤\]=σ2𝐈\\mathbb\{E\}\[\\mathbf\{z\}\\mathbf\{z\}^\{\\top\}\]=\\sigma^\{2\}\\mathbf\{I\}:
𝔼𝐳\[‖𝐞var‖2\]\\displaystyle\\mathbb\{E\}\_\{\\mathbf\{z\}\}\[\\\|\\mathbf\{e\}\_\{var\}\\\|^\{2\}\]=Tr\(𝚺1/2\(𝚺\+λ𝐈\)−2𝚺1/2⋅σ2𝐈\)\\displaystyle=\\text\{Tr\}\\left\(\\mathbf\{\\Sigma\}^\{1/2\}\(\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}\)^\{\-2\}\\mathbf\{\\Sigma\}^\{1/2\}\\cdot\\sigma^\{2\}\\mathbf\{I\}\\right\)\(24\)=σ2Tr\(𝚺\(𝚺\+λ𝐈\)−2\)\(cyclic property\)\\displaystyle=\\sigma^\{2\}\\text\{Tr\}\\left\(\\mathbf\{\\Sigma\}\(\\mathbf\{\\Sigma\}\+\\lambda\\mathbf\{I\}\)^\{\-2\}\\right\)\\quad\\text\{\(cyclic property\)\}\(25\)=σ2∑i=1dμi\(μi\+λ\)2\.\\displaystyle=\\sigma^\{2\}\\sum\_\{i=1\}^\{d\}\\frac\{\\mu\_\{i\}\}\{\(\\mu\_\{i\}\+\\lambda\)^\{2\}\}\.\(26\)Summing Part A and Part B yields Eq\.[15](https://arxiv.org/html/2605.23033#A1.E15)\. ∎
#### A\.7\.3Optimality of Isotropy in LOES
In the context of LOES, we are particularly concerned with the Alignment Bias \(ℬ\\mathcal\{B\}\)\. The variance term𝒱\\mathcal\{V\}depends heavily on label noise, which varies by dataset\. The bias term, however, represents a fundamental geometric mismatch between our chosen features and the task mechanism\. To ensure our representation is robust toanytask orientation, we must minimize this bias\.
We now prove that minimizing the Alignment Bias strictly requires the feature covariance to be isotropic\.
###### Theorem A\.4\(Isotropy Minimizes Alignment Bias\)\.
Let𝒮E=\{𝚺⪰0∣Tr\(𝚺\)=E\}\\mathcal\{S\}\_\{E\}=\\\{\\mathbf\{\\Sigma\}\\succeq 0\\mid\\text\{Tr\}\(\\mathbf\{\\Sigma\}\)=E\\\}be the set of valid covariance matrices with fixed total signal energyEE\. The Alignment Bias component of the parameter error is strictly minimized if and only if𝚺\\mathbf\{\\Sigma\}is isotropic\.
argmin𝚺∈𝒮E∑i=1dℬ\(μi\)=Ed𝐈d\\arg\\min\_\{\\mathbf\{\\Sigma\}\\in\\mathcal\{S\}\_\{E\}\}\\sum\_\{i=1\}^\{d\}\\mathcal\{B\}\(\\mu\_\{i\}\)=\\frac\{E\}\{d\}\\mathbf\{I\}\_\{d\}\(27\)
###### Proof\.
Step 1: Convexity Analysis of the Bias Kernel\.Letg\(μ\)=1\(μ\+λ\)2g\(\\mu\)=\\frac\{1\}\{\(\\mu\+\\lambda\)^\{2\}\}be the bias contribution of a single eigenvalueμ\\mu\(omitting the positive constant factorR2λ2/dR^\{2\}\\lambda^\{2\}/d\)\. We compute the first and second derivatives with respect toμ\\mu:
g′\(μ\)\\displaystyle g^\{\\prime\}\(\\mu\)=ddμ\(μ\+λ\)−2=−2\(μ\+λ\)−3\\displaystyle=\\frac\{d\}\{d\\mu\}\(\\mu\+\\lambda\)^\{\-2\}=\-2\(\\mu\+\\lambda\)^\{\-3\}\(28\)g′′\(μ\)\\displaystyle g^\{\\prime\\prime\}\(\\mu\)=ddμ\[−2\(μ\+λ\)−3\]=6\(μ\+λ\)−4\.\\displaystyle=\\frac\{d\}\{d\\mu\}\[\-2\(\\mu\+\\lambda\)^\{\-3\}\]=6\(\\mu\+\\lambda\)^\{\-4\}\.\(29\)Since the regularization parameterλ\>0\\lambda\>0and eigenvaluesμ≥0\\mu\\geq 0, the term\(μ\+λ\)4\(\\mu\+\\lambda\)^\{4\}is strictly positive\. Consequently,g′′\(μ\)\>0g^\{\\prime\\prime\}\(\\mu\)\>0for all validμ\\mu\. This implies thatg\(μ\)g\(\\mu\)is astrictly convex function\.
Step 2: Constrained Optimization via Jensen’s Inequality\.We seek to minimize the total biasJ\(𝝁\)=∑i=1dg\(μi\)J\(\\boldsymbol\{\\mu\}\)=\\sum\_\{i=1\}^\{d\}g\(\\mu\_\{i\}\)subject to the trace constraint∑i=1dμi=E\\sum\_\{i=1\}^\{d\}\\mu\_\{i\}=E\. ByJensen’s Inequalityfor convex functions, the average of the function values is bounded below by the function of the average:
1d∑i=1dg\(μi\)≥g\(1d∑i=1dμi\)\.\\frac\{1\}\{d\}\\sum\_\{i=1\}^\{d\}g\(\\mu\_\{i\}\)\\geq g\\left\(\\frac\{1\}\{d\}\\sum\_\{i=1\}^\{d\}\\mu\_\{i\}\\right\)\.\(30\)Substituting the constraint∑μi=E\\sum\\mu\_\{i\}=E:
∑i=1d1\(μi\+λ\)2≥d⋅1\(E/d\+λ\)2\.\\sum\_\{i=1\}^\{d\}\\frac\{1\}\{\(\\mu\_\{i\}\+\\lambda\)^\{2\}\}\\geq d\\cdot\\frac\{1\}\{\(E/d\+\\lambda\)^\{2\}\}\.\(31\)
Step 3: Condition for Strict Optimality\.For astrictlyconvex function likeg\(μ\)g\(\\mu\), equality in Jensen’s Inequality holds if and only if all inputs are identical:
μ1=μ2=⋯=μd=Ed\.\\mu\_\{1\}=\\mu\_\{2\}=\\dots=\\mu\_\{d\}=\\frac\{E\}\{d\}\.\(32\)This spectral condition corresponds uniquely to the isotropic covariance matrix𝚺=Ed𝐈\\mathbf\{\\Sigma\}=\\frac\{E\}\{d\}\\mathbf\{I\}\.
Conclusion:Any deviation from isotropy \(i\.e\., spectral skewness\) strictly increases the lower bound of the Alignment Bias\. Therefore, an isotropic distribution of residual variance is the theoretical optimum for minimizing structural error in the probe\. ∎
#### A\.7\.4Theoretical Justification of the LOES Objective
This theoretical analysis directly informs the design of the LOES selection criterion𝒮\(l\)\\mathcal\{S\}\(l\)\. While theℒridge\\mathcal\{L\}\_\{ridge\}term reduces the empirical error on the source task, it cannot guarantee that the features capture the correct causal structure—it is prone to latching onto high\-variance spurious correlations\. Theorem[A\.4](https://arxiv.org/html/2605.23033#A1.Thmtheorem4)establishes that to safeguard against such structural misalignment, the representation must be isotropic\. The termα\(1−Iso\(𝐱~l\)\)\\alpha\(1\-\\text\{Iso\}\(\\tilde\{\\mathbf\{x\}\}\_\{l\}\)\)in our objective serves as a geometric regularizer that enforces the optimality condition derived in Theorem[A\.4](https://arxiv.org/html/2605.23033#A1.Thmtheorem4)\. By penalizing anisotropy, LOES forces the selected residual features to adopt the optimal geometry for mechanism recovery, ensuring that every selected dimension contributes equally to the representation’s transfer capability\.
### A\.8Detailed Justification of LOES Objective Components
This subsection provides a detailed justification of the individual components used in the LOES objective\. The discussion builds directly on the theoretical results established in the preceding section, with particular emphasis on the role of spectral properties of representations in linear probing\. Throughout, isotropy plays a central role as it directly governs the conditioning, stability, and predictability of ridge\-based linear evaluation\.
#### A\.8\.1Setting and Notation
Let𝐗ℓ∈ℝN×dℓ\\mathbf\{X\}\_\{\\ell\}\\in\\mathbb\{R\}^\{N\\times d\_\{\\ell\}\}denote the centered embedding matrix extracted from layerℓ\\ellon a calibration set ofNNsamples, and let𝐘∈ℝN×C\\mathbf\{Y\}\\in\\mathbb\{R\}^\{N\\times C\}denote one\-hot encoded targets\. The objective is to select a subset of layersS⊂\{1,…,L\}S\\subset\\\{1,\\dots,L\\\}with\|S\|=K\|S\|=Ksuch that the resulting representation subspace supports stable and complementary linear prediction\.
At each iteration, candidate layers are evaluated using
Score\(ℓ\)=ℒres\(ℓ\)\+α\(1−Iso\(𝐗ℓ\)\)\+γRed\(ℓ\)−ηTri\(ℓ\),\\mathrm\{Score\}\(\\ell\)=\\mathcal\{L\}\_\{\\mathrm\{res\}\}\(\\ell\)\+\\alpha\\bigl\(1\-\\mathrm\{Iso\}\(\\mathbf\{X\}\_\{\\ell\}\)\\bigr\)\+\\gamma\\,\\mathrm\{Red\}\(\\ell\)\-\\eta\\,\\mathrm\{Tri\}\(\\ell\),\(33\)where each term corresponds to a distinct failure mode identified in the theoretical analysis\.
#### A\.8\.2Residual Ridge Loss and Spectral Sensitivity
For a feature matrix𝐗\\mathbf\{X\}and targets𝐘\\mathbf\{Y\}, LOES relies on ridge regression,
𝐖^=argmin𝐖‖𝐗𝐖−𝐘‖F2\+λ‖𝐖‖F2,𝐖^=\(𝐗⊤𝐗\+λ𝐈\)−1𝐗⊤𝐘\.\\hat\{\\mathbf\{W\}\}=\\arg\\min\_\{\\mathbf\{W\}\}\\\|\\mathbf\{X\}\\mathbf\{W\}\-\\mathbf\{Y\}\\\|\_\{F\}^\{2\}\+\\lambda\\\|\\mathbf\{W\}\\\|\_\{F\}^\{2\},\\qquad\\hat\{\\mathbf\{W\}\}=\(\\mathbf\{X\}^\{\\top\}\\mathbf\{X\}\+\\lambda\\mathbf\{I\}\)^\{\-1\}\\mathbf\{X\}^\{\\top\}\\mathbf\{Y\}\.\(34\)
As shown in Theorem[A\.4](https://arxiv.org/html/2605.23033#A1.Thmtheorem4), the expected error of the ridge estimator depends not only on the alignment between𝐘\\mathbf\{Y\}and the feature subspace, but also on the distribution of eigenvalues of𝐗⊤𝐗\\mathbf\{X\}^\{\\top\}\\mathbf\{X\}\. In particular, highly anisotropic spectra lead to estimators whose behavior is dominated by a small number of principal directions, while directions associated with smaller eigenvalues contribute disproportionately to variance\.
Let𝐘^\\widehat\{\\mathbf\{Y\}\}denote the cumulative prediction from already selected layers, and define the residual
𝐑=𝐘−𝐘^\.\\mathbf\{R\}=\\mathbf\{Y\}\-\\widehat\{\\mathbf\{Y\}\}\.The residual ridge lossℒres\(ℓ\)\\mathcal\{L\}\_\{\\mathrm\{res\}\}\(\\ell\)evaluates how effectively a candidate layer explains this residual signal\. Importantly, residuals often emphasize directions with lower variance in the feature space, making their estimation particularly sensitive to anisotropy\. This sensitivity motivates the inclusion of an explicit isotropy term alongside residual fitting\.
#### A\.8\.3Orthogonalization, Redundancy, and Spectral Overlap
Let𝐗S\\mathbf\{X\}\_\{S\}denote the concatenation of features from the selected layers\. LOES evaluates each candidate layer after removing its projection onto the current subspace:
𝐗~ℓ=𝐗ℓ−𝐗S\(𝐗S⊤𝐗S\+ε𝐈\)−1𝐗S⊤𝐗ℓ\.\\widetilde\{\\mathbf\{X\}\}\_\{\\ell\}=\\mathbf\{X\}\_\{\\ell\}\-\\mathbf\{X\}\_\{S\}\(\\mathbf\{X\}\_\{S\}^\{\\top\}\\mathbf\{X\}\_\{S\}\+\\varepsilon\\mathbf\{I\}\)^\{\-1\}\\mathbf\{X\}\_\{S\}^\{\\top\}\\mathbf\{X\}\_\{\\ell\}\.\(35\)
This operation isolates directions that are not already represented\. From a spectral perspective, it prevents repeated selection of layers whose dominant eigen\-directions coincide with those already chosen\. This is particularly important when individual layers are anisotropic, as their strongest directions may otherwise dominate the selection process despite offering little new information\.
Residual redundancy is further quantified by
Red\(ℓ\)=maxj∈S‖𝐗ℓ⊤𝐗j‖F‖𝐗ℓ‖F‖𝐗j‖F,\\mathrm\{Red\}\(\\ell\)=\\max\_\{j\\in S\}\\frac\{\\\|\\mathbf\{X\}\_\{\\ell\}^\{\\top\}\\mathbf\{X\}\_\{j\}\\\|\_\{F\}\}\{\\\|\\mathbf\{X\}\_\{\\ell\}\\\|\_\{F\}\\\|\\mathbf\{X\}\_\{j\}\\\|\_\{F\}\},\(36\)which explicitly penalizes candidates whose feature directions remain strongly aligned with those already selected, even after orthogonalization\.
#### A\.8\.4Isotropy and Conditioning of Linear Probes
Let𝚺ℓ=1N𝐗ℓ⊤𝐗ℓ\\mathbf\{\\Sigma\}\_\{\\ell\}=\\frac\{1\}\{N\}\\mathbf\{X\}\_\{\\ell\}^\{\\top\}\\mathbf\{X\}\_\{\\ell\}with eigenvalues\{μi\}\\\{\\mu\_\{i\}\\\}\. LOES measures isotropy via
Iso\(𝐗ℓ\)=μ¯Var\(\{μi\}\)\+δ,μ¯=1dℓ∑iμi\.\\mathrm\{Iso\}\(\\mathbf\{X\}\_\{\\ell\}\)=\\frac\{\\bar\{\\mu\}\}\{\\sqrt\{\\mathrm\{Var\}\(\\\{\\mu\_\{i\}\\\}\)\+\\delta\}\},\\qquad\\bar\{\\mu\}=\\tfrac\{1\}\{d\_\{\\ell\}\}\\sum\_\{i\}\\mu\_\{i\}\.\(37\)
This quantity decreases as the covariance spectrum becomes more uneven\. As established in Theorem[A\.4](https://arxiv.org/html/2605.23033#A1.Thmtheorem4), representations with larger spectral variance admit ridge estimators whose risk is more sensitive to the orientation of the target relative to dominant eigen\-directions\. Conversely, representations with more uniform spectra exhibit more consistent behavior across tasks and sample realizations\.
In the context of LOES, isotropy does not serve as a proxy for task relevance\. Instead, it acts as a regularizer on the selection process by favoring layers whose linear probes are better conditioned\. This is particularly important in the greedy setting, where early selection of highly anisotropic layers can distort subsequent residual estimates and lead to unstable layer rankings\.
#### A\.8\.5Class Geometry
For classification tasks, LOES additionally evaluates the geometry of class centroids computed from𝐗~ℓ\\widetilde\{\\mathbf\{X\}\}\_\{\\ell\}\. Let𝝁c\\boldsymbol\{\\mu\}\_\{c\}denote the centroid of classcc\. The class geometry term is
Tri\(ℓ\)=𝔼\(a,b,c\)\[12‖𝝁b−𝝁a‖2‖𝝁c−𝝁a‖2−⟨𝝁b−𝝁a,𝝁c−𝝁a⟩2\]\.\\mathrm\{Tri\}\(\\ell\)=\\mathbb\{E\}\_\{\(a,b,c\)\}\\Big\[\\tfrac\{1\}\{2\}\\sqrt\{\\\|\\boldsymbol\{\\mu\}\_\{b\}\-\\boldsymbol\{\\mu\}\_\{a\}\\\|^\{2\}\\\|\\boldsymbol\{\\mu\}\_\{c\}\-\\boldsymbol\{\\mu\}\_\{a\}\\\|^\{2\}\-\\langle\\boldsymbol\{\\mu\}\_\{b\}\-\\boldsymbol\{\\mu\}\_\{a\},\\boldsymbol\{\\mu\}\_\{c\}\-\\boldsymbol\{\\mu\}\_\{a\}\\rangle^\{2\}\}\\Big\]\.\(38\)
This term penalizes representations in which class means collapse into low\-dimensional affine configurations\. Such collapse is often correlated with strong anisotropy, where only a small number of directions dominate both variance and class separation\.
Overall, isotropy interacts with all other components of the LOES objective\. It stabilizes ridge estimation, moderates the effects of residual fitting, reduces the influence of dominant but redundant directions, and supports more balanced class geometry\. For these reasons, isotropy serves as a structural regularizer within LOES rather than a standalone selection criterion\.
### A\.9Time Complexity Analysis of LOES
We analyze the computational complexity of LOES, focusing on the calibration and layer\-selection stages, which constitute the only additional overhead introduced beyond standard frozen\-encoder transfer learning\.
##### Notation\.
LetLLdenote the total number of encoder layers,ddthe embedding dimension per layer,CCthe number of classes,NcalN\_\{\\text\{cal\}\}the number of calibration samples, andK≪LK\\ll Lthe number of layers selected by LOES\. All encoder parameters remain frozen during calibration\.
##### Calibration Feature Extraction\.
LOES first extracts intermediate representations from all encoder layers forNcalN\_\{\\text\{cal\}\}samples\. This requires a single forward pass through the encoder per batch and incurs
𝒪\(Ncal⋅L⋅d\)\\mathcal\{O\}\(N\_\{\\text\{cal\}\}\\cdot L\\cdot d\)time and
𝒪\(L⋅Ncal⋅d\)\\mathcal\{O\}\(L\\cdot N\_\{\\text\{cal\}\}\\cdot d\)memory to store pooled layer representations\. This cost is incurred once per dataset\.
##### Per\-Layer Scoring\.
Each layer is scored independently using a closed\-form linear probe and a geometric regularization term\. Solving the linear system dominates this step and requires
𝒪\(d3\+Ncal⋅d2\)\\mathcal\{O\}\(d^\{3\}\+N\_\{\\text\{cal\}\}\\cdot d^\{2\}\)per layer\. Aggregated across all layers, the total cost is
𝒪\(L⋅\(d3\+Ncal⋅d2\)\)\.\\mathcal\{O\}\\big\(L\\cdot\(d^\{3\}\+N\_\{\\text\{cal\}\}\\cdot d^\{2\}\)\\big\)\.
##### Greedy Complementary Selection\.
LOES selectsKKlayers using a greedy procedure that evaluates candidate layers against the current selection\. At each iteration, all remaining layers are considered, and scoring involves residual fitting, orthogonalization with respect to previously selected layers, and redundancy\-aware geometric penalties\. Each iteration incurs
𝒪\(L⋅\(d3\+Ncal⋅d2\)\),\\mathcal\{O\}\\big\(L\\cdot\(d^\{3\}\+N\_\{\\text\{cal\}\}\\cdot d^\{2\}\)\\big\),leading to a total selection cost of
𝒪\(K⋅L⋅\(d3\+Ncal⋅d2\)\)\.\\mathcal\{O\}\\big\(K\\cdot L\\cdot\(d^\{3\}\+N\_\{\\text\{cal\}\}\\cdot d^\{2\}\)\\big\)\.
##### Overall Complexity\.
Combining all stages, the total time complexity of LOES is
𝒪\(Ncal⋅L⋅d\+K⋅L⋅\(d3\+Ncal⋅d2\)\),\\mathcal\{O\}\\big\(N\_\{\\text\{cal\}\}\\cdot L\\cdot d\\;\+\\;K\\cdot L\\cdot\(d^\{3\}\+N\_\{\\text\{cal\}\}\\cdot d^\{2\}\)\\big\),while the memory complexity is dominated by stored calibration features and scales as
𝒪\(L⋅Ncal⋅d\)\.\\mathcal\{O\}\(L\\cdot N\_\{\\text\{cal\}\}\\cdot d\)\.
##### Practical Implications\.
In practice, LOES is computationally efficient because calibration sets are small, the number of selected layers is limited \(K≤4K\\leq 4in all experiments\), and all operations are performed on frozen representations\. Consequently, the overhead of LOES is negligible compared to encoder pretraining or downstream fine\-tuning and typically completes within seconds on a single GPU for standard vision and language encoders\.
### A\.10GeoReg
In addition to the layer selection procedure, some experiments incorporate a geometric regularization term during probe training to mildly constrain the geometry of the fused representation\. This term, referred to as GeoReg in the implementation, is controlled by a single scalar hyperparameter that multiplies the entire regularization contribution\. In all reported experiments, this weight is fixed to a small value of0\.10\.1when enabled, and set to zero otherwise\. The design choice reflects the intended role of the regularizer as a secondary bias rather than a competing objective\. Concretely, GeoReg combines two effects computed on the current batch features: a spectral dispersion term given by the variance of the eigenvalues of the feature covariance matrix, and a class\-level separation term based on the logarithm of the area of a simplex formed by randomly sampled class centroids\. The spectral component penalizes highly anisotropic representations by increasing loss when variance concentrates along a few directions, while the centroid\-based term encourages non\-degenerate class configurations by favoring larger geometric separation\.
The relative scaling of these components is fixed in the code, leaving the outer weight as the only tunable hyperparameter\. Empirically, larger values were found to interfere with optimization of the primary cross\-entropy objective, especially in low\-data or high\-class\-count regimes, while smaller values had negligible effect\. The chosen value therefore reflects a compromise that consistently biases training toward better\-conditioned feature spaces without destabilizing learning\. Importantly, the regularizer is applied only when the number of classes is sufficient to define meaningful centroid geometry, and it is automatically disabled otherwise\. This conditional use ensures that GeoReg does not introduce spurious gradients in degenerate or low\-class settings\. Overall, the hyperparameterization mirrors the broader philosophy of the method: geometric criteria are used to gently shape representation structure, while predictive performance remains primarily driven by the ridge\-regularized linear probe\.
### A\.11GeoReg Computation Overhead
GeoReg adds 5\.6% per\-batch overhead on DINOv2\-S/14, Stanford Cars \(trainable, K=3\), dominated by eigenvalue decomposition of the D\-dimensional covariance matrix \(11ms for D=768, refer Table[A15](https://arxiv.org/html/2605.23033#A1.T15)\)
Table A15:Per\-batch computational overhead of GeoReg on DINOv2\-S/14 \(Stanford Cars, trainable,k=3k=3\)\. The overhead is modest \(\+5\.6%\) and arises primarily from covariance eigendecomposition\.
### A\.12Additional Implementation Details of Geometric Regularization
This subsection provides additional details on the geometric regularization term used during downstream adaptation, focusing on design and implementation choices that are not covered in the main text\. The goal of this regularizer is to gently bias learned representations toward favorable linear geometry, consistent with the ridge\-based analysis and layer selection criteria, without introducing task\-specific constraints or significant optimization overhead\.
The regularizer is applied to the fused representation produced after layer aggregation and optional adapter projection\. LetZ∈ℝB×dZ\\in\\mathbb\{R\}^\{B\\times d\}denote the batch of fused features, whereBBis the batch size\. The geometric penalty is added to the task loss with a fixed scalar weight and is evaluated independently for each mini\-batch\. Importantly, the regularizer operates on batch\-level statistics and does not require storing running estimates or dataset\-level covariance statistics\.
The isotropy component is computed using the empirical batch covariance
ΣZ=1B\(Z−Z¯\)⊤\(Z−Z¯\),\\Sigma\_\{Z\}=\\frac\{1\}\{B\}\(Z\-\\bar\{Z\}\)^\{\\top\}\(Z\-\\bar\{Z\}\),whereZ¯\\bar\{Z\}denotes the batch mean\. Rather than normalizing the spectrum by trace or enforcing a hard constraint on eigenvalues, the implementation penalizes the variance of the eigenvalue spectrum\. This choice directly targets spectral concentration while remaining numerically stable and differentiable\. In practice, a small diagonal jitter is added toΣZ\\Sigma\_\{Z\}to avoid numerical issues when batch size is small or features are nearly collinear\. This formulation aligns with the theoretical analysis linking isotropy to reduced worst\-case alignment bias in ridge regression, while avoiding additional normalization hyperparameters\.
The class\-geometry component is implemented using batch\-level class centroids\. When at least three distinct classes are present in a batch, three centroids are sampled uniformly at random and the area of the triangle they form is computed\. This computation is intentionally stochastic and lightweight: only a single triplet is used per batch, and no attempt is made to enumerate or average over all possible triplets\. The logarithm of the resulting area is incorporated into the loss with a negative sign, discouraging degenerate or nearly collinear class configurations\. If fewer than three classes are present in a batch, this term is skipped entirely\. This conditional behavior ensures that the regularizer does not introduce spurious gradients in low\-class or imbalanced settings\.
Both components are combined under a single weighting coefficient\. Across all experiments, this coefficient is fixed to a small value and is not tuned per dataset or model\. Empirically, larger values were observed to interfere with optimization of the primary task objective, while smaller values had negligible effect\. The chosen value therefore reflects a conservative trade\-off, where geometric structure is encouraged without dominating the training dynamics\. Notably, even when this hyperparameter is suboptimal, improvements over the last\-layer baseline are consistently observed, indicating that the method is not overly sensitive to precise tuning\.
When the backbone encoder is frozen, gradients from the geometric regularizer affect only the probe and any adapter layers\. When the encoder is trainable, the regularizer additionally stabilizes representation geometry during fine\-tuning, mitigating collapse of variance into a small number of directions\. For dense prediction and regression tasks, where class centroids are ill\-defined or computationally expensive to estimate, the centroid\-based term is disabled and only the isotropy component is retained\.
Algorithm 1LOES Layer Selection Algorithm \(Classification\)Input:Embeddings
ℰ=\{X1,…,XL\}\\mathcal\{E\}=\\\{X\_\{1\},\\dots,X\_\{L\}\\\}, labels
yy, budget
KK, parameters
λ,α,γ,η\\lambda,\\alpha,\\gamma,\\eta
Output:Selected layer indices
𝒮\\mathcal\{S\}
Y←OneHot\(y\)Y\\leftarrow\\mathrm\{OneHot\}\(y\)
𝒮←∅\\mathcal\{S\}\\leftarrow\\emptyset,
X𝒮←∅X\_\{\\mathcal\{S\}\}\\leftarrow\\emptyset,
Y^←𝟎\\widehat\{Y\}\\leftarrow\\mathbf\{0\}
\{— Stage \(i\): Initial Selection \(raw features, original targets\) —\}
s⋆←argminl\[ℒridge\(Xl,Y\)\+α\(1−Iso\(Xl\)\)\]s^\{\\star\}\\leftarrow\\arg\\min\_\{l\}\\Big\[\\mathcal\{L\}\_\{\\mathrm\{ridge\}\}\(X\_\{l\},Y\)\+\\alpha\\big\(1\-\\mathrm\{Iso\}\(X\_\{l\}\)\\big\)\\Big\]\{Eq\. \([1](https://arxiv.org/html/2605.23033#S3.E1)\), \([4](https://arxiv.org/html/2605.23033#S3.E4)\)\}
𝒮←\{s⋆\}\\mathcal\{S\}\\leftarrow\\\{s^\{\\star\}\\\},
X𝒮←Xs⋆X\_\{\\mathcal\{S\}\}\\leftarrow X\_\{s^\{\\star\}\}
Y^←RidgePred\(Xs⋆,Y\)\\widehat\{Y\}\\leftarrow\\mathrm\{RidgePred\}\(X\_\{s^\{\\star\}\},Y\)\{refit on rawXs⋆X\_\{s^\{\\star\}\}vsYY\}
R←Y−Y^R\\leftarrow Y\-\\widehat\{Y\}\{residual; see Eq\. \([7](https://arxiv.org/html/2605.23033#S3.E7)\) discussion\}
\{— Stage \(ii\): Greedy Complementary Selection —\}
while
\|𝒮\|<K\|\\mathcal\{S\}\|<Kdo
vbest←∞v\_\{\\text\{best\}\}\\leftarrow\\infty,
s⋆←Nones^\{\\star\}\\leftarrow\\text\{None\}
for
l∈\{1,…,L\}∖𝒮l\\in\\\{1,\\dots,L\\\}\\setminus\\mathcal\{S\}do
\{Orthogonalize
XlX\_\{l\}w\.r\.t\. selected context
X𝒮X\_\{\\mathcal\{S\}\}– Eq\. \([3](https://arxiv.org/html/2605.23033#S3.E3)\)\}
Xlc←Xl−μ\(Xl\)X\_\{l\}^\{c\}\\leftarrow X\_\{l\}\-\\mu\(X\_\{l\}\),
X𝒮c←X𝒮−μ\(X𝒮\)X\_\{\\mathcal\{S\}\}^\{c\}\\leftarrow X\_\{\\mathcal\{S\}\}\-\\mu\(X\_\{\\mathcal\{S\}\}\)
X~l←Xlc−X𝒮c\(\(X𝒮c\)⊤X𝒮c\+ϵI\)−1\(X𝒮c\)⊤Xlc\+μ\(Xl\)\\widetilde\{X\}\_\{l\}\\leftarrow X\_\{l\}^\{c\}\-X\_\{\\mathcal\{S\}\}^\{c\}\\big\(\(X\_\{\\mathcal\{S\}\}^\{c\}\)^\{\\top\}X\_\{\\mathcal\{S\}\}^\{c\}\+\\epsilon I\\big\)^\{\-1\}\(X\_\{\\mathcal\{S\}\}^\{c\}\)^\{\\top\}X\_\{l\}^\{c\}\+\\mu\(X\_\{l\}\)
\{Score components – Eq\. \([7](https://arxiv.org/html/2605.23033#S3.E7)\)\}
ℒ←ℒridge\(X~l,R\)\\mathcal\{L\}\\leftarrow\\mathcal\{L\}\_\{\\mathrm\{ridge\}\}\(\\widetilde\{X\}\_\{l\},R\)\{Lossℓ\\mathrm\{Loss\}\_\{\\ell\}: marginal fit on residual\}
ρ←maxj∈𝒮‖Xl⊤Xj‖F‖Xl‖F‖Xj‖F\\rho\\leftarrow\\max\_\{j\\in\\mathcal\{S\}\}\\dfrac\{\\\|X\_\{l\}^\{\\top\}X\_\{j\}\\\|\_\{F\}\}\{\\\|X\_\{l\}\\\|\_\{F\}\\\|X\_\{j\}\\\|\_\{F\}\}\{Redℓ\\mathrm\{Red\}\_\{\\ell\}, Eq\. \([5](https://arxiv.org/html/2605.23033#S3.E5)\)\}
τ←TriGeo\(X~l,y\)\\tau\\leftarrow\\mathrm\{TriGeo\}\(\\widetilde\{X\}\_\{l\},y\)\{Tri\\mathrm\{Tri\}, Eq\. \([6](https://arxiv.org/html/2605.23033#S3.E6)\);η=0\\eta\{=\}0if regression\}
v←ℒ\+α\(1−Iso\(Xl\)\)\+γρ−ητv\\leftarrow\\mathcal\{L\}\+\\alpha\\big\(1\-\\mathrm\{Iso\}\(X\_\{l\}\)\\big\)\+\\gamma\\,\\rho\-\\eta\\,\\tau\{Eq\. \([7](https://arxiv.org/html/2605.23033#S3.E7)\)\}
if
v<vbestv<v\_\{\\text\{best\}\}then
vbest←vv\_\{\\text\{best\}\}\\leftarrow v,
s⋆←ls^\{\\star\}\\leftarrow l
endif
endfor
if
s⋆=Nones^\{\\star\}=\\text\{None\}then
break
endif
\{— Refit on raw features against original targets and update ensemble —\}
Y^←Y^\+RidgePred\(Xs⋆,Y\)\\widehat\{Y\}\\leftarrow\\widehat\{Y\}\+\\mathrm\{RidgePred\}\(X\_\{s^\{\\star\}\},Y\)\{refit uses\(Xs⋆,Y\)\(X\_\{s^\{\\star\}\},Y\), not\(X~s⋆,R\)\(\\widetilde\{X\}\_\{s^\{\\star\}\},R\)\}
R←Y−Y^R\\leftarrow Y\-\\widehat\{Y\}
𝒮←𝒮∪\{s⋆\}\\mathcal\{S\}\\leftarrow\\mathcal\{S\}\\cup\\\{s^\{\\star\}\\\},
X𝒮←\[X𝒮,Xs⋆\]X\_\{\\mathcal\{S\}\}\\leftarrow\[X\_\{\\mathcal\{S\}\},X\_\{s^\{\\star\}\}\]
endwhile
return
𝒮\\mathcal\{S\}
### Datasets
TaskDatasetModalitySamplesSplitList of Datasets UsedAudio ClassificationASVspoof 2019 \(LA\)Audio121,461Train/Dev/TestCREMA\-DAudio\-Visual7,442Train/TestGoogle Speech CommandsAudio105,829Train/Val/TestSegmentationCOCO\-Stuff 164KImage164,000Train/TestCityscapesImage5,000Train/Val/TestImage ClassificationStanford Cars \(196\)Image16,185Train/TestMini\-ImageNetImage60,000Train/Val/TestStanford DogsImage20,580Train/TestCUB\-200\-2011Image11,788Train/TestCIFAR\-100Image60,000Train/TestSUN397Image108,754Train/TestImageNetImage1,281,167Train/Val/TestText ClassificationMTOP DomainText15,667Train/Val/TestTwitter EmotionText20,000Train/Val/TestAmazon CounterfactualText4,000Train/TestMASSIVE ScenarioText16,521Train/Val/TestMASSIVE IntentText16,521Train/Val/TestTweet Sentiment ExtractionText31,015Train/TestMultimodalAmazon Products\-2023Image/Text140,000Train/Val/TestFashion\-GenImage/Text67,306Train/ValFakedditMultimodal140,000Train/Val/TestTable A16:Comprehensive overview of benchmark datasets across modalities\.This table details the tasks, modalities, and sample distributions for the datasets utilized in this study, including benchmarks from the MTEB suite and fine\-grained vision tasks\.Below we provide detailed descriptions of the datasets \(Appendix Table[A16](https://arxiv.org/html/2605.23033#A1.T16)\) utilized in our experiments, categorized by their primary modality\.
##### Audio Classification Datasets
- •ASVspoof 2019 Logical Access \(LA\):A standard benchmark derived from the VCTK corpus for detecting logical access attacks \(Text\-to\-Speech and Voice Conversion\) against speaker verification systems\. It includes genuine speech alongside spoofing attacks generated by 19 different algorithms\.
- •CREMA\-D:The Crowd\-sourced Emotional Multimodal Actors Dataset features facial and vocal expressions from 91 actors of varying ages and ethnicities\. It contains audio\-visual clips of 12 fixed sentences spoken with six different emotions \(Anger, Disgust, Fear, Happy, Neutral, Sad\)\.
- •Google Speech Commands \(V2\):A large\-scale benchmark for Keyword Spotting \(KWS\) consisting of one\-second audio clips of 35 specific spoken words\. It is designed to train small\-footprint models for on\-device applications\.
##### Segmentation Datasets
- •COCO\-Stuff 164K:A large\-scale scene understanding dataset that extends COCO 2017\. Unlike the standard COCO which focuses on ”things” \(countable objects\), COCO\-Stuff provides dense pixel\-wise annotations for 172 classes, including 91 ”stuff” classes \(amorphous regions like sky, grass, and water\)\.
- •Cityscapes:A premier dataset for semantic urban scene understanding\. It features high\-quality pixel\-level annotations of complex street scenes from 50 different cities, capturing diverse traffic situations and scene layouts\.
##### Image Classification Datasets
- •Stanford Cars \(Cars196\):A fine\-grained classification benchmark containing 196 classes of vehicles defined by Make, Model, and Year \(e\.g\., distinguishing a ”2011 BMW M3” from a ”2012 BMW M3”\)\.
- •Mini\-ImageNet:A subset of ILSVRC\-12 designed for few\-shot learning and rapid prototyping\. It consists of 100 classes selected from the larger ImageNet dataset, typically resized to lower resolutions for meta\-learning tasks\.
- •Stanford Dogs:A fine\-grained dataset subset from ImageNet, challenging models to distinguish between 120 closely related dog breeds\.
- •CUB\-200\-2011:The Caltech\-UCSD Birds dataset is a benchmark for fine\-grained visual categorization\. It contains 200 bird species and includes rich annotations such as bounding boxes, part locations, and binary attributes\.
- •CIFAR\-100:A widely used benchmark containing low\-resolution \(32×3232\\times 32\) images across 100 classes\. The classes are hierarchically organized into 20 superclasses, making it suitable for testing hierarchical learning\.
- •SUN397:A definitive benchmark for scene recognition rather than object classification\. It covers 397 distinct scene categories \(e\.g\., ”Abbey,” ”Diner,” ”Forest”\) from the larger SUN Database\.
- •ImageNet\-1k:A large\-scale dataset for object recognition containing 1,000 object categories\. It is a subset of the larger ImageNet database and follows the WordNet hierarchy, containing a variety of animal, plant, and object classes
##### Text Classification Datasets \(MTEB\)
- •MTOP Domain:A task\-oriented semantic parsing dataset used to classify user utterances into 11 specific action domains \(e\.g\., Alarm, Messaging, Weather\)\.
- •Twitter Emotion:Also known as the MTEB Emotion dataset, this consists of English tweets labeled with six basic emotions to evaluate how well embeddings capture emotional nuance\.
- •Amazon Counterfactual:A dataset designed to challenge models in identifying counterfactual statements within product reviews, distinguishing between actual events and hypothetical scenarios\.
- •Amazon MASSIVE Scenario:A Natural Language Understanding task where utterances are classified into 18 broad functional categories \.
- •Amazon MASSIVE Intent:A more granular NLU task requiring the classification of user utterances into 60 specific action\-based goals \.
- •Tweet Sentiment extraction:A Natural Language Processing task where, given a social media post and its overall sentiment \(positive, negative, or neutral\), the objective is to extract the specific word or phrase from the text that best reflects and supports that sentiment
##### Multimodal Datasets
- •Amazon Products\-2023 \(Amazon\-M2\):A stratified sampling strategy was applied to obtain 140K products, covering all 248 categories with balanced category sizes when possible and full inclusion of low\-frequency categories\.
- •Fashion\-Gen:A dataset for high\-resolution text\-to\-image synthesis in the fashion domain, pairing professional images with stylist\-written descriptions\. We filtered for unique product IDs, resulting in 67,306 samples\.
- •Fakeddit:A massive multimodal fake news detection benchmark from Reddit\. It pairs text posts with images and metadata to detect various types of misinformation\. We randomly sampled 140K image\-text pairs for our experiments\.Similar Articles
Freeze Deep, Train Shallow: Interpretable Layer Allocation for Continued Pre-Training
This paper proposes LayerTracer, an interpretable framework for layer allocation in continued pre-training, demonstrating that freezing deep layers while training shallow ones outperforms full-parameter fine-tuning. It offers a low-cost, actionable strategy for resource-constrained teams optimizing Large Language Models.
Aletheia: Gradient-Guided Layer Selection for Efficient LoRA Fine-Tuning Across Architectures
Aletheia introduces a gradient-guided layer selection method for efficient LoRA fine-tuning that identifies task-relevant transformer layers via lightweight gradient probes and applies adapters selectively, achieving 15-28% training speedup across 14 models while maintaining downstream performance on MMLU, GSM8K, and HumanEval benchmarks.
Echo-LoRA: Parameter-Efficient Fine-Tuning via Cross-Layer Representation Injection
The article introduces Echo-LoRA, a new parameter-efficient fine-tuning method that injects cross-layer representations from deeper source layers into shallow LoRA modules to improve performance without adding inference-time overhead.
Representation Without Reward: A JEPA Audit for LLM Fine-Tuning
This paper audits Joint-embedding predictive architectures (JEPA) for LLM fine-tuning on a natural-language-to-regex task, testing twenty-two auxiliary objectives. The results show that hidden-state representation improvements are only weakly coupled to decoded-task accuracy, with no auxiliary surviving family-wise correction.
Layer-wise Representation Dynamics: An Empirical Investigation Across Embedders and Base LLMs
This paper introduces Layer-wise Representation Dynamics (LRD), a framework with three measurement families to analyze how hidden states change across layers in language models. Applied to 31 models on 30 MTEB tasks, LRD reveals architectural differences and enables label-free model selection and inference-time layer pruning.