Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity
Summary
This paper demonstrates that cosine similarity is a poor proxy for assessing layer importance in LLMs, and proposes using the actual accuracy drop from layer removal as a more robust metric.
View Cached Full Text
Cached at: 05/15/26, 06:27 AM
# Rethinking Layer Relevance in Large Language Models Beyond Cosine Similarity
Source: [https://arxiv.org/html/2605.14075](https://arxiv.org/html/2605.14075)
Cristian Hinostroza1,2, Rodrigo Toro Icarte1,2, Christ Devia2,3, Andres Carvallo2, Eugenio Herrera\-Berg2,Denis Parra1,2,Jorge F\. Silva2,3 1Pontificia Universidad Católica de Chile 2National Center for Artificial Intelligence \(CENIA\) 3Universidad de Chile
###### Abstract
Large language models \(LLMs\) have revolutionized natural language processing\. Understanding their internal mechanisms is crucial for developing more interpretable and optimized architectures\. Mechanistic interpretability has led to the development of various methods for assessing layer relevance, with cosine similarity being a widely used tool in the field\. In this work, we demonstrate that cosine similarity is a poor proxy for the actual performance degradation caused by layer removal\. Our theoretical analysis shows that a layer can exhibit an arbitrarily low cosine similarity score while still being crucial to the model’s performance\. On the other hand, empirical evidence from a range of LLMs confirms that the correlation between cosine similarity and actual performance degradation is often weak or moderate, leading to misleading interpretations of a transformer’s internal mechanisms\. We propose a more robust metric for assessing layer relevance: the actual drop in model accuracy resulting from the removal of a layer\. Even though it is a computationally costly metric, this approach offers a more accurate picture of layer importance, allowing for more informed pruning strategies and lightweight models\. Our findings have significant implications for the development of interpretable LLMs and highlight the need to move beyond cosine similarity in assessing layer relevance\.
## 1Introduction
Transformers\(Vaswaniet al\.,[2017](https://arxiv.org/html/2605.14075#bib.bib1)\), initially designed for tasks related to large language models \(LLMs\)\(Chkirbeneet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib41)\), have become the main architecture for modern AI\. They now support applications in computer vision\(Caronet al\.,[2021](https://arxiv.org/html/2605.14075#bib.bib10)\), reinforcement learning\(Liet al\.,[2023a](https://arxiv.org/html/2605.14075#bib.bib3)\), multimodal learning\(Xuet al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib11)\), recommender systems\(Villaet al\.,[2020](https://arxiv.org/html/2605.14075#bib.bib16)\), and beyond\. Since these models play a central role in AI, uncovering which parts matter the most can guide us toward more interpretable and optimized architectures\.
*Mechanistic interpretability*aims to reverse\-engineer pre\-trained LLMs to better understand how they work\(Ferrandoet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib54)\)\. In this context, cosine similarity has become a standard tool for assessing semantic relationships between internal representations\(Sanhet al\.,[2019](https://arxiv.org/html/2605.14075#bib.bib50); Liet al\.,[2023b](https://arxiv.org/html/2605.14075#bib.bib53); Sunet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib17); Modellet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib48)\)\. Intuitively, when the angle between two token embeddings is small, the tokens are assumed to encode similar information\.
Recent studies have used cosine similarity to assess layer relevance in pre\-trained LLMs\(e\.g\., Sajjadet al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib51); Heet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib35); Menet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib33); Zhanget al\.,[2024b](https://arxiv.org/html/2605.14075#bib.bib39); Sunet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib17); Yanget al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib49)\)\.The core idea is that layers making minimal changes to their input vectors are considered less relevant, with relevance quantified as one minus the cosine similarity between a layer’s input and output vectors\. This score has been applied in various contexts: for example,Gromovet al\.\([2025](https://arxiv.org/html/2605.14075#bib.bib34)\)used it to prune models and analyze performance across tasks, finding that reasoning tasks require more layers than factual ones\. Similarly,Heet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib35)\)visualized relevance scores across datasets \(Figure[1](https://arxiv.org/html/2605.14075#S1.F1)B\), showing that some layers consistently appear irrelevant regardless of the task\.
While these results offer valuable insights, they hinge on the assumption that cosine similarity is a reliable indicator of layer relevance—an assumption we challenge\. In this paper, we demonstrate that cosine similarity is a poor proxy for the actual performance degradation caused by layer removal\. For example, layer 16 in OLMo appears to be of low relevance according to cosine similarity, as illustrated in Figure[1](https://arxiv.org/html/2605.14075#S1.F1)B \(where irrelevant layers are shown in yellow\)\. However, removing this layer results in an average accuracy drop of 66% across the ten datasets presented\. In fact, eliminating layer 16 alone reduces OLMo’s performance to chance level on ARC\-C\. These findings suggest that relying on cosine similarity as a relevance metric can lead to misleading interpretations of a transformer’s internal mechanisms\.
Figure 1:Relevance of OLMo\(Groeneveldet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib44)\)’s layers across ten datasets\.A\. Accuracy\-based relevance scores: We measure the drop in task accuracy to evaluate the relevance of each layer\. Layers that increase accuracy when removed are highlighted in green, layers that do not affect accuracy appear in white, and layers that reduce the model’s accuracy when removed are indicated in red/purple\.B\. Relevance scores computed using the cosine similarity score, which measures how much each layer transforms its input\. The least relevant layers appear in yellow\.In this paper, we provide a formal proof demonstrating that a layer can exhibit an arbitrarily low cosine similarity score while still being crucial to the model’s performance\. In particular, removing such a layer can drastically alter the model’s output—potentially reducing its accuracy from perfect to zero\. This phenomenon arises when the layer introduces a subtle modification to its input vector that is subsequently amplified by downstream layers, resulting in a snowball effect\. Consequently, despite its near\-zero cosine similarity score, the removal of this layer can significantly disrupt the model’s final predictions\.
We then show that this theoretical worst\-case scenario does occur, to some degree, in practice\. Empirically, we find that the correlation between cosine similarity and actual performance degradation is often weak or moderate, depending on the model\. As a result, cosine similarity either overestimates or underestimates a layer’s true relevance in over 90% of cases we studied\.
Having established that cosine similarity is an unreliable metric for assessing layer relevance, we next investigate the implications of re\-running previously proposed experiments using a more robust alternative\. Specifically, we argue that for the purposes of mechanistic interpretability, the most appropriate metric is the actual drop in model accuracy resulting from the removal of a layer\. While this approach is computationally expensive—requiring layer\-by\-layer removal and performance re\-evaluation—it avoids the shortcomings inherent to cosine similarity\. Crucially, this metric captures the complex interdependencies among layers in Transformer architectures\.
We begin by replicating the layer relevance visualization introduced byHeet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib35)\)\. Figure[1](https://arxiv.org/html/2605.14075#S1.F1)A displays the relevance of each layer in OLMo across ten datasets, measured by the change in model accuracy after removing each layer individually\. Red/purple indicates a drop in accuracy, green an improvement, and white no change\. This visualization offers a markedly different perspective from cosine similarity, revealing that layer relevance varies by dataset and highlighting the critical role of layers 8 and 16 in OLMo’s performance\.
We then replicated the task analysis proposed byGromovet al\.\([2025](https://arxiv.org/html/2605.14075#bib.bib34)\), which involved pruning layers deemed irrelevant based on cosine similarity and observing the resulting performance drop\. Instead, we ranked layers by the actual decrease in accuracy on the task’s training set and pruned accordingly\. Results are shown in Figure[2](https://arxiv.org/html/2605.14075#S1.F2)\. Because our metric better reflects layer relevance, the performance drop in HellaSwag is less pronounced than in the original study\. This challenges the conclusion that all layers are essential for reasoning tasks: using a more informative metric, we find that over 75% accuracy can be maintained even after removing 22% of the layers\.
Figure 2:We evaluate LLaMA\-3\-8B using the cosine\-similarity pruning strategy proposed byGromovet al\.\([2025](https://arxiv.org/html/2605.14075#bib.bib34)\), and compare it with our method\. In contrast to cosine similarity, our approach mitigates immediate performance degradation in reasoning tasks, highlighting the critical role of selecting an appropriate metric for interpreting model internals\.We conclude with a practical application in structured pruning\(Anwaret al\.,[2017](https://arxiv.org/html/2605.14075#bib.bib19)\), which aims to remove layers from trained models with minimal impact on performance\. In the task\-dependent setting, we show that pruning layers based on our accuracy\-based relevance score yields superior results compared to existing methods, including Taylor approximations\(Kimet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib36); Maet al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib37)\), cosine similarity\(Heet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib35); Menet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib33); Gromovet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib34)\), and FinerCut\(Zhanget al\.,[2024b](https://arxiv.org/html/2605.14075#bib.bib39)\)\. In the task\-independent setting, our method also achieves the best performance, though it is sensitive to the choice of calibration dataset\.
## 2Related Work
A central challenge in Transformer research is accurately measuring layer relevance\. This question is critical for two main applications: mechanistic interpretability, which seeks to understand how pre\-trained LLMs operate, and structured pruning, which aims to reduce model size by removing irrelevant layers while preserving performance\. Cosine similarity has become a popular metric for both tasks due to its computational efficiency and intuitive appeal\(e\.g\., Sajjadet al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib51); Gromovet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib34); Heet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib35); Menet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib33); Yanget al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib49); Zhanget al\.,[2024b](https://arxiv.org/html/2605.14075#bib.bib39)\)\. It assumes that layers making minimal changes to their input vectors are less relevant\. Moreover, cosine\-based pruning has achieved strong results in task\-independent settings\.
However, cosine similarity is only a proxy for what truly matters: downstream performance\. While prior work has raised concerns about its use in comparing token embeddings\(Timkey and Van Schijndel,[2021](https://arxiv.org/html/2605.14075#bib.bib2)\), to our knowledge, this is the first study to rigorously evaluate—both theoretically and empirically—its limitations in estimating layer relevance in Transformer models\. We then propose an alternative: an accuracy\-based relevance score, which considers a layer relevant only if its removal significantly degrades performance on a given task\.
Beyond cosine similarity, which is typically used as a local metric\(e\.g\., Sajjadet al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib51); Gromovet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib34); Heet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib35); Menet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib33)\), several global metrics have been proposed\. These assess relevance by evaluating changes in the model’s output after removing a layer\. Global metrics fall into two categories: consistency\-based and performance\-based\. Consistency\-based metrics compare the model’s output distributions with and without a target layer\(Sieberlinget al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib55); Yanget al\.,[2026](https://arxiv.org/html/2605.14075#bib.bib47); Zhanget al\.,[2024b](https://arxiv.org/html/2605.14075#bib.bib39)\), identifying layers whose removal leaves the output unchanged\. However, these metrics focus on output invariance rather than predictive accuracy, and may overlook layers that subtly affect performance\.
Performance\-based metrics are more aligned with our approach\(Kimet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib36); Maet al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib37); Zhonget al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib38); Songet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib46)\)\. These metrics rely on ground\-truth information to assess the relevance of a layer\. For example,Maet al\.\([2023](https://arxiv.org/html/2605.14075#bib.bib37)\)use Taylor expansions to estimate the change in loss when a layer is removed\. Other works rely on perplexity\-based scores, deeming layers irrelevant if their removal does not significantly increase perplexity\(Kimet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib36); Zhonget al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib38); Songet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib46)\)\. Like our accuracy\-based score, these methods aim to identify layers whose exclusion yields minimal performance degradation\. Nevertheless, as we show in Section[6](https://arxiv.org/html/2605.14075#S6), our metric consistently outperforms these alternatives in structured pruning tasks\.
Finally, our work connects with a broader literature on understanding how Transformers represent and process information\(Brinkmannet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib29); Clarket al\.,[2019b](https://arxiv.org/html/2605.14075#bib.bib22); Devlinet al\.,[2019](https://arxiv.org/html/2605.14075#bib.bib23); Gevaet al\.,[2021](https://arxiv.org/html/2605.14075#bib.bib18);[2022](https://arxiv.org/html/2605.14075#bib.bib20);[2023](https://arxiv.org/html/2605.14075#bib.bib25); Gurnee and Tegmark,[2024](https://arxiv.org/html/2605.14075#bib.bib26); Jawaharet al\.,[2019](https://arxiv.org/html/2605.14075#bib.bib21); Lioubashevskiet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib31); Menget al\.,[2022](https://arxiv.org/html/2605.14075#bib.bib24); Sunet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib17); Tiggeset al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib30)\)\. In particular, our relevance metric could be used to revisit studies that identify functional behaviors in specific attention layers\(Clarket al\.,[2019b](https://arxiv.org/html/2605.14075#bib.bib22); Gevaet al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib25)\)or MLPs\(Gevaet al\.,[2021](https://arxiv.org/html/2605.14075#bib.bib18); Menget al\.,[2022](https://arxiv.org/html/2605.14075#bib.bib24)\), offering new insights into their contributions\. It may also help bridge findings on global Transformer behavior\(Gurnee and Tegmark,[2024](https://arxiv.org/html/2605.14075#bib.bib26); Brinkmannet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib29); Tiggeset al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib30)\)with specific layers or processing stages\. We expand on these connections and review additional pruning methods in Appendix[A](https://arxiv.org/html/2605.14075#A1)\.
## 3Cosine\-Similarity Score
Let us begin by formally defining the*cosine\-similarity score*\. The cosine\-similarity score is a local metric that examines the difference between the input and output vectors of a layer to assess its relevance\(Sajjadet al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib51); Gromovet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib34); Heet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib35); Menet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib33)\)\. Intuitively, if the output of a layer is identical to its input, removing that layer would have no effect on the model’s performance\. Formally, given two vectors𝒙\\bm\{x\}and𝒚\\bm\{y\}, the cosine similarity is defined as follows:
CosineSim\(𝒙,𝒚\)=𝒙⋅𝒚‖𝒙‖⋅‖𝒚‖\\text\{CosineSim\}\(\\bm\{x\},\\bm\{y\}\)=\\frac\{\\bm\{x\}\\cdot\\bm\{y\}\}\{\|\|\\bm\{x\}\|\|\\cdot\|\|\\bm\{y\}\|\|\}\(1\)To define a score where the least relevant layers receive a value of zero, we compute the cosine\-similarity score as one minus the cosine similarity between the input and output vectors of a layer\. Given a calibration dataset𝔻=\{s\(i\)\}i=1N\{\\mathbb\{D\}\}=\\\{s^\{\(i\)\}\\\}\_\{i=1\}^\{N\}, the relevance of a layer is then calculated as the average cosine\-similarity score across all tokens and instances:
CosSimScore\(l;𝔻\)=1N∑i=1N1n\(i\)∑j=1n\(i\)\(1−CosineSim\(𝑿j,:\(l,i\),𝑿j,:\(l\+1,i\)\)\),\\operatorname\{CosSimScore\}\(l;\{\\mathbb\{D\}\}\)=\\frac\{1\}\{N\}\\sum\_\{i=1\}^\{N\}\\frac\{1\}\{n^\{\(i\)\}\}\\sum\_\{j=1\}^\{n^\{\(i\)\}\}\\Big\(1\-\\text\{CosineSim\}\\\!\\big\(\{\\bm\{X\}\}^\{\(l,i\)\}\_\{j,:\},\{\\bm\{X\}\}^\{\(l\+1,i\)\}\_\{j,:\}\\big\)\\Big\),\(2\)where each sequences\(i\)s^\{\(i\)\}hasn\(i\)n^\{\(i\)\}tokens,𝑿\(l,i\)∈ℝn\(i\)×d\{\\bm\{X\}\}^\{\(l,i\)\}\\in\\mathbb\{R\}^\{n^\{\(i\)\}\\times d\}is the intermediate layer representation ofs\(i\)s^\{\(i\)\}at layerll, and𝑿j,:\(l,i\)∈ℝd\{\\bm\{X\}\}^\{\(l,i\)\}\_\{j,:\}\\in\\mathbb\{R\}^\{d\}denotes the representation of thejj\-th token at layerll\.
## 4Rethinking Layer Relevance: Beyond Cosine Similarity
This section highlights the limitations of cosine similarity as a layer relevance metric\. We show, both theoretically and empirically, that layers assessed as irrelevant by cosine similarity can still cause significant drops in downstream performance when removed\. To address this, we propose an accuracy\-based metric that directly evaluates relevance based on what truly matters: the model’s predictive performance\.
### 4\.1Limitations of Cosine Similarity for Layer Relevance
We begin by formally demonstrating that a layer can have an arbitrarily low cosine similarity score while still having a significant impact on model performance\. Specifically, the following theorem shows that for any dataset𝔻\{\\mathbb\{D\}\}and anyϵ\>0\\epsilon\>0, it is possible to construct a decoder\-only Transformer that achieves perfect accuracy on𝔻\{\\mathbb\{D\}\}, yet the removal of the layer with the lowest cosine similarity reduces the model’s performance to zero\. Moreover, the cosine similarity score of that layer isϵ\\epsilon\.
###### Theorem 1
LetfLf^\{L\}denote a Transformer model withLLlayers, andf−lLf^\{L\}\_\{\-l\}represent the same model with layerllremoved\. Then, for anyϵ\>0\\epsilon\>0and any calibration dataset𝔻=\{\(s\(i\),y\(i\)\)\}i=1N\{\\mathbb\{D\}\}=\\\{\(s^\{\(i\)\},y^\{\(i\)\}\)\\\}\_\{i=1\}^\{N\}such thats\(i\)≠s\(j\)s^\{\(i\)\}\\neq s^\{\(j\)\}for alli≠ji\\neq jandy\(i\)∈\{0,…,C−1\}y^\{\(i\)\}\\in\\\{0,\\dots,C\-1\\\}, there exists a decoder\-only TransformerfLf^\{L\}withL≥3L\\geq 3satisfying the following conditions:
1. 1\.There exists an intermediate layerl∈\{1,…,L−2\}l\\in\\\{1,\\dots,L\-2\\\}such thatCosSimScore\(l;𝔻\)=ϵ\\operatorname\{CosSimScore\}\(l;\{\\mathbb\{D\}\}\)=\\epsilon, andCosSimScore\(i;𝔻\)\>ϵ\\operatorname\{CosSimScore\}\(i;\{\\mathbb\{D\}\}\)\>\\epsilonfor alli≠li\\neq l\.
2. 2\.The full model achieves perfect accuracy:fL\(s\(i\)\)=y\(i\)f^\{L\}\(s^\{\(i\)\}\)=y^\{\(i\)\}for alls\(i\)∈𝔻s^\{\(i\)\}\\in\{\\mathbb\{D\}\}, but removing layerllcauses the model’s accuracy to drop to zero:f−lL\(s\(i\)\)≠y\(i\)f^\{L\}\_\{\-l\}\(s^\{\(i\)\}\)\\neq y^\{\(i\)\}for alls\(i\)∈𝔻s^\{\(i\)\}\\in\{\\mathbb\{D\}\}\.
To construct a Transformer in which a layer has an arbitrarily low cosine similarity yet significantly impacts model performance, two key conditions must be met\. First, a snowball effect must occur: the target layer introduces a subtle change to its input vector, which is then amplified by subsequent layers\. This allows the layer to have minimal cosine similarity while still influencing the final output\. Second, some embedding dimensions must be irrelevant to the model’s prediction\. This enables other layers to make large changes in those irrelevant dimensions, inflating their cosine similarity scores without contributing to performance\. A complete proof is provided in Appendix[B](https://arxiv.org/html/2605.14075#A2)\.
We believe both phenomena can naturally arise in pre\-trained LLMs, particularly in task\-dependent settings\. For a given task, many transformations applied by the model may be irrelevant to solving it\. Empirically, we observe a snowball effect in models like OLMo, where layer 16 exhibits a very low cosine similarity score yet has a substantial impact on performance \(see Figure[1](https://arxiv.org/html/2605.14075#S1.F1)\)\.
Figure 3:A\.Relationship between cosine similarity scores and performance variation after removing a layer\. Each point represents a specific layer–task pair from one of the 28 middle layers in Pythia, Mistral, or OLMo, evaluated across ten tasks \(same set as in Figure[1](https://arxiv.org/html/2605.14075#S1.F1)\)\.B\.Alignment between cosine similarity rankings and performance rankings across three models, ten tasks, and 28 layers\. Cell\(i,j\)\(i,j\)indicates the number of times cosine similarity assigned rankjjwhile the ground\-truth rank wasii\(rank 1 = least relevant\)\. The heatmap uses three distinct color scales: green for the diagonal \(perfect alignment\), blue for low\-cost misrankings, and red for all other cells\.We now present a more in\-depth empirical evaluation of the cosine similarity score as a proxy for layer relevance\. Specifically, we aim to assess how well cosine similarity predicts the actual drop in downstream performance when a layer is removed\. Figure[3](https://arxiv.org/html/2605.14075#S4.F3)A compares the cosine similarity score with the observed reduction in accuracy after removing individual layers from three pre\-trained LLMs—Mistral, Pythia, and OLMo—across ten datasets: C4, CodeAlpaca, LIMA, MathInstruct, BoolQ, ARC\-Challenge, ARC\-Easy, HellaSwag, PIQA, and Winogrande\. We exclude the first and last two layers, as they are trivially identifiable as relevant and behave as clear outliers\.
As shown in Figure[3](https://arxiv.org/html/2605.14075#S4.F3)A, there is some correlation between cosine similarity and performance degradation\. However, the strength of this correlation varies by model: moderate in Pythia \(R = \-0\.46\), weak in Mistral \(R = \-0\.23\), and very weak in OLMo \(R = \-0\.15\)\.
To further evaluate the reliability of cosine similarity, we compare its layer relevance ranking against a ground\-truth ranking based on actual performance drop\. Figure[3](https://arxiv.org/html/2605.14075#S4.F3)B presents a confusion matrix summarizing the results across the same three models and ten datasets\. In this matrix, cell\(i,j\)\(i,j\)indicates the number of times cosine similarity ranked a layer as thejj\-th least relevant, while its true rank wasiiaccording to performance drop\. Diagonal entries represent perfect agreement; entries below the diagonal indicate underestimation of relevance, and those above indicate overestimation\. Overall, cosine similarity misestimated a layer’s relevance in 93\.8% of cases\. That said, not all errors are equally severe: entries near the diagonal reflect small ranking deviations\. But even when considering only substantial errors \(highlighted in red\), cosine similarity still fails in 53\.6% of cases\.
Overall, these results demonstrate that cosine similarity is an unreliable and noisy metric for estimating layer relevance\. In many cases, layers deemed irrelevant by cosine similarity lead to substantial drops in performance when removed—and vice versa\. This inconsistency highlights the need for caution when using cosine similarity, particularly in the context of mechanistic interpretability\. Relying on such a flawed metric risks drawing incorrect conclusions about how Transformer models function\. We illustrate this issue with two concrete examples in Section[5](https://arxiv.org/html/2605.14075#S5)\.
### 4\.2Accuracy\-Based Relevance Score
Rather than relying on a proxy, we propose directly visualizing the performance drop to assess how each layer contributes to the model’s effectiveness\. We do so by using an*accuracy\-based score*\. Given a dataset𝔻\{\\mathbb\{D\}\}and a Transformer model withLLlayersfLf^\{L\}, we assess the relevance of layerllas:
AccBasedRelevance\(fL,l,𝔻\)=1−max\(Accuracy\(f−lL,𝔻\)−r\(𝔻\),0\)max\(Accuracy\(fL,𝔻\)−r\(𝔻\),0\),\\operatorname\{AccBasedRelevance\}\(f^\{L\},l,\{\\mathbb\{D\}\}\)=1\-\\frac\{\\max\(\\text\{Accuracy\}\(f^\{L\}\_\{\-l\},\{\\mathbb\{D\}\}\)\-r\(\{\\mathbb\{D\}\}\),0\)\}\{\\max\(\\text\{Accuracy\}\(f^\{L\},\{\\mathbb\{D\}\}\)\-r\(\{\\mathbb\{D\}\}\),0\)\},\(3\)whereAccuracy\(fL,𝒟\)\\text\{Accuracy\}\(f^\{L\},\\mathcal\{D\}\)denotes the accuracy of the full model on dataset𝔻\{\\mathbb\{D\}\}, andr\(𝔻\)r\(\{\\mathbb\{D\}\}\)represents the expected performance of a random predictor in the dataset\.
This score ranges from \-∞\\inftyand \+1: negative values indicate improved performance upon removal of the layer, zero indicates no change, and positive values reflect a drop in performance\. Thus, higher scores correspond to greater relevance of the layer for the task\. It is important to note that this range is valid only when the full model performs better than a random predictor\. If the model’s accuracy falls below that of a random predictor, the relevance score becomes ill\-defined, and the analysis should not be applied in such cases\.
The accuracy\-based score can be applied to any component of a transformer\-based model, including a single weight, a multi\-head attention layer, an MLP, a Transformer block, or multiple blocks\. That said, in the next section, we will focus on visualizing the importance of Transformer blocks\.
## 5Case Studies
To assess the practical impact of our findings, we revisit two case studies that used cosine similarity to evaluate layer relevance in pre\-trained LLMs\. Replacing cosine similarity with our accuracy\-based metric, we observed significantly different outcomes\. These results highlight the limitations of proxy metrics and reinforce the value of accuracy\-based evaluation for mechanistic interpretability\.
### 5\.1Relevance Consistency Across Datasets
Figure 4:Relevance of Mistral’s Transformer blocks across datasets\.A\. Accuracy\-based relevance scores\.B\. Cosine similarity score\.We begin by revisiting the study*What Matters in Transformers*byHeet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib35)\), which proposes a method to visualize layer relevance using cosine similarity\. Figure[4](https://arxiv.org/html/2605.14075#S5.F4)B shows the relevance of each layer in Mistral\(Jianget al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib62)\)across multiple datasets, with yellow indicating low cosine similarity score and purple indicating high\. Based on these visualizations, the authors conclude that layer relevance is largely task\-independent—a pattern we also observed in OLMo \(Figure[1](https://arxiv.org/html/2605.14075#S1.F1)B\)\.
When applying our accuracy\-based metric, we obtain a markedly different view of layer relevance, as shown in Figure[1](https://arxiv.org/html/2605.14075#S1.F1)A for OLMo and Figure[4](https://arxiv.org/html/2605.14075#S5.F4)A for Mistral\. These visualizations use a fixed color scale: green for performance gains, white for no change, and red/purple for performance drops\. Unlike cosine similarity, our metric reveals that layer relevance is highly task\-dependent\. For example, removing block 14 in OLMo reduces accuracy by∼\\sim41% on MathInstruct but has minimal impact \(∼\\sim1%\) on CodeAlpaca\. Some layers even show negative relevance, improving performance when removed—e\.g\., block 23 in Mistral increases accuracy by∼\\sim25% on MathInstruct but decreases it by∼\\sim6% on CodeAlpaca\. Finally, our metric also captures broader task sensitivity\. For instance, Mistral shows consistently lower relevance across blocks on MMLU compared to BoolQ—a distinction not visible in cosine similarity plots\.
Figure 5:Relevances Across DatasetsTo ensure these differences are not artifacts of visualization, we conducted a statistical comparison between the two metrics\. Using z\-score normalization, we computed the average variance of each of OLMo’s 32 blocks across ten datasets\. As shown in Figure[5](https://arxiv.org/html/2605.14075#S5.F5), our accuracy\-based score exhibits significantly greater variance than cosine similarity\. A Wilcoxon test \(Appendix[C\.3](https://arxiv.org/html/2605.14075#A3.SS3)\) confirms these differences are statistically significant, reinforcing the visual evidence that layer relevance is task\-dependent\.
Beyond cross\-task consistency, we also explored how relevance evolves during training \(Appendix[C\.4](https://arxiv.org/html/2605.14075#A3.SS4)\) and pruning \(Appendix[C\.5](https://arxiv.org/html/2605.14075#A3.SS5)\)\. In pruning, we found that a layer’s relevance depends on the presence of other layers—removing one can increase or decrease the importance of another\. In training, no clear pattern emerged: some layers gained relevance over time, while others fluctuated\.
### 5\.2Differences Between Types of Tasks
We now revisit*The Unreasonable Ineffectiveness of the Deeper Layers*byGromovet al\.\([2025](https://arxiv.org/html/2605.14075#bib.bib34)\), which argues that deeper layers in pre\-trained LLMs are essential for reasoning tasks \(e\.g\., GSM8K, HellaSwag\) but less relevant for factual retrieval tasks such as MMLU\. Their hypothesis is based on the idea that, when faced with a reasoning task, the model must compute intermediate steps to arrive at the final answer—implying that all layers contribute meaningfully to such tasks\. Their analysis on LLaMA 2\-70B showed that MMLU retained accuracy under early pruning, while GSM8K and HellaSwag degraded instantly and continued to decline, supporting the hypothesis that deeper layers play a critical role in reasoning\.
Their pruning strategy, however, was based on cosine similarity rather than direct depth\-based ablation\. To assess the robustness of their findings, we replicated the experiment on LLaMA 3\-8B using our accuracy\-based relevance metric\. As shown in Figure[2](https://arxiv.org/html/2605.14075#S1.F2), we observed similar task\-dependent trends according to the cosine similarity score: MMLU remained stable under initial pruning, while GSM8K and HellaSwag showed performance drops, particularly in GSM8K\.
In contrast, when pruning is guided by our accuracy\-based metric, a different pattern emerges: the model maintains strong performance on HellaSwag even after several blocks are removed, and GSM8K shows minimal degradation after pruning two blocks\. This suggests that cosine similarity may underestimate the importance of certain blocks for reasoning tasks—pruning them prematurely and thereby reducing model performance\. Moreover, cosine similarity appears unable to identify blocks that are not relevant for reasoning, whereas our method successfully distinguishes between essential and non\-essential layers\.
Finally, we note that, unlikeGromovet al\.\([2025](https://arxiv.org/html/2605.14075#bib.bib34)\), who pruned contiguous block groups based on aggregate cosine similarity, our method prunes blocks iteratively, re\-evaluating the model after each step\. In Appendix[D](https://arxiv.org/html/2605.14075#A4), we compare both strategies—cosine similarity as used in their work and our iterative method\. While some trends are less pronounced, the core conclusion remains: different metrics yield different insights into model behavior\.
## 6Empirical Results in Structured Pruning
The primary goal of our accuracy\-based relevance score was to support mechanistic interpretability by providing a reliable measure of layer importance\. However, this metric also proves effective for structured pruning—i\.e\., reducing model size by removing layers with minimal impact on performance\(Anwaret al\.,[2017](https://arxiv.org/html/2605.14075#bib.bib19)\)\. Surprisingly, pruning layers deemed irrelevant by our score yields state\-of\-the\-art results, while remaining simple to implement\.
Structured pruning methods typically rely on a calibration set to estimate layer relevance and prune up top%p\\%of the model’s weights\. These methods differ in whether they apply one\-shot or iterative pruning, and in the criteria used to rank layers\. To evaluate pruning effectiveness, we compare generalization performance across standard benchmarks\. We applied our accuracy\-based score iteratively to prune LLaMA3\-8B\(Grattafioriet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib63)\), selected for its use in prior state\-of\-the\-art pruning work\(Zhanget al\.,[2024b](https://arxiv.org/html/2605.14075#bib.bib39)\)\. We also replicated the experiment on Mistral\-7B\(Jianget al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib62)\)and evaluated one\-shot pruning \(see Appendix[E\.3](https://arxiv.org/html/2605.14075#A5.SS3)\)\.
Our method was benchmarked against leading pruning techniques, including: Taylor approximations\(Kimet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib36); Maet al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib37)\), cosine similarity\(Heet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib35); Menet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib33); Gromovet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib34)\), output\-based metrics \(e\.g\., output cosine similarity, norm similarity, divergence similarity\)\(Zhanget al\.,[2024b](https://arxiv.org/html/2605.14075#bib.bib39); Yanget al\.,[2026](https://arxiv.org/html/2605.14075#bib.bib47); Sieberlinget al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib55)\), and perplexity\-based relevance\(Kimet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib36); Zhonget al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib38); Songet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib46)\)\.
To ensure fair comparison, all methods pruned the same layer types using identical calibration data, removing up to 25% of the model\. Metrics were recomputed after each pruning step\. We also included SlideGPT\(Ashkbooset al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib52)\), which reduces layer size rather than removing entire layers; we matched its pruning ratio to 25%\. No healing or postprocessing was applied, as our focus was on evaluating the effectiveness of the relevance metric itself\.
We assessed performance across eight widely used benchmarks: ARC\-Challenge\(Clarket al\.,[2018](https://arxiv.org/html/2605.14075#bib.bib70)\), ARC\-Easy\(Clarket al\.,[2018](https://arxiv.org/html/2605.14075#bib.bib70)\), BoolQ\(Clarket al\.,[2019a](https://arxiv.org/html/2605.14075#bib.bib66)\), HellaSwag\(Zellerset al\.,[2019](https://arxiv.org/html/2605.14075#bib.bib68)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2605.14075#bib.bib67)\), OpenBookQA\(Mihaylovet al\.,[2018](https://arxiv.org/html/2605.14075#bib.bib72)\), Winogrande\(Sakaguchiet al\.,[2021](https://arxiv.org/html/2605.14075#bib.bib69)\), and MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.14075#bib.bib71)\)\. These span a range of reasoning and knowledge tasks, each with a train/test split\. Implementation details are provided in Appendix[E\.1](https://arxiv.org/html/2605.14075#A5.SS1)\.
We first report task\-dependent results, where the goal is to optimize performance for a specific task\. Each model was pruned using the corresponding training set as calibration data and evaluated on the test set\. Table[1](https://arxiv.org/html/2605.14075#S6.T1)presents results for LLaMA3\-8B\. Our accuracy\-based score consistently outperformed all baselines and, in some cases, even surpassed the unpruned model\. Similar trends were observed with Mistral\-7B \(see Appendix[E\.2](https://arxiv.org/html/2605.14075#A5.SS2)\)\. These findings indicate that our score can effectively prune pre\-trained LLMs when the deployment task is known\. For example, if a lightweight model is needed for math problem solving, our score identifies and removes layers unrelated to that domain\. While this may reduce performance on unrelated tasks \(e\.g\., poetry generation\), such trade\-offs are acceptable when the goal is task\-specific efficiency\.
Table 1:Task\-dependent results for LLaMA3\-8B models across multiple tasks\. All methods remove 25% of the model using each task’s training set\. “Original” refers to the unpruned model\.MethodArc\-CArc\-EBoolQHSOBQAPIQAWGMMLUMeanOriginal53\.1681\.0282\.0278\.9444\.881\.2873\.5665\.1169\.99Taylor31\.4867\.9761\.3162\.7338\.476\.5555\.6425\.0352\.39Cosine Similarity45\.7367\.866\.3369\.5238\.672\.9171\.3544\.0559\.54Out\. Cosine\-Sim39\.5165\.5772\.1167\.9736\.876\.8865\.1136\.8257\.6Out\. Norm\-Sim40\.1966\.0872\.0864\.9639\.675\.4668\.2749\.0359\.46Out\. Divergence\-Sim41\.1365\.8772\.1467\.2034\.074\.4369\.4635\.1257\.42Perplexity38\.1453\.1162\.1458\.9238\.467\.1962\.1259\.0454\.88Slice\-GPT41\.6473\.2775\.7567\.3539\.677\.1570\.5648\.7461\.76Accuracy \(Ours\)49\.5774\.9684\.0471\.534479\.0673\.862\.9767\.49We now turn to the task\-independent setting, where the objective is to prune a pre\-trained LLM while preserving performance across a diverse set of tasks\. In this context, we observed that the effectiveness of our accuracy\-based score is highly sensitive to the choice of calibration set\.
Table[2](https://arxiv.org/html/2605.14075#S6.T2)reports results for LLaMA3\-8B using a calibration set composed of 10% of the training data from each of the eight benchmarks\. Under this configuration, our method outperforms all baselines, yielding a pruned model that achieves the highest average performance across tasks\. However, when the calibration set is restricted to a single benchmark, performance varies significantly \(as shown in Appendix[E\.5](https://arxiv.org/html/2605.14075#A5.SS5)\) while cosine similarity remains stable at≈60%\\approx 60\\%, regardless of the calibration set\.
To further examine our model’s sensitivity to the choice of calibration set, the last two rows of Table[2](https://arxiv.org/html/2605.14075#S6.T2)report its task\-independent performance when pruned using two different calibration sets: ARC\-E and C4\. First, pruning with ARC\-E yields strong performance across most tasks, indicating that layer relevance derived from one task can generalize effectively to others\. Second, pruning with C4 demonstrates the opposite effect: task\-specific relevance can severely degrade performance on unrelated tasks\. Finally, certain tasks remain difficult to generalize to\. In particular, BoolQ and MMLU exhibit relevance patterns that differ substantially from the rest, such that strong performance was only achievable when incorporating part of their training data during pruning\. Overall, these findings reinforce a key observation of our work: layer relevance is inherently task\-dependent\.
An important direction for future work is to understand what properties a calibration set should possess to enable strong task\-independent performance\. Our observations suggest that accuracy improves when the calibration set includes a diverse mix of data—for example, sampling approximately 10% from the training sets of multiple benchmarks\. We also find that certain tasks, such as ARC\-E, exhibit layer relevance patterns that align well with many other tasks\. However, the underlying reasons for this consistency remain unclear and warrant further investigation\.
Table 2:Task\-independent results for LLaMA3\-8B across multiple tasks\. Each pruning method uses the same calibration dataset to prune 25% of the model once, which is then evaluated on all tasks\.MethodArc\-CArc\-EBoolQHSOBQAPIQAWGMMLUMeanOriginal53\.1681\.0282\.0278\.9444\.881\.2873\.5665\.1169\.99Taylor45\.3967\.9761\.3162\.7341\.476\.5568\.1125\.0356\.06Cosine Similarity43\.666\.9675\.5369\.3536\.273\.2371\.8244\.0760\.1Out\. Cosine\-Sim40\.6165\.7867\.5864\.636\.275\.369\.5130\.1656\.22Out\. Norm\-Sim37\.8865\.3257\.7761\.7339\.674\.8665\.4326\.553\.64Out\. Divergence\-Sim39\.5164\.3964\.864\.3734\.273\.8368\.5933\.555\.4Perplexity31\.8348\.9559\.2748\.0130\.566\.7661\.8829\.3147\.06Slice\-GPT41\.1670\.2877\.4961\.1936\.873\.6662\.6645\.0358\.53Accuracy \(Ours\)47\.3571\.6878\.3873\.4143\.876\.5571\.1158\.0465\.04Accuracy \(Arc\-E\)51\.3774\.9666\.9473\.6243\.678\.5171\.5944\.8263\.18Accuracy \(C4\)36\.6956\.5253\.3660\.1633\.872\.6360\.1428\.550\.23### 6\.1Computational Cost Comparison
As discussed in Section[4](https://arxiv.org/html/2605.14075#S4), cosine similarity has clear limitations as a measure of layer relevance\. Its primary advantage, however, is speed\. Computing layer relevance via cosine similarity is highly efficient: letNNdenote the number of layers andTTthe number of instances in the calibration set\. Cosine similarity requires onlyTTforward passes to compute relevance for all layers\. In contrast, our accuracy\-based score requiresN×TN\\times Tforward passes, output\-based methods require\(N\+1\)×T\(N\+1\)\\times Tforward passes, and Taylor approximations requireTTforward andTTbackward passes\.
Table[3](https://arxiv.org/html/2605.14075#S6.T3)illustrates this trade\-off by reporting the time required to prune 25% of LLaMA\-3\-8B\. As shown, cosine similarity is by far the fastest—requiring only a few minutes to prune 25% of the model\. Our accuracy\-based method averages 4\.6 hours, which is similar to other baselines\. However, our method consistently produces pruned models with superior performance\.
Reducing the computational cost of our metric is a key direction for future work, particularly for larger models\. Using our current \(naïve\) implementation, we estimate that pruning 50% of LLaMA\-3\-70B would take between 8\.5 days with C4 and 1\.1 days with CodeAlpaca—using two NVIDIA H100 GPUs\. These estimates represent worst\-case scenarios, as we have not yet fully exploited parallelization\. Further details are provided in Appendix[E\.6](https://arxiv.org/html/2605.14075#A5.SS6)\.
Moreover, our iterative pruning strategy is inherently suboptimal\. At present, we employ a greedy approach: we compute each layer’s relevance, prune the least relevant layer, and then recompute relevance scores before proceeding to the next pruning step\. However, identifying the optimal combination ofnnlayers to remove would require a search\-based method—such as A∗—capable of backtracking and considering cases where removing a seemingly suboptimal layer might lead to better overall performance later\. Unfortunately, even our current greedy procedure is computationally expensive, making exhaustive search for the optimal set ofnnlayers infeasible\. Nevertheless, if we could accelerate the evaluation \(or estimation\) of accuracy\-based relevance, it would open the door to exploring how much performance could be gained by finding truly optimal pruning combinations\.
Table 3:End\-to\-end runtime for pruning 25% of LLaMA\-3\-8B on NVIDIA L40s GPU\.AccuracyCos\. Sim\.PerplexityOut\. CosOut\. NormOut\. DivTaylorC47\.70 hrs9\.54 min7\.71 hrs8\.08 hrs7\.98 hrs8\.08 hrs11\.30 hrsLIMA6\.46 hrs14\.45 min6\.38 hrs6\.62 hrs6\.65 hrs6\.83 hrs14\.17 hrsMathInstruct3\.16 hrs7\.01 min3\.30 hrs3\.22 hrs3\.29 hrs3\.31 hrs8\.45 hrsCodeAlpaca1\.06 hrs3\.55 min1\.08 hrs1\.12 hrs1\.11 hrs1\.12 hrs7\.59 hrsMean4\.60 hrs8\.64 min4\.62 hrs4\.76 hrs4\.76 hrs4\.83 hrs10\.37 hrs
### 6\.2Results of Pruning with Healing
Figure 6:Impact of healing \(dark colors\) after pruning \(light colors\) across varying pruning ratios\.Recent work has introduced a healing phase after pruning\(e\.g\., Sunet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib60); Gromovet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib34); Songet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib46); Kimet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib36)\)\. This phase fine\-tunes the pruned model for a few steps, aiming to mitigate distributional mismatches across layers caused by pruning\. In this section, we examine whether healing complements our accuracy\-based pruning approach\.
Figure[6](https://arxiv.org/html/2605.14075#S6.F6)shows task\-dependent pruning results with and without healing\. Scores are normalized: 100 indicates perfect performance, 0 a random predictor\. When pruning 25% of layers, performance after healing is nearly invariant to the pruning strategy\. In fact, a random baseline—removing 25% of layers at random \(averaged over five seeds\)—performs competitively after healing\. In other words, for small pruning ratios, healing largely neutralizes differences between pruning methods\. Still, accuracy\-based pruning combined with healing slightly outperforms cosine similarity \(60\.5% vs\. 59\.6% normalized average\)\. At 50% pruning, the gap widens: accuracy\-based \+ healing achieves 42\.2% normalized performance, compared to 31\.2% for cosine similarity and 20\.5% for random\. At 75% pruning, all methods degrade severely, yet accuracy\-based pruning remains superior \(12\.2% vs\. 9\.3% for cosine similarity and 9\.4% for random\)\.
Implementation details are provided in Appendix[E\.7](https://arxiv.org/html/2605.14075#A5.SS7)\. In brief, we fine\-tuned using LoRA for up to 10 epochs, reporting the best performance across epochs \(full curves appear in Appendix[E\.7](https://arxiv.org/html/2605.14075#A5.SS7)\)\. Ten epochs are sufficient for all methods to exhibit signs of overfitting\. While early stopping relied on the test set—ideally, a validation set should be used—this does not alter the main conclusion: healing substantially reduces differences between pruning strategies at low pruning ratios, yet accuracy\-based pruning consistently achieves the best performance as pruning becomes more aggressive\.
## 7Conclusion
In this paper, we challenged the common use of cosine similarity as a proxy for layer relevance in LLMs, showing through theory and experiments that it often misrepresents true importance\. To address this, we propose an accuracy\-based relevance metric that directly measures performance impact, offering a more faithful view of layer significance\. Beyond interpretability, this metric enables superior structured pruning, outperforming existing methods in both task\-dependent and task\-independent settings\. Our findings call for a shift toward performance\-grounded evaluations to better understand model internals and design more effective pruning strategies\.
#### Acknowledgments
We gratefully acknowledge the support of the Agencia Nacional de Investigación y Desarrollo \(ANID\), Chile\. This research was funded by the National Center for Artificial Intelligence CENIA FB210017, Basal ANID\. The work of R\. Toro Icarte was supported by Fondecyt Iniciación 11230762\. The work of C\. Devia was supported by Fondecyt 11241551 and FONDEQUIP EQM230106\. The work of A\. Carvallo was supported by Fondecyt 3240001\. The work of D\. Parra was supported by Fondecyt Regular 1231724 and the Millennium Initiative research centers iHealth ICN2021\_004 and IMFD ICN17\_002\. The work of J\. F\. Silva was supported by Fondecyt Regular 1250098 and ANID AC3E CIA250006\.
## References
- Structured pruning of deep convolutional neural networks\.ACM Journal on Emerging Technologies in Computing Systems \(JETC\)13\(3\),pp\. 1–18\.Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p10.1),[§6](https://arxiv.org/html/2605.14075#S6.p1.1)\.
- S\. Ashkboos, M\. L\. Croci, M\. G\. do Nascimento, T\. Hoefler, and J\. Hensman \(2024\)SliceGPT: compress large language models by deleting rows and columns\.InProceedings of the 12th International Conference on Learning Representations \(ICLR\),Cited by:[§6](https://arxiv.org/html/2605.14075#S6.p4.1)\.
- Y\. Bisk, R\. Zellers, R\. Le Bras, J\. Gao, and Y\. Choi \(2020\)Piqa: reasoning about physical commonsense in natural language\.InProceedings of the 34th AAAI Conference on Artificial Intelligence \(AAAI\),pp\. 7432–7439\.Cited by:[§6](https://arxiv.org/html/2605.14075#S6.p5.1)\.
- J\. Brinkmann, A\. Sheshadri, V\. Levoso, P\. Swoboda, and C\. Bartelt \(2024\)A mechanistic analysis of a transformer trained on a symbolic multi\-step reasoning task\.InFindings of the Association for Computational Linguistics: ACL 2024,pp\. 4082–4102\.Cited by:[§A\.1](https://arxiv.org/html/2605.14075#A1.SS1.p3.1),[§2](https://arxiv.org/html/2605.14075#S2.p5.1)\.
- M\. Caron, H\. Touvron, I\. Misra, H\. Jégou, J\. Mairal, P\. Bojanowski, and A\. Joulin \(2021\)Emerging properties in self\-supervised vision transformers\.InProceedings of the IEEE/CVF International Conference on Computer Vision \(ICCV\),pp\. 9650–9660\.Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p1.1)\.
- Z\. Chkirbene, R\. Hamila, A\. Gouissem, and U\. Devrim \(2024\)Large language models \(llm\) in industry: a survey of applications, challenges, and trends\.InProceedings of the IEEE 21st International Conference on Smart Communities: Improving Quality of Life using AI, Robotics and IoT \(HONET\),pp\. 229–234\.Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p1.1)\.
- C\. Clark, K\. Lee, M\. Chang, T\. Kwiatkowski, M\. Collins, and K\. Toutanova \(2019a\)Boolq: exploring the surprising difficulty of natural yes/no questions\.InProceedings of the 2019 Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\),pp\. 2924–2936\.Cited by:[§6](https://arxiv.org/html/2605.14075#S6.p5.1)\.
- K\. Clark, U\. Khandelwal, O\. Levy, and C\. D\. Manning \(2019b\)What does bert look at? an analysis of bert’s attention\.InProceedings of the 2nd BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP \(BlackboxNLP\),pp\. 276–286\.Cited by:[§A\.1](https://arxiv.org/html/2605.14075#A1.SS1.p1.1),[§2](https://arxiv.org/html/2605.14075#S2.p5.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv preprint arXiv:1803\.05457\.Cited by:[§6](https://arxiv.org/html/2605.14075#S6.p5.1)\.
- T\. Dettmers, A\. Pagnoni, A\. Holtzman, and L\. Zettlemoyer \(2023\)QLoRA: efficient finetuning of quantized llms\.InProceedings of the 37th Conference on Advances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§E\.7](https://arxiv.org/html/2605.14075#A5.SS7.p6.1)\.
- J\. Devlin, M\. Chang, K\. Lee, and K\. Toutanova \(2019\)Bert: pre\-training of deep bidirectional transformers for language understanding\.InProceedings of the 2019 Conference of the North American chapter of the Association for Computational Linguistics: Human Language Technologies \(NAACL\-HLT\),pp\. 4171–4186\.Cited by:[§A\.1](https://arxiv.org/html/2605.14075#A1.SS1.p1.1),[§2](https://arxiv.org/html/2605.14075#S2.p5.1)\.
- J\. Ferrando, G\. Sarti, A\. Bisazza, and M\. R\. Costa\-jussà \(2024\)A primer on the inner workings of transformer\-based language models\.CoRRabs/2405\.00208\.External Links:[Link](https://doi.org/10.48550/arXiv.2405.00208),[Document](https://dx.doi.org/10.48550/ARXIV.2405.00208),2405\.00208Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p2.1)\.
- L\. Gao, J\. Tow, B\. Abbasi, S\. Biderman, S\. Black, A\. DiPofi, C\. Foster, L\. Golding, J\. Hsu, A\. Le Noac’h, H\. Li, K\. McDonell, N\. Muennighoff, C\. Ociepa, J\. Phang, L\. Reynolds, H\. Schoelkopf, A\. Skowron, L\. Sutawika, E\. Tang, A\. Thite, B\. Wang, K\. Wang, and A\. Zou \(2024\)The language model evaluation harness\.Zenodo\.External Links:[Document](https://dx.doi.org/10.5281/zenodo.12608602),[Link](https://zenodo.org/records/12608602)Cited by:[§C\.1](https://arxiv.org/html/2605.14075#A3.SS1.p1.1),[§C\.1](https://arxiv.org/html/2605.14075#A3.SS1.p5.1),[§E\.1](https://arxiv.org/html/2605.14075#A5.SS1.p4.1)\.
- M\. Geva, J\. Bastings, K\. Filippova, and A\. Globerson \(2023\)Dissecting recall of factual associations in auto\-regressive language models\.InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 12216–12235\.Cited by:[§A\.1](https://arxiv.org/html/2605.14075#A1.SS1.p2.1),[§2](https://arxiv.org/html/2605.14075#S2.p5.1)\.
- M\. Geva, A\. Caciularu, K\. Wang, and Y\. Goldberg \(2022\)Transformer feed\-forward layers build predictions by promoting concepts in the vocabulary space\.InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 30–45\.Cited by:[§A\.1](https://arxiv.org/html/2605.14075#A1.SS1.p4.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p7.1),[§C\.5](https://arxiv.org/html/2605.14075#A3.SS5.p2.1),[§2](https://arxiv.org/html/2605.14075#S2.p5.1)\.
- M\. Geva, R\. Schuster, J\. Berant, and O\. Levy \(2021\)Transformer feed\-forward layers are key\-value memories\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 5484–5495\.Cited by:[§A\.1](https://arxiv.org/html/2605.14075#A1.SS1.p2.1),[§2](https://arxiv.org/html/2605.14075#S2.p5.1)\.
- A\. Grattafiori, A\. Dubey, A\. Jauhri, A\. Pandey, A\. Kadian, A\. Al\-Dahle, A\. Letman, A\. Mathur, A\. Schelten, A\. Vaughan,et al\.\(2024\)The llama 3 herd of models\.arXiv preprint arXiv:2407\.21783\.Cited by:[§6](https://arxiv.org/html/2605.14075#S6.p2.1)\.
- D\. Groeneveld, I\. Beltagy, E\. Walsh, A\. Bhagia, R\. Kinney, O\. Tafjord, A\. Jha, H\. Ivison, I\. Magnusson, Y\. Wang,et al\.\(2024\)OLMo: accelerating the science of language models\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[Figure 1](https://arxiv.org/html/2605.14075#S1.F1)\.
- A\. Gromov, K\. Tirumala, H\. Shapourian, P\. Glorioso, and D\. A\. Roberts \(2025\)The unreasonable ineffectiveness of the deeper layers\.InProceedings of the 13th International Conference on Learning Representations \(ICLR\),Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p3.1),[Figure 18](https://arxiv.org/html/2605.14075#A4.F18),[§E\.7](https://arxiv.org/html/2605.14075#A5.SS7.p4.1),[§E\.7](https://arxiv.org/html/2605.14075#A5.SS7.p6.1),[Figure 2](https://arxiv.org/html/2605.14075#S1.F2),[§1](https://arxiv.org/html/2605.14075#S1.p10.1),[§1](https://arxiv.org/html/2605.14075#S1.p3.1),[§1](https://arxiv.org/html/2605.14075#S1.p9.1),[§2](https://arxiv.org/html/2605.14075#S2.p1.1),[§2](https://arxiv.org/html/2605.14075#S2.p3.1),[§3](https://arxiv.org/html/2605.14075#S3.p1.2),[§5\.2](https://arxiv.org/html/2605.14075#S5.SS2.p1.1),[§5\.2](https://arxiv.org/html/2605.14075#S5.SS2.p4.1),[§6\.2](https://arxiv.org/html/2605.14075#S6.SS2.p1.1),[§6](https://arxiv.org/html/2605.14075#S6.p3.1)\.
- W\. Gurnee and M\. Tegmark \(2024\)Language models represent space and time\.InProceedings of the 12th International Conference on Learning Representations \(ICLR\),Cited by:[§A\.1](https://arxiv.org/html/2605.14075#A1.SS1.p3.1),[§2](https://arxiv.org/html/2605.14075#S2.p5.1)\.
- S\. He, G\. Sun, Z\. Shen, and A\. Li \(2024\)What matters in transformers? not all attention is needed\.arXiv preprint arXiv:2406\.15786\.Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p3.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p6.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p7.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p8.1),[§C\.1](https://arxiv.org/html/2605.14075#A3.SS1.p2.1),[Appendix C](https://arxiv.org/html/2605.14075#A3.p1.1),[§E\.1](https://arxiv.org/html/2605.14075#A5.SS1.p2.1),[§E\.1](https://arxiv.org/html/2605.14075#A5.SS1.p3.1),[§E\.2](https://arxiv.org/html/2605.14075#A5.SS2.p2.1),[§E\.4](https://arxiv.org/html/2605.14075#A5.SS4.p2.1),[§E\.4](https://arxiv.org/html/2605.14075#A5.SS4.p3.1),[§1](https://arxiv.org/html/2605.14075#S1.p10.1),[§1](https://arxiv.org/html/2605.14075#S1.p3.1),[§1](https://arxiv.org/html/2605.14075#S1.p8.1),[§2](https://arxiv.org/html/2605.14075#S2.p1.1),[§2](https://arxiv.org/html/2605.14075#S2.p3.1),[§3](https://arxiv.org/html/2605.14075#S3.p1.2),[§5\.1](https://arxiv.org/html/2605.14075#S5.SS1.p1.1),[§6](https://arxiv.org/html/2605.14075#S6.p3.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.InProceedings of the 9th International Conference on Learning Representations \(ICLR\),Cited by:[§6](https://arxiv.org/html/2605.14075#S6.p5.1)\.
- E\. Hu, Y\. Shen, P\. Wallis, Z\. Allen\-Zhu, Y\. Li, S\. Wang, L\. Wang, and W\. Chen \(2022\)LoRA: low\-rank adaptation of large language models\.InProceedings of the 10th International Conference on Learning Representations \(ICLR\),Cited by:[§E\.7](https://arxiv.org/html/2605.14075#A5.SS7.p6.1)\.
- G\. Jawahar, B\. Sagot, and D\. Seddah \(2019\)What does bert learn about the structure of language?\.InProceedings of the 57nd Annual Meeting of the Association for Computational Linguistics \(ACL\),Cited by:[§A\.1](https://arxiv.org/html/2605.14075#A1.SS1.p1.1),[§2](https://arxiv.org/html/2605.14075#S2.p5.1)\.
- A\. Q\. Jiang, A\. Sablayrolles, A\. Mensch, C\. Bamford, D\. S\. Chaplot, D\. d\. l\. Casas, F\. Bressand, G\. Lengyel, G\. Lample, L\. Saulnier, and et al\. \(2023\)Mistral 7b\.arXiv preprint arXiv:2310\.06825\.Cited by:[§5\.1](https://arxiv.org/html/2605.14075#S5.SS1.p1.1),[§6](https://arxiv.org/html/2605.14075#S6.p2.1)\.
- B\. Kim, G\. Kim, T\. Kim, T\. Castells, S\. Choi, J\. Shin, and H\. Song \(2024\)Shortened LLaMA: a simple depth pruning for large language models\.InProceedings of the ICLR 2024 Workshop on Mathematical and Empirical Understanding of Foundation Models,External Links:[Link](https://openreview.net/forum?id=18VGxuOdpu)Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p3.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p5.1),[§1](https://arxiv.org/html/2605.14075#S1.p10.1),[§2](https://arxiv.org/html/2605.14075#S2.p4.1),[§6\.2](https://arxiv.org/html/2605.14075#S6.SS2.p1.1),[§6](https://arxiv.org/html/2605.14075#S6.p3.1)\.
- H\. Li, A\. Kadav, I\. Durdanovic, H\. Samet, and H\. P\. Graf \(2016\)Pruning filters for efficient convnets\.arXiv preprint arXiv:1608\.08710\.Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p2.1)\.
- W\. Li, H\. Luo, Z\. Lin, C\. Zhang, Z\. Lu, and D\. Ye \(2023a\)A survey on transformers in reinforcement learning\.Transactions on Machine Learning Research\.Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p1.1)\.
- Y\. Li, Y\. Li, and A\. Risteski \(2023b\)How do transformers learn topic structure: towards a mechanistic understanding\.InProceedings of the 40th International Conference on Machine Learning \(ICML\),pp\. 19689–19729\.Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p2.1)\.
- D\. Lioubashevski, T\. M\. Schlank, G\. Stanovsky, and A\. Goldstein \(2025\)Looking beyond the top\-1: transformers determine top tokens in order\.InProceedings of the 42th International Conference on Machine Learning \(ICML\),External Links:[Link](https://openreview.net/forum?id=2B11W1Z6ID)Cited by:[§A\.1](https://arxiv.org/html/2605.14075#A1.SS1.p4.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p7.1),[§2](https://arxiv.org/html/2605.14075#S2.p5.1)\.
- Y\. Lu, H\. Cheng, Y\. Fang, Z\. Wang, J\. Wei, D\. Xu, Q\. Xuan, X\. Yang, and Z\. Zhu \(2024\)Reassessing layer pruning in llms: new insights and methods\.arXiv preprint arXiv:2411\.15558\.Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p6.1)\.
- X\. Ma, G\. Fang, and X\. Wang \(2023\)LLM\-pruner: on the structural pruning of large language models\.InProceedings of the 37th Conference on Advances in Neural Information Processing Systems \(NeurIPS\),pp\. 21702–21720\.Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p3.1),[§E\.1](https://arxiv.org/html/2605.14075#A5.SS1.p3.1),[§1](https://arxiv.org/html/2605.14075#S1.p10.1),[§2](https://arxiv.org/html/2605.14075#S2.p4.1),[§6](https://arxiv.org/html/2605.14075#S6.p3.1)\.
- S\. Mangrulkar, S\. Gugger, L\. Debut, Y\. Belkada, S\. Paul, B\. Bossan, and M\. Tietz \(2022\)PEFT: state\-of\-the\-art parameter\-efficient fine\-tuning methods\.Note:[https://github\.com/huggingface/peft](https://github.com/huggingface/peft)Cited by:[§E\.7](https://arxiv.org/html/2605.14075#A5.SS7.p6.1)\.
- X\. Men, M\. Xu, Q\. Zhang, Q\. Yuan, B\. Wang, H\. Lin, Y\. Lu, X\. Han, and W\. Chen \(2025\)Shortgpt: layers in large language models are more redundant than you expect\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 20192–20204\.Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p3.1),[§1](https://arxiv.org/html/2605.14075#S1.p10.1),[§1](https://arxiv.org/html/2605.14075#S1.p3.1),[§2](https://arxiv.org/html/2605.14075#S2.p1.1),[§2](https://arxiv.org/html/2605.14075#S2.p3.1),[§3](https://arxiv.org/html/2605.14075#S3.p1.2),[§6](https://arxiv.org/html/2605.14075#S6.p3.1)\.
- K\. Meng, D\. Bau, A\. Andonian, and Y\. Belinkov \(2022\)Locating and editing factual associations in gpt\.InProceedings of the 36th Conference on Advances in Neural Information Processing Systems \(NeurIPS\),Cited by:[§A\.1](https://arxiv.org/html/2605.14075#A1.SS1.p2.1),[§2](https://arxiv.org/html/2605.14075#S2.p5.1)\.
- T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 2381–2391\.Cited by:[§6](https://arxiv.org/html/2605.14075#S6.p5.1)\.
- A\. Modell, P\. Rubin\-Delanchy, and N\. Whiteley \(2025\)The origins of representation manifolds in large language models\.arXiv preprint arXiv:2505\.18235\.Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p2.1)\.
- H\. Sajjad, F\. Dalvi, N\. Durrani, and P\. Nakov \(2023\)On the effect of dropping layers of pre\-trained transformer models\.Computer Speech & Language77,pp\. 101429\.Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p3.1),[§2](https://arxiv.org/html/2605.14075#S2.p1.1),[§2](https://arxiv.org/html/2605.14075#S2.p3.1),[§3](https://arxiv.org/html/2605.14075#S3.p1.2)\.
- K\. Sakaguchi, R\. L\. Bras, C\. Bhagavatula, and Y\. Choi \(2021\)Winogrande: an adversarial winograd schema challenge at scale\.Communications of the ACM64\(9\),pp\. 99–106\.Cited by:[§6](https://arxiv.org/html/2605.14075#S6.p5.1)\.
- V\. Sanh, L\. Debut, J\. Chaumond, and T\. Wolf \(2019\)DistilBERT, a distilled version of bert: smaller, faster, cheaper and lighter\.arXiv preprint arXiv:1910\.01108\.Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p2.1)\.
- S\. A\. Siddiqui, X\. Dong, G\. Heinrich, T\. Breuel, J\. Kautz, D\. Krueger, and P\. Molchanov \(2024\)A deeper look at depth pruning of llms\.InProceedings of the ICML 2024 Workshop on Theoretical Foundations of Foundation Models,Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p5.1)\.
- O\. Sieberling, D\. Kuznedelev, E\. Kurtic, and D\. Alistarh \(2024\)Evopress: towards optimal dynamic model compression via evolutionary search\.arXiv preprint arXiv:2410\.14649\.Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p4.1),[§E\.3](https://arxiv.org/html/2605.14075#A5.SS3.p3.1),[§2](https://arxiv.org/html/2605.14075#S2.p3.1),[§6](https://arxiv.org/html/2605.14075#S6.p3.1)\.
- J\. Song, K\. Oh, T\. Kim, H\. Kim, Y\. Kim, and J\. Kim \(2024\)Sleb: streamlining llms through redundancy verification and elimination of transformer blocks\.InProceedings of the 41th International Conference on Machine Learning \(ICML\),Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p5.1),[§2](https://arxiv.org/html/2605.14075#S2.p4.1),[§6\.2](https://arxiv.org/html/2605.14075#S6.SS2.p1.1),[§6](https://arxiv.org/html/2605.14075#S6.p3.1)\.
- M\. Sun, Z\. Liu, A\. Bair, and J\. Z\. Kolter \(2024\)A simple and effective pruning approach for large language models\.InProceedings of the 12th International Conference on Learning Representations \(ICLR\),Cited by:[§6\.2](https://arxiv.org/html/2605.14075#S6.SS2.p1.1)\.
- Q\. Sun, M\. Pickett, A\. K\. Nain, and L\. Jones \(2025\)Transformer layers as painters\.InProceedings of the 39th AAAI Conference on Artificial Intelligence \(AAAI\),pp\. 25219–25227\.Cited by:[§A\.1](https://arxiv.org/html/2605.14075#A1.SS1.p4.1),[§1](https://arxiv.org/html/2605.14075#S1.p2.1),[§1](https://arxiv.org/html/2605.14075#S1.p3.1),[§2](https://arxiv.org/html/2605.14075#S2.p5.1)\.
- C\. Tigges, O\. J\. Hollinsworth, A\. Geiger, and N\. Nanda \(2023\)Linear representations of sentiment in large language models\.arXiv preprint arXiv:2310\.15154\.Cited by:[§A\.1](https://arxiv.org/html/2605.14075#A1.SS1.p3.1),[§2](https://arxiv.org/html/2605.14075#S2.p5.1)\.
- W\. Timkey and M\. Van Schijndel \(2021\)All bark and no bite: rogue dimensions in transformer language models obscure representational quality\.InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),pp\. 4527–4546\.Cited by:[§2](https://arxiv.org/html/2605.14075#S2.p2.1)\.
- A\. Vaswani, N\. Shazeer, N\. Parmar, J\. Uszkoreit, L\. Jones, A\. N\. Gomez, Ł\. Kaiser, and I\. Polosukhin \(2017\)Attention is all you need\.InProceedings of the 31st Conference on Advances in Neural Information Processing Systems \(NIPS\),Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p1.1)\.
- A\. Villa, V\. Araujo, F\. Cattan, and D\. Parra \(2020\)Interpretable contextual team\-aware item recommendation: application in multiplayer online battle arena games\.InProceedings of the 14th ACM Conference on Recommender Systems,pp\. 503–508\.Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p1.1)\.
- F\. Wilcoxon \(1992\)Individual comparisons by ranking methods\.InBreakthroughs in statistics: Methodology and distribution,pp\. 196–202\.Cited by:[§C\.3](https://arxiv.org/html/2605.14075#A3.SS3.p1.1)\.
- T\. Wolf, L\. Debut, V\. Sanh, J\. Chaumond, C\. Delangue, A\. Moi, P\. Cistac, T\. Rault, R\. Louf, M\. Funtowicz, J\. Davison, S\. Shleifer, P\. von Platen, C\. Ma, Y\. Jernite, J\. Plu, C\. Xu, T\. Le Scao, S\. Gugger, M\. Drame, Q\. Lhoest, and A\. Rush \(2020\)Transformers: state\-of\-the\-art natural language processing\.InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing \(EMNLP\),Q\. Liu and D\. Schlangen \(Eds\.\),Online,pp\. 38–45\.External Links:[Link](https://aclanthology.org/2020.emnlp-demos.6/),[Document](https://dx.doi.org/10.18653/v1/2020.emnlp-demos.6)Cited by:[§E\.7](https://arxiv.org/html/2605.14075#A5.SS7.p6.1)\.
- P\. Xu, X\. Zhu, and D\. A\. Clifton \(2023\)Multimodal learning with transformers: a survey\.IEEE Transactions on Pattern Analysis and Machine Intelligence45\(10\),pp\. 12113–12132\.Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p1.1)\.
- G\. Yang, Y\. Zhou, X\. Zhang, W\. Cheng, K\. Liu, X\. Chen, T\. Y\. Zhuo, and T\. Chen \(2026\)Less is more: towards green code large language models via unified structural pruning\.Information Processing & Management63\(4\),pp\. 104580\.Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p4.1),[§2](https://arxiv.org/html/2605.14075#S2.p3.1),[§6](https://arxiv.org/html/2605.14075#S6.p3.1)\.
- Y\. Yang, Z\. Cao, and H\. Zhao \(2024\)Laco: large language model pruning via layer collapse\.InFindings of the Association for Computational Linguistics: EMNLP 2024,pp\. 6401–6417\.Cited by:[§1](https://arxiv.org/html/2605.14075#S1.p3.1),[§2](https://arxiv.org/html/2605.14075#S2.p1.1)\.
- R\. Zellers, A\. Holtzman, Y\. Bisk, A\. Farhadi, and Y\. Choi \(2019\)Hellaswag: can a machine really finish your sentence?\.InProceedings of the 57nd Annual Meeting of the Association for Computational Linguistics \(ACL\),pp\. 4791–4800\.Cited by:[§6](https://arxiv.org/html/2605.14075#S6.p5.1)\.
- Y\. Zhang, Y\. Dong, and K\. Kawaguchi \(2024a\)Investigating layer importance in large language models\.InProceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP \(BlackboxNLP\),pp\. 469–479\.Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p5.1)\.
- Y\. Zhang, Y\. Li, X\. Wang, Q\. Shen, B\. Plank, B\. Bischl, M\. Rezaei, and K\. Kawaguchi \(2024b\)FinerCut: finer\-grained interpretable layer pruning for large language models\.InProceedings of the NeurIPS 2024 Workshop on Machine Learning and Compression,Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p1.1),[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p4.1),[§E\.1](https://arxiv.org/html/2605.14075#A5.SS1.p2.1),[§E\.2](https://arxiv.org/html/2605.14075#A5.SS2.p2.1),[§E\.4](https://arxiv.org/html/2605.14075#A5.SS4.p2.1),[§E\.4](https://arxiv.org/html/2605.14075#A5.SS4.p4.1),[§1](https://arxiv.org/html/2605.14075#S1.p10.1),[§1](https://arxiv.org/html/2605.14075#S1.p3.1),[§2](https://arxiv.org/html/2605.14075#S2.p1.1),[§2](https://arxiv.org/html/2605.14075#S2.p3.1),[§6](https://arxiv.org/html/2605.14075#S6.p2.1),[§6](https://arxiv.org/html/2605.14075#S6.p3.1)\.
- L\. Zhong, F\. Wan, R\. Chen, X\. Quan, and L\. Li \(2025\)Blockpruner: fine\-grained pruning for large language models\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 5065–5080\.Cited by:[§A\.2](https://arxiv.org/html/2605.14075#A1.SS2.p5.1),[§2](https://arxiv.org/html/2605.14075#S2.p4.1),[§6](https://arxiv.org/html/2605.14075#S6.p3.1)\.
## Appendix ARelated Work \(Extended Version\)
### A\.1Understanding Transformer Internals
The question of how Transformer models represent and process information was first explored in depth with BERT\(Devlinet al\.,[2019](https://arxiv.org/html/2605.14075#bib.bib23)\)\. Early studies revealed that BERT captures structural properties of language across its layers\. Lower layers focus on phrase\-level and surface features, while intermediate layers encode a rich hierarchy of linguistic information—starting with syntactic structures and transitioning to semantic representations at higher layers\(Jawaharet al\.,[2019](https://arxiv.org/html/2605.14075#bib.bib21)\)\. Additionally, some attention heads within BERT specialize in specific linguistic tasks, such as syntactic parsing and coreference resolution, aligning with traditional linguistic notions\(Clarket al\.,[2019b](https://arxiv.org/html/2605.14075#bib.bib22)\)\. These findings provided initial insights into how Transformer\-based models organize knowledge, setting the stage for broader investigations into their internal mechanisms\.
Recent studies have further refined our understanding of how Transformers encode and manipulate information\. Feed\-forward \(FF\) layers function as key\-value memory systems, storing patterns from training and influencing the model’s output distribution\(Gevaet al\.,[2021](https://arxiv.org/html/2605.14075#bib.bib18)\)\. This structured memory is particularly important for factual recall, as knowledge is primarily stored in the FF layers of middle blocks\(Menget al\.,[2022](https://arxiv.org/html/2605.14075#bib.bib24)\)\. Meanwhile, attention layers propagate and retrieve stored information, dynamically integrating relevant associations for prediction\(Gevaet al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib25)\)\.
Beyond factual recall, Transformers encode abstract and structured representations\. They capture spatiotemporal relationships in text\(Gurnee and Tegmark,[2024](https://arxiv.org/html/2605.14075#bib.bib26)\)and implement a depth\-bounded recurrent mechanism that stores intermediate results at selected token positions\(Brinkmannet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib29)\)\. Additionally, high\-level concepts such as sentiment are encoded in linear activation structures\(Tiggeset al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib30)\), highlighting the model’s ability to organize information hierarchically\.
Another line of research suggests that certain Transformer layers contribute little to the model’s final prediction\. By analyzing how probability distributions evolve across blocks, researchers observed that in many cases, a model’s prediction stabilizes early—once a token becomes the most probable, it remains unchanged until the final layer\. These stabilization points, known as saturation events, suggest that the model’s later layers primarily refine rather than reshape its output\(Gevaet al\.,[2022](https://arxiv.org/html/2605.14075#bib.bib20)\)\. Further studies confirmed that even lower\-ranked tokens follow the same pattern once the top\-1 prediction stabilizes\(Lioubashevskiet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib31)\)\. Moreover, experimental evidence shows that middle blocks can be removed or swapped with minimal impact on performance\(Sunet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib17)\)\.
These findings have led to the prevailing belief that some Transformer blocks are inherently unimportant\. In this work, we revisited this assumption by showing that a block’s relevance can vary significantly depending on the task—suggesting that global conclusions about importance may overlook task\-specific dynamics\.
### A\.2Measuring Block Relevance
Most research on measuring block relevance has been conducted in the context of structured pruning\(Menet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib33); Gromovet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib34); Heet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib35); Kimet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib36); Maet al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib37); Zhanget al\.,[2024b](https://arxiv.org/html/2605.14075#bib.bib39); Yanget al\.,[2026](https://arxiv.org/html/2605.14075#bib.bib47); Sieberlinget al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib55)\)\. The goal is to remove the least relevant blocks while preserving model performance, which has led to the development of several techniques for estimating a block’s importance\.
A foundational but now outdated approach is magnitude\-based pruning, which removes blocks based on their parameter magnitudes\. While widely used for individual weight pruning\(Liet al\.,[2016](https://arxiv.org/html/2605.14075#bib.bib32)\), this method proved too simplistic at the block level\. Still, it served as a useful baseline in the early development of structured pruning techniques\.
More recent work has focused on proxy\-based relevance scores that analyze how much a block transforms its input\. One popular class of methods uses cosine similarity between a block’s input and output, assuming that low transformation implies low relevance\(Menet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib33); Gromovet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib34); Heet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib35)\)\. Other studies rely on Taylor expansion techniques to estimate the change in loss when a weight or block is removed, providing a more gradient\-informed view of importance\(Kimet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib36); Maet al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib37)\)\.
Another set of methods evaluates relevance by comparing the pruned model’s output to the original model’s, using metrics like cosine similarity, norm differences, and divergence\-based measures\.Zhanget al\.\([2024b](https://arxiv.org/html/2605.14075#bib.bib39)\), for example, employ Jensen\-Shannon divergence to guide pruning and achieve state\-of\-the\-art results\. Follow\-up work builds on this idea using KL divergence, a closely related metric:Yanget al\.\([2026](https://arxiv.org/html/2605.14075#bib.bib47)\)apply it as part of a multi\-step strategy to create smaller models tailored to code generation, whileSieberlinget al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib55)\)combine it with a novel selection algorithm that prunes blocks jointly rather than iteratively\.
Perplexity\-based metrics are also used, especially in language modeling, where blocks are considered irrelevant if their removal does not significantly increase perplexity\(Kimet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib36); Zhonget al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib38); Songet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib46)\)\. Beyond pruning\-specific methods, some studies draw on game\-theoretic tools, such as approximations of Shapley values, to assess a block’s contribution to the model’s output in a more theoretically grounded way\(Zhanget al\.,[2024a](https://arxiv.org/html/2605.14075#bib.bib40); Siddiquiet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib56)\)\.
While most pruning research focuses on overall effectiveness, few works ask whether a block’s relevance remains stable as the model is progressively pruned\.Heet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib35)\)andLuet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib45)\), for instance, compare one\-shot pruning—where relevance scores are computed a single time and used to select all blocks to prune—with iterative pruning, where relevance is recalculated and re\-ranked after each pruning step\. Their findings suggest that one\-shot pruning can match or even outperform iterative pruning for structured sparsity\. However, their analyses center on end\-task accuracy rather than how relevance itself shifts during the process\. This leaves an important question unanswered: Does pruning change block relevance? A question that we answered in Section[C\.5](https://arxiv.org/html/2605.14075#A3.SS5)\.
While most methods rely on a single calibration dataset to assess relevance, some recent studies have started exploring the generalizability of relevance scores across datasets\.Heet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib35)\)found that relevance maps computed via cosine similarity appear largely consistent across datasets, leading them to conclude that certain layers may be universally important or unimportant\. This perceived dataset\-agnostic behavior motivated their decision to use a single calibration dataset throughout their experiments\. Their findings also connect to saturation\-based analyses\(Gevaet al\.,[2022](https://arxiv.org/html/2605.14075#bib.bib20); Lioubashevskiet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib31)\), which similarly suggest that once a model’s prediction stabilizes, later computations may be less critical\.
We tookHeet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib35)\)as our primary baseline because they provide one of the few systematic attempts to visualize and quantify block relevance across tasks\. Their heatmaps offered a clear point of comparison for our own cross\-task analysis, which was built on their setup but replaces similarity\-based relevance with a task\-grounded, accuracy\-based metric\.
## Appendix BProof of Theorem 1
### B\.1Auxiliary Result
Before proving Theorem[1](https://arxiv.org/html/2605.14075#Thmtheorem1), let’s first prove the following theorem:
###### Theorem 2
For anyϵ\>0\\epsilon\>0and unlabeled calibration dataset𝔻=\{s\(i\)\}i=1N\{\\mathbb\{D\}\}=\\\{s^\{\(i\)\}\\\}\_\{i=1\}^\{N\}, there exists a decoder\-only TransformerfLf^\{L\}withL≥3L\\geq 3and a labeling functionℒ:𝔻→\{0,1\}\\mathcal\{L\}:\{\\mathbb\{D\}\}\\to\\\{0,1\\\}satisfying the following conditions:
1. 1\.There exists an intermediate layerl∈\{1,…,L−2\}l\\in\\\{1,\\dots,L\-2\\\}such thatCosSimScore\(l;𝔻\)=ϵ\\operatorname\{CosSimScore\}\(l;\{\\mathbb\{D\}\}\)=\\epsilon, andCosSimScore\(i;𝔻\)\>ϵ\\operatorname\{CosSimScore\}\(i;\{\\mathbb\{D\}\}\)\>\\epsilonfor alli≠li\\neq l\.
2. 2\.The full model achieves perfect accuracy:fL\(s\(i\)\)=ℒ\(s\(i\)\)f^\{L\}\(s^\{\(i\)\}\)=\\mathcal\{L\}\(s^\{\(i\)\}\)for alls\(i\)∈𝔻s^\{\(i\)\}\\in\{\\mathbb\{D\}\}, but removing layerllcauses the model’s accuracy to drop to zero:f−lL\(s\(i\)\)≠ℒ\(s\(i\)\)f^\{L\}\_\{\-l\}\(s^\{\(i\)\}\)\\neq\\mathcal\{L\}\(s^\{\(i\)\}\)for alls\(i\)∈𝔻s^\{\(i\)\}\\in\{\\mathbb\{D\}\}\.
This result can be viewed as a simplified version of Theorem[1](https://arxiv.org/html/2605.14075#Thmtheorem1), where the labeling function is binary and freely chosen, rather than being fixed by the dataset𝔻\{\\mathbb\{D\}\}\.
LetE\(s\(i\)\)=𝑿\(0,i\)E\(s^\{\(i\)\}\)=\{\\bm\{X\}\}^\{\(0,i\)\}denote the embedding of a sequences\(i\)s^\{\(i\)\}, where𝑿\(l,i\)∈ℝn×d\{\\bm\{X\}\}^\{\(l,i\)\}\\in\\mathbb\{R\}^\{n\\times d\}, withnnthe number of tokens andddthe hidden dimension\. The transformation at blockllis given by
𝑿\(l\+1,i\)=𝑿\(l,i\)\+f\(𝑿\(l,i\);𝜽\(l\)\)\.\{\\bm\{X\}\}^\{\(l\+1,i\)\}=\{\\bm\{X\}\}^\{\(l,i\)\}\+f\\big\(\{\\bm\{X\}\}^\{\(l,i\)\};\{\\bm\{\\theta\}\}^\{\(l\)\}\\big\)\.
A decoder\-only transformer withLLblocks is then
fL\(s\(i\)\)=U\(E\(s\(i\)\)\+∑l=0L−1f\(𝐗\(l,i\);𝜽\(l\)\)\),f^\{L\}\(s^\{\(i\)\}\)=U\\left\(E\(s^\{\(i\)\}\)\+\\sum\_\{l=0\}^\{L\-1\}f\\big\(\\mathbf\{X\}^\{\(l,i\)\};\{\\bm\{\\theta\}\}^\{\(l\)\}\\big\)\\right\),whereU\(⋅\)U\(\\cdot\)denotes the final transformation applied to the output of the last block \(e\.g\., an unembedding layer for next\-token prediction or a classification head\)\.
We also define the model obtained by removing blockll, denotedf−lLf\_\{\-l\}^\{L\}\. In this case, the hidden stateX\(l−1\)X^\{\(l\-1\)\}is directly connected to blockl\+1l\+1, bypassing blockll\. Formally,
f−lL\(s\(i\)\)=U\(E\(s\(i\)\)\+∑k=0k≠lL−1f\(𝑿\(k,i\);𝜽\(k\)\)\),f\_\{\-l\}^\{L\}\(s^\{\(i\)\}\)=U\\left\(E\(s^\{\(i\)\}\)\+\\sum\_\{\\begin\{subarray\}\{c\}k=0\\\\ k\\neq l\\end\{subarray\}\}^\{L\-1\}f\\big\(\{\\bm\{X\}\}^\{\(k,i\)\};\{\\bm\{\\theta\}\}^\{\(k\)\}\\big\)\\right\),with the convention that
𝑿\(l\+1,i\)=𝑿\(l−1,i\)\+f\(𝑿\(l−1,i\);𝜽\(l−1\)\)for the pruned model\.\{\\bm\{X\}\}^\{\(l\+1,i\)\}=\{\\bm\{X\}\}^\{\(l\-1,i\)\}\+f\\big\(\{\\bm\{X\}\}^\{\(l\-1,i\)\};\{\\bm\{\\theta\}\}^\{\(l\-1\)\}\\big\)\\quad\\text\{for the pruned model\}\.
Let𝟏n\{\\bm\{1\}\}\_\{n\}denote the column vector of sizennwith all entries equal to one, and𝟎n\{\\bm\{0\}\}\_\{n\}the zero vector of the same size\.
Consider the embedding function
E\(si\)=\[𝟎n\(i\)δ⋅𝟏n\(i\)𝟎n\(i\)\],∀s\(i\)∈𝔻,E\(s\_\{i\}\)=\\begin\{bmatrix\}\{\\bm\{0\}\}\_\{n^\{\(i\)\}\}&\\delta\\cdot\{\\bm\{1\}\}\_\{n^\{\(i\)\}\}&\{\\bm\{0\}\}\_\{n^\{\(i\)\}\}\\end\{bmatrix\},\\qquad\\forall s^\{\(i\)\}\\in\{\\mathbb\{D\}\},with hidden dimensiond=3d=3andδ\>0\\delta\>0\. Thus, every token in the vocabulary has the same embedding\.
Define the labeling functionℒ\(s\(i\)\)=0\\mathcal\{L\}\(s^\{\(i\)\}\)=0for alls\(i\)∈𝔻s^\{\(i\)\}\\in\{\\mathbb\{D\}\}, i\.e\., all sentences belong to the same class\. The final transformation is a standard classification head
U\(𝑿\)=argmaxj∈\{0,1\}softmax\[𝑿n\(i\)𝑾U\]j,U\(\{\\bm\{X\}\}\)=\\operatorname\*\{arg\\,max\}\_\{j\\in\\\{0,1\\\}\}\\mathrm\{softmax\}\\\!\\left\[\{\\bm\{X\}\}\_\{n^\{\(i\)\}\}\{\\bm\{W\}\}^\{U\}\\right\]\_\{j\},where𝑿n\(i\)\{\\bm\{X\}\}\_\{n^\{\(i\)\}\}is the representation of the last token, and
𝑾U=\[100100\]\.\{\\bm\{W\}\}^\{U\}=\\begin\{bmatrix\}1&0\\\\ 0&1\\\\ 0&0\\end\{bmatrix\}\.
We now construct three blocks as follows \(withM≫1M\\gg 1\):
𝑿\(1,i\)=𝑿\(0,i\)\+\[𝟎n\(i\)𝟎n\(i\)M⋅𝟏n\(i\)\],\{\\bm\{X\}\}^\{\(1,i\)\}=\{\\bm\{X\}\}^\{\(0,i\)\}\+\\begin\{bmatrix\}\{\\bm\{0\}\}\_\{n^\{\(i\)\}\}&\{\\bm\{0\}\}\_\{n^\{\(i\)\}\}&M\\cdot\{\\bm\{1\}\}\_\{n^\{\(i\)\}\}\\end\{bmatrix\},𝑿\(2,i\)=𝑿\(1,i\)\+\[δ⋅𝟏n\(i\)𝟎n\(i\)𝟎n\(i\)\],\{\\bm\{X\}\}^\{\(2,i\)\}=\{\\bm\{X\}\}^\{\(1,i\)\}\+\\begin\{bmatrix\}\\delta\\cdot\{\\bm\{1\}\}\_\{n^\{\(i\)\}\}&\{\\bm\{0\}\}\_\{n^\{\(i\)\}\}&\{\\bm\{0\}\}\_\{n^\{\(i\)\}\}\\end\{bmatrix\},𝑿\(3,i\)=𝑿\(2,i\)\+\[δM⋅𝟏n\(i\)𝟎n\(i\)−M⋅𝟏n\(i\)\]\.\{\\bm\{X\}\}^\{\(3,i\)\}=\{\\bm\{X\}\}^\{\(2,i\)\}\+\\begin\{bmatrix\}\\delta M\\cdot\{\\bm\{1\}\}\_\{n^\{\(i\)\}\}&\{\\bm\{0\}\}\_\{n^\{\(i\)\}\}&\-M\\cdot\{\\bm\{1\}\}\_\{n^\{\(i\)\}\}\\end\{bmatrix\}\.
Each Transformer block contains a feed\-forward network of the form
FFN\(𝑿\)=ReLU\(𝑿𝑾1\+𝟏n𝒃1⊤\)𝑾2\+𝟏n𝒃2⊤,\\text\{FFN\}\(\{\\bm\{X\}\}\)=\\mathrm\{ReLU\}\(\{\\bm\{X\}\}\{\\bm\{W\}\}\_\{1\}\+\{\\bm\{1\}\}\_\{n\}\{\\bm\{b\}\}\_\{1\}^\{\\top\}\)\{\\bm\{W\}\}\_\{2\}\+\{\\bm\{1\}\}\_\{n\}\{\\bm\{b\}\}\_\{2\}^\{\\top\},where𝑾1,𝑾2∈ℝd×d\{\\bm\{W\}\}\_\{1\},\{\\bm\{W\}\}\_\{2\}\\in\\mathbb\{R\}^\{d\\times d\}and𝒃1,𝒃2∈ℝd\{\\bm\{b\}\}\_\{1\},\{\\bm\{b\}\}\_\{2\}\\in\\mathbb\{R\}^\{d\}\. Note that the bias vectors are written as𝟏n𝒃⊤\{\\bm\{1\}\}\_\{n\}\{\\bm\{b\}\}^\{\\top\}so that dimensions match for sequence lengthnn\.
To enforce that the multi\-head attention does not modify the representation, we set its output to zero, so that the residual connection yields the identity mapping\.
For the FFN, we choose𝑾1=𝑰\{\\bm\{W\}\}\_\{1\}=\{\\bm\{I\}\},𝑾2=𝑰\{\\bm\{W\}\}\_\{2\}=\{\\bm\{I\}\}, and𝒃1=𝟎\{\\bm\{b\}\}\_\{1\}=\{\\bm\{0\}\}, so the residual effect comes only from𝒃2\{\\bm\{b\}\}\_\{2\}\. Specifically:
- •In Block 1, set𝒃2=\(0,0,M\)⊤\{\\bm\{b\}\}\_\{2\}=\(0,0,M\)^\{\\top\}to addMMin the third coordinate\.
- •In Block 2, set𝒃2=\(δ,0,0\)⊤\{\\bm\{b\}\}\_\{2\}=\(\\delta,0,0\)^\{\\top\}to addδ\\deltain the first coordinate\.
- •In Block 3, we instead choose 𝑾2=\[M00000000\],𝒃2=\(0,0,−M\)⊤,\{\\bm\{W\}\}\_\{2\}=\\begin\{bmatrix\}M&0&0\\\\ 0&0&0\\\\ 0&0&0\\end\{bmatrix\},\\quad\{\\bm\{b\}\}\_\{2\}=\(0,0,\-M\)^\{\\top\},so that the FFN contributes the transformation \[δM⋅𝟏n\(i\)𝟎n\(i\)−M⋅𝟏n\(i\)\]\.\\begin\{bmatrix\}\\delta M\\cdot\{\\bm\{1\}\}\_\{n^\{\(i\)\}\}&\{\\bm\{0\}\}\_\{n^\{\(i\)\}\}&\-M\\cdot\{\\bm\{1\}\}\_\{n^\{\(i\)\}\}\\end\{bmatrix\}\.
With this construction, the model output is
f3\(s\(i\)\)=U\(\[δ\(M\+1\)⋅𝟏n\(i\)δ⋅𝟏n\(i\)𝟎n\(i\)\]\)=0=ℒ\(si\),f^\{3\}\(s^\{\(i\)\}\)=U\\\!\\left\(\\begin\{bmatrix\}\\delta\(M\+1\)\\cdot\{\\bm\{1\}\}\_\{n^\{\(i\)\}\}&\\delta\\cdot\{\\bm\{1\}\}\_\{n^\{\(i\)\}\}&\{\\bm\{0\}\}\_\{n^\{\(i\)\}\}\\end\{bmatrix\}\\right\)=0=\\mathcal\{L\}\(s\_\{i\}\),while pruning the second block yields
f−13\(s\(i\)\)=U\(\[𝟎n\(i\)δ⋅𝟏n\(i\)𝟎n\(i\)\]\)=1≠ℒ\(si\)\.f^\{3\}\_\{\-1\}\(s^\{\(i\)\}\)=U\\\!\\left\(\\begin\{bmatrix\}\{\\bm\{0\}\}\_\{n^\{\(i\)\}\}&\\delta\\cdot\{\\bm\{1\}\}\_\{n^\{\(i\)\}\}&\{\\bm\{0\}\}\_\{n^\{\(i\)\}\}\\end\{bmatrix\}\\right\)=1\\neq\\mathcal\{L\}\(s\_\{i\}\)\.
Finally, we compute the cosine\-similarity scores:
CosSimScore\(0;𝔻\)=1−δδ2\+M2,\\operatorname\{CosSimScore\}\(0;\{\\mathbb\{D\}\}\)=1\-\\frac\{\\delta\}\{\\sqrt\{\\delta^\{2\}\+M^\{2\}\}\},CosSimScore\(1;𝔻\)=1−δ2\+M22δ2\+M2,\\operatorname\{CosSimScore\}\(1;\{\\mathbb\{D\}\}\)=1\-\\frac\{\\sqrt\{\\delta^\{2\}\+M^\{2\}\}\\,\}\{\\sqrt\{2\\delta^\{2\}\+M^\{2\}\}\},CosSimScore\(2;𝔻\)=1−δ\(M\+1\)2δ2\+M21\+M2\.\\operatorname\{CosSimScore\}\(2;\{\\mathbb\{D\}\}\)=1\-\\frac\{\\delta\(M\+1\)\}\{\\sqrt\{2\\delta^\{2\}\+M^\{2\}\}\\,\\sqrt\{1\+M^\{2\}\}\}\.
AsM→∞M\\to\\infty, we obtain
CosSimScore\(0;𝔻\)→1,CosSimScore\(1;𝔻\)→0,CosSimScore\(2;𝔻\)→1\.\\operatorname\{CosSimScore\}\(0;\{\\mathbb\{D\}\}\)\\to 1,\\quad\\operatorname\{CosSimScore\}\(1;\{\\mathbb\{D\}\}\)\\to 0,\\quad\\operatorname\{CosSimScore\}\(2;\{\\mathbb\{D\}\}\)\\to 1\.Thus, by choosing appropriate values ofMMandδ\\delta, we can ensure
CosSimScore\(1;𝔻\)=ϵ,\\operatorname\{CosSimScore\}\(1;\{\\mathbb\{D\}\}\)=\\epsilon,which completes the proof for Theorem[2](https://arxiv.org/html/2605.14075#Thmtheorem2)\.
It is also worth noting that this argument can be extended to multiple dimensions that do not affect the task, rather than relying on a single one\. In this way, instead of requiring a large value ofMM, we can use several smaller dimensionsM1M\_\{1\},M2M\_\{2\}, …,MdM\_\{d\}\.
Finally, one might worry that this construction would fail in practice because each block also includes a LayerNorm operation applied after the residual aggregation\. However, in our setup every row of𝑿\(l\)\{\\bm\{X\}\}^\{\(l\)\}is identical, so each token representation has the same mean and variance at every step\. Consequently, the effect of LayerNorm is deterministic and can be exactly canceled out by choosing the LayerNorm parameters\(γ,β\)\(\\gamma,\\beta\)appropriately\. In particular, settingγ\\gammaandβ\\betato rescale and shift the normalized vectors recovers the pre\-normalized representation, ensuring that LayerNorm does not alter the intended behavior of the construction\.
### B\.2General Case
We first show that a decoder\-only Transformer can trivially overfit any labeled calibration dataset𝔻=\{\(s\(i\),y\(i\)\)\}i=1N\{\\mathbb\{D\}\}=\\\{\(s^\{\(i\)\},y^\{\(i\)\}\)\\\}\_\{i=1\}^\{N\}, wheres\(i\)≠s\(j\)s^\{\(i\)\}\\neq s^\{\(j\)\}fori≠ji\\neq jandy\(i\)∈\{0,…,C−1\}y^\{\(i\)\}\\in\\\{0,\\dots,C\-1\\\}\.
Suppose that the tokenizer assigns one token to each sequences\(i\)s^\{\(i\)\}\. Define an embedding functionE\(⋅\)E\(\\cdot\)such that
E\(s\(i\)\)=𝑿\(i\)∈ℝ1×C\.E\(s^\{\(i\)\}\)=\{\\bm\{X\}\}^\{\(i\)\}\\in\\mathbb\{R\}^\{1\\times C\}\.If we let
E\(s\(i\)\)=\(𝒆\(y\(i\)\+1\)\)⊤,E\(s^\{\(i\)\}\)=\(\{\\bm\{e\}\}^\{\(y^\{\(i\)\}\+1\)\}\)^\{\\top\},where𝒆\(j\)\{\\bm\{e\}\}^\{\(j\)\}is thejj\-th standard basis vector inℝC\\mathbb\{R\}^\{C\}, then the classification head
U\(𝑿\)=argmaxj∈\{0,…,C−1\}softmax\[𝑿𝑾U\]j,U\(\{\\bm\{X\}\}\)=\\operatorname\*\{arg\\,max\}\_\{j\\in\\\{0,\\dots,C\-1\\\}\}\\mathrm\{softmax\}\\\!\\bigl\[\\,\{\\bm\{X\}\}\{\\bm\{W\}\}^\{U\}\\,\\bigr\]\_\{j\},with
𝑾U=\[\(𝒆\(1\)\)⊤⋮\(𝒆\(C\)\)⊤\],\{\\bm\{W\}\}^\{U\}=\\begin\{bmatrix\}\(\{\\bm\{e\}\}^\{\(1\)\}\)^\{\\top\}\\\\ \\vdots\\\\ \(\{\\bm\{e\}\}^\{\(C\)\}\)^\{\\top\}\\end\{bmatrix\},perfectly classifies the dataset, i\.e\.f\(s\(i\)\)=y\(i\)f\(s^\{\(i\)\}\)=y^\{\(i\)\}\. Thus, the model can memorize the dataset without any Transformer blocks, using only embeddings and the unembedding matrix\.
We now extend this idea to construct a model satisfying the conditions of Theorem[1](https://arxiv.org/html/2605.14075#Thmtheorem1)\. Let the hidden dimension bed=2C\+1d=2C\+1\. Define the embedding as
E\(s\(i\)\)=δ⋅\(𝒆\(C\+y\(i\)\+2\)\)⊤∈ℝd,E\(s^\{\(i\)\}\)=\\delta\\cdot\(\{\\bm\{e\}\}^\{\(C\+y^\{\(i\)\}\+2\)\}\)^\{\\top\}\\in\\mathbb\{R\}^\{d\},so that each input is mapped into a unique coordinate among the lastCCdimensions \(beyond the firstC\+1C\+1\)\.
We construct three Transformer blocks as follows \(withM≫1M\\gg 1\):
- •Block 1\.AddsMMto coordinateC\+1C\+1: 𝑿\(1,i\)=𝑿\(0,i\)\+M⋅𝒆\(C\+1\)\.\{\\bm\{X\}\}^\{\(1,i\)\}=\{\\bm\{X\}\}^\{\(0,i\)\}\+M\\cdot\{\\bm\{e\}\}^\{\(C\+1\)\}\.
- •Block 2\.Addsδ⋅𝒆\(y\(i\)\+1\)\\delta\\cdot\{\\bm\{e\}\}^\{\(y^\{\(i\)\}\+1\)\}, i\.e\. a one\-hot signal in the firstCCcoordinates corresponding to the correct class: 𝑿\(2,i\)=𝑿\(1,i\)\+δ⋅𝒆\(y\(i\)\+1\)\.\{\\bm\{X\}\}^\{\(2,i\)\}=\{\\bm\{X\}\}^\{\(1,i\)\}\+\\delta\\cdot\{\\bm\{e\}\}^\{\(y^\{\(i\)\}\+1\)\}\.
- •Block 3\.Amplifies the signal in the firstCCcoordinates by\(M−δ\)\(M\-\\delta\), subtractsMMfrom coordinateC\+1C\+1, and adds a misleading one\-hot vector from the lastCCdimensions: 𝑿\(3,i\)=𝑿\(2,i\)\+\(M−δ\)⋅𝒆\(y\(i\)\+1\)−M⋅𝒆\(C\+1\)\+δ⋅𝒆\(C−y\(i\)\)\.\{\\bm\{X\}\}^\{\(3,i\)\}=\{\\bm\{X\}\}^\{\(2,i\)\}\+\(M\-\\delta\)\\cdot\{\\bm\{e\}\}^\{\(y^\{\(i\)\}\+1\)\}\-M\\cdot\{\\bm\{e\}\}^\{\(C\+1\)\}\+\\delta\\cdot\{\\bm\{e\}\}^\{\(C\-y^\{\(i\)\}\)\}\.
As in the proof of Theorem[2](https://arxiv.org/html/2605.14075#Thmtheorem2), we ensure that multi\-head attention acts as the identity by setting its output projection𝑾O=0\{\\bm\{W\}\}\_\{O\}=0, and we use the feed\-forward networks with suitable\(𝑾1,𝑾2,𝒃1,𝒃2\)\(\{\\bm\{W\}\}\_\{1\},\{\\bm\{W\}\}\_\{2\},\{\\bm\{b\}\}\_\{1\},\{\\bm\{b\}\}\_\{2\}\)to realize the desired additive transformations\.
After the three blocks, the firstCCcoordinates of𝑿\(3,i\)\{\\bm\{X\}\}^\{\(3,i\)\}are dominated by\(M−δ\+δ\)⋅𝒆\(y\(i\)\+1\)=M⋅𝒆\(y\(i\)\+1\)\(M\-\\delta\+\\delta\)\\cdot\{\\bm\{e\}\}^\{\(y^\{\(i\)\}\+1\)\}=M\\cdot\{\\bm\{e\}\}^\{\(y^\{\(i\)\}\+1\)\}, while the misleading additions are suppressed\. Thus, the classifierUUcorrectly outputsy\(i\)y^\{\(i\)\}for allii, and the model achieves perfect accuracy\.
However, if Block 2 is removed, then the model never inserts the signal in the firstCCcoordinates\. Block 3 then only contributes spurious information, and the classification head produces incorrect labels for all samples\. Therefore, the pruned model fails completely\.
Finally, as in Theorem[2](https://arxiv.org/html/2605.14075#Thmtheorem2), we compute the cosine similarity scores for each block\. By takingM→∞M\\to\\inftyand choosingδ\\deltaappropriately, we ensure that Block 2 attainsCosSimScore\(l;𝔻\)=ϵ\\operatorname\{CosSimScore\}\(l;\{\\mathbb\{D\}\}\)=\\epsilon, while the others approach11\. Thus, the theorem follows\.
Two final remarks are worth noting\. First, if the number of classesCCis odd, the pruned model may not achieve zero accuracy\. Specifically, instances assigned to classC−12\\frac\{C\-1\}\{2\}will be classified correctly, as the misleading signal coincides with the correct label\. This issue is trivial to resolve by adjusting the label assignment or class structure\.
Second, as in the proof of Theorem[2](https://arxiv.org/html/2605.14075#Thmtheorem2), the presence of normalization layers does not invalidate the construction\. This is because the mean and variance of each token representation within a block remain constant across instances, ensuring that normalization does not interfere with the mechanism underlying the proof\.
## Appendix CFurther Analysis about Relevance Consistency Across Datasets
In this section we go deeper in the analysis done in Section[5\.1](https://arxiv.org/html/2605.14075#S5.SS1), about the work fromHeet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib35)\)\.
### C\.1Implementation Details
All experiments were conducted on pre\-trained models, using code based on the EleutherAI LM Evaluation Harness\(Gaoet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib61)\)for our accuracy\-based scores\. We used a batch size of 4 and ran evaluations on NVIDIA RTX A6000 and RTX 4090 GPUs\.
To compute cosine similarity relevance scores, we used the same hardware and followed the methodology introduced in Section[3](https://arxiv.org/html/2605.14075#S3), based on the implementation fromHeet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib35)\)\. Each sample in this method is a full input sequence matching the model’s context length \(e\.g\., 4096 tokens for Mistral\-7B\), constructed by concatenating multiple dataset instances until the required token length is reached\. Because instance lengths vary across datasets, the number of instances per sample also varies\. FollowingHeet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib35)\), we use 256 such samples per dataset for C4, LIMA, CodeAlpaca, and MathInstruct\.
For fairness, we used the same dataset instances to compute our accuracy\-based relevance scores\. However, unlike cosine similarity, we did not concatenate instances into long sequences\. Instead, we evaluated next\-token prediction at the instance level, computing accuracy on the last token of each instance\. Thus, while the underlying data is shared, the two metrics differ in their evaluation granularity\.
For the remaining datasets, we used the training split associated with each task and modified the input format used during relevance scoring to compute cosine similarity scores\. Specifically, instead of generating full answer phrases, we presented all answer options explicitly \(e\.g\., “A”, “B”, “C”, “D”\) within the prompt and computed the probability of generating only the correct option token\. This adjustment was necessary to ensure the model received all relevant information required for task evaluation\. In contrast, no such modification was needed for our accuracy\-based metric, as we followed standard LM evaluation protocols for multiple\-choice tasks\.
Following these protocols\(Gaoet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib61)\), we constructed input prompts by concatenating the context, question, and each answer option individually\. For each example, we computed the total log\-probability of the full prompt associated with each option and selected the one with the highest value\. We report normalized accuracy, which adjusts log\-probabilities for option length to ensure fairness between longer and shorter candidates\. A prediction is counted as correct if the selected option matches the gold label\.
### C\.2Normalization Details
To complement our block relevance visualizations and quantify how our accuracy\-based score captures more variation across datasets than cosine similarity, we compute the variance in relevance across tasks for both methods\. For each model and method, we first apply z\-score normalization to the block relevance scores, then calculate the variance across datasets for each of the 32 layers—yielding 32 variance values per model\-method combination\. In Figure[5](https://arxiv.org/html/2605.14075#S5.F5), we report the mean variance and standard deviation error bars for each model and method\.
### C\.3Wilcoxon signed\-rank test
When comparing our accuracy\-based relevance to cosine similarity, we found significantly higher variance using our metric\. In fact, the Wilcoxon signed\-rank test\(Wilcoxon,[1992](https://arxiv.org/html/2605.14075#bib.bib4)\)resulted in the following values: p\-value = 1\.7e\-7, W\-value = 20 for Mistral; p\-value = 8\.8e\-9, W\-value = 7 for OLMo; and p\-value = 4\.6e\-10, W\-value = 0 for Pythia, respectively\.
### C\.4Relevance During Training
Figure 7:Block relevance during training in OLMo \(left\) and Pythia \(right\) on the MathInstruct dataset\.A, each row corresponds to a model checkpoint trained on a given number of tokens in billions \(OLMo\) or train iterations in millions \(Pythia onB\), with accuracy reported on the y\-axis\.A,B, Accuracy\-based score\.C,D, Cosine\-similarity score\.Block relevance patterns evolve during training, but in markedly different ways depending on the metric\. Our accuracy\-based metric \(Figures[7](https://arxiv.org/html/2605.14075#A3.F7)A and[7](https://arxiv.org/html/2605.14075#A3.F7)B\) displays a chaotic behavior through training, with some blocks gaining or losing relevance between checkpoints without following smooth trends\. While certain blocks in OLMo tend to increase in relevance, these changes are rarely monotonic\. This fluctuation suggests that blocks may take on transient, adaptive roles throughout training—dynamics that cosine similarity tends to obscure\.
Cosine similarity \(Figures[7](https://arxiv.org/html/2605.14075#A3.F7)C and[7](https://arxiv.org/html/2605.14075#A3.F7)D\) reveals consistent patterns across models\. For both models, most blocks either maintain their relevance scores or gradually increase throughout training\. This suggests that some blocks increasingly modify their inputs as training progresses\. However, it’s important to note that this does not necessarily reflect how much each block contributes to the model’s output\.
Figure 8:Block relevance during training in OLMo on the CodeAlpaca dataset\. Each row corresponds to a model checkpoint trained on a given number of tokens in billions \(shown on the y\-axis\), with accuracy reported in parentheses\. \(Top\) Accuracy\-based score\. \(Bottom\) Cosine\-similarity score\.Figure 9:Block relevance during training in OLMo on the C4 dataset\. Each row corresponds to a model checkpoint trained on a given number of tokens in billions \(shown on the y\-axis\), with accuracy reported in parentheses\. \(Top\) Accuracy\-based score\. \(Bottom\) Cosine\-similarity score\.Figure 10:Block relevance during training in OLMo on the LIMA dataset\. Each row corresponds to a model checkpoint trained on a given number of tokens in billions \(shown on the y\-axis\), with accuracy reported in parentheses\. \(Top\) Accuracy\-based score\. \(Bottom\) Cosine\-similarity score\.Figures[8](https://arxiv.org/html/2605.14075#A3.F8),[9](https://arxiv.org/html/2605.14075#A3.F9)and[10](https://arxiv.org/html/2605.14075#A3.F10)present the results on CodeAlpaca, C4 and LIMA, respectively, using OLMo\. As with MathInstruct, the cosine similarity\-based relevance \(bottom figures\) produces nearly identical heatmaps across datasets, reinforcing the metric’s stability and dataset\-agnostic nature\. Interestingly, we also find a pattern not discussed in prior work—block 2 shows a non\-monotonic trajectory where its relevance increases at early stages and later decreases, a pattern that could be studied in future works\.
In contrast, the accuracy\-based relevance \(top figures\) continues to show less consistent and less interpretable patterns\. While some blocks exhibit periods of increased or decreased relevance, there are no clear, sustained trends comparable to those seen with cosine similarity\.
Figure 11:Block relevance during training in Pythia on the CodeAlpaca dataset\. Each row corresponds to a model checkpoint trained with a given number of iterations in thousand \(shown on the y\-axis\), with accuracy reported in parentheses\. \(Top\) Accuracy\-based score\. \(Bottom\) Cosine\-similarity score\.Figure 12:Block relevance during training in Pythia on the C4 dataset\. Each row corresponds to a model checkpoint trained with a given number of iterations in thousand \(shown on the y\-axis\), with accuracy reported in parentheses\. \(Top\) Accuracy\-based score\. \(Bottom\) Cosine\-similarity score\.Figure 13:Block relevance during training in Pythia on the LIMA dataset\. Each row corresponds to a model checkpoint trained with a given number of iterations in thousand \(shown on the y\-axis\), with accuracy reported in parentheses\. \(Top\) Accuracy\-based score\. \(Bottom\) Cosine\-similarity score\.Figures[11](https://arxiv.org/html/2605.14075#A3.F11)to[13](https://arxiv.org/html/2605.14075#A3.F13)show the results for the same experiment, but with Pythia on the same four datasets used in previous sections\. Unlike the OLMo figures, we apply a separate color scale for block 1 in the cosine similarity plots \(bottom figures\) for all Pythia figures\. This is necessary because the relevance values of the first block are significantly higher than the rest—using a single color scale would make differences between blocks 2 to 31 nearly invisible\.
As with OLMo, cosine similarity yields nearly identical relevance patterns across datasets, reinforcing the observation that this metric is largely insensitive to the specific task\. However, a new behavior emerges in Pythia: some blocks show an initial drop in relevance between the first and second checkpoints, but then stabilize or fluctuate rather than continue decreasing\. This, along with the unusual pattern in block 2 in OLMo, suggests that certain relevance dynamics may be model\-specific\. More precisely, we suspect they may be seed\-specific: different initializations of the same model trained on the same data could produce distinct relevance trajectories\.
In contrast, our accuracy\-based metric continues to show no clear, smooth patterns across training steps, and exhibits noticeable differences between datasets\. One particularly interesting finding is that blocks 17 to 31 appear nearly irrelevant under cosine similarity for all datasets—yet our method shows that pruning some of these blocks can significantly hurt performance\. This further illustrates that cosine similarity can miss important functional contributions of blocks, reinforcing the need for task\-aware relevance measures\.
Finally, through our experiments, we do not observe a clear relationship between block relevance patterns and the model’s accuracy gains throughout training for either metric\. In other words, changes in block relevance do not directly correlate with improvements in overall performance, highlighting the complexity of the internal dynamics involved during model learning\.
### C\.5Relevance During Pruning
Figure 14:Block relevance on Mistral in MathInstruct \(left\) and CodeAlpaca \(right\) as blocks are iteratively pruned\.A, at each row the least relevant block, according to the Accuracy\-based score of Mistral on MathInstruct, is removed and shown with a gray cross\. The accuracy of the pruned model is shown on the right\.B, the same on CodeAlpaca\.C,D, using cosine\-similarity score\.Pruning significantly changes block relevance—especially under our accuracy\-based metric\. As model blocks are pruned, we observe that certain blocks increase in importance while others become less critical\. These shifts reveal that accuracy\-based relevance captures latent dependencies and compensatory dynamics between layers\. To better understand how these shifts in relevance emerge, we performed iterative structured pruning on Mistral\-7B\. At each step, we \(1\) compute block relevance using either our accuracy\-based method or cosine similarity, \(2\) remove the least relevant block, and \(3\) repeat steps one and two until 25% of blocks are pruned\. Figure[14](https://arxiv.org/html/2605.14075#A3.F14)shows results for MathInstruct and CodeAlpaca, while Figure[15](https://arxiv.org/html/2605.14075#A3.F15)and Figure[16](https://arxiv.org/html/2605.14075#A3.F16)show results for C4 and LIMA respectivelly\. The figures also report the accuracy for the same dataset used for pruning\.
Pruning using the accuracy\-based score \(Figures[14](https://arxiv.org/html/2605.14075#A3.F14)A and[14](https://arxiv.org/html/2605.14075#A3.F14)B\) reveals complex dynamics\. First, after pruning a block, earlier \(closer to the input\) and/or later \(closer to the output\) blocks can gain relevance\. For example, in MathInstruct \(Figure[14](https://arxiv.org/html/2605.14075#A3.F14)A\), block 17—initially of moderate importance—becomes highly relevant once later blocks are removed, suggesting that pruning can reassign or expose latent functional roles\. Second, blocks with negative relevance \(green blocks\) become neutral or positive after pruning\. For example, in the first row of Figure[14](https://arxiv.org/html/2605.14075#A3.F14)A, several green blocks change behavior after pruning block 23, implying that they were not inherently harmful but instead interacted negatively with it\. Third, blocks with high relevance decreased their value after pruning\. For example, we observe that block 31 becomes less relevant after block 23 is removed, which we speculate reflects a compensatory role—block 31 may have been mitigating the detrimental effects of block 23, a pattern aligning with prior findings on corrective behavior\(Gevaet al\.,[2022](https://arxiv.org/html/2605.14075#bib.bib20)\)\. These examples suggest that our metric can be a tool to study the inner workings of transformers\.
As expected, when pruning using our method, we observed differences in relevance at the dataset level\. In MathInstruct, pruning blocks triggers sharp shifts in relevance, while in CodeAlpaca, the relevance landscape remains relatively stable in the early pruning steps\. Notably, CodeAlpaca lacks negatively relevant blocks at initialization, suggesting less redundancy or a more uniform functional distribution, among other possible explanations\. This phenomenon opens new avenues for research\.
On the other hand, under cosine similarity \(Figures[14](https://arxiv.org/html/2605.14075#A3.F14)C and[14](https://arxiv.org/html/2605.14075#A3.F14)D\), we observe that relevance changes after pruning are generally local and limited\. Only the later blocks of the network, those positioned after the pruned block, display relevance changes according to this measure\. For instance, in MathInstruct, pruning block 27 results in slight increases of cosine\-similarity score in later blocks, while earlier blocks remain unaffected\. Even though one might expect this behavior given the local nature of the metric, this explanation is only partially correct\. Since cosine similarity is computed locally, only blocks following the pruned one can exhibit changes in relevance\. Mathematically, these changes could be either increases or decreases; however, in practice, we observe only increases\.
When using accuracy\-based relevance, iterative pruning produces a different model compared to one\-shot pruning, which removes all least\-relevant blocks simultaneously based on initial relevance scores\. As shown in Figure[14](https://arxiv.org/html/2605.14075#A3.F14), our metric reveals that block relevance changes significantly after each pruning step, with new dependencies and compensatory patterns emerging across layers\. For example, under one\-shot pruning, blocks 16, 19, 21, 23, 26, 27, 28, and 29 would be removed from Mistral \(Figure[14](https://arxiv.org/html/2605.14075#A3.F14)A first row\); in this case, the pruned model would exhibit an accuracy of 0\.22 \(data not shown\)\. In contrast, based on iterative pruning, we removed different blocks, resulting in a pruned model accuracy of 0\.44\. Our results indicate that one\-shot pruning may not be suitable when employing accuracy\-based relevance\. In contrast, cosine similarity yields nearly identical results for both one\-shot and iterative pruning since relevance scores remain largely stable throughout the pruning steps\.
Figure 15:Block relevance in Mistral on the C4 dataset as layers are incrementally pruned\. In each row, the least relevant block \(according to the corresponding metric\) is removed and shown with a gray cross\. The accuracy of the pruned model is shown in parentheses\. \(Top\) Accuracy\-based score\. \(Bottom\) Cosine\-similarity score\.Figure 16:Block relevance in Mistral on the LIMA dataset as layers are incrementally pruned\. In each row, the least relevant block \(according to the corresponding metric\) is removed and shown with a gray cross\. The accuracy of the pruned model is shown in parentheses\. \(Top\) Accuracy\-based score\. \(Bottom\) Cosine\-similarity score\.Regarding C4 and LIMA datasets\. We observe similar patterns to those previously discussed: our accuracy\-based relevance scores reveal richer dynamics than cosine similarity\.
With the accuracy\-based metric, we see both increases and decreases in relevance as pruning progresses\. In rare cases, such as block 1, relevance remains stable throughout\. In contrast, cosine similarity mostly shows increasing relevance in blocks that follow the pruned one, while other blocks remain largely unaffected\.
An interesting pattern emerges in both C4 and LIMA: block 20 consistently increases in relevance under our metric\. This may suggest a shared functional role between these two tasks, though it may also be coincidental\. A deeper investigation into this connection would be valuable\.
Regarding the comparison between one\-shot and iterative pruning, we noted that the two approaches often select different sets of blocks for removal\. However, the reasons for these divergences differ depending on the relevance metric\.
For cosine similarity \(Figures[14](https://arxiv.org/html/2605.14075#A3.F14)C and[14](https://arxiv.org/html/2605.14075#A3.F14)D\), the evolution is mostly predictable\. As discussed earlier, pruning a block tends to increase the similarity scores of subsequent blocks\. As a result, iterative pruning diverges from one\-shot pruning primarily when the least relevant block is not one of the later\-positioned layers\. For example, in MathInstruct, block 28 initially had low relevance, but pruning earlier blocks \(e\.g\., block 27\) increased its relevance, causing it to be excluded from later pruning steps\. A similar shift happens with block 26\. If the initial relevance ordering of blocks 21–28 had been strictly decreasing, both pruning methods would have selected the same blocks\. The observed deviations result from small, local shifts in relevance caused by positional effects during pruning\.
### C\.6Size of the calibration dataset
To further analyze our metric, we applied it to LLaMA\-3\-8B using four different datasets: C4, LIMA, MathInstruct, and CodeAlpaca, each with varying dataset sizes\. Figure[17](https://arxiv.org/html/2605.14075#A3.F17)presents the results of this experiment\. We observe that after using approximately 500 samples, the heatmaps begin to converge toward the scores computed with 3,000 samples\. While some differences remain noticeable, they are not critical for the overall ranking; in other words, the sets of relevant and irrelevant blocks remain consistent\.
Figure 17:Block relevance in LLaMA\-3\-8B across 4 datastes\. In each row, we used a different size of the dataset to compute the metric\.
## Appendix DFurther Analysis about Differences Between Type of Tasks
Figure[18](https://arxiv.org/html/2605.14075#A4.F18)shows the same results as Figure[2](https://arxiv.org/html/2605.14075#S1.F2), with the addition of results obtained using cosine similarity under an iterative pruning strategy\. As discussed in the main paper, the original conclusions still hold—although the performance drop in HellaSwag is now less abrupt\.
It’s also worth noting that there are clear differences between the two ways of applying the cosine similarity score, which supports our argument that this metric should be used with caution when making assumptions about the internal mechanisms of Transformer models\.
Figure 18:Evaluation of LLaMA\-3\-8B under the cosine\-similarity pruning strategy ofGromovet al\.\([2025](https://arxiv.org/html/2605.14075#bib.bib34)\)compared with our proposed method and cosine\-similarity score with an iterative pruning strategy\.
## Appendix EStructured Pruning
### E\.1Implementation details
Since all benchmarks used in our structured pruning experiments are multiple\-choice tasks, we followed the same protocol and considerations as mentioned in Appendix[C\.1](https://arxiv.org/html/2605.14075#A3.SS1)\.
We evaluate models in a zero\-shot setting on all tasks except for MMLU, where we use the five\-shot format commonly adopted in prior work\(Zhanget al\.,[2024b](https://arxiv.org/html/2605.14075#bib.bib39); Heet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib35)\)\.
For Taylor relevance, we implement the element\-wise importance formulation fromMaet al\.\([2023](https://arxiv.org/html/2605.14075#bib.bib37)\), using absolute weight–gradient products aggregated via sum—identified as the best\-performing setup in their study\. For Cosine Similarity, we follow the approach ofHeet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib35)\), concatenating multiple examples to form long input sequences that align with the model’s context window\. Explained in Appendix[C\.1](https://arxiv.org/html/2605.14075#A3.SS1)\.
All models are evaluated using the LM Evaluation Harness\(Gaoet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib61)\), ensuring consistency with prior structured pruning work\. Experiments were run on NVIDIA RTX A6000 and RTX 4090 GPUs, using a batch size of 4\.
### E\.2Mistral
Table 4:Accuracy of pruned Mistral\-7B models on eight downstream tasks\. All methods remove 25% of the layers using task\-specific relevance estimates computed from each task’s training set\. Our accuracy\-based approach consistently outperforms baselines\. Best results per task are in bold\. “Original” refers to the unpruned model\.MethodsArc\-CArc\-EBoolQOBQAHSPIQAWGMMLUOriginal53\.6779\.5583\.7344\.481\.0382\.2674\.2762\.48Taylor23\.4629\.855\.7224\.432\.1565\.9451\.4624\.16Cosine Similarity42\.4160\.2366\.7337\.470\.4373\.7270\.4842\.29Out\. Cosine\-Sim38\.4168\.3964\.6237\.870\.8278\.5661\.826\.69Out\. Norm\-Sim38\.4168\.3966\.5437\.870\.4778\.5661\.9638\.73Out\. Divergence\-Sim32\.5156\.9958\.234\.466\.9774\.3259\.8333\.41Perplexity40\.9659\.1864\.8636\.462\.9871\.7164\.7257\.86Acc \(Ours\)46\.4274\.8382\.2942\.475\.7780\.5272\.4661\.18To assess the generality of our approach across architectures, we replicate the structured pruning experiment described in the main paper \(Section[6](https://arxiv.org/html/2605.14075#S6)\) using Mistral\-7B\. The pruning setup, datasets, evaluation method, and relevance proxies are identical to those used in the LLaMA3 experiment\.
As shown in Table[4](https://arxiv.org/html/2605.14075#A5.T4), our accuracy\-based relevance method consistently outperforms all baselines across tasks, confirming its robustness beyond a single model family\. However, unlike with LLaMA3, the task\-specific pruned models do not surpass the performance of the unpruned model\. This aligns with observations in prior work\(Heet al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib35); Zhanget al\.,[2024b](https://arxiv.org/html/2605.14075#bib.bib39)\), which also report significant differences between models in the percentage of original performance retained after pruning\.
### E\.3One\-shot
Table 5:One\-shot structured pruning results on LLaMA3\-8B across eight downstream benchmarks\. In this setting, relevance scores are computed once and used to prune 25% of layers in a single step\. While our method occasionally underperforms others in this configuration, it remains highly competitive overall\. Notably, the iterative version of our method consistently outperforms all one\-shot baselines, highlighting the benefits of dynamic relevance estimation\.MethodArc\-CArc\-EBoolQHSOBQAPIQAWGMMLUOriginal53\.1681\.0282\.0278\.9444\.881\.2873\.5665\.11Taylor33\.3656\.1461\.2556\.7734\.871\.9854\.1423\.64Cosine Similarity47\.6168\.8670\.471\.0939\.476\.3970\.3935\.12Out\. Cosine\-Sim44\.868\.0156\.3351\.1938\.273\.2959\.1923\.72Out\. Norm\-Sim38\.4665\.0764\.6557\.2137\.672\.8664\.8823\.72Out\. Divergence\-Sim42\.4656\.4470\.3466\.3632\.471\.1667\.9630\.12Perplexity39\.8557\.6662\.4255\.0537\.266\.8165\.5959\.63Acc 1\-Shot \(Ours\)42\.2472\.0952\.274\.4944\.479\.5466\.9353\.21Acc Iterative \(Ours\)49\.5774\.9684\.0471\.534479\.0673\.862\.97To assess how our method performs in a simpler pruning setup, we replicate the main structured pruning experiment using a one\-shot approach\. Instead of iteratively updating relevance scores during pruning, we compute each method’s scores only once, rank the layers accordingly, and prune the bottom 25% in a single step\.
Results are shown in Table[5](https://arxiv.org/html/2605.14075#A5.T5)\. While our method occasionally underperforms others in the one\-shot setting \(e\.g\., on BoolQ\), the iterative version of our method still outperforms all baselines—including one\-shot variants—highlighting the benefits of reevaluating relevance dynamically\. This is consistent with our earlier findings in Section[C\.5](https://arxiv.org/html/2605.14075#A3.SS5), where we showed that block relevance evolves during pruning\.
Interestingly, for a few datasets \(e\.g\., HellaSwag and OpenBookQA\), our one\-shot variant marginally outperforms its iterative counterpart\. We hypothesize that this may result from domain shifts between the training and test splits, which can affect our accuracy\-based signal\. Additionally, selecting the optimal pruning set is ultimately a challenging search problem—one that has been tackled explicitly in recent works\(Sieberlinget al\.,[2024](https://arxiv.org/html/2605.14075#bib.bib55)\)\.
### E\.4Task\-Independent Pruning
The task\-independent structured pruning setup consists of using a single dataset—commonly referred to as a calibration dataset—to compute relevance scores and prune the model accordingly\. This results in one pruned model per pruning method, which is then evaluated across multiple downstream tasks\. The tasks used for evaluation typically mirror those presented in the main paper\.
It is worth noting that there is no standardized protocol regarding which dataset to use as calibration data or how many samples to include\. For example,Zhanget al\.\([2024b](https://arxiv.org/html/2605.14075#bib.bib39)\)uses WikiText\-2 with 10 randomly selected instances, whileHeet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib35)\)uses C4, selecting 256 samples where each sample may span multiple instances \(due to concatenation to match the model’s input length; see Appendix[C\.1](https://arxiv.org/html/2605.14075#A3.SS1)\)\.
In our experiments, we adopt the setup ofHeet al\.\([2024](https://arxiv.org/html/2605.14075#bib.bib35)\)for consistency\. However, to ensure a fair comparison—especially against pruning methods like cosine similarity that operate on instance\-level granularity—we avoid concatenation and instead use the same 1,500 instances employed in the cosine similarity baseline\. These instances were selected to construct 256 full\-context\-length samples in a consistent and comparable manner\.
Table[6](https://arxiv.org/html/2605.14075#A5.T6)presents results for the LLaMA3\-8B model under this classic pruning setup\. As shown, the cosine similarity method outperforms all others, including our accuracy\-based metric\. This outcome contrasts with the results reported byZhanget al\.\([2024b](https://arxiv.org/html/2605.14075#bib.bib39)\), likely due to differences in calibration dataset choice and sample size\.
These results are consistent with our expectations\. Our method is tightly coupled to the calibration dataset, and—as demonstrated throughout this paper—relevance is highly task\- and data\-dependent\. Therefore, when the calibration dataset is misaligned with the evaluation tasks, performance is likely to degrade\.
Table 6:Task\-independent structured pruning results for LLaMA3\-8B across eight downstream benchmarks\. Each pruning method uses the same 1,500\-instance calibration dataset to prune the model once, which is then evaluated on all tasks\. Cosine similarity performs best in this setup, while our accuracy\-based method underperforms, likely due to its strong dependency on the calibration dataset\.Arc\-CArc\-EBoolQHSOBQAPIQAWGMMLUMeanOriginal53\.1581\.0282\.0278\.9444\.881\.2873\.5665\.1169\.99Taylor45\.3967\.9761\.3163\.7341\.476\.5568\.1125\.0356\.19Cosine Similarity43\.3465\.3276\.770\.2436\.873\.3970\.9640\.7859\.69Out\. Cosine\-Sim44\.272\.0571\.9966\.5740\.277\.3766\.9334\.5659\.23Out\. Norm\-Sim42\.6670\.1266\.946741\.478\.7367\.7234\.158\.58Out\. Divergence\-Sim43\.5471\.2568\.6265\.0939\.676\.0165\.9430\.5757\.58Perplexity40\.1963\.4744\.4365\.9639\.274\.4864\.429\.452\.69Acc \(Ours\)36\.6956\.5253\.3660\.1633\.872\.6360\.1428\.550\.23Then, we evaluate our method using an alternative calibration dataset that combines training data from the target evaluation tasks\. Table[2](https://arxiv.org/html/2605.14075#S6.T2)reports results for LLaMA3\-8B using a calibration set composed of 10% of the training data from each of the eight benchmarks\. Under this configuration, our method outperforms all baselines, yielding a pruned model that achieves the highest average performance across tasks\.
### E\.5Task Relations
Given the task\-independent results presented in Table[2](https://arxiv.org/html/2605.14075#S6.T2), a natural question arises: can the training set of one task serve as a suitable calibration set for pruning models used in other tasks? Table[7](https://arxiv.org/html/2605.14075#A5.T7)explores this by showing the performance of different training sets used as calibration data\. We observe that most tasks achieve good average performance; notably, some tasks serve as particularly effective calibration sets for others\.
Table 7:Calibration dataset analysis\. Each row shows the performance of a LLaMA3\-8B model pruned at 25% with our method using a different train set as a calibration dataset\.Arc\-CArc\-EBoolQHSOBQAPIQAWGMMLUMeanArc\-C49\.5774\.4569\.0872\.9142\.277\.867\.5640\.261\.72Arc\-E51\.3774\.9666\.9473\.6243\.678\.5171\.5944\.8263\.18BoolQ40\.766\.9684\.167\.9738\.673\.2371\.8235\.0959\.81HS44\.4562\.565\.8471\.5344\.273\.564\.7242\.8358\.7OBQA45\.8266\.575\.3866\.884475\.2465\.950\.4561\.27PIQA44\.6270\.248\.6268\.4544\.279\.0567\.827\.856\.34WG44\.7168\.7378\.1369\.8539\.275\.0373\.842\.461\.48MMLU40\.6162\.4675\.3263\.113571\.4969\.362\.9760\.03Building on Table[7](https://arxiv.org/html/2605.14075#A5.T7), Figure[19](https://arxiv.org/html/2605.14075#A5.F19)presents a graph illustrating the relationships between tasks\. Each node corresponds to a task, and a directed edge from task 1 to task 2 indicates that the training set of task 1 serves as either a good \(blue\) or poor \(red\) calibration set for task 2\. We define “good” as achieving at least 90% of the performance obtained when pruning with task 2’s own training set \(see Table[1](https://arxiv.org/html/2605.14075#S6.T1)\), and “poor” as 10% or lower\. To further indicate the strength of the effect, edge transparency varies: lighter blue denotes values closer to the 90% threshold, while lighter red denotes values closer to 10%\.
Several observations emerge from this analysis:
- •ARC\-E and ARC\-C serve as good proxies for almost all tasks, with the exception of BoolQ and MMLU\. Interestingly, ARC\-E is a better proxy than ARC\-C, despite being an easier version of the same benchmark\.
- •Nearly all tasks act as good proxies for HellaSwag, except for MMLU\. This finding is noteworthy because HellaSwag is generally considered a commonsense reasoning task, whereas MMLU requires broader world knowledge\.
- •No task provides a good proxy for MMLU or BoolQ\. For MMLU, this is expected: as a world\-knowledge benchmark, it likely requires calibration sets with overlapping domain coverage, which the other tasks lack\. For BoolQ, however, the absence of good proxies is less straightforward\. One possible explanation is that its yes/no format introduces unique structural properties that are particularly sensitive to pruning\.
Figure 19:Relation between tasks\. Computed from data in Table[7](https://arxiv.org/html/2605.14075#A5.T7)
### E\.6Computational Cost
To assess how long our method would take to achieve higher pruning ratios, quantify the benefits of using more compute, and evaluate larger models, we estimate the time required to prune 50% of LLaMA\-3\-8B using an NVIDIA L40S and a pair of NVIDIA H100s, as well as the time required to prune 50% of LLaMA\-3\-70B using two NVIDIA H100s\. These estimates are derived from the timings reported in Table[3](https://arxiv.org/html/2605.14075#S6.T3), along with additional inference runs performed on both models using the dual\-H100 setup\. For each dataset, we measured the runtime using the maximum feasible batch size and recorded the reduction in runtime obtained after removing a block\. The resulting estimates, summarized in Table[8](https://arxiv.org/html/2605.14075#A5.T8), show that additional compute substantially benefits our method and that pruning to higher ratios—even for larger models—is feasible\.
Additionally, we analyze how runtime scales with the size of the calibration dataset\. Figure[20](https://arxiv.org/html/2605.14075#A5.F20)shows the time per sample for different calibration\-set sizes when pruning 25% of LLaMA\-3\-8B on an NVIDIA L40S\. The results indicate that, beyond approximately 250 samples, the time per sample becomes stable and even slightly decreases across all datasets\. This suggests that, in the worst case, our method scales linearly with the number of calibration samples\.
Reducing the computational cost of our approach remains an important direction for future work\. Parallelizing relevance computation across multiple GPUs—for example, assigning different subsets of layers to each device—could substantially reduce runtime\. Additional gains may come from inference\-optimized frameworks or quantization, though the latter may affect pruning behavior and requires further study\. Moreover, many optimized frameworks do not yet support model modification, limiting their applicability to our method\.
Table 8:Estimated time \(hours or days\) required for pruning 50% of the model using our method using 1,000 instances per dataset\.ModelBatch SizeHardwareC4LIMAMathInstructCodeAlpacaLlama 3\-8B81 × L40s11\.56 hrs10\.02 hrs4\.72 hrs1\.60 hrsLlama 3\-8B642 × H1007\.46 hrs4\.68 hrs3\.23 hrs0\.83 hrsLlama 3\-70B82 × H1008\.49 days6\.85 days3\.39 days1\.12 daysFigure 20:Time per sample versus the number of calibration samples for our method\. Results are shown across multiple datasets using an L40S GPU while pruning 25% of LLaMA\-3\-8B\.
### E\.7Healing
Given that our results so far have not used healing as a post\-processing step, a natural question arises: Could healing allow less computationally expensive methods, such as Cosine Similarity, to achieve comparable performance? Moreover, are the blocks selected by our method truly less important for the task than those selected by other methods?
To address these questions, we applied a healing process to our task\-dependent setup using the training split of each benchmark \(the same data used for calibration during pruning\)\. Tables[9](https://arxiv.org/html/2605.14075#A5.T9),[10](https://arxiv.org/html/2605.14075#A5.T10), and[11](https://arxiv.org/html/2605.14075#A5.T11)show a comparison of our method, Cosine Similarity, and the selection of random blocks to prune the same 8 benchmarks used in previous sections\. After pruning at a 25% ratio and then healing \(Table[9](https://arxiv.org/html/2605.14075#A5.T9)\), our method performs similarly to Cosine Similarity\. However, we do not interpret this as evidence that our method is selecting worse—or even equally important—blocks compared to Cosine Similarity\. Instead, we believe this behavior reflects the fact that a LLaMA\-3\-8B pruned by 25% remains expressive enough to perform well on these tasks\. Supporting this interpretation, we observe that even randomly selecting blocks to prune, followed by healing, can sometimes achieve competitive results\.
We repeated the same experiment at pruning ratios of 50% \(Table[10](https://arxiv.org/html/2605.14075#A5.T10)\) and 75% \(Table[11](https://arxiv.org/html/2605.14075#A5.T11)\)\. At 50% pruning, our method consistently outperforms Cosine Similarity even after healing\. At 75%, it still outperforms the baselines in several cases\. The cases where our method no longer leads, however, coincide with performance levels close to random selection, suggesting that the model simply lacks sufficient parameters to solve the task at such extreme sparsity\.
A similar trend—where healing provides clear benefits over other pruning methods only around the 50% pruning regime—was also reported by\(Gromovet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib34)\), who compared Cosine Similarity with an even simpler pruning method across multiple pruning ratios and tasks\.
To implement the healing stage, we fine\-tuned each pruned model for 10 epochs using the corresponding training set for each task \(see Figure[21](https://arxiv.org/html/2605.14075#A5.F21)for per\-epoch performance\)\. For each method–task pair, the tables report the best\-performing epoch within this window\. Across all experiments, model performance consistently peaked during the 10\-epoch schedule and then began to decline, indicating the onset of overfitting to the training set\.
Following prior work\(Gromovet al\.,[2025](https://arxiv.org/html/2605.14075#bib.bib34)\), we employed the Hugging FaceTrainer API\(Wolfet al\.,[2020](https://arxiv.org/html/2605.14075#bib.bib74)\), QLoRA quantization using thebitsandbyteslibrary\(Dettmerset al\.,[2023](https://arxiv.org/html/2605.14075#bib.bib75)\), and LoRA adapters\(Huet al\.,[2022](https://arxiv.org/html/2605.14075#bib.bib76)\)implemented with thepeftlibrary\(Mangrulkaret al\.,[2022](https://arxiv.org/html/2605.14075#bib.bib77)\)\. The fine\-tuning configuration was as follows:
- •Applied modules:\[gate\_proj, down\_proj, up\_proj\]
- •Batch size: 16
- •LoRAα\\alpha: 2
- •LoRA rank: 2
- •Peak learning rate: 3e\-4
- •LoRA dropout: 0\.05
- •LR scheduler: cosine
- •Warmup steps: 100
Table 9:Results on the 8 tasks after pruning25%of the model LlaMa\-3\-8B using our method, cosine similarity, or random block pruning, we show the results with and without healing stage\.MethodHealingArc\-CArc\-EBoolQHSOBQAPIQAWGMMLUAccuracy \(Ours\)No49\.5774\.9684\.0471\.5344\.0079\.0673\.8062\.97Yes57\.4283\.8489\.3075\.7151\.2084\.2282\.0861\.64Cosine SimilarityNo45\.7367\.8066\.3369\.5238\.6072\.9171\.3544\.05Yes56\.4882\.0788\.9976\.0853\.8081\.6683\.0358\.76RandomNo24\.0836\.8046\.0345\.7425\.4851\.7344\.7325\.47Yes43\.8673\.6882\.1667\.0646\.8478\.6874\.6534\.63Table 10:Results on the 8 tasks after pruning50%of the model LlaMa\-3\-8B using our method, cosine similarity, or random block pruning, we show the results with and without healing stage\.MethodHealingArc\-CArc\-EBoolQHSOBQAPIQAWGMMLUAccuracy \(Ours\)No28\.6740\.9579\.5748\.6932\.4063\.2854\.4655\.64Yes38\.6562\.9686\.3357\.3940\.6075\.2470\.4855\.53Cosine SimilarityNo25\.3427\.7842\.7226\.6429\.4052\.7750\.5122\.95Yes33\.9659\.6878\.3851\.3140\.4073\.0166\.1426\.11RandomNo20\.8719\.7242\.2426\.5222\.3641\.1439\.4324\.85Yes26\.9652\.0067\.0641\.5534\.1268\.7859\.1625\.67Table 11:Results on the 8 tasks after pruning75%of the model LlaMa\-3\-8B using our method, cosine similarity, or random block pruning, we show the results with and without healing stage\.MethodHealingArc\-CArc\-EBoolQHSOBQAPIQAWGMMLUAccuracy \(Ours\)No26\.1128\.6662\.1727\.5626\.8055\.6650\.5125\.46Yes25\.0940\.9163\.0630\.4331\.0065\.0251\.7025\.83Cosine SimilarityNo27\.3026\.5657\.1326\.0528\.0051\.8050\.1224\.32Yes26\.1934\.6063\.0027\.4829\.4061\.5950\.8324\.28RandomNo20\.9020\.3937\.9926\.5022\.8040\.9740\.6824\.32Yes24\.7838\.5761\.1128\.5028\.0061\.1051\.7425\.22Figure 21:Impact of healing after pruning across varying pruning ratios and train epochs\.Similar Articles
On the Persistent Effects of Lexicality in Large Language Mod
This paper investigates how lexical overlap, rather than semantic content, influences LLM representations across layers and architectures, and demonstrates that this lexical effect persists even in models trained for semantic similarity, leading to degraded performance on downstream tasks.
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions
This paper analyzes performance collapse in layer-pruned LLMs by introducing decision representation metrics, identifying a 'Silent Phase' critical for maintaining model integrity.
Beyond Layer Importance in Layer-wise Sparsity: An Inter-Layer Perturbation-Absorption Perspective
This paper proposes an inter-layer perturbation-absorption perspective for layer-wise sparsity in LLMs, showing that layers exhibit heterogeneous responses to pruning perturbations and introducing an absorption-aware correction that improves existing pruning methods by reducing perplexity and boosting accuracy.
Cosine Misleads: Auxiliary Losses Reshape Vision Language Models, Not Their Latents
The paper challenges the assumption that cosine alignment between supervised latents and visual targets improves accuracy in vision-language models, finding a strong negative correlation. It introduces PRISM diagnostics revealing that answers are decoded downstream from latents, not within them, and that the auxiliary loss reshapes the language model via shared parameters.
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models
This paper identifies the 'Massive Emergence Layer' where extreme activations in LLMs originate and propagate, proposing a method to mitigate their rigidity and improve model performance on tasks like math reasoning and instruction following.