A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

arXiv cs.CL 05/12/26, 04:00 AM Papers
large-language-models interpretability activations attention-sinks icml research mlp
Summary
This paper identifies the 'Massive Emergence Layer' where extreme activations in LLMs originate and propagate, proposing a method to mitigate their rigidity and improve model performance on tasks like math reasoning and instruction following.
arXiv:2605.08504v1 Announce Type: new Abstract: We investigate the origins of massive activations in large language models (LLMs) and identify a specific layer named the \textbf{Massive Emergence Layer (ME Layer)}, that is consistently observed across model families, where massive activations first emerge and subsequently propagate to deeper layers through residual connections. We show that, within the ME Layer both the RMSNorm and the FFN parameters jointly contribute to the emergence of massive activations. Once formed, the massive activation token representation remains largely invariant across layers, reducing the diversity of hidden representations passed to the attention module. Motivated by this limitation, we propose a simple and effective method to reduce the rigidity of the massive activation token. Our approach consistently improves LLM performance across multiple tasks, including instruction following and math reasoning, in both training free and fine tuning settings. Moreover, we show that our method mitigates attention sinks by selectively weakening their influence, elucidating their origin at the hidden state level and shedding new light on principled mitigation strategies.
Original Article Export to Word Export to PDF
View Cached Full Text
Cached at: 05/12/26, 06:51 AM
# A Single Layer to Explain Them All: Understanding Massive Activations in Large Language Models
Source: [https://arxiv.org/html/2605.08504](https://arxiv.org/html/2605.08504)
###### Abstract

We investigate the origins of massive activations in large language models \(LLMs\) and identify a specific layer named theMassive Emergence Layer \(ME Layer\), that is consistently observed across model families, where massive activations first emerge and subsequently propagate to deeper layers through residual connections\. We show that, within the ME Layer both the RMSNorm and the FFN parameters jointly contribute to the emergence of massive activations\. Once formed, the massive activation token representation remains largely invariant across layers, reducing the diversity of hidden representations passed to the attention module\. Motivated by this limitation, we propose a simple and effective method to reduce the rigidity of the massive activation token\. Our approach consistently improves LLM performance across multiple tasks, including instruction following and math reasoning, in both training free and fine tuning settings\. Moreover, we show that our method mitigates attention sinks by selectively weakening their influence, elucidating their origin at the hidden state level and shedding new light on principled mitigation strategies\. The model and code have been released at[MELayer & WeMask](https://github.com/vanpe20/A-Single-Layer-to-Explain-Them-All-Understanding-Massive-Values-in-Large-Language-Models.git)\.

Machine Learning, ICML

## 1Introduction

![Refer to caption](https://arxiv.org/html/2605.08504v1/x1.png)Figure 1:This figure illustrates how massive activations emerge and propagate\. In the top panel, we trace the flow of massive activations: they arise only at the FFN of a specific layer and then propagate to subsequent layers through residual connections\. The→\\rightarrowarrows denote the generation and propagation of massive activations\. The bottom panel shows how the outputℓ2\\ell\_\{2\}norm changes across layers\. ME Layer means Massive Emergence Layer\.Large Language Models \(LLMs\)\(Yanget al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib23); Liuet al\.,[2024](https://arxiv.org/html/2605.08504#bib.bib22)\)have demonstrated strong capabilities across a wide range of complex tasks, motivating increasing efforts to probe their internal mechanisms\(Zhaoet al\.,[2024](https://arxiv.org/html/2605.08504#bib.bib66); Shiet al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib65); Zhanget al\.,[2025c](https://arxiv.org/html/2605.08504#bib.bib45),[b](https://arxiv.org/html/2605.08504#bib.bib37)\)\. Some work use embeddings to following tasks\(Shiet al\.,[2026](https://arxiv.org/html/2605.08504#bib.bib38)\)\. One emerging line of work focuses onmassive activations: in intermediate representations, the embeddings of few tokens can attain values several orders of magnitude larger than the rest\. This raises a fundamental question:why do such extreme activations arise in LLMs, what do they encode, and how do they shape model behavior?Recent studies suggest that massive activations can behave like dominant bias terms\(Sunet al\.,[2024](https://arxiv.org/html/2605.08504#bib.bib1)\), affect contextual information processing\(Jinet al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib52)\), and alter attention behavior and training dynamics\([Kaulet al\.,](https://arxiv.org/html/2605.08504#bib.bib67); Gallego\-Felicianoet al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib68)\)\. Despite these advances, existing work still lacks a clear account of how massive activations emerge end\-to\-end and how their emergence connects to their downstream functional effects in LLMs\.

In this paper, we provide a systematic analysis of the emergence of massive activations in LLMs\. We find that massive activations are generated at asingle layerof the model and, once formed, propagate to subsequent layers through residual connections\. As shown in[Figure 1](https://arxiv.org/html/2605.08504#S1.F1)and[Appendix H](https://arxiv.org/html/2605.08504#A8), in the particular layer, the activation values of the massive activation tokens will increase by several hundreds times compared to the previous layer\. We refer to this layer as theME Layer\(MassiveEmergence Layer\)\. In[Figure 1](https://arxiv.org/html/2605.08504#S1.F1), we illustrate how massive activations are generated at the ME Layer and then propagate into later layers\. Surprisingly, we show that theME Layer is consistently observed across models of different sizes and families\(see[Appendix H](https://arxiv.org/html/2605.08504#A8)\), suggesting a shared, architecture\-level mechanism and positioning the ME Layer as the primary locus for systematic analysis of massive activation emergence\.

To unpack the ME Layer mechanism, we conduct a fine\-grained analysis within this layer and find massive activation emergence is jointly driven by the pre\-FFN RMSNorm and the FFN layer in the ME Layer\. We further find that massive activations exhibit high degree of stability and consistency \([subsection 3\.2](https://arxiv.org/html/2605.08504#S3.SS2)and[Appendix D](https://arxiv.org/html/2605.08504#A4)\)\. This invariance reduces representation diversity\. When it propagates into self\-attention, the shared direction biases how tokens interact, making attention patterns more similar across inputs and less context\-adaptive in practice\.

To mitigate the effects of massive activation–induced directional invariance in hidden states, we propose a method that starts from the ME Layer and selectively masks dimensions in the attention input corresponding to large RMSNorm weights, which tend to amplify dominant directions in the hidden state\. This operation relaxes the directional rigidity of the massive activation token while preserving the overall structure of the representation, thereby restoring greater directional diversity in the attention input\. As a result, the attention mechanism can better adjust its similarity structure across different inputs\. Experimental results show that our method consistently improves model performance across downstream tasks, both as an inference\-time, training\-free intervention and when applied during fine\-tuning\.

We further analyze the attention sink phenomenon\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.08504#bib.bib7)\), in which LLMs assign disproportionately large attention weights to a small subset of tokens, typically the first token\. We find that attention sinks emerge in the layer immediately following the ME Layer, and that the corresponding attention weights exhibit low\-rank properties similar to those of the massive activations produced in the ME Layer\. Our method leads to a partial attenuation of attention sinks, and that this controlled reduction is consistently associated with improved model performance\. These results suggest a new perspective on attention sinks from a representational standpoint: attention sinks are not inherently detrimental, but instead appear to play a functional role in model computation\. Rather than eliminating them entirely, moderately reducing their dominance while preserving their presence yields more effective and stable behavior, highlighting the importance of balancing representational flexibility with structural regularization\.

In summary, our contributions are as follow:

- •We trace the massive activation phenomenon back to its root cause and findME Layer, the massive activation of hidden state starting from the this layer and propagate via residual connections\.
- •We show that massive activations arise from the characteristics of the RMSNorm and FFN weights in ME Layer, and the properties of the massive activation token remain highly consistent across different inputs and layers\.
- •We propose a method that relaxes the directional rigidity of the massive\-activation token, enabling self\-attention to respond more contextually across inputs and delivering consistent performance gains across multiple model families and tasks\.
- •We provide a new perspective on the attention sink phenomenon based on our findings, offering a hidden state level explanation of its origin and new insights into mitigating the bad influence of attention sink\.

## 2Related Work

### 2\.1Massive Activation

Timkey and Van Schijndel \([2021](https://arxiv.org/html/2605.08504#bib.bib27)\)first identified the phenomenon that certain feature dimensions exhibit extremely large activations in GPT\-2\. Following this observation, several studies began to investigate such outlier features in hidden states\(Dettmerset al\.,[2022](https://arxiv.org/html/2605.08504#bib.bib31); Zenget al\.,[2022](https://arxiv.org/html/2605.08504#bib.bib30); Ahmadianet al\.,[2023](https://arxiv.org/html/2605.08504#bib.bib29)\)\. Subsequent work explored these outlier features from different perspectives:Owenet al\.\([2025](https://arxiv.org/html/2605.08504#bib.bib8)\)studied them through quantification analysis, whileZhaoet al\.\([2025](https://arxiv.org/html/2605.08504#bib.bib28)\)examined their functional roles\. Other studies attempted to suppress or remove outlier dimensions to improve model robustness or quantization\(Bondarenkoet al\.,[2023](https://arxiv.org/html/2605.08504#bib.bib34)\)\. More recent work reported the presence of unusually large magnitude hidden states, often referred to as massive activations\(Sunet al\.,[2024](https://arxiv.org/html/2605.08504#bib.bib1); Sonet al\.,[2024](https://arxiv.org/html/2605.08504#bib.bib50)\)\.Ohet al\.\([2025](https://arxiv.org/html/2605.08504#bib.bib51)\)further suggested that such massive activations can be driven by large FFN weights\. In addition,Gallego\-Felicianoet al\.\([2025](https://arxiv.org/html/2605.08504#bib.bib68)\)analyzed how massive activations emerge during training, whileHeet al\.\([2024](https://arxiv.org/html/2605.08504#bib.bib9)\)investigated how massive activations affect model performance and behavior\. Meanwhile, other studies argue that attention sinks may serve functional roles rather than being purely pathological artifacts; for example,Ruscioet al\.\([2025](https://arxiv.org/html/2605.08504#bib.bib5)\)and[Zhanget al\.](https://arxiv.org/html/2605.08504#bib.bib54)interpret attention sinks as structural anchors in the model\. In\(Cancedda,[2024](https://arxiv.org/html/2605.08504#bib.bib41)\)and\(Ferrando and Voita,[2024](https://arxiv.org/html/2605.08504#bib.bib42)\), they report the BOS token residual stream write in a ”dark subspace” and this stability across layers\.\(Queipo\-de\-Llanoet al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib43)\)develops a unified theory showing that massive activations explain both attention sinks and compression valleys, and uses this to motivate a Mix–Compress–Refine view of depth\-wise computation\. Despite these advances, existing work still lacks a unified analysis that connects the emergence of massive activations with their downstream effects particularly attention sinks and leverages such source level understanding to develop targeted mitigation methods\.

### 2\.2Attention Sink

In LLM self\-attention, a small subset of tokens consistently receives disproportionately large attention weights, a phenomenon known as attention sinks\. Prior work observes attention sinks in both LLMs and VLMs\(Xiaoet al\.,[2024](https://arxiv.org/html/2605.08504#bib.bib7);[Darcetet al\.,](https://arxiv.org/html/2605.08504#bib.bib32)\)\.Guet al\.\([2024](https://arxiv.org/html/2605.08504#bib.bib3)\)characterizes sinks as non\-informative key biases arising from softmax\-induced coupling, motivating a line of work that mitigates sinks by modifying the attention mechanism\(Ramapuramet al\.,[2024](https://arxiv.org/html/2605.08504#bib.bib33); Zuhriet al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib35); Bondarenkoet al\.,[2023](https://arxiv.org/html/2605.08504#bib.bib34); Miller,[2023](https://arxiv.org/html/2605.08504#bib.bib49)\)\. Representative approaches include attention gating and clipping\(Bondarenkoet al\.,[2023](https://arxiv.org/html/2605.08504#bib.bib34)\), gated attention modules\(Qiuet al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib4)\), and decoupling value states from sink dynamics\(Buet al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib53)\)\. Some work also discuss the safety mechanism\(Shanget al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib44); Zhanget al\.,[2025a](https://arxiv.org/html/2605.08504#bib.bib40); Zhang and Zhang,[2025](https://arxiv.org/html/2605.08504#bib.bib39)\)\.However, existing analyses largely focus on attention, overlooking the role of embeddings\.

## 3Emergence of Massive Activations in a Single Transformer Layer

As shown in[Figure 1](https://arxiv.org/html/2605.08504#S1.F1), massive activations emerge abruptly within a single transformer layer, the ME Layer rather than accumulating gradually across layers\. We analyze the origin of this phenomenon in[subsection 3\.1](https://arxiv.org/html/2605.08504#S3.SS1), linking it to the ME Layer ’s normalization behavior and weight structure\. In[subsection 3\.2](https://arxiv.org/html/2605.08504#S3.SS2), we further show that once formed, these activations become directionally stable, reducing representational diversity and constraining downstream self\-attention\.

### 3\.1Understanding the Emergence in the ME Layer

Key TakeawayMassive activations emerge only at the ME Layer driven by unusually large and directionally aligned RMSNorm and FFN parameters that selectively amplify the massive\-activation token\.

In this section, we use Qwen3\-4B as a case study to pinpoint the computations in the ME Layer that trigger massive activations\.[Figure 1](https://arxiv.org/html/2605.08504#S1.F1)reveals a clear transition in activation magnitude centered at the ME Layer\. Before this layer, token activations remain comparable across tokens, whereas at the ME Layer the first token exhibits a sudden and isolated increase in magnitude that is subsequently preserved through residual connections\. The lower panels further localize this transition within the ME Layer : deviation first appears at the RMSNorm output and is sharply amplified by the FFN into a massive activation\. Once formed, this large\-magnitude representation is directly propagated to later layers\. This staged behavior localizes the origin of massive activations to the internal transformations of the ME Layer\. Among the components of a decoder block, only RMSNorm and the FFN can induce such rapid, token\-specific amplification within a single layer, motivating a focused analysis of these two modules\. We find that Qwen3\-4B consistently exhibits massive activations on the first token across diverse inputs, accordingly, in the following sections, we use the first token as our primary object of analysis\.

![Refer to caption](https://arxiv.org/html/2605.08504v1/x2.png)Figure 2:The comparison of the magnification of RMSNorm on token0and other tokens in Qwen3\-4B across layers\.Amplification effect of RMSNorm\.We analyze the scaling factors in RMSNorm layer by layer and find that the amplification effect in the ME Layer on the hidden state far exceeds that of other layers\. In[Figure 2](https://arxiv.org/html/2605.08504#S3.F2), we measure the RMSNorm weighted activation norm, which represents the overall magnitude of the RMSNorm output for each token:WeightNorml\(t\)=‖h^l,t‖2,\\mathrm\{WeightNorm\}\_\{l\}\(t\)=\\left\\lVert\\hat\{h\}\_\{l,t\}\\right\\rVert\_\{2\},whereh^l,t=RMSNorm\(hl,t\)\\hat\{h\}\_\{l,t\}=\\mathrm\{RMSNorm\}\(h\_\{l,t\}\)denotes the output of RMSNorm at layerlland token positiontt\. We observe that before layer 7, the first token and the other tokens are amplified to a similar extent\. However, at layer 7, RMSNorm produces a much larger output magnitude for the first token than for the other tokens\.

![Refer to caption](https://arxiv.org/html/2605.08504v1/x3.png)Figure 3:This metric captures the contribution of high\-weight dimensions and reflects how well a token’s values align with weight\-based amplification across layers\.To further analyze whether this amplification is associated with dimensions corresponding to large RMSNorm scaling factors, we examine how the squared magnitude of the RMSNorm output is distributed across dimensions\. Let𝒦\\mathcal\{K\}denote the index set of the top\-KKlargest RMSNorm scaling factors\. We define the total squared magnitude of output asEt=∑i=1Dh^t,i2,E\_\{t\}=\\sum\_\{i=1\}^\{D\}\\hat\{h\}\_\{t,i\}^\{2\},and the contribution from dimensions in𝒦\\mathcal\{K\}asEt𝒦=∑i∈𝒦h^t,i2\.E\_\{t\}^\{\\mathcal\{K\}\}=\\sum\_\{i\\in\\mathcal\{K\}\}\\hat\{h\}\_\{t,i\}^\{2\}\.The fraction of the output magnitude contributed by high\-scaling dimensions is then defined asFract=Et𝒦Et\.\\mathrm\{Frac\}\_\{t\}=\\frac\{E\_\{t\}^\{\\mathcal\{K\}\}\}\{E\_\{t\}\}\.We compute the difference between the first token and the average of the remaining tokens as

ΔFrac=Frac0−1S−1∑t=1S−1Fract\.\\Delta\\mathrm\{Frac\}=\\mathrm\{Frac\}\_\{0\}\-\\frac\{1\}\{S\-1\}\\sum\_\{t=1\}^\{S\-1\}\\mathrm\{Frac\}\_\{t\}\.\(1\)Meanwhile, we also measure the similarity between the RMSNorm output distribution and the distribution induced by the RMSNorm scaling factors using KL divergence:

ΔKL=KL⁡\(p0∥g\)−1S−1∑t=1S−1KL⁡\(pt∥g\),\\Delta\\operatorname\{KL\}=\\operatorname\{KL\}\\\!\\left\(p\_\{0\}\\,\\\|\\,g\\right\)\-\\frac\{1\}\{S\-1\}\\sum\_\{t=1\}^\{S\-1\}\\operatorname\{KL\}\\\!\\left\(p\_\{t\}\\,\\\|\\,g\\right\),\(2\)wherepi=h^i2∑j=1Dh^j2,gi=fi2∑j=1Dfj2,p\_\{i\}=\\frac\{\\hat\{h\}\_\{i\}^\{\\,2\}\}\{\\sum\_\{j=1\}^\{D\}\\hat\{h\}\_\{j\}^\{\\,2\}\},\\quad g\_\{i\}=\\frac\{f\_\{i\}^\{\\,2\}\}\{\\sum\_\{j=1\}^\{D\}f\_\{j\}^\{\\,2\}\},andfif\_\{i\}denotes the RMSNorm scaling factor of dimensionii\. As shown in[Figure 4](https://arxiv.org/html/2605.08504#S3.F4), at the ME Layer a large positiveΔFrac\\Delta\\mathrm\{Frac\}indicates that the RMSNorm output of the first token is more strongly concentrated on dimensions associated with large scaling factors, while a negativeΔKL\\Delta\\operatorname\{KL\}shows that the overall output pattern of the first token is more consistent with the distribution induced by RMSNorm scaling\. These results indicate that RMSNorm disproportionately amplifies the first token at the ME Layer through concentrated scaling effects\.

![Refer to caption](https://arxiv.org/html/2605.08504v1/x4.png)Figure 4:Line chart\(the y\-axis on the left\) shows difference of the projection concentration between first token and others after different module in FFN\. Bar chart\(the y\-axis on the right\) shows the amplification factor of the MLP on the token hidden state\.#### Amplification effect of FFN

In addition to RMSNorm, the FFN in the ME Layer also contributes to the magnification of hidden states\. To characterize how selectively a token’s representation is shaped by the FFN, we compute the projection concentration, which measures how concentrated the hidden state is along a small subset of representation dimensions after the FFN transformation\. A higher projection concentration indicates that the resulting token representation is dominated by a limited number of projection induced directions, rather than being evenly distributed across the representation space\. This metrics captures the downstream representational effect of selective activation induced by these projections\. As such, projection concentration serves as an indirect indicator of how strongly the input representation is shaped by a small subset of FFN projection directions, rather than a uniform transformation across all dimensions\. The formula is defined as follows:

𝒞t=∑i=1d\(\(ht,i\)2∑j=1d\(ht,j\)2\)2,\\mathcal\{C\}\_\{t\}=\\sum\_\{i=1\}^\{d\}\\left\(\{\\frac\{\\left\(h\_\{t,i\}\\right\)^\{2\}\}\{\\sum\_\{j=1\}^\{d\}\\left\(h\_\{t,j\}\\right\)^\{2\}\}\}\\right\)^\{2\},\(3\)dddenotes the hidden\-state dimension, andht,ih\_\{t,i\}denotes theii\-th dimension of thett\-th token\. The results are shown in[Figure 4](https://arxiv.org/html/2605.08504#S3.F4)\. We observe that only at the ME Layer does the difference between the first token and the other tokens simultaneously reach its maximum across all three FFN modules\. This indicates that, at the ME Layer, the first token exhibits a substantially stronger selective activation pattern under FFN transformations than in other layers, consistent with its disproportionately amplified activation at this layer\. Meanwhile, we also report the amplification factor of the MLP for the first token\. As shown in the figure, at the ME Layer the projection contributions of the three FFN projections jointly peak, resulting in the strongest amplification effect\.

In[Appendix B](https://arxiv.org/html/2605.08504#A2), we examine the respective contributions of RMSNorm and the FFN to the emergence of massive activations\. The results highlight a complementary interaction between the FFN and the preceding RMSNorm within the ME Layer\. Specifically, the FFN is the primary driver responsible for generating and sustaining massive activations, while the pre\-FFN RMSNorm plays a critical role in regulating their scale\. Together, these components amplify the massive\-activation token to levels that are hundreds or even thousands of times greater than those of other tokens\.

![Refer to caption](https://arxiv.org/html/2605.08504v1/x5.png)Figure 5:\(a\) L2 norm of the first token’s hidden state across layers for different input instances\. \(b\) The activation of token 0 in different layer of model\.Red lineindicates the activation of ME Layer \(c\) Heatmap of the cosine similarity between different input’s first\-token hidden state across layers\.

### 3\.2The Direction of Massive Activation

Key TakeawayOnce the massive activation emerges at the ME Layer the massive activation’s hidden state exhibits strong input\-invariant directionality and remains stable across subsequent layers\.

After identifying the ME Layer we further investigate the massive activation from the perspective of hidden states in the layers following ME Layer\. We observe the value and direction of the hidden state of massive activation remain highly consistent across different tasks and input instances\.

To identify the nature of the massive activation token, we similarly use Qwen3\-4B as the representative model\. Unlike models with an explicit begin of sequence token, Qwen3\-4B does not introduce a dedicated start token embedding at the input\. Therefore, any massive activation observed at a specific token position cannot be trivially attributed to a fixed or input independent embedding, but must emerge from the interaction between the input content and the model’s internal transformations\. We construct several different inputs from different tasks and compute: ❶ the L2 Norm of the massive activation’s hidden state, ❷ massive activation token’s hidden state across layers ❸ Cosine similarity of the massive\-activation hidden states across layers with respect to a different input\. The results are shown in[Figure 5](https://arxiv.org/html/2605.08504#S3.F5)\. As shown in[Figure 5](https://arxiv.org/html/2605.08504#S3.F5)\(a\), once the massive activation emerges, the L2 norm of the massive activation remains stable across subsequent middle layers, indicating limited influence from later transformations\. As shown in[Figure 5](https://arxiv.org/html/2605.08504#S3.F5)\(b\), the hidden\-state patterns of the massive activation remain similar across layers after the ME Layer suggesting that the activation direction is preserved\. Consistently,[Figure 5](https://arxiv.org/html/2605.08504#S3.F5)\(c\) shows that cosine similarity across different inputs remains nearly unchanged after the ME Layer\. Therefore, it is well demonstrate thatthe hidden state of the massive activation token remains stable across layers and inputs once it emerges\. More results in[Appendix D](https://arxiv.org/html/2605.08504#A4)and[Appendix F](https://arxiv.org/html/2605.08504#A6)\.

## 4Weight Guided Dimension Masking

Based on the previous analysis, we observe that after the ME Layer, the information encoded in massive activations remains largely identical across different inputs\. While such massive activations can serve as a stable and shared global reference vector, a fixed hidden\-state direction introduces inherent limitations\. Once this direction becomes rigid, it restricts the attention mechanism’s ability to conditionally adapt to diverse inputs, thereby reducing its input dependent flexibility during inference\.

Table 1:This table reports the performance of our method across multiple benchmarks, evaluating the model’s generalization ability after instruction fine\-tuning\. TF denotes a training\-free inference\-time setting without parameter updates, while SFT denotes supervised fine\-tuning with parameter updates\.Boldindicates the best performance under the corresponding experimental settings\.### 4\.1Directional Rigidity Constrains Attention

To understand why directional similarity persists when hidden states enter the attention module, we examine the effect of the pre\-attention RMSNorm\. Before attention, hidden states are normalized by RMSNorm, defined asRMSNorm\(𝐱\)=𝐱1d∑i=1dxi2\+ϵ⊙w\\mathrm\{RMSNorm\}\(\\mathbf\{x\}\)=\\frac\{\\mathbf\{x\}\}\{\\sqrt\{\\frac\{1\}\{d\}\\sum\_\{i=1\}^\{d\}x\_\{i\}^\{2\}\+\\epsilon\}\}\\odot w, Without the learnable scaling vectorww, RMSNorm strictly rescales the magnitude of the hidden state while preserving its direction\. With learnable scaling, RMSNorm performs a dimension wise reweighting, which in general can alter the representation direction\. However, in the regime we study, the massive activation’s hidden state after the ME Layer exhibits highly concentrate along a small subset of dimensions\. In such cases, dimension\-wise scaling primarily amplifies already dominant components rather than introducing new directional components\. As a result, although RMSNorm may change the exact direction, the dominant orientation of the representation remains largely consistent across inputs after normalization\. Therefore, when entering the attention module, the massive activation’s hidden state retains a highly similar direction across different inputs\.

In self\-attention, keys are obtained via a linear projection,k0=h0WKk\_\{0\}=h\_\{0\}W\_\{K\}\. By decomposing the hidden state ash0=∥h0∥h^0h\_\{0\}=\\lVert h\_\{0\}\\rVert\\hat\{h\}\_\{0\}, whereh^0\\hat\{h\}\_\{0\}denotes the unit vector, we can rewrite the key ask0=∥h0∥\(h^0WK\)k\_\{0\}=\\lVert h\_\{0\}\\rVert\(\\hat\{h\}\_\{0\}W\_\{K\}\)\. This decomposition highlights that when the directionh^0\\hat\{h\}0of the massive activation remains stable across inputs, the resulting key occupies an approximately fixed position in the attention similarity space\. Since attention scores are computed as inner products,li0=qi⊤k0l\{i0\}=q\_\{i\}^\{\\top\}k\_\{0\}, a directionally invariant key induces stable similarity patterns that vary little with the input\. Consequently, such keys act as fixed reference points in self\-attention\. This interpretation is consistent with prior findings showing that highly similar hidden states will induce rigid representations that reduce input sensitivity and representation diversity\(Ohet al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib51)\)\. Moreover, earlier studies demonstrates when representations concentrate along a small number of dominant directions, these directions can dominate representation space, leading to degraded representational quality and reduced effective dimensionality\(Ethayarajh,[2019](https://arxiv.org/html/2605.08504#bib.bib36); Timkey and Van Schijndel,[2021](https://arxiv.org/html/2605.08504#bib.bib27)\)\.

### 4\.2Proposed Method

Motivated by these limitations, we propose a method namedWeMask\(Weight\-guided Masking\) that selectively suppresses dominant dimensions in the massive activation, thereby restoring the directional diversity required for effective attention computation without altering the overall transformer structure and incurring no additional computational cost\. An overview of the method is shown in[Figure 6](https://arxiv.org/html/2605.08504#S4.F6)\. Pre\-attention RMSNorm preserves direction while amplifying dominant dimensions, reinforcing directional rigidity and reducing attention diversity\. Based on this observation, we select dimensions with large RMSNorm weights as candidates for suppression, defined as𝒮\(l\)=TopK\(\|w\(l\)\|,,k\),\\mathcal\{S\}^\{\(l\)\}=\\mathrm\{TopK\}\\left\(\\left\|w^\{\(l\)\}\\right\|,,k\\right\),wherew\(l\)w^\{\(l\)\}is the weight in the layerll’s RMSNorm,kkdenotes the number of selected dimensions determined by the mask rate multiplied by the hidden dimension, and𝒮\(l\)\\mathcal\{S\}^\{\(l\)\}represents the selected dimensions\. After choosing them, we build a mask as:

𝐦\(l\)∈\{0,1\}D,md\(l\)=\{1,d∈𝒮\(l\);0,otherwise\.\\mathbf\{m\}^\{\(l\)\}\\in\\\{0,1\\\}^\{D\},\\qquad m^\{\(l\)\}\_\{d\}=\\begin\{cases\}1,&d\\in\\mathcal\{S\}^\{\(l\)\};\\\\ 0,&\\text\{otherwise\}\.\\end\{cases\}\(4\)Then, we use it to mask corresponding dimension in the input to the attention module, as follows:

𝐡~0\(l\)=𝐡0\(l\)⊙\(1−𝐦\(l\)\),\\tilde\{\\mathbf\{h\}\}^\{\(l\)\}\_\{0\}=\\mathbf\{h\}^\{\(l\)\}\_\{0\}\\odot\\left\(1\-\\mathbf\{m\}^\{\(l\)\}\\right\),\(5\)wherehhmeans the input hidden state of attention\. We insert this module before the attention layer in each subsequent layer, starting from the ME Layer to reduce the rigidity of massive activation’s direction and train the model\.

![Refer to caption](https://arxiv.org/html/2605.08504v1/x6.png)Figure 6:This is the schematic diagram of our methods\. We will choose top\-k dimensions based on weights then masking the corresponding dimensions in hidden state\.Table 2:This table presents the performance on math reasoning and safety alignment benchmarks after math\-oriented fine\-tuning and safety\-oriented fine\-tuning\. TF denotes a training\-free inference\-time setting without parameter updates, while SFT denotes supervised fine\-tuning with parameter updates\.Boldindicates the best performance under the corresponding experimental settings\.

## 5Experiments

### 5\.1Settings

Method Details and Training Setups:We adopt Qwen3\-4B as the base model and apply our method both as a training\-free inference\-time technique and as a training\-time strategy across multiple tasks, including instruction fine\-tuning, math reasoning, and safety alignment\. For each task, we fine\-tune the model on the corresponding datasets: FLAN\([Weiet al\.,](https://arxiv.org/html/2605.08504#bib.bib10)\)and OpenOrca\(Lianet al\.,[2023](https://arxiv.org/html/2605.08504#bib.bib11)\)for instruction fine\-tuning, GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.08504#bib.bib12)\)for math reasoning, and HH\-RLHF\(Baiet al\.,[2022](https://arxiv.org/html/2605.08504#bib.bib26)\)for safety alignment\. The context length is set to 4096\. Task\-specific training configurations, like learning rate and batch size, are provided in the corresponding sections, while all other hyper parameters follow the default AdamW settings\. In[Appendix F](https://arxiv.org/html/2605.08504#A6), we further use WeMask on Llama\-3\.1\-8B\-Instruct and Qwen3\-8B, demonstrating our method scales effectively across different model families and parameter sizes\.

Evaluation:We test 0\-shot on several benchmarks, the max new length of output is 512, except GSM8K\(128\)\. For every test, we change the random seed and test three times to compute the mean and standard deviation\. The benchmark including: MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.08504#bib.bib13)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2605.08504#bib.bib14)\), ARC\-C\(Clarket al\.,[2018](https://arxiv.org/html/2605.08504#bib.bib15)\), MathQA\(Aminiet al\.,[2019](https://arxiv.org/html/2605.08504#bib.bib18)\), StrategyQA\(Gevaet al\.,[2021](https://arxiv.org/html/2605.08504#bib.bib17)\), GSM8K\(Cobbeet al\.,[2021](https://arxiv.org/html/2605.08504#bib.bib12)\), AIME22\-24\(AIME,[2024](https://arxiv.org/html/2605.08504#bib.bib21)\), Math500\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.08504#bib.bib19)\), SorryBench\(Xieet al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib25)\)and XSTest\(Röttgeret al\.,[2023](https://arxiv.org/html/2605.08504#bib.bib24)\)\.

### 5\.2Experimental Results Analysis

Instruction Fine\-tuning\.We first evaluate our method on instruction fine\-tuning tasks using Qwen3\-4B as the base model, with a global batch size of 256 and a learning rate of 2e\-5\. Results are reported in[Table 1](https://arxiv.org/html/2605.08504#S4.T1)\. Qwen3\-4B \+ SFT denotes standard SFT on the training set; Qwen3\-4B \+ SFT\+ WeMask \(TF\) applies our method only at inference time; and Qwen3\-4B \+ WeMask\(SFT\) jointly fine\-tunes the model with our method enabled\. The mask rate indicates the proportion of dimensions corresponding to the largest weights that are masked\. Our method consistently improves performance across instruction fine\-tuning tasks, both in the training\-free and fine\-tuning settings\.

Math Reasoning and Safety Alignment\.We first apply our method to math reasoning and safety alignment tasks\. We adopt Qwen3\-4B as the base model, using a global batch size of 64 for math reasoning and 256 for safety alignment, while keeping all other experimental settings identical to those used in instruction fine\-tuning\. The results are summarized in[Table 2](https://arxiv.org/html/2605.08504#S4.T2)\. Across both task\-specific settings, incorporating our method consistently improves model performance, indicating that its effectiveness extends beyond instruction fine\-tuning\. These gains demonstrate that our approach generalizes across different optimization objectives, training paradigms, and data distributions, covering both reasoning\-oriented and safety\-critical tasks\. In particular, on XSTest, standard SFT tends to induce overly conservative refusal behaviors, leading to a noticeable degradation in overall performance\. By contrast, integrating our method mitigates this issue by reducing excessive representational rigidity, thereby better balancing safety and helpfulness and substantially restoring overall performance\.

Ablation study\.In[Appendix E](https://arxiv.org/html/2605.08504#A5), we evaluate the effectiveness of our method by comparing it with different masking strategies, including randomly masking a fixed proportion of dimensions and masking the dimensions with the largest activation magnitudes\. The results show that these alternative masking methods lead to a substantial degradation in model performance\. In contrast, only our method consistently improves performance, demonstrating the effectiveness and necessity of weight\-guided dimension masking\.

Table 3:Performance on safety alignment benchmarks after DPO training\. TF and TA denote training\-free and training\-aware settings, respectively\.Boldmeans best performance\.Underlinemeans second performance\.Weight\-guided Masking in RL Training\.In this part, we extend our approach to reinforcement learning \(RL\) and show that it continues to improve the performance of RL\-trained models\.

For safety alignment, we employ DPO\(Rafailovet al\.,[2023](https://arxiv.org/html/2605.08504#bib.bib47)\)to train Qwen3\-4B on the HH\-RLHF benchmark, randomly sampling 3,000 training instances\. The model is trained with a batch size of 8, a maximum sequence length of 1024, and a learning rate of5×10−65\\times 10^\{\-6\}\. Evaluation is performed on XSTest\(Röttgeret al\.,[2023](https://arxiv.org/html/2605.08504#bib.bib24)\)and AdvBench\(Zouet al\.,[2023](https://arxiv.org/html/2605.08504#bib.bib20)\)\. For math reasoning, we adopt GRPO\(Shaoet al\.,[2024](https://arxiv.org/html/2605.08504#bib.bib48)\)to train Qwen3\-4B on GSM8K, using a batch size of 256, a maximum sequence length of 256, and a learning rate of1×10−61\\times 10^\{\-6\}\. The resulting model is evaluated on AIME 2022–2024\(AIME,[2024](https://arxiv.org/html/2605.08504#bib.bib21)\)and Math500\(Lightmanet al\.,[2023](https://arxiv.org/html/2605.08504#bib.bib19)\)\. As shown in[Table 3](https://arxiv.org/html/2605.08504#S5.T3)and[Table 4](https://arxiv.org/html/2605.08504#S5.T4), our method consistently improves performance across both safety alignment and math reasoning tasks, achieving gains on most evaluation benchmarks\. These results demonstrate that our approach generalizes well to reinforcement learning–based training paradigms, highlighting its robustness and scalability beyond supervised fine\-tuning\.

Table 4:Performance on math reasoning after GRPO training\. TF and TA respectively denote training\-free and training\-aware settings\.Boldis best performance\.Underlineis second performance\.

## 6Discussion: Rethinking Attention Sink from a Representation Perspective

![Refer to caption](https://arxiv.org/html/2605.08504v1/x7.png)Figure 7:\(a\) shows heatmap of attention weights in the ME Layer \(layer 7\)\. \(b\) shows the layer after ME Layer \(layer 8\)\.Our findings share similarities with prior studies on attention sinks\. Previous works, such asQiuet al\.\([2025](https://arxiv.org/html/2605.08504#bib.bib4)\)andGuet al\.\([2024](https://arxiv.org/html/2605.08504#bib.bib3)\), show that attention weights are often heavily concentrated on a single token across multiple heads\. This concentration implies a low\-rank structure in the attention matrix, reducing the richness of information aggregation\. Moreover, attention sinks are observed to persist across different inputs, indicating a degree of input invariance\. Similarly, our work focuses on an earlier stage of the model\. We find that after the ME Layer the first token’s hidden state exhibits an almost input invariant direction while its magnitude becomes larger than that of other tokens\. This behavior suggests a similar low rank effect, but at the level of hidden representations rather than attention weights\.

![Refer to caption](https://arxiv.org/html/2605.08504v1/x8.png)Figure 8:\(a\) shows the attention heatmap without our method\. \(b\) shows the attention heatmap with our method\.Motivated by this connection, we further investigate the relationship between massive activation onset, our proposed intervention, and the emergence of attention sinks\. As shown in[Figure 7](https://arxiv.org/html/2605.08504#S6.F7)\(a,b\), attention sinks consistently appear in layers following the onset of massive activation\. Notably, the attention sink observed at the ME Layer is not caused by the FFN output of the same layer, as multi\-head attention precedes the FFN in the forward pass\. Instead, it reflects a directionally rigid representation already consolidated in the residual stream, which becomes explicitly amplified as a massive activation at the ME Layer and subsequently influences attention in later layers\. As shown in[Figure 8](https://arxiv.org/html/2605.08504#S6.F8)\(a,b\), our method does not fully eliminate the attention sink but substantially reduces its dominance, resulting in more balanced attention distributions\.

Based on these findings, we provide a new perspective on the attention sink phenomenon\. We show that attention sinks originate from the ME Layer, where the first token undergoes abrupt magnitude amplification and becomes highly consistent across inputs, collapsing representations into a low\-dimensional subspace before entering the attention module\. This collapse leads to highly similar keys and queries for the first token, suggesting thatattention sinks are a downstream consequence of massive\-activation–induced representation collapse rather than an artifact of the softmax operation, as emphasized in prior work\(Ruscioet al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib5); Xiaoet al\.,[2024](https://arxiv.org/html/2605.08504#bib.bib7)\)\. Importantly, we find that completely eliminating attention sinks is suboptimal: fully removing the sink consistently degrades performance, whereas moderate attenuation preserves useful information while improving overall results\. This indicates that attention sinks encode beneficial signals but become harmful when their representations are overly rigid, and that partially relaxing this rigidity yields better model performance\.

## 7Conclusion

In this paper, we analyze the origin of massive activations in large language models and identify the ME Layer as their point of emergence\. We show that once formed, the massive activation token exhibits highly consistent hidden\-state patterns across layers, even under diverse inputs, leading to reduced representational diversity and increased directional rigidity\. Motivated by this observation, we propose a simple and effective method that relaxes this excessive consistency by intervening directly on hidden\-state representations, without modifying the model architecture or training objective\. This intervention yields consistent performance improvements across multiple tasks and training settings\. Our analysis offers a new perspective on attention sinks, attributing their emergence mitigation to hidden\-state dynamics rather than the attention mechanism alone\.

## Impact Statement

This paper aims to advance the understanding of internal mechanisms in large language models and to improve their performance through principled representation\-level interventions\. While enhanced model capabilities may influence downstream applications, we do not identify any ethical concerns or societal risks specific to this work beyond those generally associated with progress in machine learning research\.

## References

- A\. Ahmadian, S\. Dash, H\. Chen, B\. Venkitesh, Z\. S\. Gou, P\. Blunsom, A\. Üstün, and S\. Hooker \(2023\)Intriguing properties of quantization at scale\.Advances in Neural Information Processing Systems36,pp\. 34278–34294\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- AIME \(2024\)External Links:[Link](https://huggingface.co/datasets/AI-MO/aimo-validation-aime)Cited by:[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p2.1),[§5\.2](https://arxiv.org/html/2605.08504#S5.SS2.p5.2)\.
- A\. Amini, S\. Gabriel, S\. Lin, R\. Koncel\-Kedziorski, Y\. Choi, and H\. Hajishirzi \(2019\)MathQA: towards interpretable math word problem solving with operation\-based formalisms\.InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 \(Long and Short Papers\),Minneapolis, Minnesota,pp\. 2357–2367\.External Links:[Link](https://aclanthology.org/N19-1245),[Document](https://dx.doi.org/10.18653/v1/N19-1245)Cited by:[Appendix F](https://arxiv.org/html/2605.08504#A6.p1.1),[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p2.1)\.
- Y\. Bai, A\. Jones, K\. Ndousse, A\. Askell, A\. Chen, N\. DasSarma, D\. Drain, S\. Fort, D\. Ganguli, T\. Henighan,et al\.\(2022\)Training a helpful and harmless assistant with reinforcement learning from human feedback\.arXiv preprint arXiv:2204\.05862\.Cited by:[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p1.1)\.
- Y\. Bisk, R\. Zellers, R\. L\. Bras, J\. Gao, and Y\. Choi \(2020\)PIQA: reasoning about physical commonsense in natural language\.InThirty\-Fourth AAAI Conference on Artificial Intelligence,Cited by:[Appendix F](https://arxiv.org/html/2605.08504#A6.p1.1),[Appendix G](https://arxiv.org/html/2605.08504#A7.p1.1),[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p2.1)\.
- Y\. Bondarenko, M\. Nagel, and T\. Blankevoort \(2023\)Quantizable transformers: removing outliers by helping attention heads do nothing\.Advances in Neural Information Processing Systems36,pp\. 75067–75096\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1),[§2\.2](https://arxiv.org/html/2605.08504#S2.SS2.p1.1)\.
- R\. Bu, H\. Zhong, W\. Chen, and Y\. Li \(2025\)Value\-state gated attention for mitigating extreme\-token phenomena in transformers\.arXiv preprint arXiv:2510\.09017\.Cited by:[§2\.2](https://arxiv.org/html/2605.08504#S2.SS2.p1.1)\.
- N\. Cancedda \(2024\)Spectral filters, dark signals, and attention sinks\.InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics \(Volume 1: Long Papers\),pp\. 4792–4808\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- P\. Clark, I\. Cowhey, O\. Etzioni, T\. Khot, A\. Sabharwal, C\. Schoenick, and O\. Tafjord \(2018\)Think you have solved question answering? try arc, the ai2 reasoning challenge\.arXiv:1803\.05457v1\.Cited by:[Appendix F](https://arxiv.org/html/2605.08504#A6.p1.1),[Appendix G](https://arxiv.org/html/2605.08504#A7.p1.1),[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p2.1)\.
- K\. Cobbe, V\. Kosaraju, M\. Bavarian, M\. Chen, H\. Jun, L\. Kaiser, M\. Plappert, J\. Tworek, J\. Hilton, R\. Nakano, C\. Hesse, and J\. Schulman \(2021\)Training verifiers to solve math word problems\.arXiv preprint arXiv:2110\.14168\.Cited by:[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p1.1),[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p2.1)\.
- \[11\]T\. Darcet, M\. Oquab, J\. Mairal, and P\. BojanowskiVision transformers need registers\.InThe Twelfth International Conference on Learning Representations,Cited by:[§2\.2](https://arxiv.org/html/2605.08504#S2.SS2.p1.1)\.
- T\. Dettmers, M\. Lewis, Y\. Belkada, and L\. Zettlemoyer \(2022\)Gpt3\. int8 \(\): 8\-bit matrix multiplication for transformers at scale\.Advances in neural information processing systems35,pp\. 30318–30332\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- K\. Ethayarajh \(2019\)How contextual are contextualized word representations\.Comparing the geometry of BERT, ELMo, and GPT\-2 Embeddings2\.Cited by:[§4\.1](https://arxiv.org/html/2605.08504#S4.SS1.p2.6)\.
- J\. Ferrando and E\. Voita \(2024\)Information flow routes: automatically interpreting language models at scale\.InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing,pp\. 17432–17445\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- J\. Gallego\-Feliciano, S\. A\. McClendon, J\. Morinelli, S\. Zervoudakis, and A\. Saravanos \(2025\)Hidden dynamics of massive activations in transformer training\.arXiv preprint arXiv:2508\.03616\.Cited by:[§1](https://arxiv.org/html/2605.08504#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- M\. Geva, D\. Khashabi, E\. Segal, T\. Khot, D\. Roth, and J\. Berant \(2021\)Did Aristotle Use a Laptop? A Question Answering Benchmark with Implicit Reasoning Strategies\.Transactions of the Association for Computational Linguistics \(TACL\)\.Cited by:[Appendix G](https://arxiv.org/html/2605.08504#A7.p1.1),[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p2.1)\.
- X\. Gu, T\. Pang, C\. Du, Q\. Liu, F\. Zhang, C\. Du, Y\. Wang, and M\. Lin \(2024\)When attention sink emerges in language models: an empirical view\.arXiv preprint arXiv:2410\.10781\.Cited by:[§2\.2](https://arxiv.org/html/2605.08504#S2.SS2.p1.1),[§6](https://arxiv.org/html/2605.08504#S6.p1.1)\.
- B\. He, L\. Noci, D\. Paliotta, I\. Schlag, and T\. Hofmann \(2024\)Understanding and minimising outlier features in transformer training\.Advances in Neural Information Processing Systems37,pp\. 83786–83846\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- D\. Hendrycks, C\. Burns, S\. Basart, A\. Zou, M\. Mazeika, D\. Song, and J\. Steinhardt \(2021\)Measuring massive multitask language understanding\.Proceedings of the International Conference on Learning Representations \(ICLR\)\.Cited by:[Appendix F](https://arxiv.org/html/2605.08504#A6.p1.1),[Appendix G](https://arxiv.org/html/2605.08504#A7.p1.1),[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p2.1)\.
- M\. Jin, K\. Mei, W\. Xu, M\. Sun, R\. Tang, M\. Du, Z\. Liu, and Y\. Zhang \(2025\)Massive values in self\-attention modules are the key to contextual knowledge understanding\.arXiv preprint arXiv:2502\.01563\.Cited by:[§1](https://arxiv.org/html/2605.08504#S1.p1.1)\.
- \[21\]P\. Kaul, C\. Ma, I\. Elezi, and J\. DengFrom attention to activation: unraveling the enigmas of large language models\.InThe Thirteenth International Conference on Learning Representations,Cited by:[§1](https://arxiv.org/html/2605.08504#S1.p1.1)\.
- W\. Lian, B\. Goodson, E\. Pentland, A\. Cook, C\. Vong, and ”Teknium” \(2023\)OpenOrca: an open dataset of gpt augmented flan reasoning traces\.HuggingFace\.Note:[https://https://huggingface\.co/datasets/Open\-Orca/OpenOrca](https://https//huggingface.co/datasets/Open-Orca/OpenOrca)Cited by:[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p1.1)\.
- H\. Lightman, V\. Kosaraju, Y\. Burda, H\. Edwards, B\. Baker, T\. Lee, J\. Leike, J\. Schulman, I\. Sutskever, and K\. Cobbe \(2023\)Let’s verify step by step\.arXiv preprint arXiv:2305\.20050\.Cited by:[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p2.1),[§5\.2](https://arxiv.org/html/2605.08504#S5.SS2.p5.2)\.
- A\. Liu, B\. Feng, B\. Xue, B\. Wang, B\. Wu, C\. Lu, C\. Zhao, C\. Deng, C\. Zhang, C\. Ruan,et al\.\(2024\)Deepseek\-v3 technical report\.arXiv preprint arXiv:2412\.19437\.Cited by:[§1](https://arxiv.org/html/2605.08504#S1.p1.1)\.
- T\. Mihaylov, P\. Clark, T\. Khot, and A\. Sabharwal \(2018\)Can a suit of armor conduct electricity? a new dataset for open book question answering\.InEMNLP,Cited by:[Appendix F](https://arxiv.org/html/2605.08504#A6.p1.1),[Appendix G](https://arxiv.org/html/2605.08504#A7.p1.1)\.
- E\. Miller \(2023\)Attention is off by one\.URL https://www\. evanmiller\. org/attention\-is\-off\-by\-one\. html\.Cited by:[§2\.2](https://arxiv.org/html/2605.08504#S2.SS2.p1.1)\.
- J\. Oh, S\. Shin, and D\. Oh \(2025\)House of cards: massive weights in llms\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.08504#S4.SS1.p2.6)\.
- L\. Owen, N\. R\. Chowdhury, A\. Kumar, and F\. Güra \(2025\)A refined analysis of massive activations in llms\.arXiv preprint arXiv:2503\.22329\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- Z\. Qiu, Z\. Wang, B\. Zheng, Z\. Huang, K\. Wen, S\. Yang, R\. Men, L\. Yu, F\. Huang, S\. Huang,et al\.\(2025\)Gated attention for large language models: non\-linearity, sparsity, and attention\-sink\-free\.arXiv preprint arXiv:2505\.06708\.Cited by:[Appendix G](https://arxiv.org/html/2605.08504#A7.p1.1),[§2\.2](https://arxiv.org/html/2605.08504#S2.SS2.p1.1),[§6](https://arxiv.org/html/2605.08504#S6.p1.1)\.
- E\. Queipo\-de\-Llano, Á\. Arroyo, F\. Barbero, X\. Dong, M\. Bronstein, Y\. LeCun, and R\. Shwartz\-Ziv \(2025\)Attention sinks and compression valleys in llms are two sides of the same coin\.arXiv preprint arXiv:2510\.06477\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- R\. Rafailov, A\. Sharma, E\. Mitchell, C\. D\. Manning, S\. Ermon, and C\. Finn \(2023\)Direct preference optimization: your language model is secretly a reward model\.Advances in neural information processing systems36,pp\. 53728–53741\.Cited by:[§5\.2](https://arxiv.org/html/2605.08504#S5.SS2.p5.2)\.
- J\. Ramapuram, F\. Danieli, E\. Dhekane, F\. Weers, D\. Busbridge, P\. Ablin, T\. Likhomanenko, J\. Digani, Z\. Gu, A\. Shidani,et al\.\(2024\)Theory, analysis, and best practices for sigmoid self\-attention\.arXiv preprint arXiv:2409\.04431\.Cited by:[§2\.2](https://arxiv.org/html/2605.08504#S2.SS2.p1.1)\.
- P\. Röttger, H\. R\. Kirk, B\. Vidgen, G\. Attanasio, F\. Bianchi, and D\. Hovy \(2023\)Xstest: a test suite for identifying exaggerated safety behaviours in large language models\.arXiv preprint arXiv:2308\.01263\.Cited by:[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p2.1),[§5\.2](https://arxiv.org/html/2605.08504#S5.SS2.p5.2)\.
- V\. Ruscio, U\. Nanni, and F\. Silvestri \(2025\)What are you sinking? a geometric approach on attention sink\.arXiv preprint arXiv:2508\.02546\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1),[§6](https://arxiv.org/html/2605.08504#S6.p3.1)\.
- B\. Shang, Y\. Chen, Y\. Zhang, B\. Shen, and S\. Liu \(2025\)Forgetting to forget: attention sink as a gateway for backdooring llm unlearning\.arXiv preprint arXiv:2510\.17021\.Cited by:[§2\.2](https://arxiv.org/html/2605.08504#S2.SS2.p1.1)\.
- Z\. Shao, P\. Wang, Q\. Zhu, R\. Xu, J\. Song, X\. Bi, H\. Zhang, M\. Zhang, Y\. Li, Y\. Wu,et al\.\(2024\)Deepseekmath: pushing the limits of mathematical reasoning in open language models\.arXiv preprint arXiv:2402\.03300\.Cited by:[§5\.2](https://arxiv.org/html/2605.08504#S5.SS2.p5.2)\.
- Z\. Shi, K\. Mei, Y\. Quan, D\. N\. Metaxas, and R\. Tang \(2026\)Improving visual reasoning with iterative evidence refinement\.arXiv preprint arXiv:2603\.14117\.Cited by:[§1](https://arxiv.org/html/2605.08504#S1.p1.1)\.
- Z\. Shi, Y\. Wan, Z\. Wang, Q\. Wang, F\. Yang, E\. Kreiss, and R\. Tang \(2025\)Meaningless tokens, meaningful gains: how activation shifts enhance llm reasoning\.arXiv preprint arXiv:2510\.01032\.Cited by:[§1](https://arxiv.org/html/2605.08504#S1.p1.1)\.
- S\. Son, W\. Park, W\. Han, K\. Kim, and J\. Lee \(2024\)Prefixing attention sinks can mitigate activation outliers for large language model quantization\.arXiv preprint arXiv:2406\.12016\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- M\. Sun, X\. Chen, J\. Z\. Kolter, and Z\. Liu \(2024\)Massive activations in large language models\.arXiv preprint arXiv:2402\.17762\.Cited by:[§1](https://arxiv.org/html/2605.08504#S1.p1.1),[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- W\. Timkey and M\. Van Schijndel \(2021\)All bark and no bite: rogue dimensions in transformer language models obscure representational quality\.arXiv preprint arXiv:2109\.04404\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1),[§4\.1](https://arxiv.org/html/2605.08504#S4.SS1.p2.6)\.
- \[42\]J\. Wei, M\. Bosma, V\. Zhao, K\. Guu, A\. W\. Yu, B\. Lester, N\. Du, A\. M\. Dai, and Q\. V\. LeFinetuned language models are zero\-shot learners\.InInternational Conference on Learning Representations,Cited by:[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p1.1)\.
- G\. Xiao, Y\. Tian, B\. Chen, S\. Han, and M\. Lewis \(2024\)EFFICIENT streaming language models with attention sinks\.Cited by:[§1](https://arxiv.org/html/2605.08504#S1.p5.1),[§2\.2](https://arxiv.org/html/2605.08504#S2.SS2.p1.1),[§6](https://arxiv.org/html/2605.08504#S6.p3.1)\.
- T\. Xie, X\. Qi, Y\. Zeng, Y\. Huang, U\. M\. Sehwag, K\. Huang, L\. He, B\. Wei, D\. Li, Y\. Sheng, R\. Jia, B\. Li, K\. Li, D\. Chen, P\. Henderson, and P\. Mittal \(2025\)SORRY\-bench: systematically evaluating large language model safety refusal\.InThe Thirteenth International Conference on Learning Representations,External Links:[Link](https://openreview.net/forum?id=YfKNaRktan)Cited by:[§5\.1](https://arxiv.org/html/2605.08504#S5.SS1.p2.1)\.
- A\. Yang, A\. Li, B\. Yang, B\. Zhang, B\. Hui, B\. Zheng, B\. Yu, C\. Gao, C\. Huang, C\. Lv,et al\.\(2025\)Qwen3 technical report\.arXiv preprint arXiv:2505\.09388\.Cited by:[§1](https://arxiv.org/html/2605.08504#S1.p1.1)\.
- A\. Zeng, X\. Liu, Z\. Du, Z\. Wang, H\. Lai, M\. Ding, Z\. Yang, Y\. Xu, W\. Zheng, X\. Xia,et al\.\(2022\)Glm\-130b: an open bilingual pre\-trained model\.arXiv preprint arXiv:2210\.02414\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- B\. Zhang, Y\. Yu, J\. Guo, and J\. Shao \(2025a\)Dive into the agent matrix: a realistic evaluation of self\-replication risk in llm agents\.arXiv preprint arXiv:2509\.25302\.Cited by:[§2\.2](https://arxiv.org/html/2605.08504#S2.SS2.p1.1)\.
- B\. Zhang and R\. Zhang \(2025\)Cot\-uq: improving response\-wise uncertainty quantification in llms with chain\-of\-thought\.InFindings of the Association for Computational Linguistics: ACL 2025,pp\. 26114–26133\.Cited by:[§2\.2](https://arxiv.org/html/2605.08504#S2.SS2.p1.1)\.
- \[49\]S\. Zhang, M\. Khan, and V\. PapyanAttention sinks: a’catch, tag, release’mechanism for embeddings\.InThe Thirty\-ninth Annual Conference on Neural Information Processing Systems,Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- X\. Zhang, Y\. Quan, C\. Shen, C\. Gu, X\. Yuan, S\. Yan, J\. Cao, H\. Cheng, K\. Wu, and J\. Ye \(2025b\)Shallow focus, deep fixes: enhancing shallow layers vision attention sinks to alleviate hallucination in lvlms\.InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,pp\. 3512–3534\.Cited by:[§1](https://arxiv.org/html/2605.08504#S1.p1.1)\.
- X\. Zhang, Y\. Quan, C\. Shen, X\. Yuan, S\. Yan, L\. Xie, W\. Wang, C\. Gu, H\. Tang, and J\. Ye \(2025c\)From redundancy to relevance: information flow in lvlms across reasoning tasks\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 2289–2299\.Cited by:[§1](https://arxiv.org/html/2605.08504#S1.p1.1)\.
- H\. Zhao, H\. Chen, F\. Yang, N\. Liu, H\. Deng, H\. Cai, S\. Wang, D\. Yin, and M\. Du \(2024\)Explainability for large language models: a survey\.ACM Transactions on Intelligent Systems and Technology15\(2\),pp\. 1–38\.Cited by:[§1](https://arxiv.org/html/2605.08504#S1.p1.1)\.
- T\. Zhao, K\. Y\. Singh, S\. Appalaraju, P\. Tang, Y\. N\. Wu, and L\. E\. Li \(2025\)On the analysis and distillation of emergent outlier properties in pre\-trained language models\.InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies \(Volume 1: Long Papers\),pp\. 8475–8507\.Cited by:[§2\.1](https://arxiv.org/html/2605.08504#S2.SS1.p1.1)\.
- A\. Zou, Z\. Wang, J\. Z\. Kolter, and M\. Fredrikson \(2023\)Universal and transferable adversarial attacks on aligned language models\.External Links:2307\.15043Cited by:[§5\.2](https://arxiv.org/html/2605.08504#S5.SS2.p5.2)\.
- Z\. M\. Zuhri, E\. H\. Fuadi, and A\. F\. Aji \(2025\)Softpick: no attention sink, no massive activations with rectified softmax\.arXiv preprint arXiv:2504\.20966\.Cited by:[§2\.2](https://arxiv.org/html/2605.08504#S2.SS2.p1.1)\.

## Appendix ALimitation and Future Works

While our analysis focuses on the emergence and propagation of massive activations in the middle layers, we observe that the final layers exhibit qualitatively different behavior\. In particular, the model again produces massive activations in the first token within the last two layers, suggesting that these layers may serve distinct functional roles compared to intermediate layers, such as output consolidation or task\-specific representation shaping\. However, our current study does not provide a detailed mechanistic explanation for this phenomenon, and a systematic analysis of massive\-value formation in the final layers remains beyond the scope of this work\.

Moreover, our evaluation primarily considers the post\-training setting, where the proposed method is applied after supervised fine\-tuning or reinforcement learning\. Although we observe consistent performance improvements under this setting, we do not investigate the effects of integrating our method into the pre\-training process\. Understanding whether suppressing dominant dimensions during large\-scale pre\-training would lead to similar or even stronger benefits, without adversely affecting representation learning, remains an open and important direction for future research\.

## Appendix BCompare the Role of RMSNorm and FFN

As discussed earlier, both RMSNorm and the FFN contribute to the emergence of massive activations\. To disentangle their respective roles, we conduct controlled ablation studies by separately removing the RMSNorm preceding the FFN and the FFN itself, and analyze how each modification affects the formation and propagation of massive activations across layers\. The result in[Figure 9](https://arxiv.org/html/2605.08504#A2.F9)\. When the FFN is removed, we observe that the massive\-activation token still emerges in the intermediate layers, indicating that earlier components of the network can transiently produce elevated activations\. However, these massive activations fail to persist and gradually vanish in deeper layers\. This suggests that, without the FFN, the network lacks a mechanism to continuously amplify or maintain such activations as they propagate through the residual stream\. In contrast, when the RMSNorm before the FFN is removed, the massive activation remains observable throughout the network\. Nevertheless, its magnitude is significantly reduced compared to the original model\. This indicates that while RMSNorm substantially influences their scale, likely by reweighting and amplifying specific dimensions of the hidden representation before entering the FFN\. Taken together, these results suggest a complementary interplay between the FFN and the preceding RMSNorm in the ME Layer The FFN appears to be the dominant component responsible for generating and sustaining massive activations, whereas the RMSNorm before the FFN plays a crucial role in modulating their magnitude\. This interaction helps explain why massive activations emerge sharply and reach extreme values specifically within the ME Layer

![Refer to caption](https://arxiv.org/html/2605.08504v1/x9.png)Figure 9:The hidden state of the output of DecoderLayer, left figure remove FFN in ME Layer middle figure remove RMSNorm in ME Layer right figure contains all module\.
## Appendix CMore Experiment Settings

During training, WeMask is applied to every layer following the onset of massive activation\. In contrast, during evaluation, we adopt different configurations depending on the task type\. For tasks that primarily assess the model’s ability to generalize knowledge, we use the same setting as in training and apply WeMask to all layers after massive activation\. However, for task\-specific evaluations such as mathematical reasoning and safety alignment, WeMask is applied only to the first layer where massive activation emerges during inference\.

This design choice is motivated by the different functional roles of WeMask during training and inference, as well as the varying sensitivity of downstream tasks to representational intervention\. During training, massive activations emerging after the ME Layer tend to propagate through the residual stream and repeatedly reinforce a directionally rigid representation across subsequent layers\. If left unmitigated, this rigidity can accumulate layer by layer, shaping the overall geometry of the hidden\-state space\. Applying WeMask to all layers following the onset of massive activation therefore acts as a form of representation\-level regularization\. This encourages the model to learn under reduced directional dominance and to distribute representational capacity more evenly across dimensions throughout the network, leading to more stable and flexible hidden\-state dynamics\.

During inference, however, the objectives and sensitivities of different tasks diverge\. For tasks that primarily assess the model’s ability to generalize knowledge across domains or inputs, maintaining consistency between training and evaluation is important\. In these settings, we therefore apply WeMask in the same manner as during training, i\.e\., to all layers following the onset of massive activation\. In contrast, task\-specific evaluations such as mathematical reasoning and safety alignment rely more heavily on precise intermediate computations and task\-specialized circuits formed in deeper layers\. Applying WeMask uniformly across all post\-ME Layer layers during inference in these tasks may introduce unnecessary interference, potentially suppressing useful task\-dependent representations\. To address this, we adopt a more targeted intervention strategy: WeMask is applied only at the first layer where massive activation emerges\. This setting directly mitigates the initial source of representational rigidity while allowing subsequent layers to operate largely unperturbed, thereby preserving the model’s capacity for fine\-grained reasoning and decision making\. This design balances effectiveness and minimality: WeMask is applied broadly during training to reshape representation learning, while during inference it is selectively deployed to correct the root cause of rigidity without over\-constraining downstream computations\.

## Appendix DStability of ME Layer

In this section, we demonstrate that the emergence of the ME Layer is not an incidental phenomenon tied to specific input examples, but a systematic and input\-agnostic behavior of the model\. We adopt Qwen3\-4B as the base model for analysis and evaluate its behavior under a diverse set of input conditions\. Specifically, we construct inputs spanning multiple task categories, including commonsense question answering, mathematical problem solving, logical reasoning, and open\-ended text continuation\. In addition, we vary the input length from short sequences of approximately 10 tokens to long contexts exceeding 1,000 tokens\. As shown in[Figure 10](https://arxiv.org/html/2605.08504#A4.F10), regardless of input type or sequence length, Qwen3\-4B consistently exhibits massive activation at the same layer, which we identify as the ME Layer This consistency across heterogeneous inputs indicates that the ME Layer reflects an intrinsic property of the model’s internal representation dynamics, rather than a task\-specific or input\-dependent artifact\.

![Refer to caption](https://arxiv.org/html/2605.08504v1/x10.png)Figure 10:L2 norm of the first token across layers for different input instances\. Each curve corresponds to a distinct example\.
## Appendix EPerformance of Different Mask Methods

In this section, we evaluate different masking strategies by incorporating them into the inference stage as training\-free interventions, in order to examine their impact on model performance\. For each masking method, we adopt the mask ratio that yields the best performance on the corresponding benchmark, as reported in[Table 1](https://arxiv.org/html/2605.08504#S4.T1)\.

Random Mask randomly masks a fixed proportion of dimensions in the hidden state of the massive\-activation token\. Magnitude Mask masks the top\-kkdimensions with the largest activation magnitudes in the massive\-activation token\. The results are summarized in[Table 5](https://arxiv.org/html/2605.08504#A5.T5)\. We observe that, except for our method, all alternative masking strategies lead to a substantial degradation in model performance, often causing severe harm to the model’s reasoning ability\. In contrast, our method consistently improves performance across benchmarks\. These results demonstrate that indiscriminately masking dimensions—either randomly or based solely on activation magnitude—destroys critical representational structure, whereas selectively masking dimensions guided by RMSNorm weights provides a principled and effective way to suppress harmful dominance while preserving useful information\.

Table 5:Performance of different masking strategies applied to Qwen3\-4B across multiple benchmarks\.
## Appendix FPerformance of Other Models

Table 6:Using of our method on different models and testing their performance on several benchmarks\.To evaluate the generality of our method, we further select Llama\-3\.1\-8B\-Instruct and Qwen3\-8B as base models and fine\-tune them using WeMask\. We then evaluate the resulting models on MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.08504#bib.bib13)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2605.08504#bib.bib14)\), ARC\-C\(Clarket al\.,[2018](https://arxiv.org/html/2605.08504#bib.bib15)\), OpenBookQA\(Mihaylovet al\.,[2018](https://arxiv.org/html/2605.08504#bib.bib16)\), and MathQA\(Aminiet al\.,[2019](https://arxiv.org/html/2605.08504#bib.bib18)\)\. The results are reported in[Table 6](https://arxiv.org/html/2605.08504#A6.T6)\. As shown in the table, compared to the training\-free variant, the SFT\-based WeMask approach exhibits more stable performance and consistently outperforms the standard SFT baselines across multiple benchmarks\. These results demonstrate that WeMask generalizes well across different model architectures and reliably improves model performance\.

## Appendix GCompared with Other Methods Which Eliminating Attention Sinks

In the preceding sections, we examined the relationship between our method and the attention sink phenomenon\. In this section, we directly compare the effectiveness of our method with existing attention sink removal approaches\(Qiuet al\.,[2025](https://arxiv.org/html/2605.08504#bib.bib4)\)\. We adopt the gated attention method to fine\-tune the model using supervised fine\-tuning \(SFT\), and evaluate its performance on MMLU\(Hendryckset al\.,[2021](https://arxiv.org/html/2605.08504#bib.bib13)\), PIQA\(Bisket al\.,[2020](https://arxiv.org/html/2605.08504#bib.bib14)\), ARC\-C\(Clarket al\.,[2018](https://arxiv.org/html/2605.08504#bib.bib15)\), OpenBookQA\(Mihaylovet al\.,[2018](https://arxiv.org/html/2605.08504#bib.bib16)\), and StrategyQA\(Gevaet al\.,[2021](https://arxiv.org/html/2605.08504#bib.bib17)\)\. The results are summarized in the table\. We observe that, compared to methods that directly suppress attention sinks within the attention module, our approach achieves consistently better performance after fine\-tuning\. These results further support the validity of our new perspective on attention sinks\. Specifically,Qiuet al\.\([2025](https://arxiv.org/html/2605.08504#bib.bib4)\)primarily introduces gated modules during the pre\-training stage to eliminate attention sinks and improve performance\. However, when applied during fine\-tuning, such interventions may disrupt representations and inductive biases already learned by the model, leading to suboptimal results\. In contrast, our method—applicable in both training\-free and fine\-tuning settings—provides a simpler and more effective way to improve model performance while mitigating the impact of attention sinks\.

Table 7:Performance of our method compared to other attention sink removal methods, with the mask rate set to 0\.1\.
## Appendix HThe Universality of ME Layer

In[Table 8](https://arxiv.org/html/2605.08504#A8.T8), we present the ME Layer indices for different models\. The results show that the ME Layer is a ubiquitous phenomenon across architectures, and its position is largely consistent within the same model family\. For example, both Qwen3\-8B and Qwen3\-4B\-Instruct locate the ME Layer at layer 7\.

Table 8:The position of ME Layer in different model and their magnification compared to the previous layer\.In this section, we will show the L2 Norm of hidden state after RMSNorm, FFN and output in different models to show the universality of ME Layer The[Figure 11](https://arxiv.org/html/2605.08504#A8.F11),[Figure 12](https://arxiv.org/html/2605.08504#A8.F12),[Figure 13](https://arxiv.org/html/2605.08504#A8.F13),[Figure 14](https://arxiv.org/html/2605.08504#A8.F14),[Figure 15](https://arxiv.org/html/2605.08504#A8.F15),[Figure 16](https://arxiv.org/html/2605.08504#A8.F16),[Figure 17](https://arxiv.org/html/2605.08504#A8.F17),[Figure 18](https://arxiv.org/html/2605.08504#A8.F18),[Figure 19](https://arxiv.org/html/2605.08504#A8.F19),[Figure 20](https://arxiv.org/html/2605.08504#A8.F20)shows the output of RMSNorm, FFN and Decoderlayer\. We observe that the ME Layer consistently exists across all evaluated models\. For models within the same family, such as Qwen3\-8B and Qwen3\-4B, the ME Layer emerges at the same layer\. The output of RMSNorm in Llama3\.1 exhibits a different pattern compared to Qwen3\. In Llama3\.1 or Mistral, the L2 norm of the massive activation token continues to increase after the ME Layer whereas in Qwen3 models it peaks sharply at the ME Layer Despite this difference in post\-ME Layer behavior, both architectures share a common characteristic: within the ME Layer the L2 norm of the massive\-activation token reaches its maximum, indicating a structurally consistent emergence of massive activations across model families\.

![Refer to caption](https://arxiv.org/html/2605.08504v1/x11.png)Figure 11:The hidden state of the output of RMSNorm, FFN and Decoderlayer on Qwen3\-8B![Refer to caption](https://arxiv.org/html/2605.08504v1/x12.png)Figure 12:The hidden state of the output of RMSNorm, FFN and Decoderlayer on Qwen3\-4B\-Instruct![Refer to caption](https://arxiv.org/html/2605.08504v1/x13.png)Figure 13:The hidden state of the output of RMSNorm, FFN and Decoderlayer on Qwen2\.5\-7B![Refer to caption](https://arxiv.org/html/2605.08504v1/x14.png)Figure 14:The hidden state of the output of RMSNorm, FFN and Decoderlayer on Qwen2\.5\-7B\-Instruct![Refer to caption](https://arxiv.org/html/2605.08504v1/x15.png)Figure 15:The hidden state of the output of RMSNorm, FFN and Decoderlayer on Qwen2\.5\-32B![Refer to caption](https://arxiv.org/html/2605.08504v1/x16.png)Figure 16:The hidden state of the output of RMSNorm, FFN and Decoderlayer on Llama3\.1\-8B![Refer to caption](https://arxiv.org/html/2605.08504v1/x17.png)Figure 17:The hidden state of the output of RMSNorm, FFN and Decoderlayer on Llama3\.1\-8B\-Instruct![Refer to caption](https://arxiv.org/html/2605.08504v1/x18.png)Figure 18:The hidden state of the output of RMSNorm, FFN and Decoderlayer on Mistral\-7B\-v0\.1\.![Refer to caption](https://arxiv.org/html/2605.08504v1/x19.png)Figure 19:The hidden state of the output of RMSNorm, FFN and Decoderlayer on DeepSeek\-llm\-7b\-chat\.![Refer to caption](https://arxiv.org/html/2605.08504v1/x20.png)Figure 20:The hidden state of the output of RMSNorm, FFN and Decoderlayer on Phi3\-mini\-4k\-instruct\.
A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

Similar Articles

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms

Towards Intrinsic Interpretability of Large Language Models: A Survey of Design Principles and Architectures

Negative Before Positive: Asymmetric Valence Processing in Large Language Models

Decomposing and Steering Functional Metacognition in Large Language Models

Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions

Submit Feedback

Similar Articles

Disentangling Mathematical Reasoning in LLMs: A Methodological Investigation of Internal Mechanisms
Towards Intrinsic Interpretability of Large Language Models: A Survey of Design Principles and Architectures
Negative Before Positive: Asymmetric Valence Processing in Large Language Models
Decomposing and Steering Functional Metacognition in Large Language Models
Understanding Performance Collapse in Layer-Pruned Large Language Models via Decision Representation Transitions