I Figured Out What Causes 'Super Weights'

Reddit r/ArtificialInteligence 06/23/26, 08:31 PM Papers

super-weights quantization softmax attention model-optimization llm research

Summary

Explains that super weights in large language models arise from the SoftMax-Attention interaction creating a 'Nothing Dump' token that serves as a stable reference point; removing these weights cripples performance.

'Super weights' are a phenomenon first highlighted by Apple in 2024. A very small number of the model's parameters are responsible for a very large part of its performance. The interesting thing within this is that these tokens are often filled with straight up 'garbage' once you examine them. You cannot get rid of them though, or the model performance drops 15-20% or more if you eliminate even one of them. When a model is quantized, this is specifically accounted for. That is why new quantization methods for AI keep getting invented. The new methods keep getting better and better at accounting for and retaining the full structure of the super weights while still quantizing everything but the super weights. But why do the Super weights occur in the first place? If you could figure that out, you would not need to invent the exotic math downstream to account for it. Are they specifically just an SGD Artifact? That was my base assumption basically forever. The research shows that the weights do not pool in the Attention Layer, so Attention does not seem to be the direct cause, SoftMax does. There is a critical interaction between SoftMax and Attention that is not explored when it comes to this particular problem. When being Optimized, every turn of Attention must produce an end score of 1.0 Attention. Even if the model does not want to devote any Attention that turn, it does not have anything within its architecture to represent this. So, it creates a 'Nothing Dump'. A random useless token becomes the 'Nothing Dump'. Maybe it's the first token every time, maybe it's the <BOS> token. It does not really matter what specific token it is. What matters is that this always becomes the token. That creates a stable reference point for nothing. A stable reference point for nothing can be very useful, it can be measured against. You can measure something vs nothing, etc. You can actually begin to utilize this in your training. It becomes a Landmark within your Latent Space. Always there. Useful because it is always there, not what it is in it. Nothing is in it lol. If you ablate it though, you destroy the Landmark. The model can no longer measure against the Landmark, so you basically destroy all of that training by eliminating that one single parameter. Deeper Visual Dive: https://youtu.be/hkom1BDuZHU

Original Article

I Figured Out What Causes 'Super Weights'

Similar Articles

"They're made out of weights"

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

Large Vision-Language Models Get Lost in Attention

Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes

Submit Feedback

Similar Articles

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

A Single Layer to Explain Them All:Understanding Massive Activations in Large Language Models

Large Vision-Language Models Get Lost in Attention

Bug or Feature^2: Weight Drift, Activation Sparsity, and Spikes