Explains that super weights in large language models arise from the SoftMax-Attention interaction creating a 'Nothing Dump' token that serves as a stable reference point; removing these weights cripples performance.
'Super weights' are a phenomenon first highlighted by Apple in 2024. A very small number of the model's parameters are responsible for a very large part of its performance. The interesting thing within this is that these tokens are often filled with straight up 'garbage' once you examine them. You cannot get rid of them though, or the model performance drops 15-20% or more if you eliminate even one of them. When a model is quantized, this is specifically accounted for. That is why new quantization methods for AI keep getting invented. The new methods keep getting better and better at accounting for and retaining the full structure of the super weights while still quantizing everything but the super weights. But why do the Super weights occur in the first place? If you could figure that out, you would not need to invent the exotic math downstream to account for it. Are they specifically just an SGD Artifact? That was my base assumption basically forever. The research shows that the weights do not pool in the Attention Layer, so Attention does not seem to be the direct cause, SoftMax does. There is a critical interaction between SoftMax and Attention that is not explored when it comes to this particular problem. When being Optimized, every turn of Attention must produce an end score of 1.0 Attention. Even if the model does not want to devote any Attention that turn, it does not have anything within its architecture to represent this. So, it creates a 'Nothing Dump'. A random useless token becomes the 'Nothing Dump'. Maybe it's the first token every time, maybe it's the <BOS> token. It does not really matter what specific token it is. What matters is that this always becomes the token. That creates a stable reference point for nothing. A stable reference point for nothing can be very useful, it can be measured against. You can measure something vs nothing, etc. You can actually begin to utilize this in your training. It becomes a Landmark within your Latent Space. Always there. Useful because it is always there, not what it is in it. Nothing is in it lol. If you ablate it though, you destroy the Landmark. The model can no longer measure against the Landmark, so you basically destroy all of that training by eliminating that one single parameter. Deeper Visual Dive: https://youtu.be/hkom1BDuZHU
A creative dialogue explores the idea that large language models are fundamentally just matrices of weights, challenging notions of understanding and sentience.
Introduces Contribution Weights, a projection-based metric that accounts for attention weight, value magnitude, and directional alignment to more faithfully measure token importance in transformer LLMs, revealing active functional roles of attention sinks.
This paper identifies the 'Massive Emergence Layer' where extreme activations in LLMs originate and propagate, proposing a method to mitigate their rigidity and improve model performance on tasks like math reasoning and instruction following.
This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.
This paper formally proves that training neural networks with asymmetric activation functions like ReLU, GELU, or SiLU causes weights to drift negative, leading to up to 90% activation sparsity. It also shows that squared activations like ReLU² improve performance but cause activation spikes, which can be fixed by clipping, with GELU² achieving the best validation loss.