@gurtej__gill_: The Kimi Team wrote a really clever paper back in March that fixes a fundamental flaw we have sort of just accepted in …
Summary
The Kimi Team's paper 'Attention Residuals' (AttnRes) replaces uniform residual connections in Transformers with softmax attention over depth, allowing each layer to dynamically select earlier representations. Pre-trained on 1.4 trillion tokens with a 48B parameter model, it stabilizes hidden states and significantly improves reasoning tasks.
View Cached Full Text
Cached at: 07/04/26, 08:41 AM
The Kimi Team wrote a really clever paper back in March that fixes a fundamental flaw we have sort of just accepted in Transformer design: “Uniform Residual Connections”.
Right now, we blindly add every layer’s output together with the exact same weight and it completely dilutes early layer features as the networks get deeper.
Their Paper “Attention Residuals” (AttnRes) replaces that rigid addition with softmax attention over the depth dimension.
It lets each layer dynamically look back and grab only the specific earlier representations it actually needs.
To keep the memory from exploding at scale, they designed “Block AttnRes” which you can see in the paper.
It groups layers into blocks so the attention step runs only across those macro levels.
They actually pre-trained this on 1.4 trillion tokens using a 48B parameter Kimi Linear model and it completely stabilized hidden state growth while giving a massive boost to heavy reasoning tasks.
It’s a great reminder that tweaking how the information flows through a network can be way more powerful than just throwing more compute at it.
Read the full paper here: https://arxiv.org/pdf/2603.15031
Untitled Document
Source: https://arxiv.org/html/2603.15031
Experimental support, pleaseview the build logsfor errors. Generated byLATExml.
Instructions for reporting errors
We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:
- Click the “Report Issue”()button, located in the page header.
**Tip:**You can select the relevant text first, to include it in your report.
Our team has already identifiedthe following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.
Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain alist of packages that need conversion, and welcomedeveloper contributions.
Similar Articles
Delta Attention Residuals
Delta Attention Residuals improve layer-wise routing in transformer models by attending to feature changes (deltas) rather than cumulative hidden states, achieving 1.7–8.2% validation perplexity gains across scales from 220M to 7.6B parameters.
Kuramoto Attention: Synchronizing Self-Attention on the Torus
Introduces Kuramoto attention, a self-attention layer where hidden states are phase angles on a torus, enabling synchronization through gated cosine similarity and circular mean updates. The layer performs comparably to standard transformers on character-level language modeling.
Adaptive Computation Depth via Learned Token Routing in Transformers
This paper presents Token-Selective Attention (TSA), a differentiable token routing mechanism that learns to skip unnecessary computations per token in transformer layers, reducing token-layer operations by 14–23% with minimal quality loss on language modeling tasks.
@tilderesearch: https://x.com/tilderesearch/status/2061771450168889432
Wall Attention generalizes diagonal forget gates to softmax attention, enabling state-of-the-art length extrapolation from 4k to 160k+ context zero-shot and outperforming RoPE and FoX in pretraining. It is released as a drop-in replacement with open-source Triton kernels.
@omarsar0: NEW paper worth reading. (bookmark it) The basic idea is to pair a compressive recurrent state with a small exact memor…
HOLA (Hippocampal Linear Attention) augments linear attention with a bounded exact KV cache inspired by hippocampal memory, improving long-range recall and perplexity without sacrificing efficiency. At 340M parameters, it outperforms full-attention Transformers on Wikitext and achieves robust needle recall up to 32k tokens.