@gurtej__gill_: The Kimi Team wrote a really clever paper back in March that fixes a fundamental flaw we have sort of just accepted in …

X AI KOLs Timeline 07/03/26, 04:50 PM Papers

transformer attention-residuals deep-learning pre-training reasoning architecture

Summary

The Kimi Team's paper 'Attention Residuals' (AttnRes) replaces uniform residual connections in Transformers with softmax attention over depth, allowing each layer to dynamically select earlier representations. Pre-trained on 1.4 trillion tokens with a 48B parameter model, it stabilizes hidden states and significantly improves reasoning tasks.

The Kimi Team wrote a really clever paper back in March that fixes a fundamental flaw we have sort of just accepted in Transformer design: “Uniform Residual Connections”. Right now, we blindly add every layer’s output together with the exact same weight and it completely dilutes early layer features as the networks get deeper. Their Paper “Attention Residuals” (AttnRes) replaces that rigid addition with softmax attention over the depth dimension. It lets each layer dynamically look back and grab only the specific earlier representations it actually needs. To keep the memory from exploding at scale, they designed "Block AttnRes" which you can see in the paper. It groups layers into blocks so the attention step runs only across those macro levels. They actually pre-trained this on 1.4 trillion tokens using a 48B parameter Kimi Linear model and it completely stabilized hidden state growth while giving a massive boost to heavy reasoning tasks. It's a great reminder that tweaking how the information flows through a network can be way more powerful than just throwing more compute at it. Read the full paper here: https://arxiv.org/pdf/2603.15031

Original Article

View Cached Full Text

Cached at: 07/04/26, 08:41 AM

The Kimi Team wrote a really clever paper back in March that fixes a fundamental flaw we have sort of just accepted in Transformer design: “Uniform Residual Connections”.

Right now, we blindly add every layer’s output together with the exact same weight and it completely dilutes early layer features as the networks get deeper.

Their Paper “Attention Residuals” (AttnRes) replaces that rigid addition with softmax attention over the depth dimension.

It lets each layer dynamically look back and grab only the specific earlier representations it actually needs.

To keep the memory from exploding at scale, they designed “Block AttnRes” which you can see in the paper.

It groups layers into blocks so the attention step runs only across those macro levels.

They actually pre-trained this on 1.4 trillion tokens using a 48B parameter Kimi Linear model and it completely stabilized hidden state growth while giving a massive boost to heavy reasoning tasks.

It’s a great reminder that tweaking how the information flows through a network can be way more powerful than just throwing more compute at it.

Read the full paper here: https://arxiv.org/pdf/2603.15031

Untitled Document

Source: https://arxiv.org/html/2603.15031 Experimental support, pleaseview the build logsfor errors. Generated byLATExml [LOGO] .

Instructions for reporting errors

We are continuing to improve HTML versions of papers, and your feedback helps enhance accessibility and mobile support. To report errors in the HTML that will help us improve conversion and rendering, choose any of the methods listed below:

Click the “Report Issue”()button, located in the page header.

**Tip:**You can select the relevant text first, to include it in your report.

Our team has already identifiedthe following issues. We appreciate your time reviewing and reporting rendering errors we may not have found yet. Your efforts will help us improve the HTML versions for all readers, because disability should not be a barrier to accessing research. Thank you for your continued support in championing open access for all.

Have a free development cycle? Help support accessibility at arXiv! Our collaborators at LaTeXML maintain alist of packages that need conversion, and welcomedeveloper contributions.

@gurtej__gill_: The Kimi Team wrote a really clever paper back in March that fixes a fundamental flaw we have sort of just accepted in …

Untitled Document

Instructions for reporting errors

Similar Articles

Delta Attention Residuals

Kuramoto Attention: Synchronizing Self-Attention on the Torus

Adaptive Computation Depth via Learned Token Routing in Transformers

@tilderesearch: https://x.com/tilderesearch/status/2061771450168889432

@omarsar0: NEW paper worth reading. (bookmark it) The basic idea is to pair a compressive recurrent state with a small exact memor…

Submit Feedback

Similar Articles

Kuramoto Attention: Synchronizing Self-Attention on the Torus

Adaptive Computation Depth via Learned Token Routing in Transformers

@tilderesearch: https://x.com/tilderesearch/status/2061771450168889432

@omarsar0: NEW paper worth reading. (bookmark it) The basic idea is to pair a compressive recurrent state with a small exact memor…