Delta Attention Residuals

Hugging Face Daily Papers 05/13/26, 12:00 AM Papers

attention-residuals delta-attention layer-routing deep-learning model-architecture fine-tuning

Summary

Delta Attention Residuals improve layer-wise routing in transformer models by attending to feature changes (deltas) rather than cumulative hidden states, achieving 1.7–8.2% validation perplexity gains across scales from 220M to 7.6B parameters.

Attention Residuals replace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selective cross-layer routing. However, standard Attention Residuals still attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads to routing collapse in deeper layers: attention weights become low-contrast and closer to uniform (max weight {approx}0.2), limiting the model's ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed in Attention Residuals? To answer this question, we propose Delta Attention Residuals, which attend over deltas -- the change introduced by each sublayer (v_i = h_{i+1} - h_i) -- instead of cumulative states. Delta representations are structurally diverse and yield higher-contrast attention distributions (max weight {approx}0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer and block granularity. Across all tested scales (220M--7.6B), Delta Attention Residuals consistently outperform both standard residuals and Attention Residuals, with 1.7--8.2\% validation perplexity gains. Delta Attention Residuals also enables converting pretrained checkpoints into Delta Attention Residuals via standard fine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.

Original Article

View Cached Full Text

Cached at: 05/20/26, 02:35 AM

Paper page - Delta Attention Residuals

Source: https://huggingface.co/papers/2605.18855

Abstract

Delta Attention Residuals improve layer-wise routing by attending to feature changes rather than cumulative states, resulting in better attention distributions and model performance across different scales.

Attention Residualsreplace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selectivecross-layer routing. However, standardAttention Residualsstill attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads torouting collapsein deeper layers:attention weightsbecome low-contrast and closer to uniform (max weight {approx}0.2), limiting the model’s ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed inAttention Residuals? To answer this question, we propose DeltaAttention Residuals, which attend over deltas -- the change introduced by each sublayer (v_i = h_{i+1} - h_i) -- instead of cumulative states.Delta representationsare structurally diverse and yield higher-contrast attention distributions (max weight {approx}0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer andblock granularity. Across all tested scales (220M--7.6B), DeltaAttention Residualsconsistently outperform both standard residuals andAttention Residuals, with 1.7--8.2\%validation perplexitygains. DeltaAttention Residualsalso enables convertingpretrained checkpointsinto DeltaAttention Residualsvia standardfine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2605\.18855

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2605.18855 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2605.18855 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2605.18855 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Delta Attention Residuals

Paper page - Delta Attention Residuals

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

𝐃𝐞𝐥𝐭𝐚 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐑𝐞𝐬𝐢𝐝𝐮𝐚𝐥𝐬 [R]

Adaptive Computation Depth via Learned Token Routing in Transformers

Rethinking Cross-Layer Information Routing in Diffusion Transformers

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention

Submit Feedback

Similar Articles

Adaptive Computation Depth via Learned Token Routing in Transformers

Rethinking Cross-Layer Information Routing in Diffusion Transformers

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention