Delta Attention Residuals
Summary
Delta Attention Residuals improve layer-wise routing in transformer models by attending to feature changes (deltas) rather than cumulative hidden states, achieving 1.7–8.2% validation perplexity gains across scales from 220M to 7.6B parameters.
View Cached Full Text
Cached at: 05/20/26, 02:35 AM
Paper page - Delta Attention Residuals
Source: https://huggingface.co/papers/2605.18855
Abstract
Delta Attention Residuals improve layer-wise routing by attending to feature changes rather than cumulative states, resulting in better attention distributions and model performance across different scales.
Attention Residualsreplace standard additive residual connections with learned softmax attention over previous layer outputs, enabling selectivecross-layer routing. However, standardAttention Residualsstill attend over cumulative hidden states in previous layers, which are highly redundant. We show that this redundancy leads torouting collapsein deeper layers:attention weightsbecome low-contrast and closer to uniform (max weight {approx}0.2), limiting the model’s ability to select informative states in previous layers. This raises a key but underexplored design question: what layer-wise representations should be routed inAttention Residuals? To answer this question, we propose DeltaAttention Residuals, which attend over deltas -- the change introduced by each sublayer (v_i = h_{i+1} - h_i) -- instead of cumulative states.Delta representationsare structurally diverse and yield higher-contrast attention distributions (max weight {approx}0.6), enabling more selective and effective routing across layers. This principle applies at both per-sublayer andblock granularity. Across all tested scales (220M--7.6B), DeltaAttention Residualsconsistently outperform both standard residuals andAttention Residuals, with 1.7--8.2\%validation perplexitygains. DeltaAttention Residualsalso enables convertingpretrained checkpointsinto DeltaAttention Residualsvia standardfine-tuning. Code is available at https://github.com/wdlctc/delta-attention-residuals-code.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2605\.18855
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2605.18855 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2605.18855 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2605.18855 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
𝐃𝐞𝐥𝐭𝐚 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐑𝐞𝐬𝐢𝐝𝐮𝐚𝐥𝐬 [R]
Delta Attention Residuals is a drop-in upgrade to residual connections that routes over deltas instead of cumulative hidden states, achieving sharper cross-layer routing and 1.7-8.2% lower perplexity at scales up to 7.6B parameters, and enabling fine-tuning of pretrained models like Qwen3-0.6B with negligible overhead.
Adaptive Computation Depth via Learned Token Routing in Transformers
This paper presents Token-Selective Attention (TSA), a differentiable token routing mechanism that learns to skip unnecessary computations per token in transformer layers, reducing token-layer operations by 14–23% with minimal quality loss on language modeling tasks.
Rethinking Cross-Layer Information Routing in Diffusion Transformers
This paper proposes Diffusion-Adaptive Routing (DAR), a learnable, timestep-adaptive residual replacement that improves cross-layer information flow in Diffusion Transformers, leading to significant training acceleration and quality improvements.
Variational Linear Attention: Stable Associative Memory for Long-Context Transformers
This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.
Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention
This paper introduces Dynamic Ultrametric Attention, a framework where Transformers learn per-head block-sparse routing topologies during training, which are then offloaded to a custom Triton block-sparse kernel at inference time, achieving up to 28x speedup and 98.4% memory reduction over dense attention.