The Routing and Filtering Structure of Attention

arXiv cs.LG 05/20/26, 04:00 AM Papers

attention transformer routing filtering decomposition spectral-cascade linear-attention

Summary

The paper decomposes the attention interaction matrix into routing (skew-symmetric) and filtering (symmetric) components, introducing S-D attention to disentangle them. It reveals a spectral cascade in routing that predicts where attention can be simplified, achieving significant parameter reduction with minimal perplexity loss.

arXiv:2605.18826v1 Announce Type: new Abstract: The attention interaction matrix $QK^{\top}$ contains two entangled computations: a skew-symmetric component that redistributes information between positions (routing) and a symmetric component that scales mutual relevance (filtering). We decompose 1776 heads across five pretrained transformers and find routing operating at low rank, well below the routing capacity allocated by the weight kernel. We introduce $S$-$D$ attention as a diagnostic parameterization that disentangles routing from filtering by construction with guaranteed stability ($\mathrm{Re}(\lambda) \le 0$) and trains stably without layer normalization. When disentangled and unnormalized, routing self-organizes into a spectral cascade, effective rank $2$ at the first layer, expanding with depth across six scales from 7M to 355M parameters. The cascade predicts where attention can be simplified: linearizing the first seven layers of 125M $S$-$D$ attention costs ${<}5\%$ perplexity, whereas standard attention collapses under the same intervention. The linearizable region widens with depth. Replacing the first four layers with ELU+1 linear attention reaches within $1.4\%$ of baseline at full head dimension. Cascade-allocated architectures trade attention parameters for perplexity ($47\%-65\%$ fewer attention parameters at $+3.9\%$ to $+8.4\%$ PPL). The routing-filtering decomposition makes the spectral budget legible; the cascade makes it actionable.

Original Article

Similar Articles

Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention

Reddit r/artificial

This paper introduces Dynamic Ultrametric Attention, a framework where Transformers learn per-head block-sparse routing topologies during training, which are then offloaded to a custom Triton block-sparse kernel at inference time, achieving up to 28x speedup and 98.4% memory reduction over dense attention.

Functional Attention: From Pairwise Affinities to Functional Correspondences

Hugging Face Daily Papers

Functional Attention is a novel attention mechanism that reinterprets attention as a functional correspondence between adaptive bases, replacing softmax affinities with structured linear operators inspired by geometric functional maps. The method achieves state-of-the-art performance on operator learning tasks including PDE solving and 3D segmentation while remaining resolution-invariant.

The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content

arXiv cs.CL

This paper identifies and formalizes the 'structural attention tax' phenomenon, where the format of retrieved content (e.g., knowledge graph triples) independently distorts LLM attention distribution regardless of semantic relevance, leading to compressed demonstration attention. It provides a formal framework, empirical evidence across models and benchmarks, and proposes structure-aware mitigation strategies.

Chiaroscuro Attention: Spending Compute in the Dark

Hugging Face Daily Papers

CHIAR-Former uses spectral entropy-based routing to dynamically select between DCT, RBF, and self-attention operators, achieving improved efficiency on large text datasets while maintaining performance through hybrid attention mechanisms.

Rethinking the Role of Efficient Attention in Hybrid Architectures