The Routing and Filtering Structure of Attention
Summary
The paper decomposes the attention interaction matrix into routing (skew-symmetric) and filtering (symmetric) components, introducing S-D attention to disentangle them. It reveals a spectral cascade in routing that predicts where attention can be simplified, achieving significant parameter reduction with minimal perplexity loss.
Similar Articles
Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention
This paper introduces Dynamic Ultrametric Attention, a framework where Transformers learn per-head block-sparse routing topologies during training, which are then offloaded to a custom Triton block-sparse kernel at inference time, achieving up to 28x speedup and 98.4% memory reduction over dense attention.
Functional Attention: From Pairwise Affinities to Functional Correspondences
Functional Attention is a novel attention mechanism that reinterprets attention as a functional correspondence between adaptive bases, replacing softmax affinities with structured linear operators inspired by geometric functional maps. The method achieves state-of-the-art performance on operator learning tasks including PDE solving and 3D segmentation while remaining resolution-invariant.
The Structural Attention Tax: How Retrieval Format Hijacks In-Context Learning Independent of Content
This paper identifies and formalizes the 'structural attention tax' phenomenon, where the format of retrieved content (e.g., knowledge graph triples) independently distorts LLM attention distribution regardless of semantic relevance, leading to compressed demonstration attention. It provides a formal framework, empirical evidence across models and benchmarks, and proposes structure-aware mitigation strategies.
Chiaroscuro Attention: Spending Compute in the Dark
CHIAR-Former uses spectral entropy-based routing to dynamically select between DCT, RBF, and self-attention operators, achieving improved efficiency on large text datasets while maintaining performance through hybrid attention mechanisms.
Rethinking the Role of Efficient Attention in Hybrid Architectures
This paper systematically analyzes the role of efficient attention modules in hybrid language model architectures, finding that different designs converge in long-context performance under sufficient training, and that long-range retrieval is primarily carried by full attention while efficient attention shapes the optimization trajectory, revealing a 'Large-Window Laziness' phenomenon.