Chiaroscuro Attention: Spending Compute in the Dark
Summary
CHIAR-Former uses spectral entropy-based routing to dynamically select between DCT, RBF, and self-attention operators, achieving improved efficiency on large text datasets while maintaining performance through hybrid attention mechanisms.
View Cached Full Text
Cached at: 06/09/26, 08:40 AM
Paper page - Chiaroscuro Attention: Spending Compute in the Dark
Source: https://huggingface.co/papers/2606.08327
Abstract
CHIAR-Former uses spectral entropy-based routing to dynamically select between DCT, RBF, and self-attention operators, achieving improved efficiency on large text datasets while maintaining performance through hybrid attention mechanisms.
Standardtransformersapplyself-attentionuniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layerhybrid transformerthat routes each token to one of three operators -DCT spectral mixing,RBF kernel mixing, orfull self-attention- based on per-tokenspectral entropy, a theoretically justified complexity signal. Through systematic ablation onWikiText-103, we discoverrouting collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing anddynamic attentionare complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 onWikiText-103- a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewerattention FLOPs. We extend evaluation toWikiText-2,IMDB sentiment classification, and syntheticListOpsoperations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.
View arXiv pageView PDFAdd to collection
Get this paper in your agent:
hf papers read 2606\.08327
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.08327 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.08327 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.08327 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
The Routing and Filtering Structure of Attention
The paper decomposes the attention interaction matrix into routing (skew-symmetric) and filtering (symmetric) components, introducing S-D attention to disentangle them. It reveals a spectral cascade in routing that predicts where attention can be simplified, achieving significant parameter reduction with minimal perplexity loss.
Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention
This paper introduces Dynamic Ultrametric Attention, a framework where Transformers learn per-head block-sparse routing topologies during training, which are then offloaded to a custom Triton block-sparse kernel at inference time, achieving up to 28x speedup and 98.4% memory reduction over dense attention.
SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers
SEGA is a training-free method that improves high-resolution text-to-image generation by adaptively scaling attention across RoPE components based on spatial-frequency structure during denoising steps.
Interdomain Attention: Beyond Token-Level Key-Value Memory
Proposes Interdomain Attention, a new method that integrates state space models into attention via kernel methods, achieving efficient long-context modeling with a fixed-size state and outperforming SSMs and softmax attention in language modeling experiments up to 1.3B parameters.
Elastic Attention Cores for Scalable Vision Transformers [R]
This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.