Chiaroscuro Attention: Spending Compute in the Dark

Hugging Face Daily Papers Papers

Summary

CHIAR-Former uses spectral entropy-based routing to dynamically select between DCT, RBF, and self-attention operators, achieving improved efficiency on large text datasets while maintaining performance through hybrid attention mechanisms.

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.
Original Article
View Cached Full Text

Cached at: 06/09/26, 08:40 AM

Paper page - Chiaroscuro Attention: Spending Compute in the Dark

Source: https://huggingface.co/papers/2606.08327

Abstract

CHIAR-Former uses spectral entropy-based routing to dynamically select between DCT, RBF, and self-attention operators, achieving improved efficiency on large text datasets while maintaining performance through hybrid attention mechanisms.

Standardtransformersapplyself-attentionuniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layerhybrid transformerthat routes each token to one of three operators -DCT spectral mixing,RBF kernel mixing, orfull self-attention- based on per-tokenspectral entropy, a theoretically justified complexity signal. Through systematic ablation onWikiText-103, we discoverrouting collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing anddynamic attentionare complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 onWikiText-103- a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewerattention FLOPs. We extend evaluation toWikiText-2,IMDB sentiment classification, and syntheticListOpsoperations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.

View arXiv pageView PDFAdd to collection

Get this paper in your agent:

hf papers read 2606\.08327

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.08327 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.08327 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.08327 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

The Routing and Filtering Structure of Attention

arXiv cs.LG

The paper decomposes the attention interaction matrix into routing (skew-symmetric) and filtering (symmetric) components, introducing S-D attention to disentangle them. It reveals a spectral cascade in routing that predicts where attention can be simplified, achieving significant parameter reduction with minimal perplexity loss.

Interdomain Attention: Beyond Token-Level Key-Value Memory

arXiv cs.LG

Proposes Interdomain Attention, a new method that integrates state space models into attention via kernel methods, achieving efficient long-context modeling with a fixed-size state and outperforming SSMs and softmax attention in language modeling experiments up to 1.3B parameters.

Elastic Attention Cores for Scalable Vision Transformers [R]

Reddit r/MachineLearning

This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.