Chiaroscuro Attention: Spending Compute in the Dark

Hugging Face Daily Papers 06/06/26, 12:00 AM Papers

spectral-entropy routing hybrid-attention transformer-efficiency dct rbf wikitext

Summary

CHIAR-Former uses spectral entropy-based routing to dynamically select between DCT, RBF, and self-attention operators, achieving improved efficiency on large text datasets while maintaining performance through hybrid attention mechanisms.

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.

Original Article

View Cached Full Text

Cached at: 06/09/26, 08:40 AM

Paper page - Chiaroscuro Attention: Spending Compute in the Dark

Source: https://huggingface.co/papers/2606.08327

Abstract

Standardtransformersapplyself-attentionuniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layerhybrid transformerthat routes each token to one of three operators -DCT spectral mixing,RBF kernel mixing, orfull self-attention- based on per-tokenspectral entropy, a theoretically justified complexity signal. Through systematic ablation onWikiText-103, we discoverrouting collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing anddynamic attentionare complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 onWikiText-103- a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewerattention FLOPs. We extend evaluation toWikiText-2,IMDB sentiment classification, and syntheticListOpsoperations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.

View arXiv page View PDF Add to collection

Get this paper in your agent:

hf papers read 2606\.08327

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.08327 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.08327 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.08327 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Chiaroscuro Attention: Spending Compute in the Dark

Paper page - Chiaroscuro Attention: Spending Compute in the Dark

Abstract

Models citing this paper0

Datasets citing this paper0

Spaces citing this paper0

Collections including this paper0

Similar Articles

Chimera: Designing and Chinchilla-Scaling Hybrid Visual Diffusion Transformers

The Routing and Filtering Structure of Attention

Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Uncertainty-gated selection for block-sparse attention

Submit Feedback

Similar Articles

Chimera: Designing and Chinchilla-Scaling Hybrid Visual Diffusion Transformers

The Routing and Filtering Structure of Attention

Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention

SEGA: Spectral-Energy Guided Attention for Resolution Extrapolation in Diffusion Transformers

Uncertainty-gated selection for block-sparse attention