Elastic Attention Cores for Scalable Vision Transformers [R]

Reddit r/MachineLearning Papers

Summary

This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.

Wanted to share our latest paper on an alternative building block for Vision Transformers. [Illustration of our model's accuracy and dense features](https://preview.redd.it/x4acnx478w0h1.png?width=2457&format=png&auto=webp&s=3ce49caf2b0cdea5d35141aebb7297862fdc6a7d) Traditional ViTs utilize dense (***N******^(2)***) self-attention, which can become pretty costly at higher resolutions. In this work, we propose an alternative backbone with a core-periphery block-sparse attention structure that scales as (***2NC + N******^(2)***) for ***C*** core tokens. We further train this using nested dropout, which enables test-time elastic adjustments to the inference cost. The whole model can achieve very competitive dense & classification accuracy compared with DINOv3, and is stable across resolutions (256 all the way to 1024). Interestingly, the core-dense attention patterns exhibit strong emergent behavior. At early layers of the network the attention maps are isotropic (spherical), but become increasingly semantically aligned deeper into the network. [Visual Elastic Core Attention paper abstract](https://preview.redd.it/zjea47ez7w0h1.png?width=935&format=png&auto=webp&s=dc78ddcd4b6faf5b135f78cd9881cdf6650e4cc8) While adjusting the number of core tokens, if you decrease the number of cores, the attention patterns become more diffuse & cover a spatially larger region. If you increase the number of core tokens, the attention patterns become smaller & more concentrated. Paper: [https://arxiv.org/abs/2605.12491](https://arxiv.org/abs/2605.12491) Project with the code (still in progress): [https://github.com/alansong1322/VECA](https://github.com/alansong1322/VECA) Happy to answer any questions about our research.
Original Article

Similar Articles

@gurtej__gill_: The Kimi Team wrote a really clever paper back in March that fixes a fundamental flaw we have sort of just accepted in …

X AI KOLs Timeline

The Kimi Team's paper 'Attention Residuals' (AttnRes) replaces uniform residual connections in Transformers with softmax attention over depth, allowing each layer to dynamically select earlier representations. Pre-trained on 1.4 trillion tokens with a 48B parameter model, it stabilizes hidden states and significantly improves reasoning tasks.

Generative modeling with sparse transformers

OpenAI Blog

OpenAI introduces the Sparse Transformer, a deep neural network that improves the attention mechanism from O(N²) to O(N√N) complexity, enabling modeling of sequences 30x longer than previously possible across text, images, and audio. The model uses sparse attention patterns and checkpoint-based memory optimization to train networks up to 128 layers deep, achieving state-of-the-art performance across multiple domains.

@tilderesearch: https://x.com/tilderesearch/status/2061771450168889432

X AI KOLs Timeline

Wall Attention generalizes diagonal forget gates to softmax attention, enabling state-of-the-art length extrapolation from 4k to 160k+ context zero-shot and outperforming RoPE and FoX in pretraining. It is released as a drop-in replacement with open-source Triton kernels.