Elastic Attention Cores for Scalable Vision Transformers [R]

Reddit r/MachineLearning Papers

Summary

This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.

Wanted to share our latest paper on an alternative building block for Vision Transformers. [Illustration of our model's accuracy and dense features](https://preview.redd.it/x4acnx478w0h1.png?width=2457&format=png&auto=webp&s=3ce49caf2b0cdea5d35141aebb7297862fdc6a7d) Traditional ViTs utilize dense (***N******^(2)***) self-attention, which can become pretty costly at higher resolutions. In this work, we propose an alternative backbone with a core-periphery block-sparse attention structure that scales as (***2NC + N******^(2)***) for ***C*** core tokens. We further train this using nested dropout, which enables test-time elastic adjustments to the inference cost. The whole model can achieve very competitive dense & classification accuracy compared with DINOv3, and is stable across resolutions (256 all the way to 1024). Interestingly, the core-dense attention patterns exhibit strong emergent behavior. At early layers of the network the attention maps are isotropic (spherical), but become increasingly semantically aligned deeper into the network. [Visual Elastic Core Attention paper abstract](https://preview.redd.it/zjea47ez7w0h1.png?width=935&format=png&auto=webp&s=dc78ddcd4b6faf5b135f78cd9881cdf6650e4cc8) While adjusting the number of core tokens, if you decrease the number of cores, the attention patterns become more diffuse & cover a spatially larger region. If you increase the number of core tokens, the attention patterns become smaller & more concentrated. Paper: [https://arxiv.org/abs/2605.12491](https://arxiv.org/abs/2605.12491) Project with the code (still in progress): [https://github.com/alansong1322/VECA](https://github.com/alansong1322/VECA) Happy to answer any questions about our research.
Original Article

Similar Articles

Generative modeling with sparse transformers

OpenAI Blog

OpenAI introduces the Sparse Transformer, a deep neural network that improves the attention mechanism from O(N²) to O(N√N) complexity, enabling modeling of sequences 30x longer than previously possible across text, images, and audio. The model uses sparse attention patterns and checkpoint-based memory optimization to train networks up to 128 layers deep, achieving state-of-the-art performance across multiple domains.

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

arXiv cs.LG

This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.

Large Vision-Language Models Get Lost in Attention

arXiv cs.AI

This research paper analyzes the internal mechanics of Large Vision-Language Models (LVLMs) using information theory, revealing that attention mechanisms may be redundant while Feed-Forward Networks drive semantic innovation. The authors demonstrate that replacing learned attention weights with random values can yield comparable performance, suggesting current models 'get lost in attention'.

Lite3R: A Model-Agnostic Framework for Efficient Feed-Forward 3D Reconstruction

Hugging Face Daily Papers

Lite3R is a model-agnostic framework that improves the efficiency of transformer-based 3D reconstruction using sparse linear attention and FP8-aware quantization. It reduces latency and memory usage by up to 2.4x while maintaining geometric accuracy on backbones like VGGT and DA3-Large.