Elastic Attention Cores for Scalable Vision Transformers [R]
Summary
This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.
Similar Articles
Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention
This paper introduces Dynamic Ultrametric Attention, a framework where Transformers learn per-head block-sparse routing topologies during training, which are then offloaded to a custom Triton block-sparse kernel at inference time, achieving up to 28x speedup and 98.4% memory reduction over dense attention.
@gurtej__gill_: The Kimi Team wrote a really clever paper back in March that fixes a fundamental flaw we have sort of just accepted in …
The Kimi Team's paper 'Attention Residuals' (AttnRes) replaces uniform residual connections in Transformers with softmax attention over depth, allowing each layer to dynamically select earlier representations. Pre-trained on 1.4 trillion tokens with a 48B parameter model, it stabilizes hidden states and significantly improves reasoning tasks.
Grammatically-Guided Sparse Attention for Efficient and Interpretable Transformers
This paper introduces a grammatically-guided sparse attention mechanism for Transformers, aiming to improve efficiency and interpretability by leveraging linguistic structure.
Generative modeling with sparse transformers
OpenAI introduces the Sparse Transformer, a deep neural network that improves the attention mechanism from O(N²) to O(N√N) complexity, enabling modeling of sequences 30x longer than previously possible across text, images, and audio. The model uses sparse attention patterns and checkpoint-based memory optimization to train networks up to 128 layers deep, achieving state-of-the-art performance across multiple domains.
@tilderesearch: https://x.com/tilderesearch/status/2061771450168889432
Wall Attention generalizes diagonal forget gates to softmax attention, enabling state-of-the-art length extrapolation from 4k to 160k+ context zero-shot and outperforming RoPE and FoX in pretraining. It is released as a drop-in replacement with open-source Triton kernels.