Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention
Summary
This paper introduces Dynamic Ultrametric Attention, a framework where Transformers learn per-head block-sparse routing topologies during training, which are then offloaded to a custom Triton block-sparse kernel at inference time, achieving up to 28x speedup and 98.4% memory reduction over dense attention.
Similar Articles
Adaptive Computation Depth via Learned Token Routing in Transformers
This paper presents Token-Selective Attention (TSA), a differentiable token routing mechanism that learns to skip unnecessary computations per token in transformer layers, reducing token-layer operations by 14–23% with minimal quality loss on language modeling tasks.
Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps
RTPurbo converts full-attention LLMs into sparse models with only a few hundred training steps, achieving near-lossless accuracy and up to 9.36x prefill and 2.01x decode speedups.
Elastic Attention Cores for Scalable Vision Transformers [R]
This article presents a new paper on Elastic Attention Cores for Vision Transformers, proposing a core-periphery block-sparse attention structure that improves scalability and accuracy compared to dense self-attention methods like DINOv3.
Block-Wise Differentiable Sinkhorn Attention: Tail-Refinement Gradients with a Gap-Aware Dustbin Bridge
This paper presents Block-Wise Differentiable Sinkhorn Attention, a method for efficient long-context balanced entropic optimal transport attention on TPU hardware. It introduces a tail-refinement surrogate for exact differentiation, proving an efficient backward pass schedule and demonstrating significant improvements in Pfam sequence alignment reconstruction.
Block-Based Double Decoders
Proposes block-based double decoders, a novel transformer architecture using doubly-causal block-based attention masks to combine decoder-only training efficiency with encoder-decoder inference efficiency, achieving strong scaling performance and reduced KV-cache memory.