Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention

Reddit r/artificial 05/30/26, 05:01 AM Papers

sparse-attention ultrametric block-sparse triton-kernel transformer long-context hardware-acceleration

Summary

This paper introduces Dynamic Ultrametric Attention, a framework where Transformers learn per-head block-sparse routing topologies during training, which are then offloaded to a custom Triton block-sparse kernel at inference time, achieving up to 28x speedup and 98.4% memory reduction over dense attention.

Abstract. Standard dense self-attention scales quadratically in sequence length, creating an intractable memory and compute bottleneck for long-context Transformers. We introduce Dynamic Ultrametric Attention, a framework in which a Transformer autonomously learns per-head block-sparse routing topologies during training via Gumbel-Sigmoid depth gates, then offloads those learned sparsity patterns directly to a custom Triton block-sparse kernel at inference time. The routing topology is derived from an ultrametric (tree-structured) distance matrix that encodes hierarchical relationships between token positions. Across nine experiments spanning Dyck-k bracket languages, the Long Range Arena ListOps benchmark, autoregressive serving, and natural language modeling, we demonstrate that: (1) the dynamic gates organically discover layer-wise specialization—dedicating early layers to hierarchical parsing and later layers to dense aggregation—without any architectural constraint; (2) the learned sparsity maps transfer losslessly to a block-sparse Triton kernel that skips entire SRAM loads for non-attending blocks; (3) the resulting system achieves an 11.59× wall-clock inference speedup over PyTorch dense attention at 2048 tokens, scaling to 28× at 8192 tokens with 98.4% memory reduction; (4) a sparse PagedAttention decoding kernel achieves 8× effective memory bandwidth over dense decoding by conditionally skipping KV-cache block loads; and (5) when augmented with a local sliding window, the architecture maintains >88% sparsity across all layers on real natural language (Shakespeare) while reducing cross-entropy loss from 10.9 to 1.55. To our knowledge, this is the first demonstration of an LLM learning its own hardware-optimal sparsity pattern and bridging it to a physically accelerated kernel without post-hoc pruning or distillation. https://github.com/sneed-and-feed/adelic-spectral-zeta/blob/main/papers/learning_to_skip_blocks.md

Original Article

Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention

Similar Articles

Adaptive Computation Depth via Learned Token Routing in Transformers

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Uncertainty-gated selection for block-sparse attention

Elastic Attention Cores for Scalable Vision Transformers [R]

Linear Attention Architectures: Mechanisms, Trade-offs, and Cross-Layer Routing

Submit Feedback

Similar Articles

Adaptive Computation Depth via Learned Token Routing in Transformers

Full Attention Strikes Back: Transferring Full Attention into Sparse within Hundred Training Steps

Uncertainty-gated selection for block-sparse attention

Elastic Attention Cores for Scalable Vision Transformers [R]

Linear Attention Architectures: Mechanisms, Trade-offs, and Cross-Layer Routing