Tag
This paper introduces Dynamic Ultrametric Attention, a framework where Transformers learn per-head block-sparse routing topologies during training, which are then offloaded to a custom Triton block-sparse kernel at inference time, achieving up to 28x speedup and 98.4% memory reduction over dense attention.