triton-kernel

#triton-kernel

Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention

Reddit r/artificial ↗ · 5d ago

This paper introduces Dynamic Ultrametric Attention, a framework where Transformers learn per-head block-sparse routing topologies during training, which are then offloaded to a custom Triton block-sparse kernel at inference time, achieving up to 28x speedup and 98.4% memory reduction over dense attention.

0 favorites 0 likes

triton-kernel

Learning to Skip Blocks: Self-Discovered Ultrametric Routing for Hardware-Accelerated Sparse Attention

Submit Feedback