@askalphaxiv: "MiniMax Sparse Attention" This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group…
Summary
This paper from Minimax introduces MiniMax Sparse Attention, which adds a tiny Index Branch to GQA to select top-k KV blocks per group, enabling GPU-native sparsity with exponential speedups on a 109B multimodal MoE.
View Cached Full Text
Cached at: 06/14/26, 07:39 AM
“MiniMax Sparse Attention”
This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group, then runs exact softmax only on those blocks, making sparsity GPU native, with exp free TopK and KV outer sparse kernels.
On a 109B multimodal MoE, it keeps dense GQA quality while cutting 1M context attention compute by 28.4x, with 14.2x prefill and 7.6x decode speedups.
Similar Articles
MiniMax Sparse Attention
MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.
@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …
MiniMax Sparse Attention (MSA) achieves up to 28.4x reduction in attention compute at 1M tokens by adding a routing branch that selectively chooses key-value blocks for attention, enabling 14.2x faster prefill and 7.6x faster decoding on H800 GPUs while matching full attention benchmark performance.
MiniMax Sparse Attention for Million-Token Contexts (GitHub Repo)
MiniMaxAI releases MSA, a library for dense and sparse attention kernels optimized for NVIDIA SM100 GPUs, enabling efficient processing of million-token contexts with FlashAttention and sparse top-k attention.
GQA-{\mu}P: The maximal parameterization update for grouped query attention
This paper extends the maximal update parameterization (μP) framework to grouped-query attention (GQA), deriving scaling laws for hyperparameter transfer across model architectures. It introduces spectral norm conditions for feature learning and addresses issues with low-rank weight matrices in GQA.
EntmaxKV: Support-Aware Decoding for Entmax Attention
EntmaxKV introduces a support-aware sparse decoding framework for entmax attention that reduces KV-cache memory traffic by exploiting sparsity before loading pages, achieving significant speedups on long-context benchmarks while maintaining output quality.