@askalphaxiv: "MiniMax Sparse Attention" This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group…

X AI KOLs Timeline Papers

Summary

This paper from Minimax introduces MiniMax Sparse Attention, which adds a tiny Index Branch to GQA to select top-k KV blocks per group, enabling GPU-native sparsity with exponential speedups on a 109B multimodal MoE.

"MiniMax Sparse Attention" This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group, then runs exact softmax only on those blocks, making sparsity GPU native, with exp free TopK and KV outer sparse kernels. On a 109B multimodal MoE, it keeps dense GQA quality while cutting 1M context attention compute by 28.4x, with 14.2x prefill and 7.6x decode speedups.
Original Article
View Cached Full Text

Cached at: 06/14/26, 07:39 AM

“MiniMax Sparse Attention”

This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group, then runs exact softmax only on those blocks, making sparsity GPU native, with exp free TopK and KV outer sparse kernels.

On a 109B multimodal MoE, it keeps dense GQA quality while cutting 1M context attention compute by 28.4x, with 14.2x prefill and 7.6x decode speedups.

Similar Articles

MiniMax Sparse Attention

Hugging Face Daily Papers

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.

GQA-{\mu}P: The maximal parameterization update for grouped query attention

arXiv cs.LG

This paper extends the maximal update parameterization (μP) framework to grouped-query attention (GQA), deriving scaling laws for hyperparameter transfer across model architectures. It introduces spectral norm conditions for feature learning and addresses issues with low-rank weight matrices in GQA.

EntmaxKV: Support-Aware Decoding for Entmax Attention

arXiv cs.LG

EntmaxKV introduces a support-aware sparse decoding framework for entmax attention that reduces KV-cache memory traffic by exploiting sparsity before loading pages, achieving significant speedups on long-context benchmarks while maintaining output quality.