@askalphaxiv: "MiniMax Sparse Attention" This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group…

X AI KOLs Timeline 06/13/26, 04:43 PM Papers

sparse-attention gqa minimax efficiency long-context multimodal moe

Summary

This paper from Minimax introduces MiniMax Sparse Attention, which adds a tiny Index Branch to GQA to select top-k KV blocks per group, enabling GPU-native sparsity with exponential speedups on a 109B multimodal MoE.

"MiniMax Sparse Attention" This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group, then runs exact softmax only on those blocks, making sparsity GPU native, with exp free TopK and KV outer sparse kernels. On a 109B multimodal MoE, it keeps dense GQA quality while cutting 1M context attention compute by 28.4x, with 14.2x prefill and 7.6x decode speedups.

Original Article

View Cached Full Text

Cached at: 06/14/26, 07:39 AM

“MiniMax Sparse Attention”

This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group, then runs exact softmax only on those blocks, making sparsity GPU native, with exp free TopK and KV outer sparse kernels.

On a 109B multimodal MoE, it keeps dense GQA quality while cutting 1M context attention compute by 28.4x, with 14.2x prefill and 7.6x decode speedups.

Similar Articles

MiniMax Sparse Attention

Hugging Face Daily Papers

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.

@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …

X AI KOLs Following

MiniMax Sparse Attention (MSA) achieves up to 28.4x reduction in attention compute at 1M tokens by adding a routing branch that selectively chooses key-value blocks for attention, enabling 14.2x faster prefill and 7.6x faster decoding on H800 GPUs while matching full attention benchmark performance.

MiniMax Sparse Attention for Million-Token Contexts (GitHub Repo)

TLDR AI

MiniMaxAI releases MSA, a library for dense and sparse attention kernels optimized for NVIDIA SM100 GPUs, enabling efficient processing of million-token contexts with FlashAttention and sparse top-k attention.

GQA-{\mu}P: The maximal parameterization update for grouped query attention

arXiv cs.LG

This paper extends the maximal update parameterization (μP) framework to grouped-query attention (GQA), deriving scaling laws for hyperparameter transfer across model architectures. It introduces spectral norm conditions for feature learning and addresses issues with low-rank weight matrices in GQA.

EntmaxKV: Support-Aware Decoding for Entmax Attention