@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …
Summary
MiniMax Sparse Attention (MSA) achieves up to 28.4x reduction in attention compute at 1M tokens by adding a routing branch that selectively chooses key-value blocks for attention, enabling 14.2x faster prefill and 7.6x faster decoding on H800 GPUs while matching full attention benchmark performance.
View Cached Full Text
Cached at: 06/15/26, 09:06 PM
Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and 7.6X faster decoding on H800 GPUs.
While mostly matching the full version’s benchmark performance.
This can happen when attention stops treating every token as equally worth revisiting.
The trick is not to abandon softmax attention, but to make it selective before it becomes expensive.
MSA adds a small routing branch beside ordinary Grouped Query Attention, letting each query group choose the key-value blocks it should inspect while the main branch performs exact attention only inside that chosen set.
The model is no longer paying to compare every new thought with the entire past, only with the parts its learned indexer predicts are worth comparing.
Long context is not a memory feature by itself; it is a retrieval problem under brutal latency constraints, where the model must decide what deserves bandwidth at the moment of use.
MiniMax Sparse Attention is compelling because it moves that decision into the architecture, trains the selector against the model’s own attention patterns.
Link – arxiv. org/abs/2606.13392
Title: “MiniMax Sparse Attention”
Similar Articles
MiniMax Sparse Attention
MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.
MiniMax Sparse Attention for Million-Token Contexts (GitHub Repo)
MiniMaxAI releases MSA, a library for dense and sparse attention kernels optimized for NVIDIA SM100 GPUs, enabling efficient processing of million-token contexts with FlashAttention and sparse top-k attention.
MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost (12 minute read)
MiniMax has released a detailed technical report on its M2 series and teased the upcoming M3 model, which uses a novel sparse attention mechanism to achieve up to 15.6× faster decoding at million-token contexts.
@askalphaxiv: "MiniMax Sparse Attention" This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group…
This paper from Minimax introduces MiniMax Sparse Attention, which adds a tiny Index Branch to GQA to select top-k KV blocks per group, enabling GPU-native sparsity with exponential speedups on a 109B multimodal MoE.
MiniMaxAI/MiniMax-M3
MiniMax releases M3, a native multimodal model with 1M context and ~428B parameters, using MiniMax Sparse Attention (MSA) for efficient long-context processing, achieving frontier-level coding and agentic performance.