@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …

X AI KOLs Following Papers

Summary

MiniMax Sparse Attention (MSA) achieves up to 28.4x reduction in attention compute at 1M tokens by adding a routing branch that selectively chooses key-value blocks for attention, enabling 14.2x faster prefill and 7.6x faster decoding on H800 GPUs while matching full attention benchmark performance.

Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and 7.6X faster decoding on H800 GPUs. While mostly matching the full version’s benchmark performance. This can happen when attention stops treating every token as equally worth revisiting. The trick is not to abandon softmax attention, but to make it selective before it becomes expensive. MSA adds a small routing branch beside ordinary Grouped Query Attention, letting each query group choose the key-value blocks it should inspect while the main branch performs exact attention only inside that chosen set. The model is no longer paying to compare every new thought with the entire past, only with the parts its learned indexer predicts are worth comparing. Long context is not a memory feature by itself; it is a retrieval problem under brutal latency constraints, where the model must decide what deserves bandwidth at the moment of use. MiniMax Sparse Attention is compelling because it moves that decision into the architecture, trains the selector against the model’s own attention patterns. ---- Link – arxiv. org/abs/2606.13392 Title: "MiniMax Sparse Attention"
Original Article
View Cached Full Text

Cached at: 06/15/26, 09:06 PM

Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and 7.6X faster decoding on H800 GPUs.

While mostly matching the full version’s benchmark performance.

This can happen when attention stops treating every token as equally worth revisiting.

The trick is not to abandon softmax attention, but to make it selective before it becomes expensive.

MSA adds a small routing branch beside ordinary Grouped Query Attention, letting each query group choose the key-value blocks it should inspect while the main branch performs exact attention only inside that chosen set.

The model is no longer paying to compare every new thought with the entire past, only with the parts its learned indexer predicts are worth comparing.

Long context is not a memory feature by itself; it is a retrieval problem under brutal latency constraints, where the model must decide what deserves bandwidth at the moment of use.

MiniMax Sparse Attention is compelling because it moves that decision into the architecture, trains the selector against the model’s own attention patterns.


Link – arxiv. org/abs/2606.13392

Title: “MiniMax Sparse Attention”

Similar Articles

MiniMax Sparse Attention

Hugging Face Daily Papers

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.

MiniMaxAI/MiniMax-M3

Hugging Face Models Trending

MiniMax releases M3, a native multimodal model with 1M context and ~428B parameters, using MiniMax Sparse Attention (MSA) for efficient long-context processing, achieving frontier-level coding and agentic performance.