@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …

X AI KOLs Following 06/15/26, 08:39 PM Papers

sparse-attention long-context efficient-transformer minimax attention-mechanism inference-speed

Summary

MiniMax Sparse Attention (MSA) achieves up to 28.4x reduction in attention compute at 1M tokens by adding a routing branch that selectively chooses key-value blocks for attention, enabling 14.2x faster prefill and 7.6x faster decoding on H800 GPUs while matching full attention benchmark performance.

Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and 7.6X faster decoding on H800 GPUs. While mostly matching the full version’s benchmark performance. This can happen when attention stops treating every token as equally worth revisiting. The trick is not to abandon softmax attention, but to make it selective before it becomes expensive. MSA adds a small routing branch beside ordinary Grouped Query Attention, letting each query group choose the key-value blocks it should inspect while the main branch performs exact attention only inside that chosen set. The model is no longer paying to compare every new thought with the entire past, only with the parts its learned indexer predicts are worth comparing. Long context is not a memory feature by itself; it is a retrieval problem under brutal latency constraints, where the model must decide what deserves bandwidth at the moment of use. MiniMax Sparse Attention is compelling because it moves that decision into the architecture, trains the selector against the model’s own attention patterns. ---- Link – arxiv. org/abs/2606.13392 Title: "MiniMax Sparse Attention"

Original Article

View Cached Full Text

Cached at: 06/15/26, 09:06 PM

Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and 7.6X faster decoding on H800 GPUs.

While mostly matching the full version’s benchmark performance.

This can happen when attention stops treating every token as equally worth revisiting.

The trick is not to abandon softmax attention, but to make it selective before it becomes expensive.

MSA adds a small routing branch beside ordinary Grouped Query Attention, letting each query group choose the key-value blocks it should inspect while the main branch performs exact attention only inside that chosen set.

The model is no longer paying to compare every new thought with the entire past, only with the parts its learned indexer predicts are worth comparing.

Long context is not a memory feature by itself; it is a retrieval problem under brutal latency constraints, where the model must decide what deserves bandwidth at the moment of use.

MiniMax Sparse Attention is compelling because it moves that decision into the architecture, trains the selector against the model’s own attention patterns.

Link – arxiv. org/abs/2606.13392

Title: “MiniMax Sparse Attention”

@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …

Similar Articles

MiniMax Sparse Attention

MiniMax Sparse Attention for Million-Token Contexts (GitHub Repo)

MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost (12 minute read)

@askalphaxiv: "MiniMax Sparse Attention" This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group…

MiniMaxAI/MiniMax-M3

Submit Feedback

Similar Articles

MiniMax Sparse Attention for Million-Token Contexts (GitHub Repo)

MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost (12 minute read)

@askalphaxiv: "MiniMax Sparse Attention" This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group…