MiniMax Sparse Attention
Summary
MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.
View Cached Full Text
Cached at: 06/12/26, 06:50 AM
Paper page - MiniMax Sparse Attention
Source: https://huggingface.co/papers/2606.13392 Authors:
,
,
,
,
,
,
,
,
,
Abstract
MiniMax Sparse Attention enables efficient processing of ultra-long contexts in large language models through blockwise sparsity and optimized GPU execution, achieving significant speedups while maintaining performance.
Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMaxSparse Attention(MSA), ablockwise sparse attentionbuilt uponGrouped Query Attention(GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attentionover only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-freeTop-k selectionand KV-outersparse attentionto improvetensor-core utilizationunder block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-tokenattention computeby 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2xprefilland 7.6xdecodingwall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.
View arXiv pageView PDFGitHub145Add to collection
Get this paper in your agent:
hf papers read 2606\.13392
Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash
Models citing this paper0
No model linking this paper
Cite arxiv.org/abs/2606.13392 in a model README.md to link it from this page.
Datasets citing this paper0
No dataset linking this paper
Cite arxiv.org/abs/2606.13392 in a dataset README.md to link it from this page.
Spaces citing this paper0
No Space linking this paper
Cite arxiv.org/abs/2606.13392 in a Space README.md to link it from this page.
Collections including this paper0
No Collection including this paper
Add this paper to acollectionto link it from this page.
Similar Articles
MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost (12 minute read)
MiniMax has released a detailed technical report on its M2 series and teased the upcoming M3 model, which uses a novel sparse attention mechanism to achieve up to 15.6× faster decoding at million-token contexts.
MiniMaxAI/MiniMax-M3
MiniMax releases M3, a native multimodal model with 1M context and ~428B parameters, using MiniMax Sparse Attention (MSA) for efficient long-context processing, achieving frontier-level coding and agentic performance.
MiniMax M3 (2 minute read)
MiniMax introduces M3, the first open-weights model to combine coding, agentic, and multimodal capabilities with up to 1M context via sparse attention.
EntmaxKV: Support-Aware Decoding for Entmax Attention
EntmaxKV introduces a support-aware sparse decoding framework for entmax attention that reduces KV-cache memory traffic by exploiting sparsity before loading pages, achieving significant speedups on long-context benchmarks while maintaining output quality.
SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference
SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.