MiniMax Sparse Attention

Hugging Face Daily Papers Papers

Summary

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMax Sparse Attention (MSA), a blockwise sparse attention built upon Grouped Query Attention (GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attention over only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-free Top-k selection and KV-outer sparse attention to improve tensor-core utilization under block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-token attention compute by 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2x prefill and 7.6x decoding wall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.
Original Article
View Cached Full Text

Cached at: 06/12/26, 06:50 AM

Paper page - MiniMax Sparse Attention

Source: https://huggingface.co/papers/2606.13392 Authors:

,

,

,

,

,

,

,

,

,

Abstract

MiniMax Sparse Attention enables efficient processing of ultra-long contexts in large language models through blockwise sparsity and optimized GPU execution, achieving significant speedups while maintaining performance.

Ultra-long-context capability is becoming indispensable for frontier LLMs: agentic workflows, repository-scale code reasoning, and persistent memory all require the model to jointly attend over hundreds of thousands to millions of tokens, yet the quadratic cost of softmax attention makes this untenable at deployment scale. We introduce MiniMaxSparse Attention(MSA), ablockwise sparse attentionbuilt uponGrouped Query Attention(GQA). A lightweight Index Branch scores key-value blocks and independently selects a Top-k subset for each GQA group, enabling group-specific sparse retrieval while maintaining efficient block-level execution; the Main Branch then performs exact block-sparse attentionover only the selected blocks. Designed around a principle of simplicity and scalability, MSA is deliberately streamlined, making it straightforward to deploy efficiently across a broad range of GPUs. To translate sparsity into practical speedups, we co-design MSA with a GPU execution path that uses exp-freeTop-k selectionand KV-outersparse attentionto improvetensor-core utilizationunder block-granular access. On a 109B-parameter model with native multimodal training, MSA performs on par with GQA while reducing per-tokenattention computeby 28.4x at 1M context. Paired with our co-designed kernel, MSA achieves 14.2xprefilland 7.6xdecodingwall-clock speedups on H800. Our inference kernel is available at: https://github.com/MiniMax-AI/MSA. A production-grade natively multimodal model powered by MSA has been publicly released at: https://huggingface.co/MiniMaxAI/MiniMax-M3.

View arXiv pageView PDFGitHub145Add to collection

Get this paper in your agent:

hf papers read 2606\.13392

Don’t have the latest CLI?curl \-LsSf https://hf\.co/cli/install\.sh \| bash

Models citing this paper0

No model linking this paper

Cite arxiv.org/abs/2606.13392 in a model README.md to link it from this page.

Datasets citing this paper0

No dataset linking this paper

Cite arxiv.org/abs/2606.13392 in a dataset README.md to link it from this page.

Spaces citing this paper0

No Space linking this paper

Cite arxiv.org/abs/2606.13392 in a Space README.md to link it from this page.

Collections including this paper0

No Collection including this paper

Add this paper to acollectionto link it from this page.

Similar Articles

MiniMaxAI/MiniMax-M3

Hugging Face Models Trending

MiniMax releases M3, a native multimodal model with 1M context and ~428B parameters, using MiniMax Sparse Attention (MSA) for efficient long-context processing, achieving frontier-level coding and agentic performance.

MiniMax M3 (2 minute read)

TLDR AI

MiniMax introduces M3, the first open-weights model to combine coding, agentic, and multimodal capabilities with up to 1M context via sparse attention.

EntmaxKV: Support-Aware Decoding for Entmax Attention

arXiv cs.LG

EntmaxKV introduces a support-aware sparse decoding framework for entmax attention that reduces KV-cache memory traffic by exploiting sparsity before loading pages, achieving significant speedups on long-context benchmarks while maintaining output quality.

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv cs.CL

SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.