"inference falls back to dense attention" for MiniMax M3 - does it mean 428B weights used at each step?

Reddit r/LocalLLaMA 06/12/26, 09:28 PM Models

Summary

Discusses the fact that for MiniMax M3, sparse attention is not yet supported in GGUF format, so inference falls back to dense attention, potentially using all 428B weights each step, causing significant slowdown.

So like 100x (or how much) slower vs. full implementation? https://huggingface.co/unsloth/MiniMax-M3-GGUF > Note: MiniMax Sparse Attention is not supported yet, so inference falls back to dense attention.

Original Article

Similar Articles

MiniMax Sparse Attention

Hugging Face Daily Papers

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.

@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …

X AI KOLs Following

MiniMax Sparse Attention (MSA) achieves up to 28.4x reduction in attention compute at 1M tokens by adding a routing branch that selectively chooses key-value blocks for attention, enabling 14.2x faster prefill and 7.6x faster decoding on H800 GPUs while matching full attention benchmark performance.

MiniMax M3 (2 minute read)

TLDR AI

MiniMax introduces M3, the first open-weights model to combine coding, agentic, and multimodal capabilities with up to 1M context via sparse attention.

MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost (12 minute read)

TLDR AI

MiniMax has released a detailed technical report on its M2 series and teased the upcoming M3 model, which uses a novel sparse attention mechanism to achieve up to 15.6× faster decoding at million-token contexts.

@askalphaxiv: "MiniMax Sparse Attention" This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group…

X AI KOLs Timeline

This paper from Minimax introduces MiniMax Sparse Attention, which adds a tiny Index Branch to GQA to select top-k KV blocks per group, enabling GPU-native sparsity with exponential speedups on a 109B multimodal MoE.