"inference falls back to dense attention" for MiniMax M3 - does it mean 428B weights used at each step?

Reddit r/LocalLLaMA Models

Summary

Discusses the fact that for MiniMax M3, sparse attention is not yet supported in GGUF format, so inference falls back to dense attention, potentially using all 428B weights each step, causing significant slowdown.

So like 100x (or how much) slower vs. full implementation? https://huggingface.co/unsloth/MiniMax-M3-GGUF > Note: MiniMax Sparse Attention is not supported yet, so inference falls back to dense attention.
Original Article

Similar Articles

MiniMax Sparse Attention

Hugging Face Daily Papers

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.

MiniMax M3 (2 minute read)

TLDR AI

MiniMax introduces M3, the first open-weights model to combine coding, agentic, and multimodal capabilities with up to 1M context via sparse attention.