@eliebakouch: the new sparse attention method introduced with this model is basically a combination of components from existing ones.…
Summary
Meituan introduces LongCat-2.0, a 1.6T parameter MoE model with 48B active parameters and 1M context length, featuring a new LongCat Sparse Attention (LSA) method that combines components from existing sparse attention techniques.
View Cached Full Text
Cached at: 06/30/26, 07:42 AM
the new sparse attention method introduced with this model is basically a combination of components from existing ones. let’s go over each sparse attention method and what they keep from them: - deepseek sparse attention (DSA): they keep the top-k indexer, this is the basis of all the following sparse attention methods - DSA + index sharing (glm5.2): they share the index over multiple layers - minimax sparse attention (and also NSA from deepseek): they do block level top-k indexing -> they add another top-k at the token level to select precisely each token in the selected block - compressed sparse attention (from deepseek V4) and also NSA: they keep the sliding window -> they also add a sink token iirc tl;dr: they do block level indexing, then token level on the selected blocks + add a sliding window and sink component with a 50/50 budget split. they share the token level top-k across layers
Meituan LongCat (@Meituan_LongCat): Introducing LongCat-2.0 🐱 1.6T parameters · MoE with ~48B active · 1M context The full model behind Owl Alpha on @OpenRouter — now available.
Built for agentic coding from the ground up: ◆ LongCat Sparse Attention (LSA) — scales efficiently for 1M-context tokens ◆
Similar Articles
@Meituan_LongCat: Introducing LongCat-2.0 1.6T parameters · MoE with ~48B active · 1M context The full model behind Owl Alpha on @OpenRou…
Meituan introduces LongCat-2.0, a 1.6T parameter MoE model with ~48B active parameters and 1M context, featuring novel architectures like LongCat Sparse Attention and Zero-Compute Experts, achieving strong benchmark scores on coding and reasoning tasks.
MiniMax Sparse Attention
MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.
LongCat-2.0, a large-scale MoE model with 1.6T total and 48B Active
LongCat-2.0 is a large-scale Mixture-of-Experts (MoE) model with 1.6 trillion total parameters and 48 billion active parameters.
MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost (12 minute read)
MiniMax has released a detailed technical report on its M2 series and teased the upcoming M3 model, which uses a novel sparse attention mechanism to achieve up to 15.6× faster decoding at million-token contexts.
@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …
MiniMax Sparse Attention (MSA) achieves up to 28.4x reduction in attention compute at 1M tokens by adding a routing branch that selectively chooses key-value blocks for attention, enabling 14.2x faster prefill and 7.6x faster decoding on H800 GPUs while matching full attention benchmark performance.