@eliebakouch: the new sparse attention method introduced with this model is basically a combination of components from existing ones.…

X AI KOLs Following 06/30/26, 03:19 AM Models

Summary

Meituan introduces LongCat-2.0, a 1.6T parameter MoE model with 48B active parameters and 1M context length, featuring a new LongCat Sparse Attention (LSA) method that combines components from existing sparse attention techniques.

the new sparse attention method introduced with this model is basically a combination of components from existing ones. let's go over each sparse attention method and what they keep from them: - deepseek sparse attention (DSA): they keep the top-k indexer, this is the basis of all the following sparse attention methods - DSA + index sharing (glm5.2): they share the index over multiple layers - minimax sparse attention (and also NSA from deepseek): they do block level top-k indexing -> they add another top-k at the token level to select precisely each token in the selected block - compressed sparse attention (from deepseek V4) and also NSA: they keep the sliding window -> they also add a sink token iirc tl;dr: they do block level indexing, then token level on the selected blocks + add a sliding window and sink component with a 50/50 budget split. they share the token level top-k across layers

Original Article

View Cached Full Text

Cached at: 06/30/26, 07:42 AM

the new sparse attention method introduced with this model is basically a combination of components from existing ones. let’s go over each sparse attention method and what they keep from them: - deepseek sparse attention (DSA): they keep the top-k indexer, this is the basis of all the following sparse attention methods - DSA + index sharing (glm5.2): they share the index over multiple layers - minimax sparse attention (and also NSA from deepseek): they do block level top-k indexing -> they add another top-k at the token level to select precisely each token in the selected block - compressed sparse attention (from deepseek V4) and also NSA: they keep the sliding window -> they also add a sink token iirc tl;dr: they do block level indexing, then token level on the selected blocks + add a sliding window and sink component with a 50/50 budget split. they share the token level top-k across layers

Meituan LongCat (@Meituan_LongCat): Introducing LongCat-2.0 🐱 1.6T parameters · MoE with ~48B active · 1M context The full model behind Owl Alpha on @OpenRouter — now available.

Built for agentic coding from the ground up: ◆ LongCat Sparse Attention (LSA) — scales efficiently for 1M-context tokens ◆

@eliebakouch: the new sparse attention method introduced with this model is basically a combination of components from existing ones.…

Similar Articles

@Meituan_LongCat: Introducing LongCat-2.0 1.6T parameters · MoE with ~48B active · 1M context The full model behind Owl Alpha on @OpenRou…

MiniMax Sparse Attention

LongCat-2.0, a large-scale MoE model with 1.6T total and 48B Active

MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost (12 minute read)

@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …

Submit Feedback

Similar Articles

@Meituan_LongCat: Introducing LongCat-2.0 1.6T parameters · MoE with ~48B active · 1M context The full model behind Owl Alpha on @OpenRou…

LongCat-2.0, a large-scale MoE model with 1.6T total and 48B Active
LongCat-2.0 is a large-scale Mixture-of-Experts (MoE) model with 1.6 trillion total parameters and 48 billion active parameters.

MiniMax teases upcoming M3 model with new sparse attention mechanism and 15.6X long-context response speed boost (12 minute read)

@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …