Tag
This paper studies sparse self-attention with Fibonacci-spaced offsets and per-layer scaling, finding that static layer-wise schedules outperform learned or fixed ones, and that sparse variants robustly extrapolate to 4x training length while dense attention collapses.
Meituan introduces LongCat-2.0, a 1.6T parameter MoE model with 48B active parameters and 1M context length, featuring a new LongCat Sparse Attention (LSA) method that combines components from existing sparse attention techniques.
Meituan introduces LongCat-2.0, a 1.6T parameter MoE model with ~48B active parameters and 1M context, featuring novel architectures like LongCat Sparse Attention and Zero-Compute Experts, achieving strong benchmark scores on coding and reasoning tasks.
The paper introduces Grouped Query Experts, which improves long-context attention by routing each token to only a few query-head experts on top of grouped-query attention, achieving 1.7-1.8x faster prefill while matching accuracy.
New paper finds that emergent capabilities in LLMs arise randomly from learning sparse attention patterns; the bottleneck is learning which tokens to attend to, which is slow and unpredictable.
We're the first to run the full GLM-5.2 (753B FP8) on RTX 4090s by porting sparse-attention kernels to Ada GPUs, enabling frontier open-weights model on commodity hardware.
Sub Quadratic claims to have a model with a context of 12 million tokens, but access is limited to partners; it performs well in the "needle in a haystack" test, but lacks evidence of general reasoning ability, raising doubts.
IndexShare is a technique in the GLM-5.2 report that shares a single indexer across multiple layers in sparse attention, reducing FLOPs by 2.9x at 1M context by avoiding redundant top-key selections per layer.
Z.AI releases GLM-5.2, a new flagship model with a solid 1M-token context, enhanced coding capabilities with flexible thinking effort, and improved architecture via IndexShare. It is released under an MIT open-source license.
Subquadratic AI introduces SubQ-1.1-Small, a model leveraging Smart Sparse Attention to achieve near-perfect long-context retrieval up to 12M tokens with up to 1,000x attention compute reduction. It balances long-context optimization with strong general reasoning, outperforming baselines on benchmarks like NIAH and RULER.
Z.AI releases GLM-5.2, a flagship open-source model with a solid 1M-token context, improved coding capabilities, and a new IndexShare sparse attention architecture that reduces FLOPs by 2.9x at 1M context.
MiniMax Sparse Attention (MSA) achieves up to 28.4x reduction in attention compute at 1M tokens by adding a routing branch that selectively chooses key-value blocks for attention, enabling 14.2x faster prefill and 7.6x faster decoding on H800 GPUs while matching full attention benchmark performance.
MiniMaxAI releases MSA, a library for dense and sparse attention kernels optimized for NVIDIA SM100 GPUs, enabling efficient processing of million-token contexts with FlashAttention and sparse top-k attention.
A curated thread covering three notable AI papers: MiniMax Sparse Attention for efficient long-context inference, Self-Harness for self-improving agent scaffolds, and Agents' Last Exam benchmark for measuring agent economic value.
A technical overview of the state of local AI models in mid-2026, highlighting how open-weight models have narrowed the gap to frontier models through advances in mixture-of-experts and sparse attention, enabling efficient local inference.
This paper from Minimax introduces MiniMax Sparse Attention, which adds a tiny Index Branch to GQA to select top-k KV blocks per group, enabling GPU-native sparsity with exponential speedups on a 109B multimodal MoE.
Discusses the fact that for MiniMax M3, sparse attention is not yet supported in GGUF format, so inference falls back to dense attention, potentially using all 428B weights each step, causing significant slowdown.
FlashMemory-DeepSeek-V4 proposes a novel inference paradigm called Lookahead Sparse Attention (LSA), which uses a neural memory indexer to actively predict future context needs, compressing physical KV cache usage to 13.5% of full context baseline while improving average accuracy by 0.6%. This method adopts a decoupled training strategy that allows independent training of the indexer without loading the base model, significantly reducing training cost.
Avatar V is a production-scale framework for generating behaviorally recognizable avatar videos conditioned on full video references, introducing sparse reference attention and motion representation streams to achieve state-of-the-art identity preservation and lip synchronization.
MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.