sparse-attention

Tag

Cards List
#sparse-attention

Depth-Staggered Fibonacci Spacing for Sparse Attention: Static Schedules Beat Learned Dilation and Extrapolate Where Dense Attention Fails

arXiv cs.CL · 15h ago Cached

This paper studies sparse self-attention with Fibonacci-spaced offsets and per-layer scaling, finding that static layer-wise schedules outperform learned or fixed ones, and that sparse variants robustly extrapolate to 4x training length while dense attention collapses.

0 favorites 0 likes
#sparse-attention

@eliebakouch: the new sparse attention method introduced with this model is basically a combination of components from existing ones.…

X AI KOLs Following · 15h ago Cached

Meituan introduces LongCat-2.0, a 1.6T parameter MoE model with 48B active parameters and 1M context length, featuring a new LongCat Sparse Attention (LSA) method that combines components from existing sparse attention techniques.

0 favorites 0 likes
#sparse-attention

@Meituan_LongCat: Introducing LongCat-2.0 1.6T parameters · MoE with ~48B active · 1M context The full model behind Owl Alpha on @OpenRou…

X AI KOLs Timeline · 16h ago Cached

Meituan introduces LongCat-2.0, a 1.6T parameter MoE model with ~48B active parameters and 1M context, featuring novel architectures like LongCat Sparse Attention and Zero-Compute Experts, achieving strong benchmark scores on coding and reasoning tasks.

0 favorites 0 likes
#sparse-attention

@rohanpaul_ai: This paper makes long-context attention cheaper and faster by letting each token use only the query heads it needs. Rea…

X AI KOLs Following · 2d ago Cached

The paper introduces Grouped Query Experts, which improves long-context attention by routing each token to only a few query-head experts on top of grouped-query attention, achieving 1.7-1.8x faster prefill while matching accuracy.

0 favorites 0 likes
#sparse-attention

@Pavel_Izmailov: New paper: Emergent Capabilities Arise Randomly from Learning Sparse Attention Patterns! Main takeaway: when LLMs learn…

X AI KOLs Timeline · 4d ago Cached

New paper finds that emergent capabilities in LLMs arise randomly from learning sparse attention patterns; the bottleneck is learning which tokens to attend to, which is slow and unpredictable.

0 favorites 0 likes
#sparse-attention

@totheagi: We're the first to make the full GLM-5.2 (FP8) run on RTX 4090s. GLM-5.2 is the new 753B SOTA open-weights model, and i…

X AI KOLs Timeline · 2026-06-18 Cached

We're the first to run the full GLM-5.2 (753B FP8) on RTX 4090s by porting sparse-attention kernels to Ada GPUs, enabling frontier open-weights model on commodity hardware.

0 favorites 0 likes
#sparse-attention

Do you guys think subquadratic actually has a 12 million context model

Reddit r/ArtificialInteligence · 2026-06-18 Cached

Sub Quadratic claims to have a model with a context of 12 million tokens, but access is limited to partners; it performs well in the "needle in a haystack" test, but lacks evidence of general reasoning ability, raising doubts.

0 favorites 0 likes
#sparse-attention

@ziv_ravid: I read the GLM-5.2 report and saw they use IndexShare, which is a cool, simple trick. Regular attention makes every tok…

X AI KOLs Timeline · 2026-06-17 Cached

IndexShare is a technique in the GLM-5.2 report that shares a single indexer across multiple layers in sparse attention, reducing FLOPs by 2.9x at 1M context by avoiding redundant top-key selections per layer.

0 favorites 0 likes
#sparse-attention

zai-org/GLM-5.2 is here!

Reddit r/LocalLLaMA · 2026-06-16 Cached

Z.AI releases GLM-5.2, a new flagship model with a solid 1M-token context, enhanced coding capabilities with flexible thinking effort, and improved architecture via IndexShare. It is released under an MIT open-source license.

0 favorites 0 likes
#sparse-attention

Subquadratic AI introduces SubQ-1.1-Small, a new model using Smart Sparse Attention

Reddit r/singularity · 2026-06-16 Cached

Subquadratic AI introduces SubQ-1.1-Small, a model leveraging Smart Sparse Attention to achieve near-perfect long-context retrieval up to 12M tokens with up to 1,000x attention compute reduction. It balances long-context optimization with strong general reasoning, outperforming baselines on benchmarks like NIAH and RULER.

0 favorites 0 likes
#sparse-attention

zai-org/GLM-5.2-FP8

Hugging Face Models Trending · 2026-06-16 Cached

Z.AI releases GLM-5.2, a flagship open-source model with a solid 1M-token context, improved coding capabilities, and a new IndexShare sparse attention architecture that reduces FLOPs by 2.9x at 1M context.

0 favorites 0 likes
#sparse-attention

@rohanpaul_ai: Quite incredible, MiniMax Sparse Attention cuts attention compute by 28.4X at 1M tokens, with 14.2X faster prefill and …

X AI KOLs Following · 2026-06-15 Cached

MiniMax Sparse Attention (MSA) achieves up to 28.4x reduction in attention compute at 1M tokens by adding a routing branch that selectively chooses key-value blocks for attention, enabling 14.2x faster prefill and 7.6x faster decoding on H800 GPUs while matching full attention benchmark performance.

0 favorites 0 likes
#sparse-attention

MiniMax Sparse Attention for Million-Token Contexts (GitHub Repo)

TLDR AI · 2026-06-15 Cached

MiniMaxAI releases MSA, a library for dense and sparse attention kernels optimized for NVIDIA SM100 GPUs, enabling efficient processing of million-token contexts with FlashAttention and sparse top-k attention.

0 favorites 0 likes
#sparse-attention

@dair_ai: https://x.com/dair_ai/status/2066174390048358760

X AI KOLs Following · 2026-06-14 Cached

A curated thread covering three notable AI papers: MiniMax Sparse Attention for efficient long-context inference, Self-Harness for self-improving agent scaffolds, and Agents' Last Exam benchmark for measuring agent economic value.

0 favorites 0 likes
#sparse-attention

Local models in mid-2026

Reddit r/LocalLLaMA · 2026-06-14 Cached

A technical overview of the state of local AI models in mid-2026, highlighting how open-weight models have narrowed the gap to frontier models through advances in mixture-of-experts and sparse attention, enabling efficient local inference.

0 favorites 0 likes
#sparse-attention

@askalphaxiv: "MiniMax Sparse Attention" This paper from Minimax adds a tiny Index Branch to GQA that picks top k KV blocks per group…

X AI KOLs Timeline · 2026-06-13 Cached

This paper from Minimax introduces MiniMax Sparse Attention, which adds a tiny Index Branch to GQA to select top-k KV blocks per group, enabling GPU-native sparsity with exponential speedups on a 109B multimodal MoE.

0 favorites 0 likes
#sparse-attention

"inference falls back to dense attention" for MiniMax M3 - does it mean 428B weights used at each step?

Reddit r/LocalLLaMA · 2026-06-12

Discusses the fact that for MiniMax M3, sparse attention is not yet supported in GGUF format, so inference falls back to dense attention, potentially using all 428B weights each step, causing significant slowdown.

0 favorites 0 likes
#sparse-attention

@karminski3: Magic! DeepSeekV4 context memory compressed to 1/10! Everyone knows DeepSeekV4 supports 1M context and is heavily optimized. To actually use 1M context, VRAM usage is only about 10GB (compared to DeepSeek-V3.2 which needs about…

X AI KOLs Following · 2026-06-12 Cached

FlashMemory-DeepSeek-V4 proposes a novel inference paradigm called Lookahead Sparse Attention (LSA), which uses a neural memory indexer to actively predict future context needs, compressing physical KV cache usage to 13.5% of full context baseline while improving average accuracy by 0.6%. This method adopts a decoupled training strategy that allows independent training of the indexer without loading the base model, significantly reducing training cost.

0 favorites 0 likes
#sparse-attention

Avatar V: Scaling Video-Reference Avatar Video Generation

Hugging Face Daily Papers · 2026-06-11 Cached

Avatar V is a production-scale framework for generating behaviorally recognizable avatar videos conditioned on full video references, introducing sparse reference attention and motion representation streams to achieve state-of-the-art identity preservation and lip synchronization.

0 favorites 0 likes
#sparse-attention

MiniMax Sparse Attention

Hugging Face Daily Papers · 2026-06-11 Cached

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.

0 favorites 0 likes
Next →
← Back to home

Submit Feedback