@maximelabonne: Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode…

X AI KOLs Following Papers

Summary

Parallax is a new parametrized form of Local Linear Attention that eliminates numerical solvers and matches FlashAttention 2/3 in decoding. Its effectiveness depends on the optimizer, working with Muon but not AdamW, highlighting the role of optimizer geometry.

Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode. The most impressive part is that the architecture's benefit works with Muon but disappears under AdamW because the model learns to suppress it. It's conditional on optimizer geometry! Lots of high-quality papers coming from Tilde, great work!
Original Article
View Cached Full Text

Cached at: 06/10/26, 07:54 PM

Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode.

The most impressive part is that the architecture’s benefit works with Muon but disappears under AdamW because the model learns to suppress it.

It’s conditional on optimizer geometry!

Lots of high-quality papers coming from Tilde, great work!

Similar Articles

MiniMax Sparse Attention

Hugging Face Daily Papers

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv cs.CL

SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.

EntmaxKV: Support-Aware Decoding for Entmax Attention

arXiv cs.LG

EntmaxKV introduces a support-aware sparse decoding framework for entmax attention that reduces KV-cache memory traffic by exploiting sparsity before loading pages, achieving significant speedups on long-context benchmarks while maintaining output quality.