@maximelabonne: Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode…

X AI KOLs Following 06/10/26, 04:15 PM Papers

attention-mechanism local-linear-attention flash-attention optimizer-geometry muon adamw research

Summary

Parallax is a new parametrized form of Local Linear Attention that eliminates numerical solvers and matches FlashAttention 2/3 in decoding. Its effectiveness depends on the optimizer, working with Muon but not AdamW, highlighting the role of optimizer geometry.

Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode. The most impressive part is that the architecture's benefit works with Muon but disappears under AdamW because the model learns to suppress it. It's conditional on optimizer geometry! Lots of high-quality papers coming from Tilde, great work!

Original Article

View Cached Full Text

Cached at: 06/10/26, 07:54 PM

Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode.

The most impressive part is that the architecture’s benefit works with Muon but disappears under AdamW because the model learns to suppress it.

It’s conditional on optimizer geometry!

Lots of high-quality papers coming from Tilde, great work!

Similar Articles

Parallax: Parameterized Local Linear Attention for Language Modeling

Hugging Face Daily Papers

Introduces Parallax, a parameterized local linear attention mechanism with hardware-aware optimization that improves LLM pretraining efficiency and performance, achieving Pareto improvements at 0.6B and 1.7B scales.

@zhaoran_wang: for me, the coolest finding is that you can connect/interpolate all softmax/linear variants and give a promising direct…

X AI KOLs Timeline

Discussion of a finding that all softmax/linear attention variants can be interpolated, and that the Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Includes link to paper and code.

MiniMax Sparse Attention

Hugging Face Daily Papers

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.

SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference

arXiv cs.CL

SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.

EntmaxKV: Support-Aware Decoding for Entmax Attention

arXiv cs.LG

EntmaxKV introduces a support-aware sparse decoding framework for entmax attention that reduces KV-cache memory traffic by exploiting sparsity before loading pages, achieving significant speedups on long-context benchmarks while maintaining output quality.