@maximelabonne: Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode…
Summary
Parallax is a new parametrized form of Local Linear Attention that eliminates numerical solvers and matches FlashAttention 2/3 in decoding. Its effectiveness depends on the optimizer, working with Muon but not AdamW, highlighting the role of optimizer geometry.
View Cached Full Text
Cached at: 06/10/26, 07:54 PM
Parallax is a parametrized form of Local Linear Attention that drops the numerical solvers and matches FA 2/3 on decode.
The most impressive part is that the architecture’s benefit works with Muon but disappears under AdamW because the model learns to suppress it.
It’s conditional on optimizer geometry!
Lots of high-quality papers coming from Tilde, great work!
Similar Articles
Parallax: Parameterized Local Linear Attention for Language Modeling
Introduces Parallax, a parameterized local linear attention mechanism with hardware-aware optimization that improves LLM pretraining efficiency and performance, achieving Pareto improvements at 0.6B and 1.7B scales.
@zhaoran_wang: for me, the coolest finding is that you can connect/interpolate all softmax/linear variants and give a promising direct…
Discussion of a finding that all softmax/linear attention variants can be interpolated, and that the Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Includes link to paper and code.
MiniMax Sparse Attention
MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.
SparDA: Sparse Decoupled Attention for Efficient Long-Context LLM Inference
SparDA proposes a decoupled sparse attention architecture that adds a lightweight 'Forecast' projection to predict future KV cache needs, enabling lookahead prefetching from CPU to GPU and reducing selection overhead. On 8B sparse-pretrained models, it achieves up to 1.25× prefill and 1.7× decode speedup, with up to 5.3× higher decode throughput over non-offload baselines.
EntmaxKV: Support-Aware Decoding for Entmax Attention
EntmaxKV introduces a support-aware sparse decoding framework for entmax attention that reduces KV-cache memory traffic by exploiting sparsity before loading pages, achieving significant speedups on long-context benchmarks while maintaining output quality.