Tag
Parallax is a new parametrized form of Local Linear Attention that eliminates numerical solvers and matches FlashAttention 2/3 in decoding. Its effectiveness depends on the optimizer, working with Muon but not AdamW, highlighting the role of optimizer geometry.
Introduces Parallax, a parameterized local linear attention mechanism with hardware-aware optimization that improves LLM pretraining efficiency and performance, achieving Pareto improvements at 0.6B and 1.7B scales.