@NousResearch: Today we release Lighthouse Attention, a selection-based hierarchical attention for long-context pre-training that deli…
Summary
NousResearch releases Lighthouse Attention, a selection-based hierarchical attention that achieves 1.4-1.7x wall-clock speedup at 98K context and ~17x faster forward/backward pass than standard attention at 512K context on a single B200, validated on 530M-parameter Llama-3 models across 50B tokens.
Similar Articles
Lighthouse Attention (11 minute read)
Lighthouse Attention is a selection-based hierarchical attention mechanism that accelerates long-context pretraining by running forward+backward passes ~17× faster at 512K context and delivering 1.4–1.7× end-to-end speedup at 98K context, validated with Llama-3 530M on 50B tokens.
Long Context Pre-Training with Lighthouse Attention
Lighthouse Attention is a training-only hierarchical selection-based attention algorithm that reduces computational complexity for long sequence training of causal transformers, enabling faster pre-training with competitive final loss after a recovery phase.
@omarsar0: Cool idea from Nous Research. What if you could speed up long-context pretraining with a subquadratic wrapper that you …
Nous Research introduces Lighthouse Attention, a training-only subquadratic wrapper for scaled dot-product attention that accelerates long-context pretraining and can be removed before deployment to preserve vanilla inference efficiency.
@tilderesearch: https://x.com/tilderesearch/status/2061771450168889432
Wall Attention generalizes diagonal forget gates to softmax attention, enabling state-of-the-art length extrapolation from 4k to 160k+ context zero-shot and outperforming RoPE and FoX in pretraining. It is released as a drop-in replacement with open-source Triton kernels.
Parallax: Parameterized Local Linear Attention for Language Modeling
Introduces Parallax, a parameterized local linear attention mechanism with hardware-aware optimization that improves LLM pretraining efficiency and performance, achieving Pareto improvements at 0.6B and 1.7B scales.