@NousResearch: Today we release Lighthouse Attention, a selection-based hierarchical attention for long-context pre-training that deli…

X AI KOLs Following Papers

Summary

NousResearch releases Lighthouse Attention, a selection-based hierarchical attention that achieves 1.4-1.7x wall-clock speedup at 98K context and ~17x faster forward/backward pass than standard attention at 512K context on a single B200, validated on 530M-parameter Llama-3 models across 50B tokens.

Today we release Lighthouse Attention, a selection-based hierarchical attention for long-context pre-training that delivers a 1.4-1.7× wall-clock speedup at 98K context. It runs the same forward+backward pass ~17× faster than standard attention at 512K context on a single B200, without a custom sparse attention kernel, a straight-through estimator, or an auxiliary loss. During training, queries, keys, and values are pooled symmetrically into a multi-resolution pyramid. We then score every pyramid heads, and a top-k cascade selects a small hierarchical dense sub-sequence, and after a sorting pass that enforces causality, we use standard attention for token mixing. A brief full attention resume at the end converts the checkpoint back into a competent dense-attention model. Validated this using 530M parameter Llama-3 models across 50B tokens, with up to 1M-token benchmarks across 32 B200s under context parallelism. The work on Lighthouse Attention was led by @bloc97_, @SubhoGhosh02, and @theemozilla.
Original Article

Similar Articles

Lighthouse Attention (11 minute read)

TLDR AI

Lighthouse Attention is a selection-based hierarchical attention mechanism that accelerates long-context pretraining by running forward+backward passes ~17× faster at 512K context and delivering 1.4–1.7× end-to-end speedup at 98K context, validated with Llama-3 530M on 50B tokens.

Long Context Pre-Training with Lighthouse Attention

Hugging Face Daily Papers

Lighthouse Attention is a training-only hierarchical selection-based attention algorithm that reduces computational complexity for long sequence training of causal transformers, enabling faster pre-training with competitive final loss after a recovery phase.

@tilderesearch: https://x.com/tilderesearch/status/2061771450168889432

X AI KOLs Timeline

Wall Attention generalizes diagonal forget gates to softmax attention, enabling state-of-the-art length extrapolation from 4k to 160k+ context zero-shot and outperforming RoPE and FoX in pretraining. It is released as a drop-in replacement with open-source Triton kernels.