efficient-attention

Tag

Cards List
#efficient-attention

Memory-Managed Long-Context Attention: A Preliminary Study of Editable Request-Local Memory

arXiv cs.CL · 5d ago Cached

This paper investigates memory-managed long-context attention, a research direction that separates efficient state compression from explicit editable memory slots. Experiments show that a hybrid approach combining fast recurrent/sparse backbones with explicit memory management outperforms pure fixed-state or pure sparse methods across synthetic tasks and long-context benchmarks.

0 favorites 0 likes
#efficient-attention

NLL-Guided Full-Attention Layer Selection for Training-Free Sliding-Window Adaptation

arXiv cs.CL · 6d ago Cached

Proposes a training-free NLL-guided method for selecting which layers to retain full attention in hybrid attention models, achieving comparable accuracy with 1/4 full-attention layers against a 1/2 periodic baseline on long-context tasks.

0 favorites 0 likes
#efficient-attention

Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention

arXiv cs.CL · 2026-06-26 Cached

Proposes Erase-then-Delta Attention (EDA), a memory update rule for linear attention that decouples erase and write addresses to selectively suppress stale information before writing new content. Experiments on 2.5B dense and 25B MoE models demonstrate consistent gains in standard and long-context evaluations.

0 favorites 0 likes
#efficient-attention

Gaussian Mixture Attention: Linear-Time Sequence Mixing via Probabilistic Latent Routing

arXiv cs.LG · 2026-06-18 Cached

This paper introduces Gaussian Mixture Attention (GMA), a probabilistic attention mechanism that replaces explicit pairwise query-key comparisons with routing through learned Gaussian mixture components, achieving linear-time complexity in sequence length. Experiments show competitive performance on long-context tasks with fixed-K linear memory scaling.

0 favorites 0 likes
#efficient-attention

@ziv_ravid: I read the GLM-5.2 report and saw they use IndexShare, which is a cool, simple trick. Regular attention makes every tok…

X AI KOLs Timeline · 2026-06-17 Cached

IndexShare is a technique in the GLM-5.2 report that shares a single indexer across multiple layers in sparse attention, reducing FLOPs by 2.9x at 1M context by avoiding redundant top-key selections per layer.

0 favorites 0 likes
#efficient-attention

Rethinking the Role of Efficient Attention in Hybrid Architectures

arXiv cs.CL · 2026-06-16 Cached

This paper systematically analyzes the role of efficient attention modules in hybrid language model architectures, finding that different designs converge in long-context performance under sufficient training, and that long-range retrieval is primarily carried by full attention while efficient attention shapes the optimization trajectory, revealing a 'Large-Window Laziness' phenomenon.

0 favorites 0 likes
#efficient-attention

MiniMax Sparse Attention

Hugging Face Daily Papers · 2026-06-11 Cached

MiniMax Sparse Attention introduces a blockwise sparse attention mechanism that achieves significant speedups for ultra-long-context LLMs, reducing per-token attention compute by 28.4x at 1M context with wall-clock speedups of 14.2x for prefill and 7.6x for decoding on H800 GPUs. The method is accompanied by an open-source inference kernel and a publicly released multimodal model.

0 favorites 0 likes
#efficient-attention

Blurry Window Attention

arXiv cs.LG · 2026-06-10 Cached

Introduces Blurry Window Attention (BLA), a novel attention method with bounded-memory control that reconstructs a blurry KV history via Dirichlet kernel interpolation, achieving 8x state efficiency over Sliding Window Attention on the Multi-Query Associate Recall task.

0 favorites 0 likes
#efficient-attention

Dynamic Linear Attention

arXiv cs.CL · 2026-06-10 Cached

This paper proposes DLA, a dynamic memory modeling framework for multi-state linear attention that adaptively merges states based on token information variation and maintains a fixed-size state cache, enabling better long-context representation without the quadratic complexity of standard attention.

0 favorites 0 likes
#efficient-attention

Wall Attention (GitHub Repo)

TLDR AI · 2026-06-03 Cached

Wall Attention is a new attention variant with per-channel, per-timestep multiplicative decay, providing content-dependent forgetting rates and efficient training/decode kernels implemented in Triton.

0 favorites 0 likes
#efficient-attention

Parallax: Parameterized Local Linear Attention for Language Modeling

Hugging Face Daily Papers · 2026-05-27 Cached

Introduces Parallax, a parameterized local linear attention mechanism with hardware-aware optimization that improves LLM pretraining efficiency and performance, achieving Pareto improvements at 0.6B and 1.7B scales.

0 favorites 0 likes
#efficient-attention

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Hugging Face Daily Papers · 2026-05-21 Cached

Gated DeltaNet-2 introduces separate erase and write gates for linear attention, achieving superior performance in long-context language modeling and retrieval tasks.

0 favorites 0 likes
#efficient-attention

CompactAttention: Accelerating Chunked Prefill with Block-Union KV Selection

Hugging Face Daily Papers · 2026-05-16 Cached

CompactAttention introduces Block-Union KV Selection to accelerate chunked prefill for long-context LLMs, achieving up to 2.72x attention speedup on LLaMA-3.1-8B at 128K context while maintaining accuracy close to dense attention.

0 favorites 0 likes
#efficient-attention

Training-Inference Consistent Segmented Execution for Long-Context LLMs

arXiv cs.CL · 2026-05-13 Cached

This paper proposes a training-inference consistent segmented execution framework for long-context LLMs to address the mismatch between full-context training and restricted inference regimes, achieving comparable performance with significantly reduced memory usage.

0 favorites 0 likes
#efficient-attention

@omarsar0: Cool idea from Nous Research. What if you could speed up long-context pretraining with a subquadratic wrapper that you …

X AI KOLs Following · 2026-05-12 Cached

Nous Research introduces Lighthouse Attention, a training-only subquadratic wrapper for scaled dot-product attention that accelerates long-context pretraining and can be removed before deployment to preserve vanilla inference efficiency.

0 favorites 0 likes
#efficient-attention

Toeplitz MLP Mixers are Low Complexity, Information-Rich Sequence Models

arXiv cs.LG · 2026-05-11 Cached

This paper introduces Toeplitz MLP Mixers (TMM), a novel architecture that replaces attention with Toeplitz matrix multiplication to achieve lower computational complexity while maintaining high information retention and training efficiency.

0 favorites 0 likes
← Back to home

Submit Feedback