linear-attention

#linear-attention

Erase-then-Delta Attention: Decoupling Erase and Write Addresses in Delta-Rule Linear Attention

arXiv cs.CL ↗ · yesterday Cached

Proposes Erase-then-Delta Attention (EDA), a memory update rule for linear attention that decouples erase and write addresses to selectively suppress stale information before writing new content. Experiments on 2.5B dense and 25B MoE models demonstrate consistent gains in standard and long-context evaluations.

0 favorites 0 likes

#linear-attention

SSM Adapters via Hankel Reduced-order Modeling: Injection Site Determines Task Suitability in Long-Context Fine-Tuning

arXiv cs.LG ↗ · yesterday Cached

Introduces Hankel Reduced order Model (HRM) adapter, an SSM-based residual module initialized via Balanced Truncation for parameter-efficient fine-tuning, outperforming LoRA on long-context tasks.

0 favorites 0 likes

#linear-attention

HydraHead: From Head-Level Functional Heterogeneity to Specialized Attention Hybridization

Hugging Face Daily Papers ↗ · 2026-06-18 Cached

HydraHead is a novel attention hybridization architecture that combines Full and Linear Attention at the head level, achieving superior long-context performance with reduced training overhead via interpretability-driven selection and scale-normalized fusion.

0 favorites 0 likes

#linear-attention

SinkRec: Mitigating Semantic State Sink in Long Sequence Recommendation with Memory-Conditioned Gated Delta Networks

arXiv cs.LG ↗ · 2026-06-10 Cached

SinkRec introduces a hybrid memory-transition architecture to mitigate semantic state sink in long sequence recommendation, using memory-conditioned gated delta networks to decouple pattern storage from dynamic modeling, achieving linear-time efficiency.

0 favorites 0 likes

#linear-attention

Blurry Window Attention

arXiv cs.LG ↗ · 2026-06-10 Cached

Introduces Blurry Window Attention (BLA), a novel attention method with bounded-memory control that reconstructs a blurry KV history via Dirichlet kernel interpolation, achieving 8x state efficiency over Sliding Window Attention on the Multi-Query Associate Recall task.

0 favorites 0 likes

#linear-attention

Dynamic Linear Attention

arXiv cs.CL ↗ · 2026-06-10 Cached

This paper proposes DLA, a dynamic memory modeling framework for multi-state linear attention that adaptively merges states based on token information variation and maintains a fixed-size state cache, enabling better long-context representation without the quadratic complexity of standard attention.

0 favorites 0 likes

#linear-attention

Dynamic Linear Attention

Hugging Face Daily Papers ↗ · 2026-06-09 Cached

DLA introduces adaptive state merging and capacity-bounded memory modeling for multi-state linear attention, improving long-context LLM performance.

0 favorites 0 likes

#linear-attention

Unlocking Feature Learning in Gated Delta Networks at Scale

arXiv cs.LG ↗ · 2026-06-04 Cached

This paper derives scaling rules for Gated Delta Networks using Maximal Update Parametrization (μP), enabling zero-shot hyperparameter transfer across model widths for efficient sub-quadratic LLM architectures. Experiments confirm stable learning-rate transfer under both AdamW and SGD, whereas standard parametrization fails.

0 favorites 0 likes

#linear-attention

@zhaoran_wang: for me, the coolest finding is that you can connect/interpolate all softmax/linear variants and give a promising direct…

X AI KOLs Timeline ↗ · 2026-05-30 Cached

Discussion of a finding that all softmax/linear attention variants can be interpolated, and that the Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Includes link to paper and code.

0 favorites 0 likes

#linear-attention

Linearizing Vision Transformer with Test-Time Training

Hugging Face Daily Papers ↗ · 2026-05-28 Cached

This paper proposes a method to convert pretrained Softmax attention models into linear-complexity Test-Time Training (TTT) architectures, achieving comparable text-to-image quality to fine-tuned Softmax models while significantly accelerating inference. The approach is validated by linearizing Stable Diffusion 3.5, resulting in SD3.5-T^5 with 1.32x speedup at 1K resolution.

0 favorites 0 likes

#linear-attention

Interdomain Attention: Beyond Token-Level Key-Value Memory

arXiv cs.LG ↗ · 2026-05-26 Cached

Proposes Interdomain Attention, a new method that integrates state space models into attention via kernel methods, achieving efficient long-context modeling with a fixed-size state and outperforming SSMs and softmax attention in language modeling experiments up to 1.3B parameters.

0 favorites 0 likes

#linear-attention

Tensor Cache: Eviction-conditioned Associative Memory for Transformers

arXiv cs.LG ↗ · 2026-05-25 Cached

Tensor Cache introduces a two-level caching mechanism that compresses evicted key-value pairs from sliding-window attention into a fixed-size associative memory, improving long-context language modeling without unbounded memory growth.

0 favorites 0 likes

#linear-attention

@jiqizhixin: New from NVIDIA! You can edit a model’s compressed memory without scrambling what it already knows! Enter Gated DeltaNe…

X AI KOLs Timeline ↗ · 2026-05-22 Cached

NVIDIA introduces Gated DeltaNet-2, a method for editing compressed model memory without catastrophic forgetting, using independent gates for erase and write operations. It outperforms existing models like Mamba-2 and Mamba-3 on language modeling and long-context tasks.

0 favorites 0 likes

#linear-attention

@BlinkDL_AI: Gated DeltaNet-2 is almost exactly RWKV-7's DPLR recurrence, not acknowledging the elephant in the room

X AI KOLs Following ↗ · 2026-05-22 Cached

Ali Hatamizadeh announces Gated DeltaNet-2, a new linear attention model that outperforms KDA and Mamba-3 at 1.3B scale; @BlinkDL_AI notes its recurrence is nearly identical to RWKV-7's DPLR.

0 favorites 0 likes

#linear-attention

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Hugging Face Daily Papers ↗ · 2026-05-21 Cached

Gated DeltaNet-2 introduces separate erase and write gates for linear attention, achieving superior performance in long-context language modeling and retrieval tasks.

0 favorites 0 likes

#linear-attention

Exact Linear Attention

arXiv cs.LG ↗ · 2026-05-20

This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention without approximation error by leveraging kernel decomposition, and addresses gradient explosion and token dilution through constrained kernel functions. It also presents engineering innovations including Hyper Link, Memory Lobe, and a routing bias for Mixture of Experts.

0 favorites 0 likes

#linear-attention

The Routing and Filtering Structure of Attention

arXiv cs.LG ↗ · 2026-05-20

The paper decomposes the attention interaction matrix into routing (skew-symmetric) and filtering (symmetric) components, introducing S-D attention to disentangle them. It reveals a spectral cascade in routing that predicts where attention can be simplified, achieving significant parameter reduction with minimal perplexity loss.

0 favorites 0 likes

#linear-attention

Federated Nested Learning: Collaborative Training of Self-Referential Memories for Test-Time Adaptation

arXiv cs.LG ↗ · 2026-05-19 Cached

Proposes Federated Nested Learning (FedNL), a framework that reformulates federated learning as a three-level nested optimization system, enabling collaborative training of self-referential memories for test-time adaptation to handle Non-IID data and long-tail distributions.

0 favorites 0 likes

#linear-attention

@berryxia: Moonshot AI founder Yang Zhilin recently released a 40-minute video. Born in 1992, valedictorian of Tsinghua CS undergrad, PhD from CMU, co-author of Transformer-XL and XLNet, former researcher at Google Brain and Meta, he calmly deconstructs Kimi K2 in front of the camera...

X AI KOLs Timeline ↗ · 2026-05-14

Moonshot AI founder Yang Zhilin released a 40-minute video detailing the training process of the Kimi K2 model, which cost only $4.6 million. In an 8-model real-time programming competition, Kimi K2 took first place, defeating GPT-5.5 and others, demonstrating how a small team can overturn the traditional compute-stacking paradigm through architecture optimization.

1 favorites 1 likes

#linear-attention

Variational Linear Attention: Stable Associative Memory for Long-Context Transformers

arXiv cs.LG ↗ · 2026-05-13 Cached

This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.

0 favorites 0 likes

linear-attention

Submit Feedback