Tag
Proposes Erase-then-Delta Attention (EDA), a memory update rule for linear attention that decouples erase and write addresses to selectively suppress stale information before writing new content. Experiments on 2.5B dense and 25B MoE models demonstrate consistent gains in standard and long-context evaluations.
Introduces Hankel Reduced order Model (HRM) adapter, an SSM-based residual module initialized via Balanced Truncation for parameter-efficient fine-tuning, outperforming LoRA on long-context tasks.
HydraHead is a novel attention hybridization architecture that combines Full and Linear Attention at the head level, achieving superior long-context performance with reduced training overhead via interpretability-driven selection and scale-normalized fusion.
SinkRec introduces a hybrid memory-transition architecture to mitigate semantic state sink in long sequence recommendation, using memory-conditioned gated delta networks to decouple pattern storage from dynamic modeling, achieving linear-time efficiency.
Introduces Blurry Window Attention (BLA), a novel attention method with bounded-memory control that reconstructs a blurry KV history via Dirichlet kernel interpolation, achieving 8x state efficiency over Sliding Window Attention on the Multi-Query Associate Recall task.
This paper proposes DLA, a dynamic memory modeling framework for multi-state linear attention that adaptively merges states based on token information variation and maintains a fixed-size state cache, enabling better long-context representation without the quadratic complexity of standard attention.
DLA introduces adaptive state merging and capacity-bounded memory modeling for multi-state linear attention, improving long-context LLM performance.
This paper derives scaling rules for Gated Delta Networks using Maximal Update Parametrization (μP), enabling zero-shot hyperparameter transfer across model widths for efficient sub-quadratic LLM architectures. Experiments confirm stable learning-rate transfer under both AdamW and SGD, whereas standard parametrization fails.
Discussion of a finding that all softmax/linear attention variants can be interpolated, and that the Muon optimizer is crucial for Parallax to move beyond Softmax Attention. Includes link to paper and code.
This paper proposes a method to convert pretrained Softmax attention models into linear-complexity Test-Time Training (TTT) architectures, achieving comparable text-to-image quality to fine-tuned Softmax models while significantly accelerating inference. The approach is validated by linearizing Stable Diffusion 3.5, resulting in SD3.5-T^5 with 1.32x speedup at 1K resolution.
Proposes Interdomain Attention, a new method that integrates state space models into attention via kernel methods, achieving efficient long-context modeling with a fixed-size state and outperforming SSMs and softmax attention in language modeling experiments up to 1.3B parameters.
Tensor Cache introduces a two-level caching mechanism that compresses evicted key-value pairs from sliding-window attention into a fixed-size associative memory, improving long-context language modeling without unbounded memory growth.
NVIDIA introduces Gated DeltaNet-2, a method for editing compressed model memory without catastrophic forgetting, using independent gates for erase and write operations. It outperforms existing models like Mamba-2 and Mamba-3 on language modeling and long-context tasks.
Ali Hatamizadeh announces Gated DeltaNet-2, a new linear attention model that outperforms KDA and Mamba-3 at 1.3B scale; @BlinkDL_AI notes its recurrence is nearly identical to RWKV-7's DPLR.
Gated DeltaNet-2 introduces separate erase and write gates for linear attention, achieving superior performance in long-context language modeling and retrieval tasks.
This paper introduces Exact Linear Attention (ELA), a mechanism that achieves linear computational complexity for Transformer attention without approximation error by leveraging kernel decomposition, and addresses gradient explosion and token dilution through constrained kernel functions. It also presents engineering innovations including Hyper Link, Memory Lobe, and a routing bias for Mixture of Experts.
The paper decomposes the attention interaction matrix into routing (skew-symmetric) and filtering (symmetric) components, introducing S-D attention to disentangle them. It reveals a spectral cascade in routing that predicts where attention can be simplified, achieving significant parameter reduction with minimal perplexity loss.
Proposes Federated Nested Learning (FedNL), a framework that reformulates federated learning as a three-level nested optimization system, enabling collaborative training of self-referential memories for test-time adaptation to handle Non-IID data and long-tail distributions.
Moonshot AI founder Yang Zhilin released a 40-minute video detailing the training process of the Kimi K2 model, which cost only $4.6 million. In an 8-model real-time programming competition, Kimi K2 took first place, defeating GPT-5.5 and others, demonstrating how a small team can overturn the traditional compute-stacking paradigm through architecture optimization.
This paper introduces Variational Linear Attention (VLA), a method that stabilizes memory states in linear attention mechanisms for long-context transformers. VLA reframes memory updates as an online regularized least-squares problem, proving bounded state norms and demonstrating significant speedups and improved retrieval accuracy over standard linear attention and DeltaNet.