@BlinkDL_AI: Gated DeltaNet-2 is almost exactly RWKV-7's DPLR recurrence, not acknowledging the elephant in the room
Summary
Ali Hatamizadeh announces Gated DeltaNet-2, a new linear attention model that outperforms KDA and Mamba-3 at 1.3B scale; @BlinkDL_AI notes its recurrence is nearly identical to RWKV-7's DPLR.
View Cached Full Text
Cached at: 05/23/26, 12:05 PM
Gated DeltaNet-2 is almost exactly RWKV-7โs DPLR recurrence, not acknowledging the elephant in the room ๐
Ali Hatamizadeh (@ahatamiz1): Gated DeltaNet-2 is here. ๐
๐ฅ New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention
Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. ๐
๐ก Hereโs the idea behind it:
Linear attention
Similar Articles
@BlinkDL_AI: RWKV-7 G1g is here: the world's best pure RNN LLM, and a competitive LLM in general. Try https://huggingface.co/spaces/โฆ
BlinkDL announces RWKV-7 G1g, a pure RNN LLM that claims to be the best in its class and competitive with general LLMs, with high-speed inference on a single RTX 5090.
MDN: Parallelizing Stepwise Momentum for Delta Linear Attention
The paper introduces Momentum DeltaNet (MDN), a linear attention model that uses stepwise momentum and parallel algorithms to improve training efficiency and performance over models like Mamba2.
@jiqizhixin: New from NVIDIA! You can edit a modelโs compressed memory without scrambling what it already knows! Enter Gated DeltaNeโฆ
NVIDIA introduces Gated DeltaNet-2, a method for editing compressed model memory without catastrophic forgetting, using independent gates for erase and write operations. It outperforms existing models like Mamba-2 and Mamba-3 on language modeling and long-context tasks.
DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts
Introduces DualKV, a FlashAttention kernel variant that eliminates redundant prompt token computation in RL post-training (GRPO/DAPO), achieving up to 3.82x speedup on 30B MoE models.
๐๐๐ฅ๐ญ๐ ๐๐ญ๐ญ๐๐ง๐ญ๐ข๐จ๐ง ๐๐๐ฌ๐ข๐๐ฎ๐๐ฅ๐ฌ [R]
Delta Attention Residuals is a drop-in upgrade to residual connections that routes over deltas instead of cumulative hidden states, achieving sharper cross-layer routing and 1.7-8.2% lower perplexity at scales up to 7.6B parameters, and enabling fine-tuning of pretrained models like Qwen3-0.6B with negligible overhead.