@BlinkDL_AI: Gated DeltaNet-2 is almost exactly RWKV-7's DPLR recurrence, not acknowledging the elephant in the room

X AI KOLs Following 05/22/26, 05:20 AM Papers

Summary

Ali Hatamizadeh announces Gated DeltaNet-2, a new linear attention model that outperforms KDA and Mamba-3 at 1.3B scale; @BlinkDL_AI notes its recurrence is nearly identical to RWKV-7's DPLR.

Gated DeltaNet-2 is almost exactly RWKV-7's DPLR recurrence, not acknowledging the elephant in the room 🙂

Original Article

View Cached Full Text

Cached at: 05/23/26, 12:05 PM

Gated DeltaNet-2 is almost exactly RWKV-7’s DPLR recurrence, not acknowledging the elephant in the room 🙂

Ali Hatamizadeh (@ahatamiz1): Gated DeltaNet-2 is here. 🚀

🔥 New paper: Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

Gated DeltaNet-2 outperforms KDA and Mamba-3, the latest and best recurrent architectures, head to head at 1.3B. 🏆

💡 Here’s the idea behind it:

Linear attention

Similar Articles

@BlinkDL_AI: RWKV-7 G1g is here: the world's best pure RNN LLM, and a competitive LLM in general. Try https://huggingface.co/spaces/…

X AI KOLs Following

BlinkDL announces RWKV-7 G1g, a pure RNN LLM that claims to be the best in its class and competitive with general LLMs, with high-speed inference on a single RTX 5090.

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

Hugging Face Daily Papers

The paper introduces Momentum DeltaNet (MDN), a linear attention model that uses stepwise momentum and parallel algorithms to improve training efficiency and performance over models like Mamba2.

@jiqizhixin: New from NVIDIA! You can edit a model’s compressed memory without scrambling what it already knows! Enter Gated DeltaNe…

X AI KOLs Timeline

NVIDIA introduces Gated DeltaNet-2, a method for editing compressed model memory without catastrophic forgetting, using independent gates for erase and write operations. It outperforms existing models like Mamba-2 and Mamba-3 on language modeling and long-context tasks.

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

arXiv cs.LG

Introduces DualKV, a FlashAttention kernel variant that eliminates redundant prompt token computation in RL post-training (GRPO/DAPO), achieving up to 3.82x speedup on 30B MoE models.

𝐃𝐞𝐥𝐭𝐚 𝐀𝐭𝐭𝐞𝐧𝐭𝐢𝐨𝐧 𝐑𝐞𝐬𝐢𝐝𝐮𝐚𝐥𝐬 [R]

Reddit r/MachineLearning

Delta Attention Residuals is a drop-in upgrade to residual connections that routes over deltas instead of cumulative hidden states, achieving sharper cross-layer routing and 1.7-8.2% lower perplexity at scales up to 7.6B parameters, and enabling fine-tuning of pretrained models like Qwen3-0.6B with negligible overhead.