Tag
Proposes Erase-then-Delta Attention (EDA), a memory update rule for linear attention that decouples erase and write addresses to selectively suppress stale information before writing new content. Experiments on 2.5B dense and 25B MoE models demonstrate consistent gains in standard and long-context evaluations.
Gated DeltaNet-2 introduces separate erase and write gates for linear attention, achieving superior performance in long-context language modeling and retrieval tasks.
Proposes delta-Mem, a lightweight online memory mechanism that uses a compact state matrix updated by delta-rule learning to improve long-context performance of frozen LLMs without full fine-tuning or context extension.
The paper introduces δ-mem, a lightweight online memory mechanism that augments frozen LLMs with a compact associative memory state updated by delta-rule learning, achieving significant improvements on memory-heavy benchmarks without fine-tuning or context extension.