ChatPaper.aiChatPaper

门控DeltaNet-2:线性注意力中擦除与写入的解耦

Gated DeltaNet-2: Decoupling Erase and Write in Linear Attention

May 21, 2026
作者: Ali Hatamizadeh, Yejin Choi, Jan Kautz
cs.AI

摘要

线性注意力用固定大小的循环状态替代了softmax注意力的无界缓存,从而将序列混合复杂度降至线性时间,解码内存降至常数。难点不仅在于遗忘什么,更在于如何编辑这一压缩记忆而不破坏已有关联。Delta规则模型在写入新值前先减去当前读取,而Kimi Delta注意力(KDA)通过通道级衰减来强化遗忘。然而,主动编辑仍然使用单个标量门来控制两个不同操作:在键(key)侧擦除多少旧内容,以及在值(value)侧提交多少新内容。我们提出Gated DeltaNet-2,它继承了自适应遗忘与通道级衰减,同时解决了Gated DeltaNet与KDA共有的局限——擦除与写入之间的标量绑定。Gated Delta Rule-2通过通道级擦除门b_t和通道级写入门w_t将这两个角色分离:当两个门退化为相同标量时退化为KDA,当衰减也退化为标量时退化为Gated DeltaNet。我们推导了快速权重更新视角、一种将通道级衰减吸收为非对称擦除因子的分块WY算法,以及一种保持高效并行训练的门控感知反向传播。在1.3B参数、100B FineWeb-Edu token上训练后,Gated DeltaNet-2在语言建模、常识推理和检索任务中全面超越了Mamba-2、Gated DeltaNet、KDA和Mamba-3变体,取得了最强整体结果。其优势在长上下文RULER“大海捞针”基准测试中最为显著,在所评估的多键检索设置上取得了提升,并且在循环与混合两种设置下均保持强劲性能。代码已开源:https://github.com/NVlabs/GatedDeltaNet-2。
English
Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.