Gated DeltaNet-2：在線性注意力中解耦擦除與寫入

摘要

線性注意力將 softmax 注意力的無界快取替換為固定大小的循環狀態，從而將序列混合降至線性時間，並將解碼記憶體降至常數。其困難不僅在於忘記什麼，更在於如何編輯這個壓縮後的記憶體，而不會擾亂既有的關聯性。Delta 規則模型會在寫入新值之前先減去當前讀取值，而Kimi Delta 注意力（KDA）則透過逐通道衰減來強化遺忘機制。然而，其主動編輯仍使用單一標量閘控來控制兩個不同的事情：在鍵（key）側要擦除多少舊內容，以及在值（value）側要提交多少新內容。我們提出 Gated DeltaNet-2，它透過繼承適應性遺忘與逐通道衰減，同時解決其共同限制（即擦除與寫入之間的標量綁定），來推廣 Gated DeltaNet 與 KDA。Gated Delta Rule-2 以逐通道擦除閘 b_t 與逐通道寫入閘 w_t 來分離這兩個角色；當兩個閘都收縮為相同標量時，它退化為 KDA；當衰減也收縮時，則退化為 Gated DeltaNet。我們推導出快速權重更新視角、一種將逐通道衰減吸收到非對稱擦除因子的分塊 WY 演算法，以及一種保持高效平行訓練的閘感知反向傳播。在 100B FineWeb-Edu 令牌上訓練的 1.3B 參數模型中，Gated DeltaNet-2 在語言建模、常識推理與檢索任務上，相較於 Mamba-2、Gated DeltaNet、KDA 及 Mamba-3 變體，取得了最全面的最佳結果。其優勢在長上下文 RULER「大海撈針」基準測試中尤為顯著，不僅在評估的多鍵檢索設定中獲得改善，而且在循環與混合設定下都保持強勁表現。程式碼已公開於 https://github.com/NVlabs/GatedDeltaNet-2。

English

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.