ゲーテッドDeltaNet-2: 線形注意における消去と書き込みの分離

要旨

線形アテンションは、ソフトマックスアテンションの無制限キャッシュを固定サイズのリカレント状態に置き換え、系列混合を線形時間に、デコードを定数メモリに削減する。困難な点は、何を忘れるかだけでなく、圧縮されたメモリを既存の関連付けを乱さずに編集する方法である。デルタ則モデルは新しい値を書き込む前に現在の読み出し値を減算し、Kimi Delta Attention（KDA）はチャネル別減衰によって忘却を強化する。しかし、アクティブな編集では依然として単一のスカラーゲートを使用して、キー側で古いコンテンツをどの程度消去するかと、バリュー側で新しいコンテンツをどの程度書き込むかという二つの異なる処理を制御している。我々はGated DeltaNet-2を導入する。これは適応的忘却とチャネル別減衰を継承しつつ、両者に共通する制限である消去と書き込みのスカラー結合に対処することで、Gated DeltaNetとKDAの両方を一般化する。Gated Delta Rule-2は、チャネル別消去ゲートb_tとチャネル別書き込みゲートw_tを用いてこれらの役割を分離し、両方のゲートが同一のスカラーに縮退するとKDAに、減衰も縮退するとGated DeltaNetになる。我々は、高速重み更新の観点、チャネル別減衰を非対称消去係数に組み込んだチャンク単位のWYアルゴリズム、および効率的な並列学習を維持するゲートを考慮した逆伝播を導出する。100BのFineWeb-Eduトークンで学習された1.3Bパラメータにおいて、Gated DeltaNet-2は、言語モデリング、常識推論、検索にわたって、Mamba-2、Gated DeltaNet、KDA、Mamba-3の各バリアントの中で総合的に最強の結果を達成する。その利点は長コンテキストのRULER針干し草ベンチマークで最も顕著であり、評価されたマルチキー検索設定を改善し、リカレント設定とハイブリッド設定の両方で強力な性能を維持する。コードはhttps://github.com/NVlabs/GatedDeltaNet-2で入手可能である。

English

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.