Gated DeltaNet-2: 선형 어텐션에서 지우기와 쓰기의 분리

초록

선형 어텐션은 소프트맥스 어텐션의 무제한 캐시를 고정 크기의 순환 상태로 대체하여 시퀀스 혼합을 선형 시간으로 줄이고 디코딩 시 상수 메모리를 사용하게 한다. 어려운 점은 단순히 무엇을 잊을지 결정하는 것뿐만 아니라, 기존 연관성을 혼란시키지 않으면서 이 압축된 메모리를 어떻게 편집할지에 있다. 델타 규칙 모델은 새로운 값을 쓰기 전에 현재 읽기 값을 빼며, Kimi Delta Attention(KDA)은 채널별 감쇠를 통해 망각을 정교화한다. 그러나 활성 편집은 여전히 단일 스칼라 게이트를 사용하여 두 가지 다른 작업, 즉 키 측면에서 기존 콘텐츠를 얼마나 지울지와 값 측면에서 얼마나 많은 새 콘텐츠를 기록할지를 제어한다. 본 논문에서는 Gated DeltaNet과 KDA를 일반화하는 Gated DeltaNet-2를 소개한다. 이 모델은 적응형 망각과 채널별 감쇠를 상속하면서도 두 모델의 공통 한계인 소거와 쓰기 간의 스칼라 종속성을 해결한다. Gated Delta Rule-2는 이러한 역할을 채널별 소거 게이트 b_t와 채널별 쓰기 게이트 w_t로 분리하며, 두 게이트가 동일한 스칼라로 수렴하면 KDA로, 감쇠도 수렴하면 Gated DeltaNet으로 축소된다. 본 논문은 고속 가중치 갱신 관점, 비대칭 소거 인자에 채널별 감쇠가 흡수된 청크별 WY 알고리즘, 그리고 효율적인 병렬 훈련을 유지하는 게이트 인식 역전파를 유도한다. 100B FineWeb-Edu 토큰으로 학습된 1.3B 파라미터 모델에서 Gated DeltaNet-2는 언어 모델링, 상식 추론 및 검색 전반에 걸쳐 Mamba-2, Gated DeltaNet, KDA 및 Mamba-3 변종들 중 가장 우수한 전체 결과를 달성한다. 그 이점은 특히 장문맥 RULER 건초더미 속 바늘 벤치마크에서 두드러지며, 평가된 다중 키 검색 설정에서 성능을 향상시키고 순환 및 하이브리드 설정 모두에서 강력한 성능을 유지한다. 코드는 https://github.com/NVlabs/GatedDeltaNet-2에서 확인할 수 있다.

English

Linear attention replaces the unbounded cache of softmax attention with a fixed-size recurrent state, reducing sequence mixing to linear time and decoding to constant memory. The hard part is not just what to forget, but how to edit this compressed memory without scrambling existing associations. Delta-rule models subtract the current read before writing a new value, and Kimi Delta Attention (KDA) sharpens forgetting with channel-wise decay. But the active edit still uses a single scalar gate to control two different things: how much old content to erase on the key side and how much new content to commit on the value side. We introduce Gated DeltaNet-2, which generalizes both Gated DeltaNet and KDA by inheriting adaptive forgetting and channel-wise decay while addressing their shared limitation, the scalar tie between erasing and writing. Gated Delta Rule-2 separates these roles with a channel-wise erase gate b_t and a channel-wise write gate w_t, reducing to KDA when both gates collapse to the same scalar and to Gated DeltaNet when the decay also collapses. We derive a fast-weight update view, a chunkwise WY algorithm with channel-wise decay absorbed into asymmetric erase factors, and a gate-aware backward pass that preserves efficient parallel training. At 1.3B parameters trained on 100B FineWeb-Edu tokens, Gated DeltaNet-2 achieves the strongest overall results among Mamba-2, Gated DeltaNet, KDA, and Mamba-3 variants across language modeling, commonsense reasoning, and retrieval. Its advantage is most pronounced on long-context RULER needle-in-a-haystack benchmarks, where it improves the evaluated multi-key retrieval setting and remains strong in both recurrent and hybrid settings. Code is available at https://github.com/NVlabs/GatedDeltaNet-2.