MDN：用於 Delta 線性注意力之逐步動量並行化

摘要

線性注意力（Linear Attention, LA）為大型語言模型（LLMs）擴展至長序列提供了極具前景的範例，因為它避免了自注意力機制中的二次複雜度。近期LA模型（如Mamba2與GDN）將線性遞迴解讀為封閉形式在線隨機梯度下降（SGD），但樸素的SGD更新存在資訊快速衰減與最佳化收斂次佳的問題。動量基礎的最佳化器雖為自然解方，卻在同時達到訓練效率與效能上帶來挑戰。為此，我們透過幾何重新排序更新係數，為具逐步動量規則的LA開發出分塊平行演算法。此外，從動力系統觀點出發，我們將基於動量的遞迴分析為引入共軛複數特徵值的二階系統。此分析有助於設計穩定的門控約束。最終模型「動量DeltaNet」（Momentum DeltaNet, MDN）利用Triton核心，在訓練吞吐量上與Mamba2及KDA等具競爭力的線性模型不相上下。在400M與1.3B參數模型上的廣泛實驗顯示，相對於包括Transformer、Mamba2與GDN在內的強基線，本模型在多樣下游評估基準上均展現一致的效能提升。程式碼：https://github.com/HuuYuLong/MomentumDeltaNet

English

Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self-attention. Recent LA models such as Mamba2 and GDN interpret linear recurrences as closed-form online stochastic gradient descent (SGD), but naive SGD updates suffer from rapid information decay and suboptimal convergence in optimization. While momentum-based optimizers provide a natural remedy, they pose challenges in simultaneously achieving training efficiency and effectiveness. To address this, we develop a chunkwise parallel algorithm for LA with a stepwise momentum rule by geometrically reordering the update coefficients. Further, from a dynamical systems perspective, we analyze the momentum-based recurrence as a second-order system that introduces complex conjugate eigenvalues. This analysis guides the design of stable gating constraints. The resulting model, Momentum DeltaNet (MDN), leverages Triton kernels to achieve comparable training throughput with competitive linear models such as Mamba2 and KDA. Extensive experiments on the 400M and 1.3B parameter models demonstrate consistent performance improvements over strong baselines, including Transformers, Mamba2 and GDN, across diverse downstream evaluation benchmarks. Code: https://github.com/HuuYuLong/MomentumDeltaNet .

MDN：用於 Delta 線性注意力之逐步動量並行化

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

摘要

Support