MDN: 并行化逐步动量用于Delta线性注意力

摘要

线性注意力（LA）通过避免自注意力机制的二次复杂度，为扩展大语言模型（LLMs）处理长序列提供了有前景的范式。近期如Mamba2和GDN等线性注意力模型将线性递归解释为封闭形式的在线随机梯度下降（SGD），但朴素SGD更新存在信息快速衰减和优化收敛性不佳的问题。虽然基于动量的优化器提供了自然解决方案，但它们在同时实现训练效率和有效性方面面临挑战。为此，我们通过几何重排序更新系数，为线性注意力设计了一种带逐步动量规则的块级并行算法。进一步地，从动力系统视角，我们将基于动量的递归分析为引入复共轭特征值的二阶系统。该分析指导了稳定门控约束的设计。由此产生的Momentum DeltaNet（MDN）模型利用Triton内核实现与Mamba2、KDA等竞争性线性模型相当的训练吞吐量。在400M和1.3B参数模型上的大量实验表明，相较于包括Transformer、Mamba2和GDN在内的强基线模型，该模型在多种下游评估基准中均取得了一致的性能提升。代码：https://github.com/HuuYuLong/MomentumDeltaNet。

English

Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self-attention. Recent LA models such as Mamba2 and GDN interpret linear recurrences as closed-form online stochastic gradient descent (SGD), but naive SGD updates suffer from rapid information decay and suboptimal convergence in optimization. While momentum-based optimizers provide a natural remedy, they pose challenges in simultaneously achieving training efficiency and effectiveness. To address this, we develop a chunkwise parallel algorithm for LA with a stepwise momentum rule by geometrically reordering the update coefficients. Further, from a dynamical systems perspective, we analyze the momentum-based recurrence as a second-order system that introduces complex conjugate eigenvalues. This analysis guides the design of stable gating constraints. The resulting model, Momentum DeltaNet (MDN), leverages Triton kernels to achieve comparable training throughput with competitive linear models such as Mamba2 and KDA. Extensive experiments on the 400M and 1.3B parameter models demonstrate consistent performance improvements over strong baselines, including Transformers, Mamba2 and GDN, across diverse downstream evaluation benchmarks. Code: https://github.com/HuuYuLong/MomentumDeltaNet .

MDN: 并行化逐步动量用于Delta线性注意力

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

摘要

Support