MDN: デルタ線形注意における段階的運動量の並列化

要旨

線形注意（Linear Attention: LA）は、自己注意（Self-Attention）の二次複雑性を回避することにより、長い系列を持つ大規模言語モデル（LLM）を拡張するための有望なパラダイムを提供する。Mamba2やGDNといった最近のLAモデルは、線形再帰を閉形式のオンライン確率的勾配降下法（SGD）として解釈するが、単純なSGD更新は急速な情報減衰と最適化における準最適な収束に悩まされる。運動量ベースの最適化手法は自然な解決策を提供するが、訓練効率と有効性を同時に達成する上で課題がある。これに対処するため、我々は更新係数を幾何学的に並べ替えることで、段階的な運動量ルールを備えたLAのためのチャンク単位の並列アルゴリズムを開発する。さらに、動的システムの観点から、運動量ベースの再帰を複素共役固有値を導入する二次システムとして分析する。この分析は、安定したゲーティング制約の設計を導く。得られたモデルであるMomentum DeltaNet（MDN）は、Tritonカーネルを活用して、競合力のある線形モデル（Mamba2やKDAなど）と同等の訓練スループットを実現する。400Mおよび1.3Bパラメータモデルを用いた広範な実験により、Transformer、Mamba2、GDNなどの強力なベースラインと比較して、多様な下流評価ベンチマークで一貫した性能向上を示す。コード: https://github.com/HuuYuLong/MomentumDeltaNet

English

Linear Attention (LA) offers a promising paradigm for scaling large language models (LLMs) to long sequences by avoiding the quadratic complexity of self-attention. Recent LA models such as Mamba2 and GDN interpret linear recurrences as closed-form online stochastic gradient descent (SGD), but naive SGD updates suffer from rapid information decay and suboptimal convergence in optimization. While momentum-based optimizers provide a natural remedy, they pose challenges in simultaneously achieving training efficiency and effectiveness. To address this, we develop a chunkwise parallel algorithm for LA with a stepwise momentum rule by geometrically reordering the update coefficients. Further, from a dynamical systems perspective, we analyze the momentum-based recurrence as a second-order system that introduces complex conjugate eigenvalues. This analysis guides the design of stable gating constraints. The resulting model, Momentum DeltaNet (MDN), leverages Triton kernels to achieve comparable training throughput with competitive linear models such as Mamba2 and KDA. Extensive experiments on the 400M and 1.3B parameter models demonstrate consistent performance improvements over strong baselines, including Transformers, Mamba2 and GDN, across diverse downstream evaluation benchmarks. Code: https://github.com/HuuYuLong/MomentumDeltaNet .

MDN: デルタ線形注意における段階的運動量の並列化

MDN: Parallelizing Stepwise Momentum for Delta Linear Attention

要旨

Support