大規模なゲート付きデルタネットワークにおける特徴学習の実現

要旨

大規模言語モデルの学習とスケーリングには膨大な計算資源が必要であり、効率的な準二次アーキテクチャと原理に基づいたハイパーパラメータ調整手法の両方が動機付けられる。最大更新パラメータ化(μP)は標準的なTransformerに対してゼロショットハイパーパラメータ転送を可能にしてきたが、その線形モデル、特に構造化状態遷移と複雑なアーキテクチャを持つモデルへの拡張は、ほとんど未開拓のままである。順伝播、ゲーティング機構、リカレント状態ダイナミクスを通じて座標サイズ推定を厳密に伝播させることにより、Gated Delta Networkのスケーリング則を導出する。言語モデルの事前学習実験により、我々の構成がAdamWおよびSGDの両方においてモデル幅全体で安定した学習率転送を可能にし、標準パラメータ化では転送が失敗することが確認され、我々の分析の正確性と実用的有用性が検証された。

English

Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization (μP) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.