解鎖大規模門控Delta網路中的特徵學習

摘要

訓練與擴展大型語言模型需要大量計算資源，這促使了高效的次二次架構以及原則性的超參數調整方法的發展。雖然最大更新參數化（μP）已能實現標準Transformer的零樣本超參數遷移，但其在線性模型——特別是具有結構化狀態轉移和複雜架構的模型——中的延伸應用，至今仍鮮少被探討。通過嚴謹地在前向傳遞、門控機制與循環狀態動態中傳播座標尺度估計，我們推導出閘控Delta網路（Gated Delta Network）的縮放規則。語言模型預訓練的實驗證實，我們的配置能在AdamW與SGD兩種優化器下，實現跨模型寬度的穩定學習率遷移，而標準參數化則無法達成此遷移，這驗證了我們分析的正確性與實際應用價值。

English

Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization (μP) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.