解鎖大規模門控Delta網路中的特徵學習
Unlocking Feature Learning in Gated Delta Networks at Scale
June 2, 2026
作者: Yifeng Liu, Quanquan Gu
cs.AI
摘要
訓練與擴展大型語言模型需要大量計算資源,這促使了高效的次二次架構以及原則性的超參數調整方法的發展。雖然最大更新參數化(μP)已能實現標準Transformer的零樣本超參數遷移,但其在線性模型——特別是具有結構化狀態轉移和複雜架構的模型——中的延伸應用,至今仍鮮少被探討。通過嚴謹地在前向傳遞、門控機制與循環狀態動態中傳播座標尺度估計,我們推導出閘控Delta網路(Gated Delta Network)的縮放規則。語言模型預訓練的實驗證實,我們的配置能在AdamW與SGD兩種優化器下,實現跨模型寬度的穩定學習率遷移,而標準參數化則無法達成此遷移,這驗證了我們分析的正確性與實際應用價值。
English
Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization (μP) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.