解锁大规模门控Delta网络中的特征学习
Unlocking Feature Learning in Gated Delta Networks at Scale
June 2, 2026
作者: Yifeng Liu, Quanquan Gu
cs.AI
摘要
训练和扩展大型语言模型需要巨大的计算资源,这推动了高效次二次架构和基于原则的超参数调优方法的发展。尽管最大更新参数化(μP)已实现标准Transformer的零样本超参数迁移,但其在线性模型——尤其是那些具有结构化状态转换和复杂架构的模型——上的扩展仍基本未被探索。通过严格地在正向传播、门控机制和循环状态动态中传播坐标尺度估计,我们推导了门控Delta网络的缩放规则。语言模型预训练实验证实,我们的配置在AdamW和SGD优化器下均可实现跨模型宽度的稳定学习率迁移,而标准参数化则无法迁移,从而验证了我们分析的正确性和实际效用。
English
Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization (μP) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.