解锁大规模门控Delta网络中的特征学习

摘要

训练和扩展大型语言模型需要巨大的计算资源，这推动了高效次二次架构和基于原则的超参数调优方法的发展。尽管最大更新参数化（μP）已实现标准Transformer的零样本超参数迁移，但其在线性模型——尤其是那些具有结构化状态转换和复杂架构的模型——上的扩展仍基本未被探索。通过严格地在正向传播、门控机制和循环状态动态中传播坐标尺度估计，我们推导了门控Delta网络的缩放规则。语言模型预训练实验证实，我们的配置在AdamW和SGD优化器下均可实现跨模型宽度的稳定学习率迁移，而标准参数化则无法迁移，从而验证了我们分析的正确性和实际效用。

English

Training and scaling Large Language Models demand enormous computational resources, motivating both efficient sub-quadratic architectures and principled hyperparameter tuning methods. While the Maximal Update Parametrization (μP) has enabled zero-shot hyperparameter transfer for standard Transformers, its extension to linear models, particularly those with structured state transitions and complicated architectures, remains largely unexplored. By rigorously propagating coordinate-size estimates through the forward pass, gating mechanisms, and recurrent state dynamics, we derive the scaling rules for Gated Delta Network. Experiments on language-model pre-training confirm that our configurations enable stable learning-rate transfer across model widths under both AdamW and SGD, whereas standard parametrization fails to transfer, validating the correctness and practical utility of our analysis.