最优缩放需最优范数

摘要

尽管在模型和数据集规模扩展下的最优超参数迁移方面已取得最新进展，但尚未确立统一的解释性原则。通过使用Scion优化器，我们发现模型与数据集大小的联合最优缩放受单一不变量的支配：输出层的算子范数。在参数规模高达13亿、训练数据量达1380亿个token的模型范围内，最优学习率与批量大小的组合（η*, B*）始终具有相同的算子范数值——这一现象我们称之为范数迁移。这一恒定范数条件是必要但不充分的：对于每个数据集大小，虽然多个（η, B）能达到最优范数，但仅有一个独特的（η*, B*）能实现最佳损失。作为充分条件，我们首次测量了Scion中（η*, B*）随数据集大小的缩放规律，并发现其缩放规则与Adam优化器一致。分层组调整学习率也能提升模型性能，其中输出层最为敏感，而隐藏层则受益于较低的学习率。我们提供了关于范数引导最优缩放的实用见解，并发布了分布式Scion（Disco）实现及来自两千多次运行的日志，以支持大规模语言模型训练动态的研究。

English

Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair (eta^{ast}, B^{ast}) consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple (eta, B) reach the optimal norm, only a unique (eta^{ast}, B^{ast}) achieves the best loss. As a sufficient condition, we provide the first measurement of (eta^{ast}, B^{ast}) scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.

最优缩放需最优范数

Optimal Scaling Needs Optimal Norm

摘要

Support