最適なスケーリングには最適なノルムが必要である

要旨

モデルとデータセットのスケーリングにおける最適ハイパーパラメータ転送の最近の進展にもかかわらず、統一的な説明原理は確立されていません。Scionオプティマイザを使用して、モデルサイズとデータセットサイズにわたる共同最適スケーリングが単一の不変量、すなわち出力層の作用素ノルムによって支配されていることを発見しました。最大1.3Bパラメータのモデルと最大138Bトークンのデータセットにわたって、最適な学習率/バッチサイズのペア（eta^{ast}, B^{ast}）は常に同じ作用素ノルム値を示します。この現象をノルム転送と呼びます。この定数ノルム条件は必要ですが十分ではありません。各データセットサイズに対して、複数の（eta, B）が最適ノルムに到達しますが、唯一の（eta^{ast}, B^{ast}）が最良の損失を達成します。十分条件として、Scionにおける（eta^{ast}, B^{ast}）のデータセットサイズに伴うスケーリングを初めて測定し、そのスケーリングルールがAdamオプティマイザのそれと一致することを見出しました。レイヤーグループごとの学習率の調整もモデル性能を向上させ、出力層が最も敏感で、隠れ層は低い学習率から利益を得ます。ノルムガイドによる最適スケーリングに関する実用的な洞察を提供し、大規模LLMトレーニングダイナミクスの研究を支援するために、2000以上の実行ログを含むDistributed Scion（Disco）実装を公開します。

English

Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair (eta^{ast}, B^{ast}) consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple (eta, B) reach the optimal norm, only a unique (eta^{ast}, B^{ast}) achieves the best loss. As a sufficient condition, we provide the first measurement of (eta^{ast}, B^{ast}) scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.

最適なスケーリングには最適なノルムが必要である

Optimal Scaling Needs Optimal Norm

要旨

Support