最佳縮放需要最佳範數

摘要

儘管在模型和數據集規模擴展下的最優超參數遷移方面已取得近期進展，但尚未建立統一的解釋性原則。利用Scion優化器，我們發現模型與數據集大小之間的最優聯合擴展受單一不變量支配：輸出層的算子範數。在參數量高達13億、訓練數據量達1380億個token的模型中，最優學習率與批量大小組合（η*, B*）始終保持相同的算子範數值——我們稱此現象為範數遷移。這一恆定範數條件是必要但不充分的：對於每個數據集大小，雖然多個（η, B）能達到最優範數，但僅有唯一的（η*, B*）能實現最佳損失。作為充分條件，我們首次測量了Scion中（η*, B*）隨數據集大小的擴展規律，發現其擴展規則與Adam優化器一致。逐層組調整學習率也能提升模型性能，其中輸出層最為敏感，而隱藏層則受益於較低的學習率。我們提供了基於範數指導的最優擴展實踐見解，並發布了我們的分布式Scion（Disco）實現及來自兩千多次運行的日誌，以支持大規模語言模型訓練動態的研究。

English

Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair (eta^{ast}, B^{ast}) consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple (eta, B) reach the optimal norm, only a unique (eta^{ast}, B^{ast}) achieves the best loss. As a sufficient condition, we provide the first measurement of (eta^{ast}, B^{ast}) scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.

最佳縮放需要最佳範數

Optimal Scaling Needs Optimal Norm

摘要

Support