최적의 스케일링에는 최적의 노름이 필요하다

초록

모델과 데이터셋 크기 조정에 따른 최적 하이퍼파라미터 전이에 관한 최근의 진전에도 불구하고, 이를 설명하는 통일된 원리는 아직 확립되지 않았습니다. Scion 옵티마이저를 사용하여, 우리는 모델과 데이터셋 크기에 걸친 공동 최적 스케일링이 단일 불변량, 즉 출력 레이어의 연산자 노름(operator norm)에 의해 지배된다는 사실을 발견했습니다. 최대 13억 개의 파라미터를 가진 모델과 최대 1380억 개의 토큰으로 학습된 데이터셋에서, 최적 학습률/배치 크기 쌍(eta^{ast}, B^{ast})은 항상 동일한 연산자 노름 값을 가지는 현상을 관찰했으며, 이를 노름 전이(norm transfer)라고 명명했습니다. 이 상수 노름 조건은 필요조건이지만 충분조건은 아닙니다: 각 데이터셋 크기에 대해 여러 (eta, B)가 최적 노름에 도달할 수 있지만, 오직 하나의 (eta^{ast}, B^{ast})만이 최적의 손실을 달성합니다. 충분조건으로서, 우리는 Scion에 대한 (eta^{ast}, B^{ast})의 데이터셋 크기별 스케일링을 처음으로 측정했으며, 이 스케일링 규칙이 Adam 옵티마이저의 규칙과 일관적임을 발견했습니다. 레이어 그룹별 학습률 조정 또한 모델 성능을 향상시키는데, 출력 레이어가 가장 민감하고 은닉 레이어는 더 낮은 학습률에서 이점을 얻는 것으로 나타났습니다. 우리는 노름 기반 최적 스케일링에 대한 실용적인 통찰을 제공하고, 대규모 언어 모델(LLM) 학습 역학 연구를 지원하기 위해 2천 회 이상의 실행 로그와 함께 분산 Scion(Disco) 구현을 공개합니다.

English

Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair (eta^{ast}, B^{ast}) consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple (eta, B) reach the optimal norm, only a unique (eta^{ast}, B^{ast}) achieves the best loss. As a sufficient condition, we provide the first measurement of (eta^{ast}, B^{ast}) scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.

최적의 스케일링에는 최적의 노름이 필요하다

Optimal Scaling Needs Optimal Norm

초록

Support