프리트레이닝을 위한 뮤온의 실용적 효율성

초록

우리는 2차 최적화기의 가장 단순한 구현체인 Muon이 AdamW 대비 계산 시간과 성능 간의 파레토 프론티어를 명시적으로 확장한다는 것을 입증합니다. Muon은 소위 임계 배치 크기를 훨씬 넘어선 대규모 배치 크기에서도 데이터 효율성을 유지하는 데 AdamW보다 더 효과적이며, 동시에 계산 효율성을 유지함으로써 더 경제적인 학습을 가능하게 합니다. 우리는 효율적인 하이퍼파라미터 전이를 위해 Muon과 최대 업데이트 파라미터화(muP)의 조합을 연구하고, muP의 모든 오차 원인을 고려하면서도 리소스 오버헤드를 최소화하는 간단한 텔레스코핑 알고리즘을 제시합니다. 우리는 40억 개의 파라미터를 가진 모델 크기까지의 광범위한 실험과 데이터 분포 및 아키텍처에 대한 어블레이션을 통해 이러한 발견을 검증합니다.

English

We demonstrate that Muon, the simplest instantiation of a second-order optimizer, explicitly expands the Pareto frontier over AdamW on the compute-time tradeoff. We find that Muon is more effective than AdamW in retaining data efficiency at large batch sizes, far beyond the so-called critical batch size, while remaining computationally efficient, thus enabling more economical training. We study the combination of Muon and the maximal update parameterization (muP) for efficient hyperparameter transfer and present a simple telescoping algorithm that accounts for all sources of error in muP while introducing only a modest overhead in resources. We validate our findings through extensive experiments with model sizes up to four billion parameters and ablations on the data distribution and architecture.

프리트레이닝을 위한 뮤온의 실용적 효율성

Practical Efficiency of Muon for Pretraining

초록

Support