규모에 맞춰 튜닝: 계산 효율적인 학습을 위한 하이퍼파라미터 최적화

초록

딥러닝 모델의 하이퍼파라미터 튜닝은 동일한 계산량 대비 성능을 수준급으로 향상시킬 수 있습니다. 그럼에도 불구하고, 특히 대형 모델의 경우 체계적인 튜닝은 드물게 이루어지는데, 이는 평가 비용이 높고 많은 하이퍼파라미터를 가지고 있어 트레이드오프, 예산, 탐색 범위에 대한 어려운 판단이 필요하기 때문입니다. 이러한 문제를 해결하고 대형 모델을 견고하게 튜닝할 수 있는 실용적인 방법을 제안하기 위해, 우리는 성능-비용 파레토 프론티어 주변에서 지역 탐색을 수행하는 베이지안 최적화 알고리즘인 Cost-Aware Pareto Region Bayesian Search(CARBS)를 소개합니다. CARBS는 많은 하이퍼파라미터를 가진 무제한 탐색 공간에서도 잘 작동하며, 모델이 확장됨에 따라 튜닝할 수 있도록 스케일링 관계를 학습하고, 튜닝의 많은 부분을 자동화하여 "블랙 매직"을 줄입니다. 우리의 결과 중 하나로, 단순한 베이스라인(PPO, 원래 ProcGen 논문에서 제공된)을 튜닝함으로써 ProcGen 벤치마크 전체를 효과적으로 해결했습니다. 또한, Chinchilla 프로젝트(Hoffmann et al. 2022)의 모델 크기 대 학습 토큰 스케일링 결과를 재현하면서, 모든 다른 하이퍼파라미터에 대한 스케일링 법칙을 발견했습니다. 이는 상당히 적은 계산량을 사용하며 모든 딥러닝 문제(언어 모델뿐만 아니라)에 적용 가능한 쉬운 자동화 프로세스를 통해 이루어졌습니다.

English

Hyperparameter tuning of deep learning models can lead to order-of-magnitude performance gains for the same amount of compute. Despite this, systematic tuning is uncommon, particularly for large models, which are expensive to evaluate and tend to have many hyperparameters, necessitating difficult judgment calls about tradeoffs, budgets, and search bounds. To address these issues and propose a practical method for robustly tuning large models, we present Cost-Aware Pareto Region Bayesian Search (CARBS), a Bayesian optimization algorithm that performs local search around the performance-cost Pareto frontier. CARBS does well even in unbounded search spaces with many hyperparameters, learns scaling relationships so that it can tune models even as they are scaled up, and automates much of the "black magic" of tuning. Among our results, we effectively solve the entire ProcGen benchmark just by tuning a simple baseline (PPO, as provided in the original ProcGen paper). We also reproduce the model size vs. training tokens scaling result from the Chinchilla project (Hoffmann et al. 2022), while simultaneously discovering scaling laws for every other hyperparameter, via an easy automated process that uses significantly less compute and is applicable to any deep learning problem (not just language models).

규모에 맞춰 튜닝: 계산 효율적인 학습을 위한 하이퍼파라미터 최적화

Tune As You Scale: Hyperparameter Optimization For Compute Efficient Training

초록

Support