대규모 언어 모델의 다운스트림 작업 성능에 대한 스케일링 법칙

초록

스케일링 법칙은 대규모 언어 모델(LLM) 설계를 안내할 수 있는 중요한 통찰을 제공한다. 기존 연구는 주로 프리트레이닝(업스트림) 손실에 대한 스케일링 법칙을 연구하는 데 초점을 맞추어 왔다. 그러나 LLM이 비지도 데이터셋으로 프리트레이닝된 후 다운스트림 작업에 대해 파인튜닝되는 전이 학습 설정에서는 다운스트림 성능 역시 중요한 관심사이다. 본 연구에서는 LLM이 기계 번역 작업을 위해 파인튜닝되는 전이 학습 설정에서의 스케일링 행동을 연구한다. 구체적으로, 프리트레이닝 데이터의 선택과 그 크기가 다운스트림 성능(번역 품질)에 미치는 영향을 두 가지 지표(다운스트림 크로스 엔트로피와 BLEU 점수)를 통해 평가한다. 실험 결과, 파인튜닝 데이터셋의 크기와 프리트레이닝 데이터와 다운스트림 데이터 간의 분포 정렬이 스케일링 행동에 상당한 영향을 미치는 것으로 나타났다. 충분한 정렬이 이루어진 경우, 더 많은 프리트레이닝 데이터를 사용할수록 다운스트림 크로스 엔트로피와 BLEU 점수가 단조적으로 향상되었다. 이러한 경우, 로그 법칙을 사용하여 다운스트림 BLEU 점수를 높은 정확도로 예측할 수 있음을 보였다. 그러나 중간 정도의 정렬 불일치가 발생하는 경우, BLEU 점수는 프리트레이닝 데이터가 증가함에 따라 변동하거나 악화될 수 있는 반면, 다운스트림 크로스 엔트로피는 단조적으로 개선되는 현상도 관찰되었다. 이러한 관찰 결과를 분석함으로써, 적절한 프리트레이닝 데이터를 선택하기 위한 새로운 실용적 통찰을 제공한다.

English

Scaling laws provide important insights that can guide the design of large language models (LLMs). Existing work has primarily focused on studying scaling laws for pretraining (upstream) loss. However, in transfer learning settings, in which LLMs are pretrained on an unsupervised dataset and then finetuned on a downstream task, we often also care about the downstream performance. In this work, we study the scaling behavior in a transfer learning setting, where LLMs are finetuned for machine translation tasks. Specifically, we investigate how the choice of the pretraining data and its size affect downstream performance (translation quality) as judged by two metrics: downstream cross-entropy and BLEU score. Our experiments indicate that the size of the finetuning dataset and the distribution alignment between the pretraining and downstream data significantly influence the scaling behavior. With sufficient alignment, both downstream cross-entropy and BLEU score improve monotonically with more pretraining data. In such cases, we show that it is possible to predict the downstream BLEU score with good accuracy using a log-law. However, there are also cases where moderate misalignment causes the BLEU score to fluctuate or get worse with more pretraining, whereas downstream cross-entropy monotonically improves. By analyzing these observations, we provide new practical insights for choosing appropriate pretraining data.

대규모 언어 모델의 다운스트림 작업 성능에 대한 스케일링 법칙

Scaling Laws for Downstream Task Performance of Large Language Models

초록

Support