웹을 재구성하기: 계산 및 데이터 효율적인 언어 모델링을 위한 레시피

초록

대규모 언어 모델은 웹에서 대량으로 수집된 데이터를 기반으로 학습되며, 이러한 데이터는 종종 구조화되지 않고 노이즈가 많으며 표현이 부정확합니다. 현재의 스케일링 법칙에 따르면, 이러한 데이터로부터 학습하려면 모델의 크기에 따라 증가하는 대량의 컴퓨팅 자원과 데이터가 필요합니다. 이는 사전 학습과 관련된 높은 컴퓨팅 비용과 시간, 그리고 웹상의 고품질 데이터의 점점 더 심해지는 부족으로 인해 실현 불가능합니다. 본 연구에서는 웹 문서를 "위키피디아 스타일"이나 "질문-답변 형식"과 같은 특정 스타일로 재구성하기 위해 오프더셸프 인스트럭션 튜닝 모델을 사용하여 실제 데이터와 합성 재구성 데이터를 함께 사전 학습하는 웹 재구성 증강 사전 학습(WRAP)을 제안합니다. 먼저, 자연스럽게 노이즈가 많은 C4 데이터셋에 WRAP을 적용하면 사전 학습 속도가 약 3배 빨라짐을 보여줍니다. 동일한 사전 학습 컴퓨팅 예산에서, Pile의 다양한 하위 집합에서 평균적으로 10% 이상의 퍼플렉서티 개선을 달성하며, 13개의 작업에서 제로샷 질문-답변 정확도가 2% 이상 향상됩니다. 둘째, 재구성 스타일이 모델 성능에 미치는 영향을 조사하여, 학습 데이터의 구성이 OOD(Out-Of-Distribution) 설정에서 LLM의 성능에 어떻게 영향을 미치는지에 대한 통찰을 제공합니다. 이러한 성능 향상은 합성 재구성 데이터가 실제 데이터보다 더 높은 유용성을 가지기 때문입니다. 이는 (i) 다운스트림 평가 스타일을 밀접하게 반영하는 스타일 다양성을 포함하고, (ii) 웹 스크랩 데이터보다 더 높은 '품질'을 가지기 때문입니다.

English

Large language models are trained on massive scrapes of the web, which are often unstructured, noisy, and poorly phrased. Current scaling laws show that learning from such data requires an abundance of both compute and data, which grows with the size of the model being trained. This is infeasible both because of the large compute costs and duration associated with pre-training, and the impending scarcity of high-quality data on the web. In this work, we propose Web Rephrase Augmented Pre-training (WRAP) that uses an off-the-shelf instruction-tuned model prompted to paraphrase documents on the web in specific styles such as "like Wikipedia" or in "question-answer format" to jointly pre-train LLMs on real and synthetic rephrases. First, we show that using WRAP on the C4 dataset, which is naturally noisy, speeds up pre-training by sim3x. At the same pre-training compute budget, it improves perplexity by more than 10% on average across different subsets of the Pile, and improves zero-shot question answer accuracy across 13 tasks by more than 2%. Second, we investigate the impact of the re-phrasing style on the performance of the model, offering insights into how the composition of the training data can impact the performance of LLMs in OOD settings. Our gains are attributed to the fact that re-phrased synthetic data has higher utility than just real data because it (i) incorporates style diversity that closely reflects downstream evaluation style, and (ii) has higher 'quality' than web-scraped data.

웹을 재구성하기: 계산 및 데이터 효율적인 언어 모델링을 위한 레시피

Rephrasing the Web: A Recipe for Compute and Data-Efficient Language Modeling

초록

Support