DataDecide: 소규모 실험으로 최적의 사전 학습 데이터를 예측하는 방법

초록

대규모 언어 모델을 다양한 데이터셋에 대해 사전 학습하는 데는 비용이 많이 들기 때문에, 데이터를 결정하기 위해 소규모 실험을 활용하는 것은 비용 절감에 있어 매우 중요합니다. 소규모에서 관찰된 성능을 바탕으로 어떤 벤치마크와 의사 결정 방법이 가장 큰 모델을 얻을 수 있는 최적의 데이터셋을 가장 정확하게 예측할까요? 이 질문에 대한 개방적 탐구를 가능하게 하기 위해, 우리는 DataDecide를 공개합니다. 이는 데이터와 규모 차이에 걸친 가장 포괄적인 오픈 모델 및 평가 도구 모음입니다. 우리는 25개의 다양한 출처, 중복 제거, 필터링이 적용된 코퍼스에 대해 최대 100B 토큰, 최대 1B 파라미터의 모델 크기, 그리고 3개의 랜덤 시드를 사용하여 통제된 사전 학습 실험을 수행했습니다. 우리는 단일 소규모 모델(예: 150M 파라미터)의 순위가 더 큰 목표 규모(1B)에서 최고의 모델을 예측하는 데 강력한 기준선이 된다는 것을 발견했습니다(~80%의 비교에서 정확). 8개의 기준선 중 어떤 스케일링 법칙 방법도 단일 규모 예측의 계산-의사 결정 한계를 넘지 못했지만, DataDecide는 향후 스케일링 법칙의 개선을 측정할 수 있습니다. 또한, 소규모 실험에서 연속적인 가능도 메트릭을 대리 지표로 사용하면 MMLU, ARC, HellaSwag, MBPP, HumanEval과 같은 벤치마크가 목표 1B 규모에서 단 0.01%의 계산으로도 80% 이상 예측 가능하다는 것을 확인했습니다.

English

Because large language models are expensive to pretrain on different datasets, using smaller-scale experiments to decide on data is crucial for reducing costs. Which benchmarks and methods of making decisions from observed performance at small scale most accurately predict the datasets that yield the best large models? To empower open exploration of this question, we release models, data, and evaluations in DataDecide -- the most extensive open suite of models over differences in data and scale. We conduct controlled pretraining experiments across 25 corpora with differing sources, deduplication, and filtering up to 100B tokens, model sizes up to 1B parameters, and 3 random seeds. We find that the ranking of models at a single, small size (e.g., 150M parameters) is a strong baseline for predicting best models at our larger target scale (1B) (~80% of com parisons correct). No scaling law methods among 8 baselines exceed the compute-decision frontier of single-scale predictions, but DataDecide can measure improvement in future scaling laws. We also identify that using continuous likelihood metrics as proxies in small experiments makes benchmarks including MMLU, ARC, HellaSwag, MBPP, and HumanEval >80% predictable at the target 1B scale with just 0.01% of the compute.

DataDecide: 소규모 실험으로 최적의 사전 학습 데이터를 예측하는 방법

DataDecide: How to Predict Best Pretraining Data with Small Experiments

초록

Support