데이터 효율적인 대형 언어 모델(LLM) 훈련 방법

초록

대규모 언어 모델(LLMs)의 학습은 비용이 많이 든다. 본 논문에서는 LLMs의 사전 학습을 위한 데이터 효율적인 접근법, 즉 모델 품질과 학습 자원/데이터 소비의 파레토 최적화를 목표로 하는 기술을 연구한다. 우리는 (i) 계산 비용이 높은 데이터 품질 추정치와 (ii) 특징 공간에서의 커버리지 및 다양성 기반 측정치의 극대화를 기반으로 한 데이터 선택 루틴과 관련된 트레이드오프를 이해하고자 한다. 첫 번째 기술인 Ask-LLM은 지시 튜닝된 LLMs의 제로샷 추론 능력을 활용하여 학습 예제의 품질을 직접 평가한다. 커버리지를 목표로 하기 위해, 우리는 데이터 분포를 모델링하여 다양한 샘플을 선택하는 Density 샘플링을 제안한다. 19개의 샘플러를 비교한 결과, 수백 개의 평가 작업과 사전 학습 실행을 통해 Ask-LLM과 Density가 각각의 범주에서 최고의 방법임을 발견했다. 커버리지 샘플링은 전체 데이터의 성능을 회복할 수 있으며, Ask-LLM 데이터로 학습된 모델은 원본 데이터셋의 90%를 제외하더라도 전체 데이터 학습을 지속적으로 능가하며, 최대 70% 더 빠르게 수렴한다.

English

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.

데이터 효율적인 대형 언어 모델(LLM) 훈련 방법

How to Train Data-Efficient LLMs

초록

Support