如何訓練數據高效的LLMs

摘要

大型語言模型（LLMs）的訓練成本高昂。本文研究了對於預訓練LLMs而言具有高效率的方法，即旨在優化模型品質和訓練資源/數據消耗的Pareto前緣的數據節約方法。我們試圖了解與基於（i）昂貴的數據質量估算和（ii）在特徵空間中最大化覆蓋率和多樣性度量相關的權衡。我們的第一種技術Ask-LLM，利用調整指令LLMs的零-shot推理能力來直接評估訓練示例的質量。為了達到覆蓋率，我們提出了密度抽樣，該方法對數據分佈進行建模以選擇多樣樣本。在我們對19種取樣器進行比較的過程中，涉及數百個評估任務和預訓練運行，我們發現Ask-LLM和密度是各自類別中最佳的方法。覆蓋率抽樣可以恢復完整數據的性能，而在Ask-LLM數據上訓練的模型在拒絕原始數據集的90％時仍然持續優於完整數據訓練，並且收斂速度提高了70％。

English

The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.

如何訓練數據高效的LLMs

How to Train Data-Efficient LLMs

摘要

Support