如何訓練數據高效的LLMs
How to Train Data-Efficient LLMs
February 15, 2024
作者: Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng
cs.AI
摘要
大型語言模型(LLMs)的訓練成本高昂。本文研究了對於預訓練LLMs而言具有高效率的方法,即旨在優化模型品質和訓練資源/數據消耗的Pareto前緣的數據節約方法。我們試圖了解與基於(i)昂貴的數據質量估算和(ii)在特徵空間中最大化覆蓋率和多樣性度量相關的權衡。我們的第一種技術Ask-LLM,利用調整指令LLMs的零-shot推理能力來直接評估訓練示例的質量。為了達到覆蓋率,我們提出了密度抽樣,該方法對數據分佈進行建模以選擇多樣樣本。在我們對19種取樣器進行比較的過程中,涉及數百個評估任務和預訓練運行,我們發現Ask-LLM和密度是各自類別中最佳的方法。覆蓋率抽樣可以恢復完整數據的性能,而在Ask-LLM數據上訓練的模型在拒絕原始數據集的90%時仍然持續優於完整數據訓練,並且收斂速度提高了70%。
English
The training of large language models (LLMs) is expensive. In this paper, we
study data-efficient approaches for pre-training LLMs, i.e., techniques that
aim to optimize the Pareto frontier of model quality and training resource/data
consumption. We seek to understand the tradeoffs associated with data selection
routines based on (i) expensive-to-compute data-quality estimates, and (ii)
maximization of coverage and diversity-based measures in the feature space. Our
first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of
instruction-tuned LLMs to directly assess the quality of a training example. To
target coverage, we propose Density sampling, which models the data
distribution to select a diverse sample. In our comparison of 19 samplers,
involving hundreds of evaluation tasks and pre-training runs, we find that
Ask-LLM and Density are the best methods in their respective categories.
Coverage sampling can recover the performance of the full data, while models
trained on Ask-LLM data consistently outperform full-data training -- even when
we reject 90% of the original dataset, while converging up to 70% faster.Summary
AI-Generated Summary