ChatPaper.aiChatPaper

如何训练数据高效的LLMs

How to Train Data-Efficient LLMs

February 15, 2024
作者: Noveen Sachdeva, Benjamin Coleman, Wang-Cheng Kang, Jianmo Ni, Lichan Hong, Ed H. Chi, James Caverlee, Julian McAuley, Derek Zhiyuan Cheng
cs.AI

摘要

大型语言模型(LLMs)的训练成本很高。本文研究了用于预训练LLMs的数据高效方法,即旨在优化模型质量和训练资源/数据消耗的帕累托前沿的技术。我们试图了解与基于(i)昂贵计算的数据质量估计和(ii)在特征空间中最大化覆盖率和多样性度量相关的权衡。我们的第一种技术Ask-LLM利用了经过指令调整的LLMs的零-shot推理能力,直接评估训练样本的质量。为了达到覆盖率的目标,我们提出了密度抽样,该方法模拟数据分布以选择多样化样本。在我们对19种抽样器进行的比较中,涉及数百个评估任务和预训练运行,我们发现Ask-LLM和Density是各自类别中最佳的方法。覆盖率抽样可以恢复完整数据的性能,而在Ask-LLM数据上训练的模型始终优于完整数据训练 - 即使我们拒绝原始数据集的90%,也能收敛速度提高高达70%。
English
The training of large language models (LLMs) is expensive. In this paper, we study data-efficient approaches for pre-training LLMs, i.e., techniques that aim to optimize the Pareto frontier of model quality and training resource/data consumption. We seek to understand the tradeoffs associated with data selection routines based on (i) expensive-to-compute data-quality estimates, and (ii) maximization of coverage and diversity-based measures in the feature space. Our first technique, Ask-LLM, leverages the zero-shot reasoning capabilities of instruction-tuned LLMs to directly assess the quality of a training example. To target coverage, we propose Density sampling, which models the data distribution to select a diverse sample. In our comparison of 19 samplers, involving hundreds of evaluation tasks and pre-training runs, we find that Ask-LLM and Density are the best methods in their respective categories. Coverage sampling can recover the performance of the full data, while models trained on Ask-LLM data consistently outperform full-data training -- even when we reject 90% of the original dataset, while converging up to 70% faster.

Summary

AI-Generated Summary

PDF434December 15, 2024