LLM2LLM:利用新颖的迭代数据增强技术增强LLM模型
LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement
March 22, 2024
作者: Nicholas Lee, Thanakul Wattanawong, Sehoon Kim, Karttikeya Mangalam, Sheng Shen, Gopala Anumanchipali, Michael W. Mahoney, Kurt Keutzer, Amir Gholami
cs.AI
摘要
目前,预训练的大型语言模型(LLMs)是解决绝大多数自然语言处理任务的最先进技术。虽然许多实际应用仍需要微调才能达到令人满意的性能水平,但其中许多处于低数据范畴,这使得微调具有挑战性。为了解决这一问题,我们提出了LLM2LLM,这是一种有针对性且迭代的数据增强策略,利用一位教师LLM来增强一个小型种子数据集,通过增加额外数据可用于针对特定任务进行微调。LLM2LLM(1)在初始种子数据上微调基准学生LLM,(2)评估并提取模型错误的数据点,(3)利用教师LLM基于这些错误数据点生成合成数据,然后将其添加回训练数据中。这种方法通过在训练过程中放大LLM对错误预测数据点的信号,并将其重新整合到数据集中,以便专注于LLM的更具挑战性的示例。我们的结果表明,LLM2LLM显著提升了LLMs在低数据范畴中的性能,优于传统微调和其他数据增强基线。LLM2LLM减少了对劳动密集型数据整理的依赖,并为更具规模和性能的LLM解决方案铺平道路,使我们能够处理数据受限的领域和任务。我们在GSM8K数据集上实现了高达24.2%的改进,在CaseHOLD上为32.6%,在SNIPS上为32.0%,在TREC上为52.6%,在SST-2上为39.8%,相较于低数据范畴中使用LLaMA2-7B学生模型的常规微调。
English
Pretrained large language models (LLMs) are currently state-of-the-art for
solving the vast majority of natural language processing tasks. While many
real-world applications still require fine-tuning to reach satisfactory levels
of performance, many of them are in the low-data regime, making fine-tuning
challenging. To address this, we propose LLM2LLM, a targeted and iterative data
augmentation strategy that uses a teacher LLM to enhance a small seed dataset
by augmenting additional data that can be used for fine-tuning on a specific
task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data,
(2) evaluates and extracts data points that the model gets wrong, and (3) uses
a teacher LLM to generate synthetic data based on these incorrect data points,
which are then added back into the training data. This approach amplifies the
signal from incorrectly predicted data points by the LLM during training and
reintegrates them into the dataset to focus on more challenging examples for
the LLM. Our results show that LLM2LLM significantly enhances the performance
of LLMs in the low-data regime, outperforming both traditional fine-tuning and
other data augmentation baselines. LLM2LLM reduces the dependence on
labor-intensive data curation and paves the way for more scalable and
performant LLM solutions, allowing us to tackle data-constrained domains and
tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on
CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular
fine-tuning in the low-data regime using a LLaMA2-7B student model.Summary
AI-Generated Summary