LLM2LLM：利用新型迭代數據增強技術增強LLM

摘要

預訓練的大型語言模型（LLMs）目前是解決絕大多數自然語言處理任務的最先進技術。儘管許多現實應用仍需要微調才能達到令人滿意的性能水平，但其中許多處於低數據範疇，這使得微調變得具有挑戰性。為了應對這一問題，我們提出了LLM2LLM，這是一種針對性的迭代數據擴增策略，利用一個教師LLM來增強一個小型種子數據集，通過擴增額外的數據，用於針對特定任務進行微調。LLM2LLM（1）在初始種子數據上對基準學生LLM進行微調，（2）評估並提取模型錯誤的數據點，（3）使用教師LLM基於這些不正確的數據點生成合成數據，然後將其添加回訓練數據中。這種方法通過在訓練期間放大LLM對不正確預測數據點的信號，並將其重新整合到數據集中，以便更多地關注LLM的具有挑戰性的示例。我們的結果表明，LLM2LLM顯著提升了LLMs在低數據範疇中的性能，優於傳統的微調和其他數據擴增基準。LLM2LLM減少了對勞動密集型數據整理的依賴，為更具規模和高性能的LLM解決方案鋪平了道路，使我們能夠應對數據受限的領域和任務。我們在GSM8K數據集上實現了高達24.2％的改進，在CaseHOLD上達到32.6％，在SNIPS上達到32.0％，在TREC上達到52.6％，在SST-2上達到39.8％，相較於低數據範疇中使用LLaMA2-7B學生模型的常規微調。

English

Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a LLaMA2-7B student model.

LLM2LLM：利用新型迭代數據增強技術增強LLM

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

摘要

Support