LLM2LLM: 新たな反復的データ拡張によるLLMの強化

要旨

事前学習済みの大規模言語モデル（LLM）は、現在、自然言語処理タスクの大半において最先端の技術として位置づけられています。多くの実世界のアプリケーションでは、満足のいく性能レベルに達するためにファインチューニングが必要とされますが、その多くは低データ領域にあり、ファインチューニングが困難です。この問題に対処するため、我々はLLM2LLMを提案します。これは、教師LLMを使用して特定のタスクに特化したファインチューニングに利用できる追加データを生成し、小さなシードデータセットを拡張する、ターゲットを絞った反復的なデータ拡張戦略です。LLM2LLMは、(1) 初期シードデータに基づいてベースラインの学生LLMをファインチューニングし、(2) モデルが誤ったデータポイントを評価・抽出し、(3) 教師LLMを使用してこれらの誤ったデータポイントに基づく合成データを生成し、それをトレーニングデータに再統合します。このアプローチにより、トレーニング中にLLMが誤って予測したデータポイントからの信号を増幅し、より困難な例に焦点を当てるためにデータセットに再統合します。我々の結果は、LLM2LLMが低データ領域におけるLLMの性能を大幅に向上させ、従来のファインチューニングや他のデータ拡張ベースラインを上回ることを示しています。LLM2LLMは、労力を要するデータキュレーションへの依存を軽減し、よりスケーラブルで高性能なLLMソリューションへの道を開き、データ制約のあるドメインやタスクに取り組むことを可能にします。LLaMA2-7B学生モデルを使用した低データ領域において、GSM8Kデータセットで24.2%、CaseHOLDで32.6%、SNIPSで32.0%、TRECで52.6%、SST-2で39.8%の改善を達成しました。

English

Pretrained large language models (LLMs) are currently state-of-the-art for solving the vast majority of natural language processing tasks. While many real-world applications still require fine-tuning to reach satisfactory levels of performance, many of them are in the low-data regime, making fine-tuning challenging. To address this, we propose LLM2LLM, a targeted and iterative data augmentation strategy that uses a teacher LLM to enhance a small seed dataset by augmenting additional data that can be used for fine-tuning on a specific task. LLM2LLM (1) fine-tunes a baseline student LLM on the initial seed data, (2) evaluates and extracts data points that the model gets wrong, and (3) uses a teacher LLM to generate synthetic data based on these incorrect data points, which are then added back into the training data. This approach amplifies the signal from incorrectly predicted data points by the LLM during training and reintegrates them into the dataset to focus on more challenging examples for the LLM. Our results show that LLM2LLM significantly enhances the performance of LLMs in the low-data regime, outperforming both traditional fine-tuning and other data augmentation baselines. LLM2LLM reduces the dependence on labor-intensive data curation and paves the way for more scalable and performant LLM solutions, allowing us to tackle data-constrained domains and tasks. We achieve improvements up to 24.2% on the GSM8K dataset, 32.6% on CaseHOLD, 32.0% on SNIPS, 52.6% on TREC and 39.8% on SST-2 over regular fine-tuning in the low-data regime using a LLaMA2-7B student model.

LLM2LLM: 新たな反復的データ拡張によるLLMの強化

LLM2LLM: Boosting LLMs with Novel Iterative Data Enhancement

要旨

Support