データ津波の力を解き放つ：言語モデルの指示チューニングのためのデータ評価と選択に関する包括的調査

要旨

命令チューニングは、大規模言語モデル（LLM）を人間の好みに合わせる上で重要な役割を果たします。オープンな命令データセットが大量に存在するにもかかわらず、既存のすべての命令データでLLMを単純に訓練することは最適でも実用的でもありません。最も有益なデータポイントを特定するために、自然言語処理（NLP）や深層学習の分野でデータ評価と選択手法が提案されています。しかし、命令チューニングの文脈では、どのようなデータ評価指標を採用し、それらを選択メカニズムに統合するかについての知識がまだ不足しています。このギャップを埋めるために、我々は特にLLMの命令チューニングに関するデータ評価と選択の既存文献を包括的にレビューします。適用可能なすべての手法を品質ベース、多様性ベース、重要性ベースの3つに体系的に分類し、統一された詳細な分類体系を構築します。各カテゴリーについて、代表的な手法を詳述し、関連研究の全体像を説明します。さらに、最新の手法間の比較を公式に報告された結果に基づいて行い、それらの限界について深く議論します。最後に、未解決の課題をまとめ、将来の研究に向けた有望な方向性を提案します。関連するすべての内容はhttps://github.com/yuleiqin/fantastic-data-engineeringで公開されています。

English

Instruction tuning plays a critical role in aligning large language models (LLMs) with human preference. Despite the vast amount of open instruction datasets, naively training a LLM on all existing instructions may not be optimal and practical. To pinpoint the most beneficial datapoints, data assessment and selection methods have been proposed in the fields of natural language processing (NLP) and deep learning. However, under the context of instruction tuning, there still exists a gap in knowledge on what kind of data evaluation metrics can be employed and how they can be integrated into the selection mechanism. To bridge this gap, we present a comprehensive review on existing literature of data assessment and selection especially for instruction tuning of LLMs. We systematically categorize all applicable methods into quality-based, diversity-based, and importance-based ones where a unified, fine-grained taxonomy is structured. For each category, representative methods are elaborated to describe the landscape of relevant research. In addition, comparison between latest methods is conducted on their officially reported results to provide in-depth discussions on their limitations. Finally, we summarize the open challenges and propose the promosing avenues for future studies. All related contents are available at https://github.com/yuleiqin/fantastic-data-engineering.

データ津波の力を解き放つ：言語モデルの指示チューニングのためのデータ評価と選択に関する包括的調査

Unleashing the Power of Data Tsunami: A Comprehensive Survey on Data Assessment and Selection for Instruction Tuning of Language Models

要旨

Support