言語モデルの合成データ生成器としての評価

要旨

言語モデル（LM）の事後トレーニングにおける合成データの利用が増加していることから、高品質なデータを生成するLMの能力は、問題を直接解決する能力とほぼ同じくらい重要になっています。これまでの研究は効果的なデータ生成手法の開発に焦点を当ててきましたが、異なるLMをデータ生成器として統一された環境で系統的に比較することが欠如しています。このギャップに対処するために、標準化された設定と評価基準を提供するベンチマークであるAgoraBenchを提案します。6つのLMを使用して1.26百万のトレーニングインスタンスを合成し、99の学習モデルをトレーニングすることで、LMのデータ生成能力に関する重要な洞察を明らかにします。まず、LMには異なる強みがあることが観察されます。たとえば、GPT-4oは新しい問題を生成するのに優れていますが、Claude-3.5-Sonnetは既存の問題をより良く改善します。さらに、分析から、LMのデータ生成能力が必ずしも問題解決能力と相関しないことが明らかになります。代わりに、応答品質、パープレキシティ、指示の難易度など、データ品質の複数の固有の特徴がより良い指標として機能します。最後に、出力形式とコスト意識のモデル選択における戦略的選択がデータ生成の効果に大きな影響を与えることを示します。

English

Given the increasing use of synthetic data in language model (LM) post-training, an LM's ability to generate high-quality data has become nearly as crucial as its ability to solve problems directly. While prior works have focused on developing effective data generation methods, they lack systematic comparison of different LMs as data generators in a unified setting. To address this gap, we propose AgoraBench, a benchmark that provides standardized settings and metrics to evaluate LMs' data generation abilities. Through synthesizing 1.26 million training instances using 6 LMs and training 99 student models, we uncover key insights about LMs' data generation capabilities. First, we observe that LMs exhibit distinct strengths. For instance, GPT-4o excels at generating new problems, while Claude-3.5-Sonnet performs better at enhancing existing ones. Furthermore, our analysis reveals that an LM's data generation ability doesn't necessarily correlate with its problem-solving ability. Instead, multiple intrinsic features of data quality-including response quality, perplexity, and instruction difficulty-collectively serve as better indicators. Finally, we demonstrate that strategic choices in output format and cost-conscious model selection significantly impact data generation effectiveness.

言語モデルの合成データ生成器としての評価

Evaluating Language Models as Synthetic Data Generators

要旨

Support