大規模言語モデルから生成される合成データの品質、多様性、複雑さの効果に関する調査

要旨

大規模言語モデルを用いた合成データ生成は、さまざまなタスクにわたって自然データを拡充する有望なパラダイムです。この多様性から、合成データ生成アルゴリズムの直接比較は少なく、改善の要因や存在するボトルネックを理解することが難しい状況です。我々は、各アルゴリズムによって生成された合成データの構成を、データの品質、多様性、複雑さの観点から評価することを提案します。これらの特性を選択した理由は、オープンエンドのプロセスにおける重要性と、それぞれが下流モデルの能力に与える影響です。品質は分布内モデルの汎化にとって不可欠であり、多様性は分布外汎化に不可欠であり、複雑さはその両方に有益です。さらに、トレーニングデータにおける品質と多様性のトレードオフの存在と、モデルパフォーマンスへの下流効果を強調します。次に、合成データパイプライン内のさまざまなコンポーネントが各データ特性に与える影響を検討します。この検討により、合成データ生成アルゴリズムを、それらが利用するコンポーネントとデータQDC構成への影響に基づいて分類および比較することが可能となります。この分析は、効率的な強化学習や自己改善アルゴリズムのための合成データにおけるQDCのバランスの重要性についての議論に展開します。トレーニングデータにおけるQDのトレードオフと類似して、モデルの出力品質と出力多様性の間にトレードオフが存在し、合成データの構成に影響を与えます。現在、多くのモデルが出力品質のみに対して評価および最適化されており、出力多様性や自己改善の可能性が制限されていると観察されます。これらのトレードオフのバランスを取ることが、将来の自己改善アルゴリズムの開発に不可欠であり、この方向で進展を遂げているいくつかの研究を紹介します。

English

Synthetic data generation with Large Language Models is a promising paradigm for augmenting natural data over a nearly infinite range of tasks. Given this variety, direct comparisons among synthetic data generation algorithms are scarce, making it difficult to understand where improvement comes from and what bottlenecks exist. We propose to evaluate algorithms via the makeup of synthetic data generated by each algorithm in terms of data quality, diversity, and complexity. We choose these three characteristics for their significance in open-ended processes and the impact each has on the capabilities of downstream models. We find quality to be essential for in-distribution model generalization, diversity to be essential for out-of-distribution generalization, and complexity to be beneficial for both. Further, we emphasize the existence of Quality-Diversity trade-offs in training data and the downstream effects on model performance. We then examine the effect of various components in the synthetic data pipeline on each data characteristic. This examination allows us to taxonomize and compare synthetic data generation algorithms through the components they utilize and the resulting effects on data QDC composition. This analysis extends into a discussion on the importance of balancing QDC in synthetic data for efficient reinforcement learning and self-improvement algorithms. Analogous to the QD trade-offs in training data, often there exist trade-offs between model output quality and output diversity which impact the composition of synthetic data. We observe that many models are currently evaluated and optimized only for output quality, thereby limiting output diversity and the potential for self-improvement. We argue that balancing these trade-offs is essential to the development of future self-improvement algorithms and highlight a number of works making progress in this direction.

大規模言語モデルから生成される合成データの品質、多様性、複雑さの効果に関する調査

Surveying the Effects of Quality, Diversity, and Complexity in Synthetic Data From Large Language Models

要旨

Support