属性をテキスト遺伝子として：条件付き合成データ生成のための遺伝的アルゴリズムシミュレータとしての大規模言語モデルの活用

要旨

大規模言語モデル（LLM）は合成データの生成に優れていますが、その品質と多様性を確保することは依然として課題です。本論文では、遺伝的アルゴリズムとLLMを組み合わせた新たなフレームワーク「Genetic Prompt」を提案します。このアプローチでは、意味的なテキスト属性を遺伝子配列として扱い、LLMを活用して交叉と突然変異の操作をシミュレートします。この遺伝的プロセスにより、新たな属性の組み合わせが生成され、合成データの品質と多様性が向上し、実世界のデータに近い分布が得られます。さらに、親選択を最適化するために、子孫の探索空間を拡張する能動学習スキームも統合しています。複数のNLPタスクでの実験結果から、以下の重要な知見が得られました：Genetic Promptは、最先端のベースラインを大幅に上回るだけでなく、さまざまな生成モデルのサイズやスケールにおいても堅牢な性能を示します。さらに、提案する合成データを元のトレーニングセットと融合させることで、特にクラス不均衡なシナリオにおいて、下流モデルの性能が大幅に向上することが実証されました。これらの結果は、Genetic Promptが幅広いNLPアプリケーション向けに高品質な合成データを生成する効果的な手法であることを裏付けています。

English

Large Language Models (LLMs) excel at generating synthetic data, but ensuring its quality and diversity remains challenging. We propose Genetic Prompt, a novel framework that combines genetic algorithms with LLMs to augment synthetic data generation. Our approach treats semantic text attributes as gene sequences and leverages the LLM to simulate crossover and mutation operations. This genetic process enhances data quality and diversity by creating novel attribute combinations, yielding synthetic distributions closer to real-world data. To optimize parent selection, we also integrate an active learning scheme that expands the offspring search space. Our experiments on multiple NLP tasks reveal several key findings: Genetic Prompt not only significantly outperforms state-of-the-art baselines but also shows robust performance across various generator model sizes and scales. Moreover, we demonstrate that fusing our synthetic data with the original training set significantly boosts downstream model performance, particularly for class-imbalanced scenarios. Our findings validate that Genetic Prompt is an effective method for producing high-quality synthetic data for a wide range of NLP applications.

属性をテキスト遺伝子として：条件付き合成データ生成のための遺伝的アルゴリズムシミュレータとしての大規模言語モデルの活用

Attributes as Textual Genes: Leveraging LLMs as Genetic Algorithm Simulators for Conditional Synthetic Data Generation

要旨

Support