属性作为文本基因：利用大语言模型作为遗传算法模拟器实现条件性合成数据生成

摘要

大型语言模型（LLMs）在生成合成数据方面表现出色，但确保其质量和多样性仍具挑战性。我们提出了一种名为“遗传提示”的创新框架，该框架将遗传算法与LLMs相结合，以增强合成数据的生成。我们的方法将语义文本属性视为基因序列，并利用LLM模拟交叉和变异操作。这一遗传过程通过创造新颖的属性组合，提升了数据质量和多样性，使合成数据分布更接近真实世界数据。为了优化亲本选择，我们还整合了一种主动学习方案，以扩展后代搜索空间。我们在多个自然语言处理任务上的实验揭示了几个关键发现：“遗传提示”不仅显著超越了现有最先进的基线方法，还在不同生成模型规模和尺度上展现出稳健性能。此外，我们证明了将我们的合成数据与原始训练集融合，能显著提升下游模型性能，特别是在类别不平衡的场景中。我们的研究结果验证了“遗传提示”是一种为广泛自然语言处理应用生成高质量合成数据的有效方法。

English

Large Language Models (LLMs) excel at generating synthetic data, but ensuring its quality and diversity remains challenging. We propose Genetic Prompt, a novel framework that combines genetic algorithms with LLMs to augment synthetic data generation. Our approach treats semantic text attributes as gene sequences and leverages the LLM to simulate crossover and mutation operations. This genetic process enhances data quality and diversity by creating novel attribute combinations, yielding synthetic distributions closer to real-world data. To optimize parent selection, we also integrate an active learning scheme that expands the offspring search space. Our experiments on multiple NLP tasks reveal several key findings: Genetic Prompt not only significantly outperforms state-of-the-art baselines but also shows robust performance across various generator model sizes and scales. Moreover, we demonstrate that fusing our synthetic data with the original training set significantly boosts downstream model performance, particularly for class-imbalanced scenarios. Our findings validate that Genetic Prompt is an effective method for producing high-quality synthetic data for a wide range of NLP applications.

属性作为文本基因：利用大语言模型作为遗传算法模拟器实现条件性合成数据生成

Attributes as Textual Genes: Leveraging LLMs as Genetic Algorithm Simulators for Conditional Synthetic Data Generation

摘要

Support