VOYAGER:基于大语言模型生成多样化数据集的免训练方法
VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs
December 12, 2025
作者: Avinash Amballa, Yashas Malur Saidutta, Chi-Heng Lin, Vivek Kulkarni, Srinivas Chappidi
cs.AI
摘要
大型语言模型(LLMs)正日益广泛地用于生成合成数据集,以支持下游模型的评估与训练。然而已有研究指出,此类生成数据存在多样性不足的问题。本文提出Voyager——一种基于数学原理的新型数据生成方法,能够有效提升数据集多样性。该方法采用迭代优化策略,通过行列式点过程机制直接优化表征数据集多样性的数学指标。此外,该方法无需训练过程、适用于闭源模型,并具备良好的可扩展性。我们不仅从理论层面论证了方法的有效性,还通过全面实验证明:相较于主流基线方法,Voyager能将数据多样性提升1.5至3倍。
English
Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper, we propose Voyager, a novel principled approach to generate diverse datasets. Our approach is iterative and directly optimizes a mathematical quantity that optimizes the diversity of the dataset using the machinery of determinantal point processes. Furthermore, our approach is training-free, applicable to closed-source models, and scalable. In addition to providing theoretical justification for the working of our method, we also demonstrate through comprehensive experiments that Voyager significantly outperforms popular baseline approaches by providing a 1.5-3x improvement in diversity.