VOYAGER:基於大型語言模型的免訓練多樣化數據集生成方法
VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs
December 12, 2025
作者: Avinash Amballa, Yashas Malur Saidutta, Chi-Heng Lin, Vivek Kulkarni, Srinivas Chappidi
cs.AI
摘要
大型语言模型正被越来越多地用于生成合成数据集,以评估和训练下游模型。然而,已有研究指出此类生成数据存在多样性不足的问题。本文提出Voyager——一种基于数学原理的新型数据集多样性生成方法。该方法采用迭代优化策略,直接利用行列式点过程机制优化表征数据集多样性的数学指标。该方案无需训练过程、适用于闭源模型且具备可扩展性。我们不仅从理论层面论证了方法的有效性,还通过全面实验证明:Voyager在多样性指标上显著优于主流基线方法,提升幅度达1.5至3倍。
English
Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper, we propose Voyager, a novel principled approach to generate diverse datasets. Our approach is iterative and directly optimizes a mathematical quantity that optimizes the diversity of the dataset using the machinery of determinantal point processes. Furthermore, our approach is training-free, applicable to closed-source models, and scalable. In addition to providing theoretical justification for the working of our method, we also demonstrate through comprehensive experiments that Voyager significantly outperforms popular baseline approaches by providing a 1.5-3x improvement in diversity.