Domain2Vec:向量化數據集以尋求無需訓練的最佳數據混合方案
Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training
June 12, 2025
作者: Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu
cs.AI
摘要
我们提出了一种名为~Domain2Vec 的新方法,该方法能够将任何数据集分解为多个元域的线性组合,这一新概念旨在捕捉数据集的关键潜在特征。Domain2Vec 维护了一个元域词汇表,并利用分类器将任意给定数据集分解为对应于该词汇表分布的域向量。这些域向量使得在\textbf{分布对齐假设}(DA^{2})下,无需训练即可识别出语言模型(LM)预训练的最佳数据混合比例,该假设认为当训练集与验证集的数据分布更为一致时,验证损失会更低。此外,Domain2Vec 能够无缝融入先前的研究中,以建模域向量与LM性能之间的关系,极大地提升了先前方法的效率和可扩展性。大量实验表明,Domain2Vec 能够以最小的计算开销找到提升下游任务性能的数据混合比例。具体而言,Domain2Vec 在 Pile-CC 上仅需原 Pile 数据集混合训练所需计算量的 51.5%,即可达到相同的验证损失。在同等计算预算下,Domain2Vec 平均提升下游任务性能 2.83%。
English
We introduce~Domain2Vec, a novel approach that decomposes any
dataset into a linear combination of several meta-domains, a new concept
designed to capture the key underlying features of datasets.
Domain2Vec maintains a vocabulary of meta-domains and uses a
classifier to decompose any given dataset into a domain vector that corresponds
to a distribution over this vocabulary. These domain vectors enable the
identification of the optimal data mixture for language model (LM) pretraining
in a training-free manner under the \textbf{Distribution
Alignment Assumption} (DA^{2}), which suggests that when
the data distributions of the training set and the validation set are better
aligned, a lower validation loss is achieved. Moreover, Domain2vec can
be seamlessly integrated into previous works to model the relationship between
domain vectors and LM performance, greatly enhancing the efficiency and
scalability of previous methods. Extensive experiments demonstrate that
Domain2Vec helps find the data mixture that enhances downstream task
performance with minimal computational overhead. Specifically,
Domain2Vec achieves the same validation loss on Pile-CC using only
51.5% of the computation required when training on the original mixture of
The Pile dataset. Under equivalent compute budget, Domain2Vec improves
downstream performance by an average of 2.83%.