Domain2Vec：通过向量化数据集寻找最优数据组合，无需训练

摘要

我们提出了Domain2Vec这一创新方法，它能够将任何数据集分解为多个元域（meta-domains）的线性组合。元域是一个新概念，旨在捕捉数据集的关键底层特征。Domain2Vec维护了一个元域词汇表，并利用分类器将任意给定数据集分解为对应于该词汇表分布的域向量。这些域向量使得在\textbf{分布对齐假设}（DA^{2}）下，无需训练即可识别出语言模型（LM）预训练的最佳数据混合比例。该假设认为，当训练集与验证集的数据分布更为一致时，验证损失会更低。此外，Domain2Vec能够无缝融入先前的研究中，建模域向量与LM性能之间的关系，极大地提升了现有方法的效率和可扩展性。大量实验表明，Domain2Vec以最小的计算开销帮助找到了提升下游任务性能的数据混合方案。具体而言，Domain2Vec在Pile-CC数据集上仅需原Pile数据集混合训练计算量的51.5%，即可达到相同的验证损失。在同等计算预算下，Domain2Vec平均提升下游任务性能2.83%。

English

We introduce~Domain2Vec, a novel approach that decomposes any dataset into a linear combination of several meta-domains, a new concept designed to capture the key underlying features of datasets. Domain2Vec maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \textbf{Distribution Alignment Assumption} (DA^{2}), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, Domain2vec can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that Domain2Vec helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, Domain2Vec achieves the same validation loss on Pile-CC using only 51.5% of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, Domain2Vec improves downstream performance by an average of 2.83%.

Domain2Vec：通过向量化数据集寻找最优数据组合，无需训练

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

摘要

Support