Domain2Vec：向量化數據集以尋求無需訓練的最佳數據混合方案

摘要

我们提出了一种名为~Domain2Vec 的新方法，该方法能够将任何数据集分解为多个元域的线性组合，这一新概念旨在捕捉数据集的关键潜在特征。Domain2Vec 维护了一个元域词汇表，并利用分类器将任意给定数据集分解为对应于该词汇表分布的域向量。这些域向量使得在\textbf{分布对齐假设}（DA^{2}）下，无需训练即可识别出语言模型（LM）预训练的最佳数据混合比例，该假设认为当训练集与验证集的数据分布更为一致时，验证损失会更低。此外，Domain2Vec 能够无缝融入先前的研究中，以建模域向量与LM性能之间的关系，极大地提升了先前方法的效率和可扩展性。大量实验表明，Domain2Vec 能够以最小的计算开销找到提升下游任务性能的数据混合比例。具体而言，Domain2Vec 在 Pile-CC 上仅需原 Pile 数据集混合训练所需计算量的 51.5%，即可达到相同的验证损失。在同等计算预算下，Domain2Vec 平均提升下游任务性能 2.83%。

English

We introduce~Domain2Vec, a novel approach that decomposes any dataset into a linear combination of several meta-domains, a new concept designed to capture the key underlying features of datasets. Domain2Vec maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \textbf{Distribution Alignment Assumption} (DA^{2}), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, Domain2vec can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that Domain2Vec helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, Domain2Vec achieves the same validation loss on Pile-CC using only 51.5% of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, Domain2Vec improves downstream performance by an average of 2.83%.

Domain2Vec：向量化數據集以尋求無需訓練的最佳數據混合方案

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

摘要

Support