Domain2Vec: データセットをベクトル化してトレーニングなしで最適なデータ混合を見つける

要旨

本論文では、任意のデータセットを複数のメタドメインの線形結合に分解する新規手法であるDomain2Vecを提案する。メタドメインは、データセットの主要な基盤的特徴を捉えるために設計された新しい概念である。Domain2Vecはメタドメインの語彙を保持し、分類器を用いて任意のデータセットをこの語彙上の分布に対応するドメインベクトルに分解する。これらのドメインベクトルにより、\textbf{分布整合性仮説}（DA^{2}）の下で、言語モデル（LM）の事前学習に最適なデータ混合をトレーニング不要で特定することが可能となる。この仮説は、訓練セットと検証セットのデータ分布がより整合している場合、検証損失が低くなることを示唆している。さらに、Domain2Vecは既存の研究にシームレスに統合可能であり、ドメインベクトルとLM性能の関係をモデル化することで、既存手法の効率性と拡張性を大幅に向上させる。大規模な実験により、Domain2Vecが最小限の計算オーバーヘッドで下流タスクの性能を向上させるデータ混合を見つけるのに有効であることが示された。具体的には、Domain2VecはPile-CCにおいて、元のPileデータセットの混合でトレーニングする場合に必要な計算量のわずか51.5%で同じ検証損失を達成した。同等の計算予算下では、Domain2Vecは下流性能を平均2.83%向上させた。

English

We introduce~Domain2Vec, a novel approach that decomposes any dataset into a linear combination of several meta-domains, a new concept designed to capture the key underlying features of datasets. Domain2Vec maintains a vocabulary of meta-domains and uses a classifier to decompose any given dataset into a domain vector that corresponds to a distribution over this vocabulary. These domain vectors enable the identification of the optimal data mixture for language model (LM) pretraining in a training-free manner under the \textbf{Distribution Alignment Assumption} (DA^{2}), which suggests that when the data distributions of the training set and the validation set are better aligned, a lower validation loss is achieved. Moreover, Domain2vec can be seamlessly integrated into previous works to model the relationship between domain vectors and LM performance, greatly enhancing the efficiency and scalability of previous methods. Extensive experiments demonstrate that Domain2Vec helps find the data mixture that enhances downstream task performance with minimal computational overhead. Specifically, Domain2Vec achieves the same validation loss on Pile-CC using only 51.5% of the computation required when training on the original mixture of The Pile dataset. Under equivalent compute budget, Domain2Vec improves downstream performance by an average of 2.83%.

Domain2Vec: データセットをベクトル化してトレーニングなしで最適なデータ混合を見つける

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

要旨

Support