CLIMB: 言語モデル事前学習のためのクラスタリングベースの反復的データ混合ブートストラップ

要旨

事前学習用データセットは通常、ウェブコンテンツから収集され、固有のドメイン分割を欠いています。例えば、Common Crawlのような広く使用されているデータセットには明示的なドメインラベルが含まれておらず、The Pileのようなラベル付きデータセットを手作業でキュレーションするのは労力を要します。その結果、事前学習のパフォーマンスに大きな利益をもたらすにもかかわらず、最適な事前学習データの混合を見つけることは依然として難しい問題です。これらの課題に対処するため、我々はCLustering-based Iterative Data Mixture Bootstrapping (CLIMB)を提案します。これは、事前学習設定においてデータの混合を発見し、評価し、洗練する自動化されたフレームワークです。具体的には、CLIMBは大規模データセットを意味空間に埋め込み、クラスタリングし、その後、より小さなプロキシモデルと予測器を使用して最適な混合を反復的に探索します。この混合で4000億トークンを継続的に学習させた場合、我々の10億パラメータモデルは最先端のLlama-3.2-1Bを2.0%上回ります。さらに、特定のドメイン（例：社会科学）に対して最適化を行うことで、ランダムサンプリングよりも5%の改善が得られることを観察しました。最後に、研究のためのプレイグラウンドとして20のクラスターを持つ1.2兆トークンのフィルタリングされたコーパスであるClimbLabと、等しいトークン予算の下で優れたパフォーマンスを発揮する効率的な事前学習用に設計されたコンパクトながら強力な4000億トークンのデータセットであるClimbMixを紹介します。最終的なデータ混合を分析し、最適なデータ混合の特性を明らかにします。我々のデータは以下で利用可能です：https://research.nvidia.com/labs/lpr/climb/

English

Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/

CLIMB: 言語モデル事前学習のためのクラスタリングベースの反復的データ混合ブートストラップ

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

要旨

Support