SlimPajama-DC: LLMトレーニングのためのデータ組み合わせの理解

要旨

本論文は、SlimPajamaを使用した大規模言語モデルのトレーニングにおける、さまざまなデータ組み合わせ（例：ウェブテキスト、ウィキペディア、GitHub、書籍）の影響を理解することを目的としている。SlimPajamaは、Togetherによって提供された1.2兆トークンに及ぶRedPajamaデータセットから、厳密に重複排除され、さらに精選された6270億トークンのマルチソースデータセットである。我々はこの研究をSlimPajama-DCと名付け、SlimPajamaを大規模言語モデルのトレーニングに使用する際の基本的な特性とベストプラクティスを明らかにするための実証分析を行った。SlimPajamaを用いた研究において、以下の2つの重要な観察結果が得られた：(1) グローバルな重複排除とローカルな重複排除。異なるデータソース間でのグローバルな重複排除と、単一のデータソース内でのローカルな重複排除が、トレーニングされたモデルの性能にどのように影響するかを分析し、議論する。(2) 高品質/高度に重複排除されたマルチソースデータセットの組み合わせにおける割合。これを研究するため、SlimPajamaデータセットの6つの構成を作成し、それぞれを1.3BのCerebras-GPTモデル（AlibiとSwiGLUを使用）でトレーニングした。我々の最良の構成は、同じトレーニングトークン数でRedPajamaを使用してトレーニングされた1.3Bモデルを大幅に上回る性能を示した。すべての1.3Bモデルは、Cerebras 16x CS-2クラスター上でbf16混合精度で合計80 PFLOP/sの性能でトレーニングされた。さらに、我々の発見（例：グローバルな重複排除後にはデータの多様性を増やすことが重要）を、大規模バッチサイズのトレーニングを用いた7Bモデルに拡張した。我々のモデルと個別のSlimPajama-DCデータセットは、https://huggingface.co/MBZUAI-LLM および https://huggingface.co/datasets/cerebras/SlimPajama-627B で公開されている。

English

This paper aims to understand the impacts of various data combinations (e.g., web text, wikipedia, github, books) on the training of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T tokens RedPajama dataset contributed by Together. We've termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of high-quality/highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations of SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16times CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our models and the separate SlimPajama-DC datasets are available at: https://huggingface.co/MBZUAI-LLM and https://huggingface.co/datasets/cerebras/SlimPajama-627B.

SlimPajama-DC: LLMトレーニングのためのデータ組み合わせの理解

SlimPajama-DC: Understanding Data Combinations for LLM Training

要旨

Support