SlimPajama-DC：理解LLM训练中的数据组合

摘要

本文旨在了解使用SlimPajama训练大型语言模型时各种数据组合（例如网络文本、维基百科、GitHub、图书）对训练的影响。SlimPajama是一个经过严格去重的多源数据集，经过进一步的精炼和去重处理，从Together贡献的庞大1.2T标记的RedPajama数据集中精简到627B标记。我们将我们的研究称为SlimPajama-DC，这是一项经验分析，旨在揭示在训练大型语言模型时采用SlimPajama的基本特征和最佳实践。在我们对SlimPajama进行研究过程中，出现了两个关键观察：（1）全局去重与局部去重。我们分析和讨论了全局（跨不同数据集来源）和局部（在单个数据集来源内）去重对训练模型性能的影响。（2）高质量/高度去重的多源数据集在组合中的比例。为了研究这一点，我们构建了六种SlimPajama数据集配置，并使用1.3B Cerebras-GPT模型与Alibi和SwiGLU分别对它们进行训练。我们最佳的配置明显优于使用相同数量训练标记的RedPajama训练的1.3B模型。我们所有的1.3B模型都是在Cerebras 16times CS-2集群上以总共80 PFLOP/s的bf16混合精度进行训练的。我们进一步扩展了我们的发现（例如，在全局去重后增加数据多样性至关重要）到一个具有大批量训练的7B模型。我们的模型和单独的SlimPajama-DC数据集可在以下网址找到：https://huggingface.co/MBZUAI-LLM 和 https://huggingface.co/datasets/cerebras/SlimPajama-627B。

English

This paper aims to understand the impacts of various data combinations (e.g., web text, wikipedia, github, books) on the training of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T tokens RedPajama dataset contributed by Together. We've termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of high-quality/highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations of SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16times CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our models and the separate SlimPajama-DC datasets are available at: https://huggingface.co/MBZUAI-LLM and https://huggingface.co/datasets/cerebras/SlimPajama-627B.

SlimPajama-DC：理解LLM训练中的数据组合

SlimPajama-DC: Understanding Data Combinations for LLM Training

摘要

Support