SlimPajama-DC：了解用於LLM訓練的數據組合

摘要

本文旨在研究各種數據組合（例如網絡文本、維基百科、GitHub、書籍）對使用SlimPajama訓練大型語言模型的影響。SlimPajama是一個經過嚴謹去重、多來源數據集，從Together貢獻的龐大1.2T tokens RedPajama數據集中進一步精煉和去重至627B tokens。我們將我們的研究稱為SlimPajama-DC，這是一項旨在揭示使用SlimPajama訓練大型語言模型的基本特徵和最佳實踐的實證分析。在我們對SlimPajama的研究中，出現了兩個重要觀察：（1）全局去重與局部去重。我們分析並討論全局（跨不同數據集來源）和局部（在單一數據集來源內）去重對訓練模型性能的影響。（2）組合中高質量/高度去重多來源數據集的比例。為了研究這一點，我們構建了六種SlimPajama數據集配置，並使用1.3B Cerebras-GPT模型與Alibi和SwiGLU分別對它們進行訓練。我們最佳的配置明顯優於使用相同數量訓練tokens的RedPajama上訓練的1.3B模型。我們所有的1.3B模型都是在Cerebras 16times CS-2集群上以總共80 PFLOP/s的bf16混合精度進行訓練。我們進一步對大批量訓練的7B模型擴展了我們的發現（例如，在全局去重後增加數據多樣性至關重要）。我們的模型和單獨的SlimPajama-DC數據集可在以下鏈接找到：https://huggingface.co/MBZUAI-LLM 和 https://huggingface.co/datasets/cerebras/SlimPajama-627B。

English

This paper aims to understand the impacts of various data combinations (e.g., web text, wikipedia, github, books) on the training of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T tokens RedPajama dataset contributed by Together. We've termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of high-quality/highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations of SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16times CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our models and the separate SlimPajama-DC datasets are available at: https://huggingface.co/MBZUAI-LLM and https://huggingface.co/datasets/cerebras/SlimPajama-627B.

SlimPajama-DC：了解用於LLM訓練的數據組合

SlimPajama-DC: Understanding Data Combinations for LLM Training

摘要

Support