SlimPajama-DC: 대규모 언어 모델 학습을 위한 데이터 조합 이해

초록

본 논문은 SlimPajama를 사용하여 대규모 언어 모델 훈련에 다양한 데이터 조합(예: 웹 텍스트, 위키피디아, 깃허브, 도서)이 미치는 영향을 이해하는 것을 목표로 한다. SlimPajama는 Together가 제공한 방대한 1.2T 토큰의 RedPajama 데이터셋에서 중복을 철저히 제거하고 추가로 정제하여 627B 토큰으로 축소된 다중 소스 데이터셋이다. 우리는 이 연구를 SlimPajama-DC로 명명하였으며, SlimPajama를 대규모 언어 모델 훈련에 활용할 때의 기본 특성과 최적의 실천 방법을 밝히기 위한 실증적 분석을 수행하였다. SlimPajama를 사용한 연구 과정에서 두 가지 중요한 관찰 결과가 도출되었다: (1) 전역 중복 제거 vs. 지역 중복 제거. 우리는 전역(다양한 데이터셋 소스 간) 및 지역(단일 데이터셋 소스 내) 중복 제거가 훈련된 모델의 성능에 미치는 영향을 분석하고 논의한다. (2) 다중 소스 데이터셋 조합에서 고품질/고도로 중복 제거된 데이터의 비율. 이를 연구하기 위해 SlimPajama 데이터셋의 여섯 가지 구성을 설계하고, Alibi와 SwiGLU를 사용한 1.3B Cerebras-GPT 모델로 각각 훈련을 진행하였다. 우리의 최적 구성은 동일한 훈련 토큰 수로 RedPajama에서 훈련된 1.3B 모델을 상당한 차이로 능가하였다. 모든 1.3B 모델은 Cerebras 16x CS-2 클러스터에서 bf16 혼합 정밀도로 총 80 PFLOP/s로 훈련되었다. 우리는 이러한 발견(예: 전역 중복 제거 후 데이터 다양성 증가가 중요함)을 대규모 배치 크기 훈련을 적용한 7B 모델로 확장하였다. 우리의 모델과 별도의 SlimPajama-DC 데이터셋은 https://huggingface.co/MBZUAI-LLM 및 https://huggingface.co/datasets/cerebras/SlimPajama-627B에서 확인할 수 있다.

English

This paper aims to understand the impacts of various data combinations (e.g., web text, wikipedia, github, books) on the training of large language models using SlimPajama. SlimPajama is a rigorously deduplicated, multi-source dataset, which has been refined and further deduplicated to 627B tokens from the extensive 1.2T tokens RedPajama dataset contributed by Together. We've termed our research as SlimPajama-DC, an empirical analysis designed to uncover fundamental characteristics and best practices associated with employing SlimPajama in the training of large language models. During our research with SlimPajama, two pivotal observations emerged: (1) Global deduplication vs. local deduplication. We analyze and discuss how global (across different sources of datasets) and local (within the single source of dataset) deduplications affect the performance of trained models. (2) Proportions of high-quality/highly-deduplicated multi-source datasets in the combination. To study this, we construct six configurations of SlimPajama dataset and train individual ones using 1.3B Cerebras-GPT model with Alibi and SwiGLU. Our best configuration outperforms the 1.3B model trained on RedPajama using the same number of training tokens by a significant margin. All our 1.3B models are trained on Cerebras 16times CS-2 cluster with a total of 80 PFLOP/s in bf16 mixed precision. We further extend our discoveries (such as increasing data diversity is crucial after global deduplication) on a 7B model with large batch-size training. Our models and the separate SlimPajama-DC datasets are available at: https://huggingface.co/MBZUAI-LLM and https://huggingface.co/datasets/cerebras/SlimPajama-627B.

SlimPajama-DC: 대규모 언어 모델 학습을 위한 데이터 조합 이해

SlimPajama-DC: Understanding Data Combinations for LLM Training

초록

Support