CCI4.0: 대규모 언어 모델의 추론 능력 향상을 위한 이중 언어 사전 학습 데이터셋

초록

우리는 우수한 데이터 품질과 다양한 인간과 유사한 추론 경로를 위해 설계된 대규모 이중 언어 사전 학습 데이터셋인 CCI4.0을 소개한다. CCI4.0은 약 35TB의 디스크 공간을 차지하며, CCI4.0-M2-Base와 CCI4.0-M2-CoT 두 개의 하위 데이터셋으로 구성된다. CCI4.0-M2-Base는 5.2TB의 신중하게 선별된 중국어 웹 코퍼스, Nemotron-CC에서 추출한 22.5TB의 영어 서브셋, 그리고 수학, 위키, arXiv, 코드 등 다양한 소스의 데이터를 결합한다. 이 데이터는 대부분 잘 처리된 데이터셋에서 가져왔지만, 다양한 도메인의 품질 기준은 동적이며 이를 처리하기 위해서는 광범위한 전문가 경험과 노력이 필요하다. 따라서 우리는 두 단계의 중복 제거, 다중 분류기 품질 점수화, 도메인 인식 유창성 필터링을 주로 기반으로 데이터 품질을 검증하는 새로운 파이프라인을 제안한다. 우리는 45억 개의 CoT(Chain-of-Thought) 템플릿을 추출하여 CCI4.0-M2-CoT로 명명했다. 더 큰 모델에서 CoT를 증류하는 방식과 달리, 우리가 제안한 단계적 CoT 추출은 다양한 추론 패턴을 보여주고 환각 가능성을 크게 줄인다. 실험적 평가는 CCI4.0에서 사전 학습된 LLM(Large Language Models)이 더 깨끗하고 신뢰할 수 있는 학습 신호로부터 이점을 얻으며, 특히 수학 및 코드 반영 작업에서 일관된 성능 향상을 보여준다는 것을 입증한다. 우리의 결과는 엄격한 데이터 큐레이션과 인간의 사고 템플릿이 LLM 성능을 향상시키는 데 중요한 역할을 한다는 것을 강조하며, 사전 학습 코퍼스를 자동으로 처리하는 데 대한 통찰을 제공한다.

English

We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly 35 TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a 5.2 TB carefully curated Chinese web corpus, a 22.5 TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract 4.5 billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse reasoning patterns and significantly decreases the possibility of hallucination. Empirical evaluations demonstrate that LLMs pre-trained in CCI4.0 benefit from cleaner, more reliable training signals, yielding consistent improvements in downstream tasks, especially in math and code reflection tasks. Our results underscore the critical role of rigorous data curation and human thinking templates in advancing LLM performance, shedding some light on automatically processing pretraining corpora.

CCI4.0: 대규모 언어 모델의 추론 능력 향상을 위한 이중 언어 사전 학습 데이터셋

CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

초록

Support