CCI4.0：提升大语言模型推理能力的双语预训练数据集

摘要

我们推出CCI4.0，这是一个大规模双语预训练数据集，专为卓越的数据质量和多样化的人类思维推理轨迹而设计。CCI4.0占据约35TB的磁盘空间，包含两个子数据集：CCI4.0-M2-Base和CCI4.0-M2-CoT。CCI4.0-M2-Base整合了5.2TB精心筛选的中文网络语料、来自Nemotron-CC的22.5TB英文子集，以及数学、维基、arXiv和代码等多个领域的多样化资源。尽管这些数据大多源自经过良好处理的数据集，但各领域的质量标准是动态变化的，需要丰富的专家经验和大量人力进行处理。因此，我们提出了一种新颖的流程，主要通过两阶段去重、多分类器质量评分和领域感知的流畅度过滤，基于模型来验证数据质量。我们提取了45亿条CoT（思维链）模板，命名为CCI4.0-M2-CoT。与从更大模型中蒸馏CoT不同，我们提出的分阶段CoT提取展示了多样化的推理模式，并显著降低了幻觉的可能性。实证评估表明，在CCI4.0上预训练的大型语言模型受益于更干净、更可靠的训练信号，在下游任务中，尤其是在数学和代码反思任务中，表现出一致的提升。我们的结果强调了严格的数据筛选和人类思维模板在提升LLM性能中的关键作用，为自动处理预训练语料库提供了一些启示。

English

We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly 35 TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a 5.2 TB carefully curated Chinese web corpus, a 22.5 TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract 4.5 billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse reasoning patterns and significantly decreases the possibility of hallucination. Empirical evaluations demonstrate that LLMs pre-trained in CCI4.0 benefit from cleaner, more reliable training signals, yielding consistent improvements in downstream tasks, especially in math and code reflection tasks. Our results underscore the critical role of rigorous data curation and human thinking templates in advancing LLM performance, shedding some light on automatically processing pretraining corpora.

CCI4.0：提升大语言模型推理能力的双语预训练数据集

CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

摘要

Support