CCI4.0:提升大型語言模型推理能力的雙語預訓練數據集
CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models
June 9, 2025
作者: Guang Liu, Liangdong Wang, Jijie Li, Yang Yu, Yao Xu, Jiabei Chen, Yu Bai, Feng Liao, Yonghua Lin
cs.AI
摘要
我們推出CCI4.0,這是一個大規模雙語預訓練數據集,專為卓越的數據質量和多樣化的人類思維軌跡而設計。CCI4.0佔用約35TB的磁盤空間,包含兩個子數據集:CCI4.0-M2-Base和CCI4.0-M2-CoT。CCI4.0-M2-Base結合了5.2TB精心策劃的中文網絡語料庫、來自Nemotron-CC的22.5TB英文子集,以及來自數學、維基、arXiv和代碼的多樣化數據源。儘管這些數據大多來自經過良好處理的數據集,但各領域的質量標準是動態的,需要大量的專家經驗和人力來處理。因此,我們提出了一種新穎的流程,主要基於模型來驗證數據質量,包括兩階段去重、多分類器質量評分和領域感知的流暢性過濾。我們提取了45億條CoT(思維鏈)模板,命名為CCI4.0-M2-CoT。與從更大模型中蒸餾CoT不同,我們提出的分階段CoT提取展示了多樣化的推理模式,並顯著降低了幻覺的可能性。實證評估表明,在CCI4.0中預訓練的LLM受益於更乾淨、更可靠的訓練信號,在下游任務中,尤其是在數學和代碼反思任務中,表現出持續的改進。我們的結果強調了嚴格的數據策劃和人類思維模板在提升LLM性能中的關鍵作用,為自動處理預訓練語料庫提供了一些啟示。
English
We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered
for superior data quality and diverse human-like reasoning trajectory. CCI4.0
occupies roughly 35 TB of disk space and comprises two sub-datasets:
CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a 5.2 TB carefully
curated Chinese web corpus, a 22.5 TB English subset from Nemotron-CC, and
diverse sources from math, wiki, arxiv, and code. Although these data are
mostly sourced from well-processed datasets, the quality standards of various
domains are dynamic and require extensive expert experience and labor to
process. So, we propose a novel pipeline justifying data quality mainly based
on models through two-stage deduplication, multiclassifier quality scoring, and
domain-aware fluency filtering. We extract 4.5 billion pieces of
CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the
distillation of CoT from larger models, our proposed staged CoT extraction
exemplifies diverse reasoning patterns and significantly decreases the
possibility of hallucination. Empirical evaluations demonstrate that LLMs
pre-trained in CCI4.0 benefit from cleaner, more reliable training signals,
yielding consistent improvements in downstream tasks, especially in math and
code reflection tasks. Our results underscore the critical role of rigorous
data curation and human thinking templates in advancing LLM performance,
shedding some light on automatically processing pretraining corpora.