CCI4.0: 大規模言語モデルの推論能力向上のためのバイリンガル事前学習データセット

要旨

CCI4.0を紹介します。これは、優れたデータ品質と多様な人間らしい推論軌跡を実現するために設計された大規模な二言語事前学習データセットです。CCI4.0は約35TBのディスク容量を占め、CCI4.0-M2-BaseとCCI4.0-M2-CoTの2つのサブデータセットで構成されています。CCI4.0-M2-Baseは、5.2TBの厳選された中国語ウェブコーパス、Nemotron-CCからの22.5TBの英語サブセット、および数学、ウィキ、arXiv、コードなどの多様なソースを組み合わせています。これらのデータは主に十分に処理されたデータセットから取得されていますが、各ドメインの品質基準は動的であり、専門家の経験と労力を要する処理が必要です。そこで、二段階の重複排除、マルチクラス分類品質スコアリング、ドメイン対応の流暢性フィルタリングを主にモデルに基づいて行う新しいパイプラインを提案します。また、45億個のCoT（Chain-of-Thought）テンプレートを抽出し、CCI4.0-M2-CoTと名付けました。大規模モデルからのCoT蒸留とは異なり、提案する段階的CoT抽出は多様な推論パターンを例示し、幻覚の可能性を大幅に低減します。実証評価により、CCI4.0で事前学習されたLLMは、よりクリーンで信頼性の高い学習信号から恩恵を受け、特に数学とコード反射タスクにおいて下流タスクでの一貫した改善を示すことが明らかになりました。これらの結果は、LLMの性能向上における厳格なデータキュレーションと人間の思考テンプレートの重要性を強調し、事前学習コーパスの自動処理に関する洞察を提供します。

English

We introduce CCI4.0, a large-scale bilingual pre-training dataset engineered for superior data quality and diverse human-like reasoning trajectory. CCI4.0 occupies roughly 35 TB of disk space and comprises two sub-datasets: CCI4.0-M2-Base and CCI4.0-M2-CoT. CCI4.0-M2-Base combines a 5.2 TB carefully curated Chinese web corpus, a 22.5 TB English subset from Nemotron-CC, and diverse sources from math, wiki, arxiv, and code. Although these data are mostly sourced from well-processed datasets, the quality standards of various domains are dynamic and require extensive expert experience and labor to process. So, we propose a novel pipeline justifying data quality mainly based on models through two-stage deduplication, multiclassifier quality scoring, and domain-aware fluency filtering. We extract 4.5 billion pieces of CoT(Chain-of-Thought) templates, named CCI4.0-M2-CoT. Differing from the distillation of CoT from larger models, our proposed staged CoT extraction exemplifies diverse reasoning patterns and significantly decreases the possibility of hallucination. Empirical evaluations demonstrate that LLMs pre-trained in CCI4.0 benefit from cleaner, more reliable training signals, yielding consistent improvements in downstream tasks, especially in math and code reflection tasks. Our results underscore the critical role of rigorous data curation and human thinking templates in advancing LLM performance, shedding some light on automatically processing pretraining corpora.

CCI4.0: 大規模言語モデルの推論能力向上のためのバイリンガル事前学習データセット

CCI4.0: A Bilingual Pretraining Dataset for Enhancing Reasoning in Large Language Models

要旨

Support