ChatPaper.aiChatPaper

数据达尔文主义(上篇):释放科学数据在预训练中的价值

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

February 8, 2026
作者: Yiwei Qin, Zhen Huang, Tiantian Mi, Weiye Si, Chenyang Zhou, Qipeng Guo, Siyuan Feng, Pengfei Liu
cs.AI

摘要

数据质量决定基础模型性能,但系统性处理框架仍属缺失。我们提出"数据达尔文主义"——一个十级分类体系(L0-L9),将数据与模型的协同进化概念化:先进模型能为下一代系统生成更优质的数据。我们通过构建包含9000亿token的科学文献语料库Darwin-Science(L0-L5)验证该理论。研究发现原始科学文本存在可学习性断层,为此我们采用前沿大语言模型实施L4(生成式精炼)与L5(认知补全),通过显式推理和术语阐释弥合这一断层。 为确保严谨溯源,我们从头预训练daVinci-origin-3B/7B模型,排除科学内容以构建无污染基线。经过6000亿token的持续预训练后,Darwin-Science在20余项基准测试中分别以+2.12(3B)和+2.95(7B)分优势超越基线,在领域对齐任务上优势更扩大至+5.60和+8.40分。系统性推进至L5级别带来+1.36分的总增益,证实高级别处理能释放数据的潜在价值。我们开源Darwin-Science语料库及daVinci-origin模型,以推动基于原则的协同进化发展。
English
Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.
PDF142February 18, 2026