數據達爾文主義第一部:釋放科學數據於預訓練中的價值
Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training
February 8, 2026
作者: Yiwei Qin, Zhen Huang, Tiantian Mi, Weiye Si, Chenyang Zhou, Qipeng Guo, Siyuan Feng, Pengfei Liu
cs.AI
摘要
資料品質決定基礎模型的效能,然而系統化的處理框架仍顯不足。我們提出「數據達爾文主義」——一個十級分類體系(L0-L9),將數據與模型的協同演化概念化:先進模型能為下一代系統生成更優質的數據。我們通過構建達爾文科學語料庫(9000億詞元,含L0-L5級)在科學文獻領域驗證此理論。研究發現原始科學文本存在可學習性鴻溝,為此我們運用前沿大語言模型實施L4(生成式優化)與L5(認知補全)策略,通過闡釋推理過程與專業術語來彌合這一鴻溝。
為確保嚴謹的歸因分析,我們從零開始預訓練daVinci-origin-3B/7B模型,並排除科學內容以建立無汙染的基準模型。經過6000億詞元的持續預訓練後,達爾文科學模型在20多項基準測試中分別以+2.12(3B)和+2.95(7B)分的優勢超越基準模型,在領域對齊任務中優勢更擴大至+5.60與+8.40分。系統性推進至L5級別可帶來+1.36分的總體增益,證實更高層級的數據處理能釋放潛在數據價值。我們公開達爾文科學語料庫與daVinci-origin模型,以推動基於協同演化機制的規範化發展。
English
Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology.
To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.