Data Darwinisme Deel I: De Waarde van Wetenschappelijke Data Ontgrendelen voor Pre-training

Samenvatting

Data kwaliteit bepaalt de prestaties van foundation modellen, maar systematische verwerkingskaders ontbreken. Wij introduceren Data Darwinisme, een taxonomie met tien niveaus (L0-L9) die de co-evolutie van data en modellen conceptualiseert: geavanceerde modellen produceren superieure data voor volgende generatie systemen. Wij valideren dit op wetenschappelijke literatuur door de constructie van Darwin-Science, een corpus van 900B tokens (L0-L5). Wij identificeren een leerbaarheidskloof in ruwe wetenschappelijke tekst, die wij overbruggen via L4 (Generatieve Verfijning) en L5 (Cognitieve Voltooiing) door gebruik te maken van frontier LLM's om redenering en terminologie te expliciëren. Om rigoureuze attributie te waarborgen, pre-trainden wij daVinci-origin-3B/7B modellen volledig vanaf nul, waarbij wetenschappelijke content werd uitgesloten om contaminatievrije basislijnen te creëren. Na 600B tokens aan voortgezette pre-training presteert Darwin-Science +2,12 (3B) en +2,95 (7B) punten beter dan de basislijnen over 20+ benchmarks, oplopend tot +5,60 en +8,40 punten op domein-afgestemde taken. Systematische progressie naar L5 resulteert in een totale winst van +1,36 punten, wat bevestigt dat verwerking op een hoger niveau latente datawaarde ontsluit. Wij geven het Darwin-Science corpus en de daVinci-origin modellen vrij om principiële, co-evolutionaire ontwikkeling mogelijk te maken.

English

Data quality determines foundation model performance, yet systematic processing frameworks are lacking. We introduce Data Darwinism, a ten-level taxonomy (L0-L9) that conceptualizes data-model co-evolution: advanced models produce superior data for next-generation systems. We validate this on scientific literature by constructing Darwin-Science, a 900B-token corpus (L0-L5). We identify a learnability gap in raw scientific text, which we bridge via L4 (Generative Refinement) and L5 (Cognitive Completion) using frontier LLMs to explicate reasoning and terminology. To ensure rigorous attribution, we pre-trained daVinci-origin-3B/7B models from scratch, excluding scientific content to create contamination-free baselines. After 600B tokens of continued pre-training, Darwin-Science outperforms baselines by +2.12 (3B) and +2.95 (7B) points across 20+ benchmarks, rising to +5.60 and +8.40 points on domain-aligned tasks. Systematic progression to L5 yields a +1.36 total gain, confirming that higher-level processing unlocks latent data value. We release the Darwin-Science corpus and daVinci-origin models to enable principled, co-evolutionary development.

Data Darwinisme Deel I: De Waarde van Wetenschappelijke Data Ontgrendelen voor Pre-training

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training

Samenvatting

Support