ChatPaper.aiChatPaper

MixtureVitae:基于开放网络规模的高质量预训练数据集,由许可优先文本源构建的指令与推理数据

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

September 29, 2025
作者: Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang, Yanqi Luo, Minh Chien Vu, Taishi Nakamura, Ken Tsui, Van Khue Nguyen, David Salinas, Aleksandra Krasnodębska, Christoph Schuhmann, Mats Leon Richter, Xuan-Son, Vu, Jenia Jitsev
cs.AI

摘要

我们推出了MixtureVitae,这是一个旨在最小化法律风险同时提供强大模型性能的开放获取预训练语料库。MixtureVitae采用了一种风险缓释的源数据策略,结合了公共领域和宽松许可的文本(如CC-BY/Apache),以及经过审慎论证的低风险补充内容(如政府作品和符合欧盟文本与数据挖掘资格的资源),并辅以有针对性的指令、推理及来源可追溯的合成数据。我们详细阐述了一个透明的多阶段流程,包括许可意识过滤、安全与质量筛查,以及领域感知混合,并发布了数据集和整理配方以支持可重复研究。在使用开放科学参考训练协议(固定架构参数为130M/400M/1.3B/1.7B;训练预算为500亿和3000亿tokens)的对照实验中,基于MixtureVitae训练的模型在一系列标准基准测试中持续超越其他宽松许可数据集,在1.7B/300B设置下,它们在训练后期阶段超越了FineWeb-Edu,并接近DCLM的表现。在数学/代码任务上表现尤为突出,在问答任务上也具有竞争力。这些结果表明,以宽松许可优先、风险缓释的数据为训练高效大语言模型提供了实用且法律上稳妥的基础,减少了对无差别网络爬取的依赖,同时不牺牲竞争力。代码地址:https://github.com/ontocord/mixturevitae
English
We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae
PDF63October 2, 2025