MixtureVitae:基于开放网络规模的高质量预训练数据集,由许可优先文本源构建的指令与推理数据
MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources
September 29, 2025
作者: Huu Nguyen, Victor May, Harsh Raj, Marianna Nezhurina, Yishan Wang, Yanqi Luo, Minh Chien Vu, Taishi Nakamura, Ken Tsui, Van Khue Nguyen, David Salinas, Aleksandra Krasnodębska, Christoph Schuhmann, Mats Leon Richter, Xuan-Son, Vu, Jenia Jitsev
cs.AI
摘要
我们推出了MixtureVitae,这是一个旨在最小化法律风险同时提供强大模型性能的开放获取预训练语料库。MixtureVitae采用了一种风险缓释的源数据策略,结合了公共领域及宽松许可(如CC-BY/Apache)的文本,以及经过审慎论证的低风险补充材料(如政府出版物和符合欧盟文本与数据挖掘资格的资源),并辅以针对性指导、推理及来源明确的人工合成数据。我们详细阐述了一个透明的多阶段处理流程,包括基于许可的筛选、安全与质量审查,以及领域感知的混合方法,并公开了数据集及其构建方案,以支持可重复性研究。在采用开放科学参考训练协议(固定架构参数为130M/400M/1.3B/1.7B;训练预算为50B和300B tokens)的对照实验中,基于MixtureVitae训练的模型在一系列标准基准测试中持续超越其他宽松许可数据集,尤其在1.7B/300B配置下,其表现超越了FineWeb-Edu,并在训练后期接近DCLM水平。在数学/代码任务上表现尤为突出,在问答任务上也展现出竞争力。这些结果表明,以宽松许可优先、风险缓释的数据为训练高效大语言模型提供了实用且法律风险可控的基础,减少了对无差别网络爬取的依赖,同时保持了竞争力。代码地址:https://github.com/ontocord/mixturevitae。
English
We present MixtureVitae, an open-access pretraining corpus built to minimize
legal risk while providing strong model performance. MixtureVitae follows a
risk-mitigated sourcing strategy that combines public-domain and permissively
licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions
(e.g., government works and EU TDM-eligible sources), alongside targeted
instruction, reasoning and synthetic data with documented provenance. We detail
a transparent, multi-stage pipeline for license-aware filtering, safety and
quality screening, and domain-aware mixing, and we release the dataset and
curation recipes to support reproducible research. In controlled experiments
using the open-sci-ref training protocol (fixed architectures at
130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens),
models trained on MixtureVitae consistently outperform other permissive
datasets across a suite of standard benchmarks, and at the 1.7B/300B setting
they surpass FineWeb-Edu and approach DCLM in the later stages of training.
Performance is particularly strong on math/code and competitive on QA tasks.
These results demonstrate that permissive-first, risk-mitigated data provides a
practical and legally mitigated foundation for training capable LLMs, reducing
reliance on indiscriminate web scraping without sacrificing competitiveness.
Code: https://github.com/ontocord/mixturevitae