MixtureVitae: 許諾優先テキストソースから構築された高品質な指示と推論データを備えたオープンなウェブスケール事前学習データセット

要旨

本研究では、法的リスクを最小化しつつ強力なモデル性能を提供することを目的としたオープンアクセスの事前学習コーパス「MixtureVitae」を提案する。MixtureVitaeは、パブリックドメインおよび許諾ライセンス（例：CC-BY/Apache）のテキストを、慎重に正当化された低リスクの追加データ（例：政府の著作物やEUのTDM適格ソース）と組み合わせ、さらに文書化された出所を持つターゲット指向の指示、推論、および合成データを統合したリスク軽減型のソーシング戦略を採用している。本論文では、ライセンスを意識したフィルタリング、安全性と品質のスクリーニング、ドメインを意識した混合を行うための透明性の高い多段階パイプラインを詳細に説明し、再現可能な研究を支援するためにデータセットとキュレーション手法を公開する。open-sci-refトレーニングプロトコル（130M/400M/1.3B/1.7Bパラメータの固定アーキテクチャ；50Bおよび300Bトークンのトレーニング予算）を用いた制御実験において、MixtureVitaeでトレーニングされたモデルは、一連の標準ベンチマークにおいて他の許諾データセットを一貫して上回り、1.7B/300B設定ではFineWeb-Eduを上回り、トレーニングの後半段階でDCLMに接近する性能を示した。特に数学/コードタスクで強力な性能を発揮し、QAタスクでも競争力のある結果を示した。これらの結果は、許諾優先かつリスク軽減型のデータが、有能な大規模言語モデル（LLM）をトレーニングするための実用的かつ法的に軽減された基盤を提供し、競争力を犠牲にすることなく無差別なウェブスクレイピングへの依存を軽減することを実証している。コード: https://github.com/ontocord/mixturevitae

English

We present MixtureVitae, an open-access pretraining corpus built to minimize legal risk while providing strong model performance. MixtureVitae follows a risk-mitigated sourcing strategy that combines public-domain and permissively licensed text (e.g., CC-BY/Apache) with carefully justified low-risk additions (e.g., government works and EU TDM-eligible sources), alongside targeted instruction, reasoning and synthetic data with documented provenance. We detail a transparent, multi-stage pipeline for license-aware filtering, safety and quality screening, and domain-aware mixing, and we release the dataset and curation recipes to support reproducible research. In controlled experiments using the open-sci-ref training protocol (fixed architectures at 130M/400M/1.3B/1.7B parameters; training budgets of 50B and 300B tokens), models trained on MixtureVitae consistently outperform other permissive datasets across a suite of standard benchmarks, and at the 1.7B/300B setting they surpass FineWeb-Edu and approach DCLM in the later stages of training. Performance is particularly strong on math/code and competitive on QA tasks. These results demonstrate that permissive-first, risk-mitigated data provides a practical and legally mitigated foundation for training capable LLMs, reducing reliance on indiscriminate web scraping without sacrificing competitiveness. Code: https://github.com/ontocord/mixturevitae

MixtureVitae: 許諾優先テキストソースから構築された高品質な指示と推論データを備えたオープンなウェブスケール事前学習データセット

MixtureVitae: Open Web-Scale Pretraining Dataset With High Quality Instruction and Reasoning Data Built from Permissive-First Text Sources

要旨

Support