巨型科学:推动科学推理后训练数据集的前沿发展
MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning
July 22, 2025
作者: Run-Ze Fan, Zengzhi Wang, Pengfei Liu
cs.AI
摘要
科学推理对于培养AI科学家及支持人类研究人员推进自然科学发现前沿至关重要。然而,开源社区主要聚焦于数学与编程领域,而忽视了科学领域,这很大程度上归因于缺乏公开、大规模、高质量且可验证的科学推理数据集。为填补这一空白,我们首先推出了TextbookReasoning,一个开放数据集,包含从12,000本大学科学教材中提取的真实参考答案,涵盖7个科学学科的650,000道推理题。进一步,我们引入了MegaScience,这是一个大规模混合的高质量开源数据集,总计125万条实例,通过系统性的消融研究开发,评估了多种数据选择方法,以确定每个公开科学数据集的最佳子集。同时,我们构建了一个全面的评估体系,覆盖15个基准测试中的多样化主题与题型,整合了全面的答案提取策略,确保评估指标的准确性。实验表明,与现有开源科学数据集相比,我们的数据集在性能与训练效率上表现更优,且响应长度更为简洁。此外,我们在MegaScience上训练了Llama3.1、Qwen2.5及Qwen3系列基础模型,这些模型在平均性能上显著超越了相应的官方指导模型。更重要的是,MegaScience对更大更强的模型展现出更高的有效性,暗示了科学调优的规模效益。我们向社区发布了数据整理流程、评估系统、数据集及七个训练模型,以推动科学推理研究的发展。
English
Scientific reasoning is critical for developing AI scientists and supporting
human researchers in advancing the frontiers of natural science discovery.
However, the open-source community has primarily focused on mathematics and
coding while neglecting the scientific domain, largely due to the absence of
open, large-scale, high-quality, verifiable scientific reasoning datasets. To
bridge this gap, we first present TextbookReasoning, an open dataset featuring
truthful reference answers extracted from 12k university-level scientific
textbooks, comprising 650k reasoning questions spanning 7 scientific
disciplines. We further introduce MegaScience, a large-scale mixture of
high-quality open-source datasets totaling 1.25 million instances, developed
through systematic ablation studies that evaluate various data selection
methodologies to identify the optimal subset for each publicly available
scientific dataset. Meanwhile, we build a comprehensive evaluation system
covering diverse subjects and question types across 15 benchmarks,
incorporating comprehensive answer extraction strategies to ensure accurate
evaluation metrics. Our experiments demonstrate that our datasets achieve
superior performance and training efficiency with more concise response lengths
compared to existing open-source scientific datasets. Furthermore, we train
Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which
significantly outperform the corresponding official instruct models in average
performance. In addition, MegaScience exhibits greater effectiveness for larger
and stronger models, suggesting a scaling benefit for scientific tuning. We
release our data curation pipeline, evaluation system, datasets, and seven
trained models to the community to advance scientific reasoning research.