ChatPaper.aiChatPaper

巨型科學:推動科學推理後訓練數據集的前沿探索

MegaScience: Pushing the Frontiers of Post-Training Datasets for Science Reasoning

July 22, 2025
作者: Run-Ze Fan, Zengzhi Wang, Pengfei Liu
cs.AI

摘要

科學推理對於培養人工智慧科學家及支持人類研究人員推進自然科學發現的前沿至關重要。然而,開源社群主要聚焦於數學與編程領域,而忽視了科學領域,這很大程度上歸因於缺乏開放、大規模、高品質且可驗證的科學推理數據集。為彌補這一缺口,我們首先推出了TextbookReasoning,這是一個開放數據集,其特點是從12,000本大學級科學教科書中提取的真實參考答案,涵蓋了7個科學領域的65萬道推理題目。我們進一步引入了MegaScience,這是一個大規模混合的高品質開源數據集,總計125萬個實例,通過系統性的消融研究開發而成,評估了多種數據選擇方法,以確定每個公開科學數據集的最佳子集。同時,我們構建了一個全面的評估系統,涵蓋了15個基準測試中的多樣化主題與題型,並整合了全面的答案提取策略,以確保評估指標的準確性。我們的實驗表明,與現有的開源科學數據集相比,我們的數據集在更簡潔的回應長度下實現了更優的性能與訓練效率。此外,我們在MegaScience上訓練了Llama3.1、Qwen2.5及Qwen3系列基礎模型,這些模型在平均性能上顯著超越了相應的官方指導模型。此外,MegaScience對於更大、更強的模型表現出更高的效能,暗示了科學調優的規模效益。我們向社群發布了我們的數據整理流程、評估系統、數據集及七個訓練模型,以推動科學推理研究的發展。
English
Scientific reasoning is critical for developing AI scientists and supporting human researchers in advancing the frontiers of natural science discovery. However, the open-source community has primarily focused on mathematics and coding while neglecting the scientific domain, largely due to the absence of open, large-scale, high-quality, verifiable scientific reasoning datasets. To bridge this gap, we first present TextbookReasoning, an open dataset featuring truthful reference answers extracted from 12k university-level scientific textbooks, comprising 650k reasoning questions spanning 7 scientific disciplines. We further introduce MegaScience, a large-scale mixture of high-quality open-source datasets totaling 1.25 million instances, developed through systematic ablation studies that evaluate various data selection methodologies to identify the optimal subset for each publicly available scientific dataset. Meanwhile, we build a comprehensive evaluation system covering diverse subjects and question types across 15 benchmarks, incorporating comprehensive answer extraction strategies to ensure accurate evaluation metrics. Our experiments demonstrate that our datasets achieve superior performance and training efficiency with more concise response lengths compared to existing open-source scientific datasets. Furthermore, we train Llama3.1, Qwen2.5, and Qwen3 series base models on MegaScience, which significantly outperform the corresponding official instruct models in average performance. In addition, MegaScience exhibits greater effectiveness for larger and stronger models, suggesting a scaling benefit for scientific tuning. We release our data curation pipeline, evaluation system, datasets, and seven trained models to the community to advance scientific reasoning research.
PDF542July 23, 2025