改進大型語言模型科學推理的數據與獎勵設計
Improving Data and Reward Design for Scientific Reasoning in Large Language Models
February 9, 2026
作者: Zijie Chen, Zhenghao Lin, Xiao Liu, Zhenzhong Lan, Yeyun Gong, Peng Cheng
cs.AI
摘要
解決開放式科學問題對大型語言模型而言仍具挑戰性,主要源於其內在不可靠的監督與評估機制。瓶頸在於科學領域後訓練的數據構建與獎勵設計。我們開發了一套大規模系統化數據處理流程,將異構開源科學數據轉化為Dr. SCI數據集——該數據集涵蓋八個STEM學科的100萬道題目,具有明確的可驗證/開放式分類、可擴展的難度標註,以及可操作化評估開放式答案的細粒度評分標準。基於此數據集,我們提出Dr. SCI後訓練流程,通過三個組件重構標準的SFT→RL工作流:(1)探索擴展式SFT,在強化學習前拓寬模型的推理模式覆蓋範圍;(2)動態難度課程學習,根據模型演進的科學能力自適應調整訓練數據;(3)科學評分標準引導的RL,通過基於明確答案正確性的評分標準評估,實現開放式科學問題的穩定強化學習。採用Dr. SCI流程訓練的Qwen3-4B-Base模型在GPQA-diamond和GPQA-general上分別達到63.2分和32.4分,持續超越o1-mini、GPT-4o等強力後訓練基線模型,尤其在開放式場景下的科學推理能力實現顯著提升。
English
Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr. SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.