改進大型語言模型科學推理的數據與獎勵設計

摘要

解決開放式科學問題對大型語言模型而言仍具挑戰性，主要源於其內在不可靠的監督與評估機制。瓶頸在於科學領域後訓練的數據構建與獎勵設計。我們開發了一套大規模系統化數據處理流程，將異構開源科學數據轉化為Dr. SCI數據集——該數據集涵蓋八個STEM學科的100萬道題目，具有明確的可驗證/開放式分類、可擴展的難度標註，以及可操作化評估開放式答案的細粒度評分標準。基於此數據集，我們提出Dr. SCI後訓練流程，通過三個組件重構標準的SFT→RL工作流：（1）探索擴展式SFT，在強化學習前拓寬模型的推理模式覆蓋範圍；（2）動態難度課程學習，根據模型演進的科學能力自適應調整訓練數據；（3）科學評分標準引導的RL，通過基於明確答案正確性的評分標準評估，實現開放式科學問題的穩定強化學習。採用Dr. SCI流程訓練的Qwen3-4B-Base模型在GPQA-diamond和GPQA-general上分別達到63.2分和32.4分，持續超越o1-mini、GPT-4o等強力後訓練基線模型，尤其在開放式場景下的科學推理能力實現顯著提升。

English

Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr. SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.

改進大型語言模型科學推理的數據與獎勵設計

Improving Data and Reward Design for Scientific Reasoning in Large Language Models

摘要

Support