改进大型语言模型科学推理的数据与奖励设计

摘要

解决开放式科学问题对大语言模型而言仍具挑战性，主要源于其内在不可靠的监督与评估机制。这一瓶颈集中体现在科学领域后训练的数据构建与奖励设计环节。我们开发了一套大规模系统化数据处理流程，将异构开源科学数据转化为Dr. SCI数据集——该数据集涵盖八门STEM学科共100万道题目，具有明确的可验证/开放式题目划分、可扩展的难度标注体系，以及通过细粒度评分标准实现开放式答案可量化评估的框架。基于此数据集，我们提出Dr. SCI后训练流程，通过三大组件重构标准SFT→RL工作流：（1）探索扩展式SFT，在强化学习前拓宽模型的推理模式覆盖范围；（2）动态难度课程学习，根据模型动态发展的科学能力自适应调整训练数据；（3）科学评分标准引导的RL，借助明确答案正确性的量规化评估，实现开放式科学问题的稳定强化学习。采用Dr. SCI流程训练的Qwen3-4B-Base模型在GPQA-diamond和GPQA-general上分别达到63.2分和32.4分，持续超越o1-mini、GPT-4o等强后训练基线，尤其在开放式场景下的科学推理能力实现显著提升。

English

Solving open-ended science questions remains challenging for large language models, particularly due to inherently unreliable supervision and evaluation. The bottleneck lies in the data construction and reward design for scientific post-training. We develop a large-scale, systematic data processing pipeline that transforms heterogeneous open-source science data into Dr. SCI dataset, which comprises of 1M questions across eight STEM subjects, with explicit verifiable/open-ended splits, scalable difficulty annotation, and fine-grained rubrics that operationalize evaluation for open-ended answers. Building on this dataset, we propose the Dr. SCI post-training pipeline, which redesigns the standard SFT -> RL workflow through three components: (i) Exploration-Expanding SFT, which broadens the model's reasoning pattern coverage prior to RL; (ii) Dynamic Difficulty Curriculum, which adapts training data to the model's evolving scientific capability; and (iii) SciRubric-Guided RL, which enables stable reinforcement learning on open-ended scientific questions via rubric-based evaluation with explicit answer correctness. Qwen3-4B-Base trained using Dr. SCI pipeline achieves 63.2 on GPQA-diamond and 32.4 on GPQA-general, consistently improves over strong post-trained baselines such as o1-mini and GPT-4o, demonstrating substantial gains in scientific reasoning, especially in open-ended settings.

改进大型语言模型科学推理的数据与奖励设计

Improving Data and Reward Design for Scientific Reasoning in Large Language Models

摘要

Support