基於評分量規獎勵機制的人工智慧協作科學家培養模式
Training AI Co-Scientists Using Rubric Rewards
December 29, 2025
作者: Shashwat Goel, Rishi Hazra, Dulhan Jayalath, Timon Willi, Parag Jain, William F. Shen, Ilias Leontiadis, Francesco Barbieri, Yoram Bachrach, Jonas Geiping, Chenxi Whitehouse
cs.AI
摘要
人工智能协研员正逐渐成为协助人类研究人员实现科研目标的重要工具。这类AI协研员的核心功能在于能够根据既定目标与约束条件生成研究方案。这些方案既可用于研究人员头脑风暴,也可在进一步优化后直接实施。然而,当前语言模型在生成完全符合约束条件与隐性需求的研究方案方面仍存在困难。本研究探索如何利用海量现有科研论文语料库,训练能够生成更优质研究方案的语言模型。我们通过自动提取多领域论文中的研究目标及针对性评估标准,构建了可扩展的多样化训练语料库。随后采用带自评机制的强化学习训练方案生成模型:训练过程中由初始策略的冻结副本担任评分器,评估标准形成的生成-验证差距使模型无需外部人工监督即可持续优化。为验证该方法,我们针对机器学习研究目标开展了累计225小时的人工专家评估。结果显示,在70%的研究目标案例中,专家更倾向于选择经微调的Qwen3-30B-A3B模型生成的方案;84%的自动提取目标评估标准获得专家认可。为检验普适性,我们将该方法扩展至医学论文及最新arXiv预印本的研究目标,并采用前沿模型陪审团进行评估。微调模型实现了12-22%的相对性能提升,并展现出显著的跨领域泛化能力,即使在医学研究这类难以获取执行反馈的场景中依然有效。这些发现共同证明,这种可扩展的自动化训练方法有望成为提升通用AI协研员能力的重要突破。
English
AI co-scientists are emerging as a tool to assist human researchers in achieving their research goals. A crucial feature of these AI co-scientists is the ability to generate a research plan given a set of aims and constraints. The plan may be used by researchers for brainstorming, or may even be implemented after further refinement. However, language models currently struggle to generate research plans that follow all constraints and implicit requirements. In this work, we study how to leverage the vast corpus of existing research papers to train language models that generate better research plans. We build a scalable, diverse training corpus by automatically extracting research goals and goal-specific grading rubrics from papers across several domains. We then train models for research plan generation via reinforcement learning with self-grading. A frozen copy of the initial policy acts as the grader during training, with the rubrics creating a generator-verifier gap that enables improvements without external human supervision. To validate this approach, we conduct a study with human experts for machine learning research goals, spanning 225 hours. The experts prefer plans generated by our finetuned Qwen3-30B-A3B model over the initial model for 70% of research goals, and approve 84% of the automatically extracted goal-specific grading rubrics. To assess generality, we also extend our approach to research goals from medical papers, and new arXiv preprints, evaluating with a jury of frontier models. Our finetuning yields 12-22% relative improvements and significant cross-domain generalization, proving effective even in problem settings like medical research where execution feedback is infeasible. Together, these findings demonstrate the potential of a scalable, automated training recipe as a step towards improving general AI co-scientists.