ChatPaper.aiChatPaper

基于人类偏好的深度研究报告生成中特定查询准则学习

Learning Query-Specific Rubrics from Human Preferences for DeepResearch Report Generation

February 3, 2026
作者: Changze Lv, Jie Zhou, Wentao Zhao, Jingwen Xu, Zisu Huang, Muzhao Tian, Shihan Dou, Tao Gui, Le Tian, Xiao Zhou, Xiaoqing Zheng, Xuanjing Huang, Jie Zhou
cs.AI

摘要

当前,由于缺乏可验证的奖励信号,对深度研究报告生成系统进行训练与评估仍具挑战性。因此,基于量规的评估方法已成为主流实践。然而,现有方案要么依赖粗粒度的预设量规导致评估精度不足,要么采用人工构建的查询定制化量规,存在成本高昂且难以扩展的问题。本文提出一种针对深度研究报告生成任务、训练符合人类偏好的查询定制化量规生成器的技术路径。我们首先构建包含人类对成对报告偏好的深度研究式查询数据集,随后通过结合人类偏好监督与基于大语言模型的量规评估的混合奖励机制,采用强化学习训练量规生成器。为提升长程推理能力,我们进一步设计了多智能体马尔可夫状态工作流用于报告生成。实验表明,相较于现有量规设计策略,我们提出的量规生成器能提供区分度更高且更符合人类偏好的监督信号。此外,当整合至多智能体马尔可夫状态训练框架时,配备本量规生成器的深度研究系统在DeepResearch Bench基准测试中持续超越所有开源基线模型,并与领先闭源模型性能相当。
English
Nowadays, training and evaluating DeepResearch-generated reports remain challenging due to the lack of verifiable reward signals. Accordingly, rubric-based evaluation has become a common practice. However, existing approaches either rely on coarse, pre-defined rubrics that lack sufficient granularity, or depend on manually constructed query-specific rubrics that are costly and difficult to scale. In this paper, we propose a pipeline to train human-preference-aligned query-specific rubric generators tailored for DeepResearch report generation. We first construct a dataset of DeepResearch-style queries annotated with human preferences over paired reports, and train rubric generators via reinforcement learning with a hybrid reward combining human preference supervision and LLM-based rubric evaluation. To better handle long-horizon reasoning, we further introduce a Multi-agent Markov-state (MaMs) workflow for report generation. We empirically show that our proposed rubric generators deliver more discriminative and better human-aligned supervision than existing rubric design strategies. Moreover, when integrated into the MaMs training framework, DeepResearch systems equipped with our rubric generators consistently outperform all open-source baselines on the DeepResearch Bench and achieve performance comparable to that of leading closed-source models.
PDF211February 5, 2026