ChatPaper.aiChatPaper

推理时验证的扩展:基于测试时准则引导验证的自我进化深度研究智能体

Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

January 22, 2026
作者: Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, Michael R. Lyu
cs.AI

摘要

近期深度研究智能体(DRA)的突破性进展正在重塑自动化知识发现与问题解决的范式。尽管现有研究大多聚焦于通过后训练增强策略能力,我们提出了一种创新路径:基于精心设计的评估准则,通过迭代验证策略模型的输出实现智能体的自我进化。该方法催生了验证机制的推理时扩展,使智能体能够通过评估自身生成的答案来产生迭代反馈与优化。我们基于自动构建的DRA失败分类法推导出评估准则,该系统将智能体失败案例划分为5个主类和13个子类。我们提出的DeepVerifier是一种基于准则的结果奖励验证器,它利用验证过程的不对称性,在元评估F1分数上以12%-48%的优势超越了传统智能体自判和LLM评判基线。为实现实用化自我进化,DeepVerifier以即插即用模块形式集成于测试时推理流程。该验证器生成基于细则的详细反馈,并反馈给智能体进行迭代自举优化,无需额外训练即可提升响应质量。当搭载高性能闭源LLM时,这种测试时扩展机制在GAIA和XBench-DeepResearch的挑战性子集上实现了8%-11%的准确率提升。为促进开源生态发展,我们同步发布了DeepVerifier-4K——一个包含4,646个高质量智能体步骤的监督微调数据集。这些案例着重体现反思与自我批判能力,助力开源模型发展出强大的验证能力。
English
Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. To enable practical self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping, refining responses without additional training. This test-time scaling delivers 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepResearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.
PDF162January 27, 2026