ChatPaper.aiChatPaper

推論時驗證的規模化:透過測試時評分標準引導驗證實現自我演進的深度研究智能體

Inference-Time Scaling of Verification: Self-Evolving Deep Research Agents via Test-Time Rubric-Guided Verification

January 22, 2026
作者: Yuxuan Wan, Tianqing Fang, Zaitang Li, Yintong Huo, Wenxuan Wang, Haitao Mi, Dong Yu, Michael R. Lyu
cs.AI

摘要

深度研究智能體(DRA)領域的最新進展正在重塑自動化知識發現與問題解決的範式。當前多數研究側重於透過後訓練增強策略能力,而我們提出了一種替代路徑:基於精心設計的評估準則,透過迭代驗證策略模型的輸出來實現智能體的自我進化。這種方法催生了驗證環節的推理時擴展機制,使智能體能透過評估自身生成的答案來產生迭代反饋與改進方案。我們基於自動構建的「DRA失敗分類法」推導出評估準則,該分類法將智能體失敗系統性歸納為5大類別與13個子類別。本文提出DeepVerifier——一種基於準則的結果獎勵驗證器,其利用驗證過程中的非對稱性,在元評估F1分數上較基礎的智能體自評與LLM評判基準提升12%-48%。為實現實用化自我進化,DeepVerifier以即插即用模組形式整合於測試時推理流程。該驗證器生成具細粒度準則的反饋,回傳至智能體進行迭代式自舉優化,無需額外訓練即可精煉回應。在採用高性能閉源LLM驅動時,此推理時擴展機制於GAIA與XBench-DeepResearch的挑戰性子集上實現8%-11%的準確率提升。最後,為促進開源生態發展,我們發布DeepVerifier-4K:一個包含4,646個高質量智能體步驟的監督微調數據集,專注於DRA驗證任務。這些範例強調反思與自我批判能力,助力開源模型發展出強健的驗證機制。
English
Recent advances in Deep Research Agents (DRAs) are transforming automated knowledge discovery and problem-solving. While the majority of existing efforts focus on enhancing policy capabilities via post-training, we propose an alternative paradigm: self-evolving the agent's ability by iteratively verifying the policy model's outputs, guided by meticulously crafted rubrics. This approach gives rise to the inference-time scaling of verification, wherein an agent self-improves by evaluating its generated answers to produce iterative feedback and refinements. We derive the rubrics based on an automatically constructed DRA Failure Taxonomy, which systematically classifies agent failures into five major categories and thirteen sub-categories. We present DeepVerifier, a rubrics-based outcome reward verifier that leverages the asymmetry of verification and outperforms vanilla agent-as-judge and LLM judge baselines by 12%-48% in meta-evaluation F1 score. To enable practical self-evolution, DeepVerifier integrates as a plug-and-play module during test-time inference. The verifier produces detailed rubric-based feedback, which is fed back to the agent for iterative bootstrapping, refining responses without additional training. This test-time scaling delivers 8%-11% accuracy gains on challenging subsets of GAIA and XBench-DeepResearch when powered by capable closed-source LLMs. Finally, to support open-source advancement, we release DeepVerifier-4K, a curated supervised fine-tuning dataset of 4,646 high-quality agent steps focused on DRA verification. These examples emphasize reflection and self-critique, enabling open models to develop robust verification capabilities.
PDF162January 27, 2026