DREAM:基於智慧體度量的深度研究評估
DREAM: Deep Research Evaluation with Agentic Metrics
February 21, 2026
作者: Elad Ben Avraham, Changhao Li, Ron Dorfman, Roy Ganz, Oren Nuriel, Amir Dudai, Aviad Aberdam, Noah Flynn, Elman Mansimov, Adi Kalyanpur, Ron Litman
cs.AI
摘要
深度研究代理能生成分析師級別的報告,但由於缺乏單一標準答案且研究質量具有多維特性,其評估仍面臨挑戰。現有基準雖提出不同方法論,卻普遍存在「綜合幻象」問題——表面流暢性與文獻引用高度吻合的表象,可能掩蓋深層的事實性與推理缺陷。我們通過構建四維分類法揭示關鍵能力錯配:靜態評估器本質上缺乏評估時效有效性與事實正確性所需的工具使用能力。為此,我們提出DREAM框架(基於代理指標的深度研究評估),通過使評估本身具備代理特性來實現能力對等原則。DREAM採用結合查詢無關指標與工具調用代理生成自適應指標的評估協議,實現時序感知覆蓋、實證驗證與系統化推理探測。對照實驗表明,DREAM對事實性衰退和時效性衰減的檢測靈敏度顯著優於現有基準,提供了一種可擴展的無參考評估範式。
English
Deep Research Agents generate analyst-grade reports, yet evaluating them remains challenging due to the absence of a single ground truth and the multidimensional nature of research quality. Recent benchmarks propose distinct methodologies, yet they suffer from the Mirage of Synthesis, where strong surface-level fluency and citation alignment can obscure underlying factual and reasoning defects. We characterize this gap by introducing a taxonomy across four verticals that exposes a critical capability mismatch: static evaluators inherently lack the tool-use capabilities required to assess temporal validity and factual correctness. To address this, we propose DREAM (Deep Research Evaluation with Agentic Metrics), a framework that instantiates the principle of capability parity by making evaluation itself agentic. DREAM structures assessment through an evaluation protocol combining query-agnostic metrics with adaptive metrics generated by a tool-calling agent, enabling temporally aware coverage, grounded verification, and systematic reasoning probes. Controlled evaluations demonstrate DREAM is significantly more sensitive to factual and temporal decay than existing benchmarks, offering a scalable, reference-free evaluation paradigm.