MiroEval:多模態深度研究代理的流程與結果基準測試框架
MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome
March 30, 2026
作者: Fangda Ye, Yuxin Hu, Pengxiang Zhu, Yibo Li, Ziqi Jin, Yao Xiao, Yibo Wang, Lei Wang, Zhen Zhang, Lu Wang, Yue Deng, Bin Wang, Yifan Zhang, Liangcai Su, Xinyu Wang, He Zhao, Chen Wei, Qiang Ren, Bryan Hooi, An Bo, Shuicheng Yan, Lidong Bing
cs.AI
摘要
深度研究系統近期進展顯著,但評估機制仍滯後於真實用戶需求。現有基準主要通過固定評分標準評估最終報告,未能對底層研究過程進行評判。多數基準還存在多模態覆蓋有限、依賴無法反映真實查詢複雜度的合成任務、以及無法隨知識演進更新等侷限。為解決這些問題,我們推出MiroEval——專為深度研究系統設計的基準與評估框架。該基準包含100項任務(70項純文本、30項多模態),所有任務均紮根真實用戶需求,並通過支持定期更新的雙路徑管道構建,實現動態演進的評估環境。我們提出的評估套件從三個維度對深度研究系統進行互補性評測:採用任務特定評分標準的自適應綜合質量評估、通過對網絡資源和多模態附件的主動檢索與推理進行智能事實核查,以及追蹤系統在研究中如何搜索、推理與優化的過程導向評估。對13套系統的評估結果揭示三大發現:三項評估維度捕捉到系統能力的互補特徵,各自凸顯不同系統的獨特優劣勢;過程質量可作為整體結果的可靠預測指標,並能揭示輸出級指標無法發現的缺陷;多模態任務帶來顯著更大挑戰,多數系統得分下降3至10分。MiroThinker系列表現最為均衡,其中MiroThinker-H1在兩種設定下均位列總分第一。人工驗證與魯棒性測試結果證實了該基準與評估框架的可靠性。MiroEval為新一代深度研究智能體提供了全息診斷工具。
English
Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.