MiroEval：面向过程与结果的多模态深度研究智能体基准评测

摘要

近期深度研究系统虽取得显著进展，但其评估体系仍滞后于真实用户需求。现有基准主要采用固定标准评估最终报告，未能对底层研究过程进行评价。多数基准还存在多模态覆盖范围有限、依赖无法反映真实查询复杂度的合成任务、以及无法随知识演进更新等问题。为弥补这些不足，我们推出MiroEval——面向深度研究系统的基准与评估框架。该基准包含100项任务（70项纯文本、30项多模态），均基于真实用户需求构建，并通过支持定期更新的双路径流程实现动态演进。我们提出的评估套件从三个互补维度评估深度研究系统：采用任务特定标准的自适应综合质量评估、基于网络资源与多模态附件的主动检索推理式事实核查、以及聚焦系统在研究中搜索推理与优化全过程的行为审计。对13个系统的评估得出三项主要发现：三个评估维度能捕捉系统能力的互补特性，各自揭示不同系统的独特优势与短板；过程质量可作为整体结果的有效预测指标，并能发现输出级指标无法察觉的缺陷；多模态任务带来显著更大挑战，多数系统得分下降3至10分。MiroThinker系列表现最为均衡，其中MiroThinker-H1在两种场景下均位列榜首。人工验证与鲁棒性测试结果证实了该基准与评估框架的可靠性。MiroEval为新一代深度研究智能体提供了全景式诊断工具。

English

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.

MiroEval：面向过程与结果的多模态深度研究智能体基准评测

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

摘要

Support