MiroEval: プロセスと結果におけるマルチモーダル深層研究エージェントのベンチマーキング

要旨

深層研究システムの最近の進歩は目覚ましいものがあるが、評価は依然として実際のユーザーニーズに遅れを取っている。既存のベンチマークは、固定された評価基準を用いて最終報告書を評価することが主流であり、背後にある研究プロセスを評価できていない。また、大半は限定的なマルチモーダル対応しかなく、現実世界のクエリの複雑さを反映しない人工的なタスクに依存し、知識の進化に伴って更新することができない。これらの課題を解決するため、我々は深層研究システムのためのベンチマークおよび評価フレームワークであるMiroEvalを提案する。このベンチマークは100のタスク（テキストのみ70、マルチモーダル30）で構成され、すべて実際のユーザーニーズに基づいており、定期的な更新をサポートするデュアルパスパイプラインを通じて構築されるため、ライブかつ進化する環境を実現する。提案する評価スイートは、深層研究システムを3つの相補的な次元で評価する：タスク固有の評価基準による適応的合成品質評価、ウェブソースとマルチモーダル添付ファイルの両方に対する能動的検索と推論によるエージェント的事実性検証、そしてプロセス中心評価によりシステムが調査を通じてどのように検索、推論、洗練を行うかを監査する。13のシステムに対する評価から3つの主要な知見が得られた：3つの評価次元はシステム能力の相補的な側面を捉え、各次元がシステム間で異なる強みと弱みを明らかにすること、プロセス品質は全体の成果を予測する信頼性の高い指標であり、出力レベルの指標では見えない弱点を明らかにすること、マルチモーダルタスクは大幅に困難であり、ほとんどのシステムで3～10ポイントの低下が見られることである。MiroThinkerシリーズは最もバランスの取れた性能を達成し、MiroThinker-H1が両設定で総合最高位となった。人間による検証とロバスト性の結果は、ベンチマークおよび評価フレームワークの信頼性を確認するものである。MiroEvalは次世代の深層研究エージェントのための包括的な診断ツールを提供する。

English

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.

MiroEval: プロセスと結果におけるマルチモーダル深層研究エージェントのベンチマーキング

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

要旨

Support