MiroEval: 프로세스와 결과에서 다중 모드 딥 리서치 에이전트 성능 평가

초록

최근 딥 리서치 시스템의 발전은 놀랍지만, 평가 방식은 여전히 실제 사용자 요구를 따라가지 못하고 있습니다. 기존 벤치마크는 주로 고정된 평가 기준을 사용해 최종 보고서를 평가하며, 근본적인 연구 과정을 평가하지 못합니다. 대부분의 벤치마크는 제한된 멀티모달 범위를 제공하고, 실제 질의의 복잡성을 반영하지 못하는 합성 작업에 의존하며, 지식이 진화함에 따라 갱신될 수 없다는 한계도 있습니다. 이러한 격차를 해결하기 위해 우리는 딥 리서치 시스템을 위한 벤치마크이자 평가 프레임워크인 MiroEval을 소개합니다. 이 벤치마크는 실제 사용자 요구에 기반을 두고 주기적 업데이트를 지원하는 이중 경로 파이프라인을 통해 구성된 100개의 태스크(텍스트 전용 70개, 멀티모달 30개)로 구성되어 있으며, 이를 통해 살아 있고 진화하는 평가 환경을 제공합니다. 제안된 평가 스위트는 딥 리서치 시스템을 세 가지 상호 보완적인 차원에서 평가합니다: 태스크별 평가 기준을 통한 적응형 종합 품질 평가, 웹 출처와 멀티모달 첨부 파일 모두에 대한 능동적 검색 및 추론을 통한 에이전트 사실성 검증, 그리고 시스템이 조사 전반에 걸쳐 어떻게 검색, 추론, 개선하는지를 감사하는 과정 중심 평가입니다. 13개 시스템에 대한 평가 결과 세 가지 주요 발견점이 도출되었습니다: 세 가지 평가 차원은 시스템 능력의 상호 보완적인 측면을 포착하며, 각 차원은 시스템별로 뚜렷한 강점과 약점을 드러냅니다; 과정 품질은 전반적 결과의 신뢰할 수 있는 예측 지표 역할을 하면서도 출력 수준 지표에서는 보이지 않는 약점을 드러냅니다; 멀티모달 태스크는 훨씬 더 큰 도전 과제를 제시하며, 대부분의 시스템이 3~10점 정도 성능이 하락했습니다. MiroThinker 시리즈는 가장 균형 잡힌 성능을 달성했으며, MiroThinker-H1이 두 설정 모두에서 전체 최고 순위를 기록했습니다. 인간 검증 및 강건성 결과는 벤치마크와 평가 프레임워크의 신뢰성을 확인해줍니다. MiroEval은 차세대 딥 리서치 에이전트를 위한 종합 진단 도구를 제공합니다.

English

Recent progress in deep research systems has been impressive, but evaluation still lags behind real user needs. Existing benchmarks predominantly assess final reports using fixed rubrics, failing to evaluate the underlying research process. Most also offer limited multimodal coverage, rely on synthetic tasks that do not reflect real-world query complexity, and cannot be refreshed as knowledge evolves. To address these gaps, we introduce MiroEval, a benchmark and evaluation framework for deep research systems. The benchmark comprises 100 tasks (70 text-only, 30 multimodal), all grounded in real user needs and constructed via a dual-path pipeline that supports periodic updates, enabling a live and evolving setting. The proposed evaluation suite assesses deep research systems along three complementary dimensions: adaptive synthesis quality evaluation with task-specific rubrics, agentic factuality verification via active retrieval and reasoning over both web sources and multimodal attachments, and process-centric evaluation audits how the system searches, reasons, and refines throughout its investigation. Evaluation across 13 systems yields three principal findings: the three evaluation dimensions capture complementary aspects of system capability, with each revealing distinct strengths and weaknesses across systems; process quality serves as a reliable predictor of overall outcome while revealing weaknesses invisible to output-level metrics; and multimodal tasks pose substantially greater challenges, with most systems declining by 3 to 10 points. The MiroThinker series achieves the most balanced performance, with MiroThinker-H1 ranking the highest overall in both settings. Human verification and robustness results confirm the reliability of the benchmark and evaluation framework. MiroEval provides a holistic diagnostic tool for the next generation of deep research agents.

MiroEval: 프로세스와 결과에서 다중 모드 딥 리서치 에이전트 성능 평가

MiroEval: Benchmarking Multimodal Deep Research Agents in Process and Outcome

초록

Support