MMDeepResearch-Bench:多模态深度研究智能体基准测试平台
MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents
January 18, 2026
作者: Peizhou Huang, Zixuan Zhong, Zhongwei Wan, Donghao Zhou, Samiul Alam, Xin Wang, Zexin Li, Zhihao Dou, Li Zhu, Jing Xiong, Chaofan Tao, Yan Xu, Dimitrios Dimitriadis, Tuo Zhang, Mi Zhang
cs.AI
摘要
深度研究智能体(DRAs)通过多轮检索与综合生成引证翔实的报告,但现有基准主要针对纯文本场景或短格式多模态问答,缺乏端到端的多模态证据运用评估。我们推出MMDeepResearch-Bench(MMDR-Bench)——一个涵盖21个领域140项专家构建任务的基准,每个任务提供图文组合以评估多模态理解与引证支撑的报告生成能力。相较于先前设定,MMDR-Bench强调显式证据驱动的报告式综合,要求模型必须将视觉要素与溯源主张相关联,并保持叙述、引证和视觉参照的一致性。我们进一步提出统一可解释的评估框架:面向报告质量的公式化LLM自适应评估(FLAE)、确保引证与证据对齐的可信检索校准评估(TRACE)、以及检验图文一致性的多模态支持对齐完整性核查(MOSAIC),每个模块均产生细粒度信号,支持超越单一总分的错误诊断。在25个前沿模型上的实验揭示了生成质量、引证规范与多模态 grounding 之间的系统性权衡,表明优质文本生成并不保证证据使用的可信度,且多模态完整性仍是深度研究智能体的关键瓶颈。
English
Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded evidence alignment, and Multimodal Support-Aligned Integrity Check (MOSAIC) for text-visual integrity, each producing fine-grained signals that support error diagnosis beyond a single overall score. Experiments across 25 state-of-the-art models reveal systematic trade-offs between generation quality, citation discipline, and multimodal grounding, highlighting that strong prose alone does not guarantee faithful evidence use and that multimodal integrity remains a key bottleneck for deep research agents.