ChatPaper.aiChatPaper

MMDeepResearch-Bench:多模態深度研究代理的基準測試框架

MMDeepResearch-Bench: A Benchmark for Multimodal Deep Research Agents

January 18, 2026
作者: Peizhou Huang, Zixuan Zhong, Zhongwei Wan, Donghao Zhou, Samiul Alam, Xin Wang, Zexin Li, Zhihao Dou, Li Zhu, Jing Xiong, Chaofan Tao, Yan Xu, Dimitrios Dimitriadis, Tuo Zhang, Mi Zhang
cs.AI

摘要

深度研究智慧體(DRAs)通過多步驟檢索與綜合生成帶有豐富引用的報告,然而現有基準主要針對純文本場景或短格式多模態問答,未能涵蓋端到端的多模態證據運用。我們提出MMDeepResearch-Bench(MMDR-Bench)——一個涵蓋21個領域共140項專家設計任務的基準,每個任務提供圖文組合包以評估多模態理解與引用錨定的報告生成能力。相較既有框架,MMDR-Bench強調需明確使用證據的報告式綜合能力,要求模型必須將視覺構件與來源主張相連結,並在敘事、引用和視覺參照間保持一致性。我們進一步設計了統一且可解釋的評估流程:針對報告質量的Formula-LLM自適應評估(FLAE)、用於引用錨定證據對齊的可信檢索校準評估(TRACE),以及檢驗圖文一致性的多模態支持校準完整性驗證(MOSAIC)。每項評估均產生細粒度信號,可支持超越單一總分的錯誤診斷。在25個前沿模型上的實驗揭示了生成質量、引用規範性與多模態錨定之間的系統性權衡,表明流暢的文本生成並不能保證忠實的證據運用,且多模態完整性仍是深度研究智慧體的關鍵瓶頸。
English
Deep Research Agents (DRAs) generate citation-rich reports via multi-step search and synthesis, yet existing benchmarks mainly target text-only settings or short-form multimodal QA, missing end-to-end multimodal evidence use. We introduce MMDeepResearch-Bench (MMDR-Bench), a benchmark of 140 expert-crafted tasks across 21 domains, where each task provides an image-text bundle to evaluate multimodal understanding and citation-grounded report generation. Compared to prior setups, MMDR-Bench emphasizes report-style synthesis with explicit evidence use, where models must connect visual artifacts to sourced claims and maintain consistency across narrative, citations, and visual references. We further propose a unified, interpretable evaluation pipeline: Formula-LLM Adaptive Evaluation (FLAE) for report quality, Trustworthy Retrieval-Aligned Citation Evaluation (TRACE) for citation-grounded evidence alignment, and Multimodal Support-Aligned Integrity Check (MOSAIC) for text-visual integrity, each producing fine-grained signals that support error diagnosis beyond a single overall score. Experiments across 25 state-of-the-art models reveal systematic trade-offs between generation quality, citation discipline, and multimodal grounding, highlighting that strong prose alone does not guarantee faithful evidence use and that multimodal integrity remains a key bottleneck for deep research agents.
PDF411January 23, 2026