ChatPaper.aiChatPaper

多模态基础模型能否理解示意图?一项针对科学论文信息检索问答的实证研究

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

July 14, 2025
作者: Yilun Zhao, Chengye Wang, Chuhan Li, Arman Cohan
cs.AI

摘要

本文介紹了MISS-QA,這是首個專門設計來評估模型在科學文獻中解讀示意圖能力的基準測試。MISS-QA包含465篇科學論文中的1,500個專家註釋範例。在此基準測試中,模型的任務是解讀展示研究概覽的示意圖,並基於論文的更廣泛背景回答相應的信息查詢問題。我們評估了包括o4-mini、Gemini-2.5-Flash和Qwen2.5-VL在內的18個前沿多模態基礎模型的表現。我們揭示了這些模型與人類專家在MISS-QA上的顯著性能差距。我們對模型在無法回答問題上的表現分析以及詳細的錯誤分析進一步凸顯了當前模型的優勢與局限,為提升模型理解多模態科學文獻的能力提供了關鍵見解。
English
This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.
PDF101July 16, 2025