マルチモーダル基盤モデルは図解を理解できるか？科学論文を対象とした情報探索型QAに関する実証研究

要旨

本論文では、科学文献内の概略図を解釈するモデルの能力を評価するために特別に設計された最初のベンチマークであるMISS-QAを紹介する。MISS-QAは、465の科学論文にわたる1,500の専門家による注釈付き例を含んでいる。このベンチマークでは、モデルは研究の概要を示す概略図を解釈し、論文のより広い文脈に基づいて対応する情報探索質問に答えることを求められる。我々は、o4-mini、Gemini-2.5-Flash、Qwen2.5-VLを含む18の最先端マルチモーダル基盤モデルの性能を評価した。MISS-QAにおいて、これらのモデルと人間の専門家との間に大きな性能差があることを明らかにした。回答不能な質問に対するモデルの性能分析と詳細なエラー分析は、現在のモデルの強みと限界をさらに強調し、マルチモーダル科学文献を理解するためのモデルを強化するための重要な洞察を提供する。

English

This paper introduces MISS-QA, the first benchmark specifically designed to evaluate the ability of models to interpret schematic diagrams within scientific literature. MISS-QA comprises 1,500 expert-annotated examples over 465 scientific papers. In this benchmark, models are tasked with interpreting schematic diagrams that illustrate research overviews and answering corresponding information-seeking questions based on the broader context of the paper. We assess the performance of 18 frontier multimodal foundation models, including o4-mini, Gemini-2.5-Flash, and Qwen2.5-VL. We reveal a significant performance gap between these models and human experts on MISS-QA. Our analysis of model performance on unanswerable questions and our detailed error analysis further highlight the strengths and limitations of current models, offering key insights to enhance models in comprehending multimodal scientific literature.

マルチモーダル基盤モデルは図解を理解できるか？科学論文を対象とした情報探索型QAに関する実証研究

Can Multimodal Foundation Models Understand Schematic Diagrams? An Empirical Study on Information-Seeking QA over Scientific Papers

要旨

Support