M3SciQA:用於評估基礎模型的多模態多文檔科學問答基準測試
M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
November 6, 2024
作者: Chuhan Li, Ziyao Shangguan, Yilun Zhao, Deyuan Li, Yixin Liu, Arman Cohan
cs.AI
摘要
現有用於評估基礎模型的基準主要專注於單一文件、僅文字任務。然而,它們常常無法完全捕捉研究工作流程的複雜性,後者通常涉及解釋非文本數據並跨多個文件收集信息。為填補這一空白,我們引入了 M3SciQA,這是一個多模態、多文件科學問答基準,旨在更全面地評估基礎模型。M3SciQA 包含 1,452 個專家注釋問題,涵蓋 70 個自然語言處理論文叢集,每個叢集代表一篇主要論文以及其所有引用文件,反映了通過需要多模態和多文件數據來理解單篇論文的工作流程。通過 M3SciQA,我們對 18 個基礎模型進行了全面評估。我們的結果表明,當前的基礎模型在多模態信息檢索和跨多個科學文件進行推理方面仍然明顯遜於人類專家。此外,我們探討了這些發現對將基礎模型應用於多模態科學文獻分析的未來發展的影響。
English
Existing benchmarks for evaluating foundation models mainly focus on
single-document, text-only tasks. However, they often fail to fully capture the
complexity of research workflows, which typically involve interpreting
non-textual data and gathering information across multiple documents. To
address this gap, we introduce M3SciQA, a multi-modal, multi-document
scientific question answering benchmark designed for a more comprehensive
evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated
questions spanning 70 natural language processing paper clusters, where each
cluster represents a primary paper along with all its cited documents,
mirroring the workflow of comprehending a single paper by requiring multi-modal
and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of
18 foundation models. Our results indicate that current foundation models still
significantly underperform compared to human experts in multi-modal information
retrieval and in reasoning across multiple scientific documents. Additionally,
we explore the implications of these findings for the future advancement of
applying foundation models in multi-modal scientific literature analysis.