ChatPaper.aiChatPaper

M3SciQA:用于评估基础模型的多模态多文档科学问答基准测试。

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

November 6, 2024
作者: Chuhan Li, Ziyao Shangguan, Yilun Zhao, Deyuan Li, Yixin Liu, Arman Cohan
cs.AI

摘要

现有用于评估基础模型的基准主要集中在单文档、仅文本任务上。然而,它们通常无法完全捕捉研究工作流程的复杂性,这些工作流程通常涉及解释非文本数据并收集跨多个文档的信息。为了填补这一空白,我们引入了M3SciQA,这是一个多模态、多文档科学问答基准,旨在更全面地评估基础模型。M3SciQA包括1,452个专家注释的问题,涵盖70个自然语言处理论文簇,其中每个簇代表一个主要论文及其所有引用文档,反映了通过需要多模态和多文档数据来理解单篇论文的工作流程。通过M3SciQA,我们对18个基础模型进行了全面评估。我们的结果表明,当前的基础模型在多模态信息检索和跨多个科学文档推理方面仍明显表现不及人类专家。此外,我们探讨了这些发现对将基础模型应用于多模态科学文献分析的未来发展的影响。
English
Existing benchmarks for evaluating foundation models mainly focus on single-document, text-only tasks. However, they often fail to fully capture the complexity of research workflows, which typically involve interpreting non-textual data and gathering information across multiple documents. To address this gap, we introduce M3SciQA, a multi-modal, multi-document scientific question answering benchmark designed for a more comprehensive evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing paper clusters, where each cluster represents a primary paper along with all its cited documents, mirroring the workflow of comprehending a single paper by requiring multi-modal and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of 18 foundation models. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents. Additionally, we explore the implications of these findings for the future advancement of applying foundation models in multi-modal scientific literature analysis.
PDF172December 4, 2025