SPIQA:用于科学论文多模态问答的数据集
SPIQA: A Dataset for Multimodal Question Answering on Scientific Papers
July 12, 2024
作者: Shraman Pramanick, Rama Chellappa, Subhashini Venugopalan
cs.AI
摘要
在长篇科学研究文章中寻找问题的答案是一个重要的研究领域,可以帮助读者快速解决他们的疑问。然而,现有基于科学论文的问答(QA)数据集在规模上存在局限,并且仅关注文本内容。为了解决这一局限,我们引入了SPIQA(Scientific Paper Image Question Answering),这是第一个专门设计用于解释计算机科学各个领域科学研究文章中复杂图表的大规模QA数据集。利用多模态大型语言模型(MLLMs)的广泛专业知识和能力来理解图表,我们采用自动和手动策划来创建数据集。我们设计了一个信息搜索任务,涉及多个图像,涵盖各种绘图、图表、表格、示意图和结果可视化。SPIQA包含27万个问题,分为训练、验证和三个不同的评估部分。通过与12个著名基础模型的广泛实验,我们评估了当前多模态系统理解研究文章细微方面的能力。此外,我们提出了一种Chain-of-Thought(CoT)评估策略,采用上下文检索,允许进行细粒度、逐步评估并提高模型性能。我们进一步探讨了通过额外文本信息提高性能的上限,并突出其对未来研究的潜力以及对改变我们与科学文献互动方式的数据集影响。
English
Seeking answers to questions within long scientific research articles is a
crucial area of study that aids readers in quickly addressing their inquiries.
However, existing question-answering (QA) datasets based on scientific papers
are limited in scale and focus solely on textual content. To address this
limitation, we introduce SPIQA (Scientific Paper Image Question Answering), the
first large-scale QA dataset specifically designed to interpret complex figures
and tables within the context of scientific research articles across various
domains of computer science. Leveraging the breadth of expertise and ability of
multimodal large language models (MLLMs) to understand figures, we employ
automatic and manual curation to create the dataset. We craft an
information-seeking task involving multiple images that cover a wide variety of
plots, charts, tables, schematic diagrams, and result visualizations. SPIQA
comprises 270K questions divided into training, validation, and three different
evaluation splits. Through extensive experiments with 12 prominent foundational
models, we evaluate the ability of current multimodal systems to comprehend the
nuanced aspects of research articles. Additionally, we propose a
Chain-of-Thought (CoT) evaluation strategy with in-context retrieval that
allows fine-grained, step-by-step assessment and improves model performance. We
further explore the upper bounds of performance enhancement with additional
textual information, highlighting its promising potential for future research
and the dataset's impact on revolutionizing how we interact with scientific
literature.Summary
AI-Generated Summary