M3SciQA: 基盤モデルを評価するためのマルチモーダル・マルチドキュメント科学QAベンチマーク

要旨

既存の基盤モデル評価のためのベンチマークは、主に単一ドキュメントのテキストのみのタスクに焦点を当てている。しかし、これらのベンチマークは、非テキストデータの解釈や複数のドキュメントにわたる情報収集を含む研究ワークフローの複雑さを十分に捉えられないことが多い。このギャップを埋めるため、我々はM3SciQAを導入する。これは、基盤モデルのより包括的な評価を目的とした、マルチモーダルかつマルチドキュメントの科学的質問応答ベンチマークである。M3SciQAは、70の自然言語処理論文クラスターにわたる1,452の専門家による注釈付き質問で構成されており、各クラスターは主要論文とその引用文献を表し、マルチモーダルおよびマルチドキュメントデータを必要とする単一論文の理解ワークフローを反映している。M3SciQAを用いて、18の基盤モデルを包括的に評価した。その結果、現在の基盤モデルは、マルチモーダル情報検索や複数の科学ドキュメントにわたる推論において、人間の専門家と比較して依然として大幅に性能が劣ることが示された。さらに、これらの発見が、マルチモーダル科学文献分析における基盤モデルの将来の進展に与える影響についても探求した。

English

Existing benchmarks for evaluating foundation models mainly focus on single-document, text-only tasks. However, they often fail to fully capture the complexity of research workflows, which typically involve interpreting non-textual data and gathering information across multiple documents. To address this gap, we introduce M3SciQA, a multi-modal, multi-document scientific question answering benchmark designed for a more comprehensive evaluation of foundation models. M3SciQA consists of 1,452 expert-annotated questions spanning 70 natural language processing paper clusters, where each cluster represents a primary paper along with all its cited documents, mirroring the workflow of comprehending a single paper by requiring multi-modal and multi-document data. With M3SciQA, we conduct a comprehensive evaluation of 18 foundation models. Our results indicate that current foundation models still significantly underperform compared to human experts in multi-modal information retrieval and in reasoning across multiple scientific documents. Additionally, we explore the implications of these findings for the future advancement of applying foundation models in multi-modal scientific literature analysis.

M3SciQA: 基盤モデルを評価するためのマルチモーダル・マルチドキュメント科学QAベンチマーク

M3SciQA: A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

要旨

Support