科学家首次测试：通过感知、理解与推理探究多模态大语言模型的认知能力

摘要

科学发现日益依赖于基于信息密集型科学数据和领域专业知识的复杂多模态推理。借助专家级科学基准的赋能，科学多模态大语言模型（MLLMs）有望在实际工作流程中显著提升这一发现过程。然而，当前的科学基准主要聚焦于评估MLLMs的知识理解能力，导致对其感知与推理能力的评估不足。为填补这一空白，我们提出了“科学家首次考试”（SFE）基准，旨在通过三个相互关联的层次——科学信号感知、科学属性理解、科学比较推理——来评估MLLMs的科学认知能力。具体而言，SFE包含了830个经过专家验证的视觉问答对，涵盖三种问题类型，跨越五个高价值学科的66项多模态任务。大量实验表明，当前最先进的GPT-3和InternVL-3在SFE上的得分仅为34.08%和26.52%，凸显了MLLMs在科学领域仍有巨大的提升空间。我们期望通过SFE获得的洞见能够推动AI增强科学发现的进一步发展。

English

Scientific discoveries increasingly rely on complex multimodal reasoning based on information-intensive scientific data and domain-specific expertise. Empowered by expert-level scientific benchmarks, scientific Multimodal Large Language Models (MLLMs) hold the potential to significantly enhance this discovery process in realistic workflows. However, current scientific benchmarks mostly focus on evaluating the knowledge understanding capabilities of MLLMs, leading to an inadequate assessment of their perception and reasoning abilities. To address this gap, we present the Scientists' First Exam (SFE) benchmark, designed to evaluate the scientific cognitive capacities of MLLMs through three interconnected levels: scientific signal perception, scientific attribute understanding, scientific comparative reasoning. Specifically, SFE comprises 830 expert-verified VQA pairs across three question types, spanning 66 multimodal tasks across five high-value disciplines. Extensive experiments reveal that current state-of-the-art GPT-o3 and InternVL-3 achieve only 34.08% and 26.52% on SFE, highlighting significant room for MLLMs to improve in scientific realms. We hope the insights obtained in SFE will facilitate further developments in AI-enhanced scientific discoveries.

科学家首次测试：通过感知、理解与推理探究多模态大语言模型的认知能力

Scientists' First Exam: Probing Cognitive Abilities of MLLM via Perception, Understanding, and Reasoning

摘要

Support