ChatPaper.aiChatPaper

ExpVid:实验视频理解与推理基准

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

October 13, 2025
作者: Yicheng Xu, Yue Wu, Jiashuo Yu, Ziang Yan, Tianxiang Jiang, Yinan He, Qingsong Zhao, Kai Chen, Yu Qiao, Limin Wang, Manabu Okumura, Yi Wang
cs.AI

摘要

多模态大语言模型(MLLMs)在解读复杂实验流程以加速科学发现方面展现出巨大潜力。然而,由于现有基准测试未能充分体现真实实验室工作,尤其是湿实验室环境中的细粒度与长期性特征,其真实能力尚不明确。为填补这一空白,我们推出了ExpVid,这是首个旨在系统评估MLLMs在科学实验视频上表现的基准测试。ExpVid精选自同行评审的视频出版物,采用了一个新的三层任务体系,该体系映射了科学研究的全过程:(1)对工具、材料及动作的细粒度感知;(2)步骤顺序与完整性的程序理解;(3)将整个实验与其发表结论相连接的科学推理。我们的视觉中心标注流程,结合自动化生成与多学科专家验证,确保任务要求视觉基础。我们在ExpVid上评估了19个领先的MLLMs,发现尽管它们在粗粒度识别上表现出色,但在区分细微差别、追踪状态随时间变化以及将实验程序与科学成果关联方面存在困难。我们的结果揭示了专有模型与开源模型之间,尤其是在高阶推理能力上的显著性能差距。ExpVid不仅提供了一个诊断工具,还为开发能够成为科学实验可信伙伴的MLLMs绘制了发展路线图。
English
Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.
PDF32October 15, 2025