ExpVid:实验视频理解与推理基准
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
October 13, 2025
作者: Yicheng Xu, Yue Wu, Jiashuo Yu, Ziang Yan, Tianxiang Jiang, Yinan He, Qingsong Zhao, Kai Chen, Yu Qiao, Limin Wang, Manabu Okumura, Yi Wang
cs.AI
摘要
多模态大语言模型(MLLMs)在解读复杂实验流程以加速科学发现方面展现出巨大潜力。然而,由于现有基准测试未能充分体现真实实验室工作,尤其是湿实验室环境中的细粒度与长期性特征,其真实能力尚不明确。为填补这一空白,我们推出了ExpVid,这是首个旨在系统评估MLLMs在科学实验视频上表现的基准测试。ExpVid精选自同行评审的视频出版物,采用了一个新的三层任务体系,该体系映射了科学研究的全过程:(1)对工具、材料及动作的细粒度感知;(2)步骤顺序与完整性的程序理解;(3)将整个实验与其发表结论相连接的科学推理。我们的视觉中心标注流程,结合自动化生成与多学科专家验证,确保任务要求视觉基础。我们在ExpVid上评估了19个领先的MLLMs,发现尽管它们在粗粒度识别上表现出色,但在区分细微差别、追踪状态随时间变化以及将实验程序与科学成果关联方面存在困难。我们的结果揭示了专有模型与开源模型之间,尤其是在高阶推理能力上的显著性能差距。ExpVid不仅提供了一个诊断工具,还为开发能够成为科学实验可信伙伴的MLLMs绘制了发展路线图。
English
Multimodal Large Language Models (MLLMs) hold promise for accelerating
scientific discovery by interpreting complex experimental procedures. However,
their true capabilities are poorly understood, as existing benchmarks neglect
the fine-grained and long-horizon nature of authentic laboratory work,
especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the
first benchmark designed to systematically evaluate MLLMs on scientific
experiment videos. Curated from peer-reviewed video publications, ExpVid
features a new three-level task hierarchy that mirrors the scientific process:
(1) Fine-grained Perception of tools, materials, and actions; (2) Procedural
Understanding of step order and completeness; and (3) Scientific Reasoning that
connects the full experiment to its published conclusions. Our vision-centric
annotation pipeline, combining automated generation with multi-disciplinary
expert validation, ensures that tasks require visual grounding. We evaluate 19
leading MLLMs on ExpVid and find that while they excel at coarse-grained
recognition, they struggle with disambiguating fine details, tracking state
changes over time, and linking experimental procedures to scientific outcomes.
Our results reveal a notable performance gap between proprietary and
open-source models, particularly in high-order reasoning. ExpVid not only
provides a diagnostic tool but also charts a roadmap for developing MLLMs
capable of becoming trustworthy partners in scientific experimentation.