ExpVid：实验视频理解与推理基准

摘要

多模态大语言模型（MLLMs）在解读复杂实验流程以加速科学发现方面展现出巨大潜力。然而，由于现有基准测试未能充分体现真实实验室工作，尤其是湿实验室环境中的细粒度与长期性特征，其真实能力尚不明确。为填补这一空白，我们推出了ExpVid，这是首个旨在系统评估MLLMs在科学实验视频上表现的基准测试。ExpVid精选自同行评审的视频出版物，采用了一个新的三层任务体系，该体系映射了科学研究的全过程：（1）对工具、材料及动作的细粒度感知；（2）步骤顺序与完整性的程序理解；（3）将整个实验与其发表结论相连接的科学推理。我们的视觉中心标注流程，结合自动化生成与多学科专家验证，确保任务要求视觉基础。我们在ExpVid上评估了19个领先的MLLMs，发现尽管它们在粗粒度识别上表现出色，但在区分细微差别、追踪状态随时间变化以及将实验程序与科学成果关联方面存在困难。我们的结果揭示了专有模型与开源模型之间，尤其是在高阶推理能力上的显著性能差距。ExpVid不仅提供了一个诊断工具，还为开发能够成为科学实验可信伙伴的MLLMs绘制了发展路线图。

English

Multimodal Large Language Models (MLLMs) hold promise for accelerating scientific discovery by interpreting complex experimental procedures. However, their true capabilities are poorly understood, as existing benchmarks neglect the fine-grained and long-horizon nature of authentic laboratory work, especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the first benchmark designed to systematically evaluate MLLMs on scientific experiment videos. Curated from peer-reviewed video publications, ExpVid features a new three-level task hierarchy that mirrors the scientific process: (1) Fine-grained Perception of tools, materials, and actions; (2) Procedural Understanding of step order and completeness; and (3) Scientific Reasoning that connects the full experiment to its published conclusions. Our vision-centric annotation pipeline, combining automated generation with multi-disciplinary expert validation, ensures that tasks require visual grounding. We evaluate 19 leading MLLMs on ExpVid and find that while they excel at coarse-grained recognition, they struggle with disambiguating fine details, tracking state changes over time, and linking experimental procedures to scientific outcomes. Our results reveal a notable performance gap between proprietary and open-source models, particularly in high-order reasoning. ExpVid not only provides a diagnostic tool but also charts a roadmap for developing MLLMs capable of becoming trustworthy partners in scientific experimentation.

ExpVid：实验视频理解与推理基准

ExpVid: A Benchmark for Experiment Video Understanding & Reasoning

摘要

Support