ExpVid:實驗影片理解與推理的基準測試平台
ExpVid: A Benchmark for Experiment Video Understanding & Reasoning
October 13, 2025
作者: Yicheng Xu, Yue Wu, Jiashuo Yu, Ziang Yan, Tianxiang Jiang, Yinan He, Qingsong Zhao, Kai Chen, Yu Qiao, Limin Wang, Manabu Okumura, Yi Wang
cs.AI
摘要
多模態大型語言模型(MLLMs)在加速科學發現方面具有潛力,特別是在解讀複雜實驗程序方面。然而,由於現有基準測試未能充分考慮真實實驗室工作中細粒度與長時程的特性,尤其是在濕實驗室環境中,這些模型的真正能力尚不明晰。為彌補這一差距,我們引入了ExpVid,這是首個旨在系統評估MLLMs在科學實驗視頻上表現的基準測試。ExpVid精選自同行評審的視頻出版物,並採用了一個新的三層任務層次結構,該結構映射了科學過程:(1)對工具、材料和動作的細粒度感知;(2)對步驟順序和完整性的程序理解;以及(3)將整個實驗與其發表結論相聯繫的科學推理。我們以視覺為中心的註釋流程,結合自動生成與多學科專家驗證,確保了任務需要視覺基礎。我們在ExpVid上評估了19個領先的MLLMs,發現雖然它們在粗粒度識別方面表現出色,但在區分細微細節、追蹤狀態隨時間變化以及將實驗程序與科學成果相聯繫方面卻存在困難。我們的結果揭示了專有模型與開源模型之間在高效能推理上的顯著性能差距。ExpVid不僅提供了一個診斷工具,還為開發能夠成為科學實驗中可信賴夥伴的MLLMs繪製了路線圖。
English
Multimodal Large Language Models (MLLMs) hold promise for accelerating
scientific discovery by interpreting complex experimental procedures. However,
their true capabilities are poorly understood, as existing benchmarks neglect
the fine-grained and long-horizon nature of authentic laboratory work,
especially in wet-lab settings. To bridge this gap, we introduce ExpVid, the
first benchmark designed to systematically evaluate MLLMs on scientific
experiment videos. Curated from peer-reviewed video publications, ExpVid
features a new three-level task hierarchy that mirrors the scientific process:
(1) Fine-grained Perception of tools, materials, and actions; (2) Procedural
Understanding of step order and completeness; and (3) Scientific Reasoning that
connects the full experiment to its published conclusions. Our vision-centric
annotation pipeline, combining automated generation with multi-disciplinary
expert validation, ensures that tasks require visual grounding. We evaluate 19
leading MLLMs on ExpVid and find that while they excel at coarse-grained
recognition, they struggle with disambiguating fine details, tracking state
changes over time, and linking experimental procedures to scientific outcomes.
Our results reveal a notable performance gap between proprietary and
open-source models, particularly in high-order reasoning. ExpVid not only
provides a diagnostic tool but also charts a roadmap for developing MLLMs
capable of becoming trustworthy partners in scientific experimentation.