Video-Holmes:多模态大语言模型能否像福尔摩斯一样进行复杂视频推理?
Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?
May 27, 2025
作者: Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, Ying Shan
cs.AI
摘要
近期在CoT推理和RL後訓練方面的進展被報導能增強多模態大語言模型(MLLMs)的視頻推理能力。這一進展自然引發了一個問題:這些模型能否以類似人類專家的方式進行複雜的視頻推理?然而,現有的視頻基準主要評估視覺感知和基礎能力,其問題可以基於明確提示或孤立的視覺線索來回答。這樣的基準並未完全捕捉現實世界推理的複雜性,在現實中,人類必須主動搜尋、整合並分析多條線索才能得出結論。為解決這一問題,我們提出了Video-Holmes,一個受夏洛克·福爾摩斯推理過程啟發的基準,旨在評估MLLMs的複雜視頻推理能力。Video-Holmes包含從270部手動註釋的懸疑短片中提取的1,837個問題,涵蓋七項精心設計的任務。每項任務的構建首先通過識別影片中的關鍵事件和因果關係,然後設計需要模型主動定位並連接分散在不同視頻片段中的多個相關視覺線索的問題。我們對最先進的MLLMs進行了全面評估,結果顯示,儘管這些模型在視覺感知方面普遍表現出色,但在信息整合方面遇到了重大困難,並且經常錯過關鍵線索。例如,表現最佳的模型Gemini-2.5-Pro的準確率僅為45%,大多數模型的得分低於40%。我們希望Video-Holmes能作為多模態推理的“福爾摩斯測試”,激勵模型更接近人類的推理方式,並強調該領域持續存在的挑戰。該基準已發佈於https://github.com/TencentARC/Video-Holmes。
English
Recent advances in CoT reasoning and RL post-training have been reported to
enhance video reasoning capabilities of MLLMs. This progress naturally raises a
question: can these models perform complex video reasoning in a manner
comparable to human experts? However, existing video benchmarks primarily
evaluate visual perception and grounding abilities, with questions that can be
answered based on explicit prompts or isolated visual cues. Such benchmarks do
not fully capture the intricacies of real-world reasoning, where humans must
actively search for, integrate, and analyze multiple clues before reaching a
conclusion. To address this issue, we present Video-Holmes, a benchmark
inspired by the reasoning process of Sherlock Holmes, designed to evaluate the
complex video reasoning capabilities of MLLMs. Video-Holmes consists of 1,837
questions derived from 270 manually annotated suspense short films, which spans
seven carefully designed tasks. Each task is constructed by first identifying
key events and causal relationships within films, and then designing questions
that require models to actively locate and connect multiple relevant visual
clues scattered across different video segments. Our comprehensive evaluation
of state-of-the-art MLLMs reveals that, while these models generally excel at
visual perception, they encounter substantial difficulties with integrating
information and often miss critical clues. For example, the best-performing
model, Gemini-2.5-Pro, achieves an accuracy of only 45%, with most models
scoring below 40%. We aim that Video-Holmes can serve as a "Holmes-test" for
multimodal reasoning, motivating models to reason more like humans and
emphasizing the ongoing challenges in this field. The benchmark is released in
https://github.com/TencentARC/Video-Holmes.Summary
AI-Generated Summary