ChatPaper.aiChatPaper

Video-Holmes:多模态大语言模型能否像福尔摩斯一样进行复杂视频推理?

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

May 27, 2025
作者: Junhao Cheng, Yuying Ge, Teng Wang, Yixiao Ge, Jing Liao, Ying Shan
cs.AI

摘要

近期在CoT推理和RL后训练方面的进展被报道能够增强多模态大语言模型(MLLMs)的视频推理能力。这一进步自然引发了一个问题:这些模型能否以与人类专家相当的方式执行复杂的视频推理?然而,现有的视频基准主要评估视觉感知和基础能力,其问题可以通过明确的提示或孤立的视觉线索来回答。这样的基准并未完全捕捉现实世界推理的复杂性,在现实中,人类必须主动搜寻、整合并分析多条线索才能得出结论。为解决这一问题,我们提出了Video-Holmes,一个受夏洛克·福尔摩斯推理过程启发的基准,旨在评估MLLMs的复杂视频推理能力。Video-Holmes包含从270部手动标注的悬疑短片中提取的1,837个问题,涵盖七个精心设计的任务。每个任务通过首先识别影片中的关键事件和因果关系,然后设计需要模型主动定位并连接散布在不同视频片段中的多个相关视觉线索的问题来构建。我们对最先进的MLLMs进行的全面评估显示,尽管这些模型在视觉感知方面普遍表现出色,但在信息整合上遇到显著困难,且常常遗漏关键线索。例如,表现最佳的模型Gemini-2.5-Pro的准确率仅为45%,大多数模型的得分低于40%。我们希望Video-Holmes能作为多模态推理的“福尔摩斯测试”,激励模型更接近人类的方式进行推理,并强调该领域持续存在的挑战。该基准已发布于https://github.com/TencentARC/Video-Holmes。
English
Recent advances in CoT reasoning and RL post-training have been reported to enhance video reasoning capabilities of MLLMs. This progress naturally raises a question: can these models perform complex video reasoning in a manner comparable to human experts? However, existing video benchmarks primarily evaluate visual perception and grounding abilities, with questions that can be answered based on explicit prompts or isolated visual cues. Such benchmarks do not fully capture the intricacies of real-world reasoning, where humans must actively search for, integrate, and analyze multiple clues before reaching a conclusion. To address this issue, we present Video-Holmes, a benchmark inspired by the reasoning process of Sherlock Holmes, designed to evaluate the complex video reasoning capabilities of MLLMs. Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films, which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments. Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information and often miss critical clues. For example, the best-performing model, Gemini-2.5-Pro, achieves an accuracy of only 45%, with most models scoring below 40%. We aim that Video-Holmes can serve as a "Holmes-test" for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field. The benchmark is released in https://github.com/TencentARC/Video-Holmes.

Summary

AI-Generated Summary

PDF272May 28, 2025