Video-Holmes: MLLM이 복잡한 비디오 추론을 위해 홈즈처럼 사고할 수 있을까?

초록

최근 CoT(Chain-of-Thought) 추론과 RL(Reinforcement Learning) 사후 훈련의 발전이 MLLM(Multimodal Large Language Models)의 비디오 추론 능력을 향상시켰다는 보고가 있습니다. 이러한 진전은 자연스럽게 다음과 같은 질문을 제기합니다: 이러한 모델이 인간 전문가와 비슷한 수준으로 복잡한 비디오 추론을 수행할 수 있을까요? 그러나 기존의 비디오 벤치마크는 주로 시각적 인식과 기반 능력을 평가하며, 명시적인 프롬프트나 고립된 시각적 단서에 기반하여 답변할 수 있는 질문들로 구성되어 있습니다. 이러한 벤치마크는 인간이 결론에 도달하기 전에 적극적으로 탐색하고, 통합하며, 여러 단서를 분석해야 하는 현실 세계의 복잡한 추론을 완전히 포착하지 못합니다. 이 문제를 해결하기 위해, 우리는 셜록 홈즈의 추론 과정에서 영감을 받은 Video-Holmes 벤치마크를 제안합니다. 이 벤치마크는 MLLM의 복잡한 비디오 추론 능력을 평가하기 위해 설계되었습니다. Video-Holmes는 270편의 수동으로 주석이 달린 서스펜스 단편 영화에서 도출된 1,837개의 질문으로 구성되며, 신중하게 설계된 7가지 작업을 포함합니다. 각 작업은 영화 내의 주요 사건과 인과 관계를 먼저 식별한 후, 모델이 서로 다른 비디오 세그먼트에 흩어져 있는 여러 관련 시각적 단서를 적극적으로 찾아 연결해야 하는 질문을 설계하여 구성됩니다. 최첨단 MLLM에 대한 우리의 포괄적인 평가는, 이러한 모델들이 일반적으로 시각적 인식에서는 뛰어난 성능을 보이지만 정보 통합에는 상당한 어려움을 겪으며 종종 중요한 단서를 놓친다는 것을 보여줍니다. 예를 들어, 가장 성능이 좋은 모델인 Gemini-2.5-Pro는 정확도가 45%에 불과하며, 대부분의 모델은 40% 미만의 점수를 기록했습니다. 우리는 Video-Holmes가 다중모드 추론을 위한 "홈즈 테스트"로 기능하여, 모델이 더 인간처럼 추론하도록 동기를 부여하고 이 분야의 지속적인 과제를 강조할 수 있기를 바랍니다. 이 벤치마크는 https://github.com/TencentARC/Video-Holmes에서 공개되었습니다.

English

Recent advances in CoT reasoning and RL post-training have been reported to enhance video reasoning capabilities of MLLMs. This progress naturally raises a question: can these models perform complex video reasoning in a manner comparable to human experts? However, existing video benchmarks primarily evaluate visual perception and grounding abilities, with questions that can be answered based on explicit prompts or isolated visual cues. Such benchmarks do not fully capture the intricacies of real-world reasoning, where humans must actively search for, integrate, and analyze multiple clues before reaching a conclusion. To address this issue, we present Video-Holmes, a benchmark inspired by the reasoning process of Sherlock Holmes, designed to evaluate the complex video reasoning capabilities of MLLMs. Video-Holmes consists of 1,837 questions derived from 270 manually annotated suspense short films, which spans seven carefully designed tasks. Each task is constructed by first identifying key events and causal relationships within films, and then designing questions that require models to actively locate and connect multiple relevant visual clues scattered across different video segments. Our comprehensive evaluation of state-of-the-art MLLMs reveals that, while these models generally excel at visual perception, they encounter substantial difficulties with integrating information and often miss critical clues. For example, the best-performing model, Gemini-2.5-Pro, achieves an accuracy of only 45%, with most models scoring below 40%. We aim that Video-Holmes can serve as a "Holmes-test" for multimodal reasoning, motivating models to reason more like humans and emphasizing the ongoing challenges in this field. The benchmark is released in https://github.com/TencentARC/Video-Holmes.

Video-Holmes: MLLM이 복잡한 비디오 추론을 위해 홈즈처럼 사고할 수 있을까?

Video-Holmes: Can MLLM Think Like Holmes for Complex Video Reasoning?

초록

Support