VideoReasonBench: MLLM이 비전 중심의 복잡한 비디오 추론을 수행할 수 있는가?

초록

최근 연구에 따르면, 긴 사고의 연쇄(Chain-of-Thought, CoT) 추론이 복잡한 작업에서 대규모 언어 모델(LLM)의 성능을 크게 향상시킬 수 있음이 밝혀졌습니다. 그러나 이러한 이점은 비디오 이해 영역에서 아직 입증되지 않았는데, 이는 대부분의 기존 벤치마크가 확장된 CoT 체인의 장점을 입증하기 위해 필요한 추론 깊이를 충족하지 못하기 때문입니다. 최근 비디오 추론을 목표로 한 벤치마크가 제안되었지만, 이러한 작업들은 종종 지식 중심적이며 시각적 콘텐츠에 크게 의존하지 않습니다. 이러한 격차를 해소하기 위해, 우리는 시각 중심의 복잡한 비디오 추론을 평가하기 위해 설계된 벤치마크인 VideoReasonBench를 소개합니다. 시각적 풍부성과 높은 추론 복잡성을 보장하기 위해, VideoReasonBench의 각 비디오는 비디오의 일부에서만 볼 수 있는 잠재 상태에 대한 세밀한 작업의 연속을 묘사합니다. 질문은 세 가지 점진적인 수준의 비디오 추론 기술을 평가합니다: 관찰된 시각적 정보를 회상하는 것, 잠재 상태의 내용을 추론하는 것, 그리고 비디오를 넘어선 정보를 예측하는 것입니다. 이러한 작업 설정 하에서, 모델은 비디오 내의 여러 작업을 정확히 회상하고, 이러한 질문에 대한 최종 정답을 얻기 위해 단계별 추론을 수행해야 합니다. VideoReasonBench를 사용하여, 우리는 18개의 최신 멀티모달 LLM(MLLM)을 포괄적으로 평가했으며, 대부분이 복잡한 비디오 추론에서 낮은 성능을 보임을 발견했습니다. 예를 들어, GPT-4o는 6.9%의 정확도를 달성한 반면, 사고가 강화된 Gemini-2.5-Pro는 56.0%의 정확도로 다른 모델들을 크게 앞섰습니다. "테스트 시간 스케일링"에 대한 우리의 조사는 확장된 사고 예산이 기존 비디오 벤치마크에서는 거의 또는 전혀 이점을 제공하지 않지만, VideoReasonBench에서의 성능 향상을 위해 필수적임을 추가로 보여줍니다.

English

Recent studies have shown that long chain-of-thought (CoT) reasoning can significantly enhance the performance of large language models (LLMs) on complex tasks. However, this benefit is yet to be demonstrated in the domain of video understanding, since most existing benchmarks lack the reasoning depth required to demonstrate the advantages of extended CoT chains. While recent efforts have proposed benchmarks aimed at video reasoning, the tasks are often knowledge-driven and do not rely heavily on visual content. To bridge this gap, we introduce VideoReasonBench, a benchmark designed to evaluate vision-centric, complex video reasoning. To ensure visual richness and high reasoning complexity, each video in VideoReasonBench depicts a sequence of fine-grained operations on a latent state that is only visible in part of the video. The questions evaluate three escalating levels of video reasoning skills: recalling observed visual information, inferring the content of latent states, and predicting information beyond the video. Under such task setting, models have to precisely recall multiple operations in the video, and perform step-by-step reasoning to get correct final answers for these questions. Using VideoReasonBench, we comprehensively evaluate 18 state-of-the-art multimodal LLMs (MLLMs), finding that most perform poorly on complex video reasoning, e.g., GPT-4o achieves only 6.9% accuracy, while the thinking-enhanced Gemini-2.5-Pro significantly outperforms others with 56.0% accuracy. Our investigations on "test-time scaling" further reveal that extended thinking budget, while offering none or minimal benefits on existing video benchmarks, is essential for improving the performance on VideoReasonBench.

VideoReasonBench: MLLM이 비전 중심의 복잡한 비디오 추론을 수행할 수 있는가?

VideoReasonBench: Can MLLMs Perform Vision-Centric Complex Video Reasoning?

초록

Support