V-ReasonBench: 비디오 생성 모델을 위한 통합 추론 벤치마크 스위트

초록

최근 Veo-3와 같은 생성적 비디오 모델의 발전은 놀라운 제로샷 추론 능력을 보여주며, 체계적이고 신뢰할 수 있는 평가의 필요성을 점점 더 증가시키고 있다. 본 연구에서는 구조적 문제 해결, 공간 인지, 패턴 기반 추론, 물리적 역학이라는 네 가지 핵심 차원에 걸친 비디오 추론 능력을 평가하기 위해 V-ReasonBench라는 벤치마크를 소개한다. 이 벤치마크는 합성 및 실제 이미지 시퀀스로 구성되어 있으며, 재현 가능하고 확장성이 있으며 모호하지 않은 다양한 답변 검증 가능 작업을 제공한다. 최신 비디오 모델 6개에 대한 평가 결과, 구조적, 공간적, 패턴 기반, 물리적 추론에서 뚜렷한 차원별 차이를 확인할 수 있었다. 또한, 강력한 이미지 모델과 비디오 모델을 비교하고, 일반적인 환각 행동을 분석하며, 비디오 지속 시간이 프레임 연쇄(Chain-of-Frames) 추론에 미치는 영향을 연구하였다. 전반적으로, V-ReasonBench는 비디오 추론을 측정하기 위한 통일되고 재현 가능한 프레임워크를 제공하며, 더 신뢰할 수 있고 인간과 일치하는 추론 능력을 가진 모델 개발을 지원하는 것을 목표로 한다.

English

Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.