V-ReasonBench：映像生成モデルのための統合推論ベンチマークスイートに向けて

要旨

最近の生成ビデオモデル、特にVeo-3の進展は、驚くべきゼロショット推論能力を示しており、体系的で信頼性のある評価の必要性が高まっている。本論文では、V-ReasonBenchを紹介する。このベンチマークは、構造化された問題解決、空間認知、パターンに基づく推論、物理的ダイナミクスという4つの主要な次元にわたるビデオ推論を評価するために設計されている。ベンチマークは、合成および実世界の画像シーケンスから構築され、再現性、拡張性、曖昧さのない多様な検証可能なタスクを提供する。6つの最先端ビデオモデルの評価により、構造化、空間、パターンに基づく、および物理的推論において明確な次元ごとの差異が明らかになった。さらに、ビデオモデルと強力な画像モデルを比較し、一般的な幻覚行動を分析し、ビデオの長さがフレーム連鎖推論にどのように影響するかを研究した。全体として、V-ReasonBenchは、ビデオ推論を測定するための統一された再現可能なフレームワークを提供し、より信頼性が高く人間に沿った推論スキルを持つモデルの開発を支援することを目指している。

English

Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

V-ReasonBench：映像生成モデルのための統合推論ベンチマークスイートに向けて

V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

要旨

Support