V-ReasonBench：面向视频生成模型的统一推理基准套件

摘要

近期，在生成式視頻模型領域，如Veo-3的進展，展現了令人驚訝的零樣本推理能力，這促使對系統化且可靠的評估需求日益增長。我們推出了V-ReasonBench，這是一個旨在評估視頻推理能力的基準測試，涵蓋四大關鍵維度：結構化問題解決、空間認知、基於模式的推理以及物理動力學。該基準測試由合成與真實世界的圖像序列構建而成，提供了一系列多樣化且答案可驗證的任務，這些任務具有可重現性、可擴展性及明確性。對六種尖端視頻模型的評估揭示了各維度間的顯著差異，特別是在結構化、空間、基於模式及物理推理方面表現出強烈變化。我們進一步將視頻模型與強大的圖像模型進行比較，分析了常見的幻覺行為，並研究了視頻時長如何影響幀間鏈推理。總體而言，V-ReasonBench為衡量視頻推理能力提供了一個統一且可重現的框架，旨在支持開發出具有更可靠、更貼近人類推理能力的模型。

English

Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.

V-ReasonBench：面向视频生成模型的统一推理基准套件

V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

摘要

Support