ChatPaper.aiChatPaper

V-ReasonBench:面向视频生成模型的统一推理基准套件

V-ReasonBench: Toward Unified Reasoning Benchmark Suite for Video Generation Models

November 20, 2025
作者: Yang Luo, Xuanlei Zhao, Baijiong Lin, Lingting Zhu, Liyao Tang, Yuqi Liu, Ying-Cong Chen, Shengju Qian, Xin Wang, Yang You
cs.AI

摘要

近期,在生成式視頻模型領域,如Veo-3的進展,展現了令人驚訝的零樣本推理能力,這促使對系統化且可靠的評估需求日益增長。我們推出了V-ReasonBench,這是一個旨在評估視頻推理能力的基準測試,涵蓋四大關鍵維度:結構化問題解決、空間認知、基於模式的推理以及物理動力學。該基準測試由合成與真實世界的圖像序列構建而成,提供了一系列多樣化且答案可驗證的任務,這些任務具有可重現性、可擴展性及明確性。對六種尖端視頻模型的評估揭示了各維度間的顯著差異,特別是在結構化、空間、基於模式及物理推理方面表現出強烈變化。我們進一步將視頻模型與強大的圖像模型進行比較,分析了常見的幻覺行為,並研究了視頻時長如何影響幀間鏈推理。總體而言,V-ReasonBench為衡量視頻推理能力提供了一個統一且可重現的框架,旨在支持開發出具有更可靠、更貼近人類推理能力的模型。
English
Recent progress in generative video models, such as Veo-3, has shown surprising zero-shot reasoning abilities, creating a growing need for systematic and reliable evaluation. We introduce V-ReasonBench, a benchmark designed to assess video reasoning across four key dimensions: structured problem-solving, spatial cognition, pattern-based inference, and physical dynamics. The benchmark is built from both synthetic and real-world image sequences and provides a diverse set of answer-verifiable tasks that are reproducible, scalable, and unambiguous. Evaluations of six state-of-the-art video models reveal clear dimension-wise differences, with strong variation in structured, spatial, pattern-based, and physical reasoning. We further compare video models with strong image models, analyze common hallucination behaviors, and study how video duration affects Chain-of-Frames reasoning. Overall, V-ReasonBench offers a unified and reproducible framework for measuring video reasoning and aims to support the development of models with more reliable, human-aligned reasoning skills.
PDF391November 22, 2025