视频模型是否已具备零样本推理能力?基于MME-CoF基准的实证研究
Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark
October 30, 2025
作者: Ziyu Guo, Xinyan Chen, Renrui Zhang, Ruichuan An, Yu Qi, Dongzhi Jiang, Xiangtai Li, Manyuan Zhang, Hongsheng Li, Pheng-Ann Heng
cs.AI
摘要
近期视频生成模型已能产出高保真度、时序连贯的视频,这表明它们可能编码了丰富的世界知识。除了逼真合成能力外,这些模型还展现出视觉感知、建模与操控等新兴行为。然而一个重要问题依然存在:在具有挑战性的视觉推理场景中,视频模型是否已具备作为零样本推理器的能力?本研究通过实证分析全面探讨该问题,聚焦于领先的主流模型Veo-3。我们从空间、几何、物理、时序及具身逻辑等12个维度评估其推理行为,系统性地刻画其优势与失效模式。为规范研究流程,我们将评估数据整合为MME-CoF紧凑基准,支持对帧间推理链(CoF)进行深入彻底的评估。研究发现:当前视频模型在短视域空间连贯性、细粒度定位及局部一致性动态方面展现出有前景的推理模式,但在长视域因果推理、严格几何约束及抽象逻辑方面仍存在局限。总体而言,它们尚未成为可靠的独立零样本推理器,但作为专用推理模型的互补视觉引擎已显现出积极潜力。项目页面:https://video-cof.github.io
English
Recent video generation models can produce high-fidelity, temporally coherent
videos, indicating that they may encode substantial world knowledge. Beyond
realistic synthesis, they also exhibit emerging behaviors indicative of visual
perception, modeling, and manipulation. Yet, an important question still
remains: Are video models ready to serve as zero-shot reasoners in challenging
visual reasoning scenarios? In this work, we conduct an empirical study to
comprehensively investigate this question, focusing on the leading and popular
Veo-3. We evaluate its reasoning behavior across 12 dimensions, including
spatial, geometric, physical, temporal, and embodied logic, systematically
characterizing both its strengths and failure modes. To standardize this study,
we curate the evaluation data into MME-CoF, a compact benchmark that enables
in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our
findings reveal that while current video models demonstrate promising reasoning
patterns on short-horizon spatial coherence, fine-grained grounding, and
locally consistent dynamics, they remain limited in long-horizon causal
reasoning, strict geometric constraints, and abstract logic. Overall, they are
not yet reliable as standalone zero-shot reasoners, but exhibit encouraging
signs as complementary visual engines alongside dedicated reasoning models.
Project page: https://video-cof.github.io