视频模型是否已具备零样本推理能力？基于MME-CoF基准的实证研究

摘要

近期视频生成模型已能产出高保真度、时序连贯的视频，这表明它们可能编码了丰富的世界知识。除了逼真合成能力外，这些模型还展现出视觉感知、建模与操控等新兴行为。然而一个重要问题依然存在：在具有挑战性的视觉推理场景中，视频模型是否已具备作为零样本推理器的能力？本研究通过实证分析全面探讨该问题，聚焦于领先的主流模型Veo-3。我们从空间、几何、物理、时序及具身逻辑等12个维度评估其推理行为，系统性地刻画其优势与失效模式。为规范研究流程，我们将评估数据整合为MME-CoF紧凑基准，支持对帧间推理链（CoF）进行深入彻底的评估。研究发现：当前视频模型在短视域空间连贯性、细粒度定位及局部一致性动态方面展现出有前景的推理模式，但在长视域因果推理、严格几何约束及抽象逻辑方面仍存在局限。总体而言，它们尚未成为可靠的独立零样本推理器，但作为专用推理模型的互补视觉引擎已显现出积极潜力。项目页面：https://video-cof.github.io

English

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io