影片模型是否已準備好作為零樣本推理者？基於MME-CoF基準的實證研究

摘要

近期影片生成模型已能產出高擬真度、時間連貫的影片，顯示其可能蘊含豐富的世界知識。除了逼真合成能力外，這些模型更展現出視覺感知、建模與操控等湧現行為。然而關鍵問題依然存在：在具挑戰性的視覺推理情境中，影片模型是否能作為零樣本推理器？本研究透過實證分析全面探討此問題，以前沿主流模型Veo-3為研究對象。我們從12個維度評估其推理表現，涵蓋空間、幾何、物理、時間及具身邏輯等層面，系統性歸納其優勢與失效模式。為標準化研究框架，我們將評估資料整合為MME-CoF基準測試集，該精簡基準能對幀序列鏈式推理進行深入透徹的評估。研究發現：當前影片模型在短時空域連貫性、細粒度定位及局部一致性動態方面展現潛力，但在長時序因果推理、嚴格幾何約束與抽象邏輯方面仍存在局限。總體而言，現階段影片模型雖尚未能作為獨立的零樣本推理器，但已顯現出作為專用推理模型輔助視覺引擎的積極特質。項目頁面：https://video-cof.github.io

English

Recent video generation models can produce high-fidelity, temporally coherent videos, indicating that they may encode substantial world knowledge. Beyond realistic synthesis, they also exhibit emerging behaviors indicative of visual perception, modeling, and manipulation. Yet, an important question still remains: Are video models ready to serve as zero-shot reasoners in challenging visual reasoning scenarios? In this work, we conduct an empirical study to comprehensively investigate this question, focusing on the leading and popular Veo-3. We evaluate its reasoning behavior across 12 dimensions, including spatial, geometric, physical, temporal, and embodied logic, systematically characterizing both its strengths and failure modes. To standardize this study, we curate the evaluation data into MME-CoF, a compact benchmark that enables in-depth and thorough assessment of Chain-of-Frame (CoF) reasoning. Our findings reveal that while current video models demonstrate promising reasoning patterns on short-horizon spatial coherence, fine-grained grounding, and locally consistent dynamics, they remain limited in long-horizon causal reasoning, strict geometric constraints, and abstract logic. Overall, they are not yet reliable as standalone zero-shot reasoners, but exhibit encouraging signs as complementary visual engines alongside dedicated reasoning models. Project page: https://video-cof.github.io

影片模型是否已準備好作為零樣本推理者？基於MME-CoF基準的實證研究

Are Video Models Ready as Zero-Shot Reasoners? An Empirical Study with the MME-CoF Benchmark

摘要

Support