世界模擬器能否推理?Gen-ViRe:生成式視覺推理基準測試
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
November 17, 2025
作者: Xinxin Liu, Zhaopan Xu, Kai Wang, Yong Jae Lee, Yuzhang Shang
cs.AI
摘要
雖然思維鏈提示能讓大型語言模型進行複雜的符號推理,但其仍侷限於離散文本,無法模擬現實世界中由物理定律驅動的連續動態。近期影片生成模型透過幀序列推理機制,展現出成為世界模擬器的潛力——將思維具象化為逐幀視覺序列,每幀皆代表基於物理規律的推理步驟。儘管已有令人矚目的展示,核心挑戰依然存在:現有評測基準僅關注擬真度或對齊度,未能評估幀序列推理能力,因此無法衡量模型在多步驟規劃、算法邏輯或抽象模式推演等核心認知能力。此評估空白阻礙了對模型能力的系統性理解與改進的理論指引。我們提出Gen-ViRe(生成式視覺推理基準),該框架植根於認知科學與現實AI應用,將幀序列推理分解為六個認知維度(從感知邏輯到抽象規劃)及24項子任務。通過多源數據策展、極簡提示協議,以及結合詳細標準的混合視覺語言模型輔助評估,Gen-ViRe首度實現對影片模型推理能力的量化評估。我們對頂尖系統的實驗顯示,驚人的視覺品質與實際推理深度存在顯著落差,藉此建立的基線與診斷工具將推動真實世界模擬器的發展。
English
While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions -- from perceptual logic to abstract planning -- and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.