世界模拟器能否推理?Gen-ViRe:生成式视觉推理基准测试
Can World Simulators Reason? Gen-ViRe: A Generative Visual Reasoning Benchmark
November 17, 2025
作者: Xinxin Liu, Zhaopan Xu, Kai Wang, Yong Jae Lee, Yuzhang Shang
cs.AI
摘要
尽管思维链(CoT)提示技术使大语言模型能够进行复杂的符号推理,但其仍局限于离散文本范畴,无法模拟现实世界中受物理规律支配的连续动态。近期出现的视频生成模型通过帧序列推理(CoF)机制——将思维具象化为逐帧可视序列,每帧代表基于物理的推理步骤——展现出成为世界模拟器的潜力。尽管已有令人瞩目的演示,但核心挑战依然存在:现有基准主要关注保真度或对齐度,未能评估CoF推理能力,因而无法衡量多步规划、算法逻辑或抽象模式外推等核心认知能力。这一评估空白阻碍了对模型能力的系统性认知及改进方法的理论指导。我们提出Gen-ViRe(生成式视觉推理基准),该框架植根于认知科学与现实AI应用,将CoF推理分解为从感知逻辑到抽象规划的六大认知维度及24项子任务。通过多源数据策展、最小化提示协议,以及结合详细标准的混合视觉语言模型辅助评估,Gen-ViRe首次实现对视频模型推理能力的量化评估。我们在前沿系统上的实验表明,视觉质量与真实推理深度之间存在显著差距,由此建立的基线标准和诊断工具将推动真正世界模拟器的发展。
English
While Chain-of-Thought (CoT) prompting enables sophisticated symbolic reasoning in LLMs, it remains confined to discrete text and cannot simulate the continuous, physics-governed dynamics of the real world. Recent video generation models have emerged as potential world simulators through Chain-of-Frames (CoF) reasoning -- materializing thought as frame-by-frame visual sequences, with each frame representing a physically-grounded reasoning step. Despite compelling demonstrations, a challenge persists: existing benchmarks, focusing on fidelity or alignment, do not assess CoF reasoning and thus cannot measure core cognitive abilities in multi-step planning, algorithmic logic, or abstract pattern extrapolation. This evaluation void prevents systematic understanding of model capabilities and principled guidance for improvement. We introduce Gen-ViRe (Generative Visual Reasoning Benchmark), a framework grounded in cognitive science and real-world AI applications, which decomposes CoF reasoning into six cognitive dimensions -- from perceptual logic to abstract planning -- and 24 subtasks. Through multi-source data curation, minimal prompting protocols, and hybrid VLM-assisted evaluation with detailed criteria, Gen-ViRe delivers the first quantitative assessment of video models as reasoners. Our experiments on SOTA systems reveal substantial discrepancies between impressive visual quality and actual reasoning depth, establishing baselines and diagnostic tools to advance genuine world simulators.