ChatPaper.aiChatPaper

MMGR:多模态生成式推理

MMGR: Multi-Modal Generative Reasoning

December 16, 2025
作者: Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Xiao Wen, Jiuxiang Gu, Nanyun Peng, Junjie Hu
cs.AI

摘要

视频基础模型能够生成视觉逼真且时序连贯的内容,但其作为世界模拟器的可靠性取决于是否遵循物理、逻辑与空间约束。现有指标如弗雷歇视频距离(FVD)侧重感知质量,却忽略了推理缺陷,包括对因果关系、物理规律和全局一致性的违背。我们提出多模态生成推理评估基准(MMGR),该框架基于五大推理能力构建原则性评估体系:物理推理、逻辑推理、3D空间推理、2D空间推理及时序推理。MMGR在三大领域评估生成式推理能力:抽象推理(ARC-AGI、数独)、具身导航(真实3D环境导航与定位)及物理常识(运动场景与组合交互)。通过需同时满足视频与图像生成整体正确性的细粒度指标,我们对主流视频模型(Veo-3、Sora-2、Wan-2.2)和图像模型(Nano-banana、Nano-banana Pro、GPT-4o-image、Qwen-image)进行基准测试,发现各领域存在显著性能差距。模型在物理常识任务中表现尚可,但在抽象推理领域准确率极低(ARC-AGI低于10%),且在具身环境的长程空间规划中表现不佳。分析揭示了当前模型的核心局限:过度依赖感知数据、全局状态一致性薄弱,以及优化目标偏向视觉合理性而非因果正确性。MMGR提供了统一的诊断基准,为构建具备推理能力的生成式世界模型指明方向。
English
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
PDF1022December 18, 2025