ChatPaper.aiChatPaper

MMGR:多模態生成式推理

MMGR: Multi-Modal Generative Reasoning

December 16, 2025
作者: Zefan Cai, Haoyi Qiu, Tianyi Ma, Haozhe Zhao, Gengze Zhou, Kung-Hsiang Huang, Parisa Kordjamshidi, Minjia Zhang, Xiao Wen, Jiuxiang Gu, Nanyun Peng, Junjie Hu
cs.AI

摘要

影片基礎模型能生成視覺逼真且時序連貫的內容,但其作為世界模擬器的可靠性取決於是否捕捉到物理、邏輯與空間約束。現有指標如弗雷歇影片距離(FVD)側重感知品質,卻忽略了推理失效問題,包括違反因果律、物理法則與全局一致性。我們提出MMGR(多模態生成式推理評估與基準框架),這是一個基於五項推理能力的原則性評估框架:物理推理、邏輯推理、3D空間推理、2D空間推理與時序推理。MMGR在三個領域評估生成式推理能力:抽象推理(ARC-AGI、數獨)、具身導航(真實世界3D導航與定位)及物理常識(運動與組合互動)。MMGR採用細粒度指標,要求影片與圖像生成均需達成整體正確性。我們對主流影片模型(Veo-3、Sora-2、Wan-2.2)與圖像模型(Nano-banana、Nano-banana Pro、GPT-4o-image、Qwen-image)進行基準測試,發現各領域存在明顯性能差距。模型在物理常識任務中表現尚可,但在抽象推理領域表現不佳(ARC-AGI準確率低於10%),並於具身環境中的長程空間規劃任務中遭遇困難。我們的分析揭示當前模型關鍵局限:過度依賴感知數據、全局狀態一致性薄弱,以及目標函數偏重視覺合理性而非因果正確性。MMGR提供統一的診斷基準,為構建具備推理意識的生成式世界模型指明方向。
English
Video foundation models generate visually realistic and temporally coherent content, but their reliability as world simulators depends on whether they capture physical, logical, and spatial constraints. Existing metrics such as Frechet Video Distance (FVD) emphasize perceptual quality and overlook reasoning failures, including violations of causality, physics, and global consistency. We introduce MMGR (Multi-Modal Generative Reasoning Evaluation and Benchmark), a principled evaluation framework based on five reasoning abilities: Physical, Logical, 3D Spatial, 2D Spatial, and Temporal. MMGR evaluates generative reasoning across three domains: Abstract Reasoning (ARC-AGI, Sudoku), Embodied Navigation (real-world 3D navigation and localization), and Physical Commonsense (sports and compositional interactions). MMGR applies fine-grained metrics that require holistic correctness across both video and image generation. We benchmark leading video models (Veo-3, Sora-2, Wan-2.2) and image models (Nano-banana, Nano-banana Pro, GPT-4o-image, Qwen-image), revealing strong performance gaps across domains. Models show moderate success on Physical Commonsense tasks but perform poorly on Abstract Reasoning (below 10 percent accuracy on ARC-AGI) and struggle with long-horizon spatial planning in embodied settings. Our analysis highlights key limitations in current models, including overreliance on perceptual data, weak global state consistency, and objectives that reward visual plausibility over causal correctness. MMGR offers a unified diagnostic benchmark and a path toward reasoning-aware generative world models.
PDF1022December 18, 2025