世界智能体：基础图像模型能否成为三维世界模型的智能代理？

摘要

鉴于二维基础图像模型生成高保真输出的卓越能力，我们探究了一个根本性问题：二维基础图像模型是否固有地具备三维世界建模能力？为此，我们系统评估了多种前沿图像生成模型和视觉语言模型在三维世界合成任务上的表现。为利用并评估其潜在的隐式三维能力，我们提出一种智能体框架来促进三维世界生成。该方法采用多智能体架构：基于VLM的导演模块制定提示词引导图像合成，生成器合成新视角图像，以及采用VLM支持的双步验证机制从二维图像和三维重建空间评估并筛选生成帧。关键的是，我们证明了该智能体方法能实现连贯稳健的三维重建，生成可通过新视角渲染进行探索的输出场景。通过对多种基础模型的大量实验，我们证实二维模型确实内蕴着对三维世界的理解。通过利用这种理解，我们的方法成功合成了具有广阔空间感、真实感且三维一致的世界。

English

Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.

世界智能体：基础图像模型能否成为三维世界模型的智能代理？

WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

摘要

Support