世界智能体：基础图像模型能否成为3D世界模型的智能代理？

摘要

鉴于二维基础图像模型能够生成高保真度输出的卓越能力，我们探究了一个根本性问题：二维基础图像模型是否天然具备三维世界建模能力？为解答此问题，我们系统评估了多种前沿图像生成模型与视觉语言模型在三维世界合成任务上的表现。为挖掘并衡量其潜在的隐式三维能力，我们提出一种智能体框架以促进三维世界生成。该方法采用多智能体架构：基于视觉语言模型的导演模块负责构建提示词指导图像合成，生成器负责合成新视角图像，而采用视觉语言模型的双步验证器则从二维图像和三维重建空间对生成帧进行评估与筛选。关键的是，我们证明该智能体方法能实现连贯稳健的三维重建，生成可通过新视角渲染进行探索的输出场景。通过对多种基础模型的大规模实验，我们证实二维模型确实内蕴对三维世界的理解能力。通过利用这种认知，我们的方法成功合成了具有广阔空间感、真实感且三维一致的世界。

English

Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.