WorldAgents: 基盤画像モデルは3D世界モデルのエージェントとなり得るか？

要旨

2次元基盤画像モデルが高精細な出力を生成する顕著な能力を有することから、我々は一つの根源的疑問を検討する：2次元基盤画像モデルは本質的に3次元世界モデルの能力を内包しているのか？この問いに答えるため、我々は3次元世界合成タスクにおいて、複数の最先端画像生成モデルと視覚言語モデル（VLM）を体系的に評価する。潜在的3次元能力を活用しベンチマークするため、3次元世界生成を促進するエージェント的枠組みを提案する。本手法はマルチエージェントアーキテクチャを採用する：画像合成を誘導するプロンプトを策定するVLMベースのディレクター、新規視点画像を合成するジェネレーター、そして2次元画像空間と3次元再構成空間の両方から生成フレームを評価・選別するVLM支援の2段階検証器から構成される。決定的に、我々のエージェント的アプローチが、新規視点レンダリングによって探索可能な出力シーンを生成し、一貫性と堅牢性のある3次元再構成を実現することを実証する。様々な基盤モデルを用いた広範な実験を通じて、2次元モデルが確かに3次元世界の理解を内包していることを示す。この理解を活用することで、本手法は拡張性があり、現実的で、3次元一貫性のある世界の合成に成功する。

English

Given the remarkable ability of 2D foundation image models to generate high-fidelity outputs, we investigate a fundamental question: do 2D foundation image models inherently possess 3D world model capabilities? To answer this, we systematically evaluate multiple state-of-the-art image generation models and Vision-Language Models (VLMs) on the task of 3D world synthesis. To harness and benchmark their potential implicit 3D capability, we propose an agentic framing to facilitate 3D world generation. Our approach employs a multi-agent architecture: a VLM-based director that formulates prompts to guide image synthesis, a generator that synthesizes new image views, and a VLM-backed two-step verifier that evaluates and selectively curates generated frames from both 2D image and 3D reconstruction space. Crucially, we demonstrate that our agentic approach provides coherent and robust 3D reconstruction, producing output scenes that can be explored by rendering novel views. Through extensive experiments across various foundation models, we demonstrate that 2D models do indeed encapsulate a grasp of 3D worlds. By exploiting this understanding, our method successfully synthesizes expansive, realistic, and 3D-consistent worlds.

WorldAgents: 基盤画像モデルは3D世界モデルのエージェントとなり得るか？

WorldAgents: Can Foundation Image Models be Agents for 3D World Models?

要旨

Support