Code-as-Room: トップダウンビュー画像からのエージェントコード合成による3Dルーム生成

要旨

現実的で機能的な3D屋内空間の設計は、インテリアデザイン、バーチャルリアリティ、ゲーム、身体化AIなど幅広いアプリケーションにおいて不可欠である。近年、MLLMベースのアプローチはテキスト記述や参照画像からの3D空間合成で大きな可能性を示しているものの、テキストベースの手法は正確な空間情報の把握が難しく、既存の画像条件付きエージェントは俯瞰図からの部屋全体の生成において不安定性や無限ループの問題を抱えている。これらの制約に対処するため、我々はBlenderコードで3D空間を表現する、構造化実行ハーネスを備えたMLLMベースのエージェンティックフレームワーク「Code-as-Room」を提案する。本フレームワークは、俯瞰図の部屋画像を入力として、参照画像を解析してシーン要素とその空間関係を抽出し、幾何形状、マテリアル、照明に関する実行可能なBlenderコードを原理に基づいた多段階パイプラインで合成する。また、既存のエージェントベースフレームワークに内在するコンテキスト忘却を軽減するため、段階間メモリモジュールを維持する。さらに、コードベースの3D空間合成のための専用ベンチマークを導入し、多様な評価プロトコルを包含する。このベンチマークに基づき、既存のエージェントベース手法との包括的な比較を行い、提案する実行ハーネスの有効性を検証する。

English

Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms with Blender codes. Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline. A cross-stage memory module is maintained throughout to mitigate context forgetting inherent to existing agent-based frameworks. We further introduce a dedicated benchmark for code-based 3D room synthesis, encompassing various evaluation protocols. Based on our benchmark, comprehensive comparisons against existing agent-based methods are conducted to validate the effectiveness of our proposed execution harness.