代码即房间：通过智能体代码合成从俯视图像生成3D房间

摘要

设计和生成逼真且功能完整的3D室内房间对于室内设计、虚拟现实、游戏以及具身智能等广泛的应用领域至关重要。尽管近期基于多模态大语言模型（MLLM）的方法在从文本描述或参考图像合成3D房间方面展现出巨大潜力，但基于文本的方法难以捕捉精确的空间信息，而现有的图像条件代理在从俯视图生成整体房间时，往往存在不稳定性及无限循环的问题。为解决这些局限性，我们提出了Code-as-Room，这是一个配备结构化执行机制的MLLM智能代理框架，利用Blender代码表示3D房间。给定一张俯视房间图像，该框架会解析参考图像以提取场景元素及其空间关系，并通过一个原则化的多阶段管道，合成包含几何、材质和光照的可执行Blender代码。在整个过程中，我们维护了一个跨阶段记忆模块，以缓解现有基于代理的框架中固有的上下文遗忘问题。此外，我们还引入了一个专门针对基于代码的3D房间合成的基准测试，涵盖了多种评估协议。基于该基准测试，我们与现有基于代理的方法进行了全面比较，从而验证了我们所提出的执行机制的有效性。

English

Designing realistic and functional 3D indoor rooms is essential for a wide range of applications, including interior design, virtual reality, gaming, and embodied AI. While recent MLLM-based approaches have shown great potential for 3D room synthesis from textual descriptions or reference images, text-based methods struggle to capture precise spatial information, and existing image-conditioned agents suffer from instability and infinite looping when tasked with holistic room generation from top-down views. To address these limitations, we propose Code-as-Room, an MLLM-based agentic framework equipped with a structured execution harness, which represents 3D rooms with Blender codes. Given a top-down room image, the framework parses the reference image to extract scene elements and their spatial relationships, and synthesizes executable Blender code for geometry, materials, and lighting in a principled, multi-stage pipeline. A cross-stage memory module is maintained throughout to mitigate context forgetting inherent to existing agent-based frameworks. We further introduce a dedicated benchmark for code-based 3D room synthesis, encompassing various evaluation protocols. Based on our benchmark, comprehensive comparisons against existing agent-based methods are conducted to validate the effectiveness of our proposed execution harness.