全局-局部树搜索用于语言引导的3D场景生成

摘要

大型视觉语言模型（VLMs），如GPT-4，已在多个领域取得了显著成就。然而，关于利用VLMs进行3D室内场景生成的研究却寥寥无几。本文将这一任务视为一个受空间与布局常识约束的规划问题。为借助VLM解决此问题，我们提出了一种新颖的全局-局部树搜索算法。在全局层面，该方法依次放置每个物体，并在每次放置过程中探索多种布局方案，将问题空间表示为树结构。为降低树的深度，我们分层分解场景结构，即房间层级、区域层级、地面物体层级及支撑物体层级。算法独立生成不同区域的地面物体及放置于各地面物体之上的支撑物体。在局部层面，我们同样将每个物体的放置子任务分解为多个步骤。算法在问题空间的树结构中进行搜索。为利用VLM模型生成物体的位置，我们将俯视空间离散化为密集网格，并用多样化的表情符号填充每个单元格，以确保各单元格的独特性。我们向VLM提供表情符号网格作为提示，VLM通过描述使用表情符号名称的位置，为物体生成合理的位置。定量与定性实验结果表明，相较于现有最先进方法，我们的方法生成的3D场景更为逼真。源代码已发布于https://github.com/dw-dengwei/TreeSearchGen。

English

Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is represented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, i.e. room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discretize the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experimental results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. Our source code is available at https://github.com/dw-dengwei/TreeSearchGen .

全局-局部树搜索用于语言引导的3D场景生成

Global-Local Tree Search for Language Guided 3D Scene Generation

摘要

Support