言語誘導型3Dシーン生成のためのグローバル-ローカルツリーサーチ

要旨

GPT-4のような大規模視覚言語モデル（VLM）は、さまざまな分野で顕著な成功を収めています。しかし、VLMを用いた3D室内シーン生成に関する研究はほとんどありません。本論文では、このタスクを空間的およびレイアウトの常識的制約に従う計画問題として捉えます。この問題をVLMで解決するために、新しいグローバル-ローカルツリー探索アルゴリズムを提案します。グローバルでは、各オブジェクトを順番に配置し、各配置プロセス中に複数の配置を探索します。ここで、問題空間はツリーとして表現されます。ツリーの深さを減らすために、シーン構造を階層的に分解します。つまり、部屋レベル、領域レベル、床オブジェクトレベル、および支持オブジェクトレベルです。このアルゴリズムは、異なる領域の床オブジェクトと、異なる床オブジェクト上に配置される支持オブジェクトを独立して生成します。ローカルでは、各オブジェクトの配置というサブタスクを複数のステップに分解します。アルゴリズムは問題空間のツリーを探索します。VLMモデルを活用してオブジェクトの位置を生成するために、トップダウンビューの空間を密なグリッドとして離散化し、各セルを多様な絵文字で埋めてセルを区別します。絵文字グリッドをVLMにプロンプトとして与えると、VLMは絵文字の名前で位置を記述することで、オブジェクトの合理的な位置を生成します。定量的および定性的な実験結果は、我々のアプローチが最先端のアプローチよりもより妥当な3Dシーンを生成することを示しています。ソースコードはhttps://github.com/dw-dengwei/TreeSearchGenで公開されています。

English

Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is represented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, i.e. room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discretize the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experimental results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. Our source code is available at https://github.com/dw-dengwei/TreeSearchGen .

言語誘導型3Dシーン生成のためのグローバル-ローカルツリーサーチ

Global-Local Tree Search for Language Guided 3D Scene Generation

要旨

Support