언어 기반 3D 장면 생성을 위한 글로벌-로컬 트리 탐색

초록

GPT-4와 같은 대형 시각-언어 모델(VLMs)은 다양한 분야에서 주목할 만한 성과를 거두었습니다. 그러나 VLMs를 활용한 3D 실내 장면 생성에 대한 연구는 거의 이루어지지 않았습니다. 본 논문에서는 이 문제를 공간 및 레이아웃 상식 제약 조건 하의 계획 문제로 간주합니다. 이 문제를 VLM으로 해결하기 위해, 우리는 새로운 전역-지역 트리 탐색 알고리즘을 제안합니다. 전역적으로, 이 방법은 각 객체를 순차적으로 배치하고 각 배치 과정에서 여러 배치를 탐색하며, 문제 공간을 트리로 표현합니다. 트리의 깊이를 줄이기 위해, 우리는 장면 구조를 계층적으로 분해합니다. 즉, 방 수준, 영역 수준, 바닥 객체 수준, 그리고 지지 객체 수준으로 나눕니다. 이 알고리즘은 서로 다른 영역의 바닥 객체와 서로 다른 바닥 객체 위에 배치된 지지 객체를 독립적으로 생성합니다. 지역적으로, 우리는 각 객체의 배치라는 하위 작업을 여러 단계로 분해합니다. 알고리즘은 문제 공간의 트리를 탐색합니다. VLM 모델을 활용하여 객체의 위치를 생성하기 위해, 우리는 위에서 내려다본 공간을 조밀한 그리드로 이산화하고 각 셀을 다양한 이모지로 채워 셀을 구별합니다. 우리는 이모지 그리드로 VLM에 프롬프트를 제공하면, VLM은 이모지 이름을 사용하여 객체의 합리적인 위치를 설명합니다. 양적 및 질적 실험 결과는 우리의 접근 방식이 최신 기술보다 더 그럴듯한 3D 장면을 생성함을 보여줍니다. 우리의 소스 코드는 https://github.com/dw-dengwei/TreeSearchGen에서 확인할 수 있습니다.

English

Large Vision-Language Models (VLMs), such as GPT-4, have achieved remarkable success across various fields. However, there are few studies on 3D indoor scene generation with VLMs. This paper considers this task as a planning problem subject to spatial and layout common sense constraints. To solve the problem with a VLM, we propose a new global-local tree search algorithm. Globally, the method places each object sequentially and explores multiple placements during each placement process, where the problem space is represented as a tree. To reduce the depth of the tree, we decompose the scene structure hierarchically, i.e. room level, region level, floor object level, and supported object level. The algorithm independently generates the floor objects in different regions and supported objects placed on different floor objects. Locally, we also decompose the sub-task, the placement of each object, into multiple steps. The algorithm searches the tree of problem space. To leverage the VLM model to produce positions of objects, we discretize the top-down view space as a dense grid and fill each cell with diverse emojis to make to cells distinct. We prompt the VLM with the emoji grid and the VLM produces a reasonable location for the object by describing the position with the name of emojis. The quantitative and qualitative experimental results illustrate our approach generates more plausible 3D scenes than state-of-the-art approaches. Our source code is available at https://github.com/dw-dengwei/TreeSearchGen .

언어 기반 3D 장면 생성을 위한 글로벌-로컬 트리 탐색

Global-Local Tree Search for Language Guided 3D Scene Generation

초록

Support