PlaceIt3D: 실제 3D 장면에서 언어 지시 기반 객체 배치

초록

우리는 실제 3D 장면에서의 언어-지시 객체 배치라는 새로운 과제를 소개합니다. 우리의 모델은 3D 장면의 포인트 클라우드, 3D 에셋, 그리고 3D 에셋이 배치되어야 할 위치를 대략적으로 설명하는 텍스트 프롬프트를 입력으로 받습니다. 여기서의 과제는 프롬프트를 준수하는 유효한 3D 에셋 배치 위치를 찾는 것입니다. 3D 장면에서의 언어-지시 위치 파악 과제(예: 그라운딩)와 비교했을 때, 이 과제는 몇 가지 특정한 도전 과제를 가지고 있습니다: 이 과제는 여러 유효한 해결책이 존재하기 때문에 모호하며, 3D 기하학적 관계와 자유 공간에 대한 추론이 필요합니다. 우리는 이 과제를 시작하기 위해 새로운 벤치마크와 평가 프로토콜을 제안합니다. 또한 이 과제를 위해 3D LLM을 훈련시키기 위한 새로운 데이터셋과, 비-사소한 기준선으로서의 첫 번째 방법을 소개합니다. 우리는 이 도전적인 과제와 새로운 벤치마크가 일반적인 3D LLM 모델을 평가하고 비교하는 데 사용되는 벤치마크 세트의 일부가 될 수 있다고 믿습니다.

English

We introduce the novel task of Language-Guided Object Placement in Real 3D Scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to find a valid placement for the 3D asset that respects the prompt. Compared with other language-guided localization tasks in 3D scenes such as grounding, this task has specific challenges: it is ambiguous because it has multiple valid solutions, and it requires reasoning about 3D geometric relationships and free space. We inaugurate this task by proposing a new benchmark and evaluation protocol. We also introduce a new dataset for training 3D LLMs on this task, as well as the first method to serve as a non-trivial baseline. We believe that this challenging task and our new benchmark could become part of the suite of benchmarks used to evaluate and compare generalist 3D LLM models.