PlaceIt3D: 実3Dシーンにおける言語誘導型オブジェクト配置

要旨

我々は、実3Dシーンにおける言語誘導型オブジェクト配置という新たなタスクを提案する。本モデルには、3Dシーンの点群データ、3Dアセット、および3Dアセットの配置場所を大まかに記述したテキストプロンプトが与えられる。ここでのタスクは、プロンプトに従った有効な3Dアセットの配置位置を見つけることである。3Dシーンにおける言語誘導型ローカライゼーションタスク（例えばグラウンディング）と比較して、このタスクには特有の課題がある。それは、複数の有効な解が存在するため曖昧性が高く、3D幾何学的関係と自由空間についての推論を必要とする点である。我々は、このタスクを開始するために、新しいベンチマークと評価プロトコルを提案する。また、このタスクにおける3D LLMのトレーニング用の新しいデータセットと、非自明なベースラインとして最初の手法を導入する。我々は、この挑戦的なタスクと新たなベンチマークが、汎用3D LLMモデルの評価と比較に使用されるベンチマーク群の一部となる可能性があると考えている。

English

We introduce the novel task of Language-Guided Object Placement in Real 3D Scenes. Our model is given a 3D scene's point cloud, a 3D asset, and a textual prompt broadly describing where the 3D asset should be placed. The task here is to find a valid placement for the 3D asset that respects the prompt. Compared with other language-guided localization tasks in 3D scenes such as grounding, this task has specific challenges: it is ambiguous because it has multiple valid solutions, and it requires reasoning about 3D geometric relationships and free space. We inaugurate this task by proposing a new benchmark and evaluation protocol. We also introduce a new dataset for training 3D LLMs on this task, as well as the first method to serve as a non-trivial baseline. We believe that this challenging task and our new benchmark could become part of the suite of benchmarks used to evaluate and compare generalist 3D LLM models.