ChatPaper.aiChatPaper

3D-LLM:将3D世界注入大型语言模型

3D-LLM: Injecting the 3D World into Large Language Models

July 24, 2023
作者: Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan
cs.AI

摘要

大型语言模型(LLMs)和视觉-语言模型(VLMs)已被证明在多个任务上表现出色,如常识推理。尽管这些模型强大,但它们并未基于涉及空间关系、可供性、物理、布局等更丰富概念的三维物理世界。在这项工作中,我们提议将三维世界注入大型语言模型,并引入全新的三维语言模型家族。具体而言,三维语言模型可以接受三维点云及其特征作为输入,并执行各种三维相关任务,包括字幕生成、密集字幕生成、三维问题回答、任务分解、三维定位、三维辅助对话、导航等。通过我们设计的三种提示机制,我们能够收集涵盖这些任务的30万多个三维语言数据。为了有效训练三维语言模型,我们首先利用一个从渲染的多视图图像中获取三维特征的三维特征提取器。然后,我们使用二维视觉-语言模型作为我们的骨干来训练我们的三维语言模型。通过引入三维定位机制,三维语言模型可以更好地捕捉三维空间信息。在ScanQA上的实验表明,我们的模型在很大程度上优于最先进的基线模型(例如,BLEU-1分数超过最先进分数9%)。此外,在我们的三维字幕生成、任务组成和三维辅助对话的保留数据集上的实验表明,我们的模型优于二维视觉-语言模型。定性示例还表明,我们的模型能够执行超出现有大型语言模型和视觉-语言模型范围的更多任务。项目页面:https://vis-www.cs.umass.edu/3dllm/。
English
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.
PDF374December 15, 2024