ChatPaper.aiChatPaper

3D-LLM:將3D世界注入大型語言模型

3D-LLM: Injecting the 3D World into Large Language Models

July 24, 2023
作者: Yining Hong, Haoyu Zhen, Peihao Chen, Shuhong Zheng, Yilun Du, Zhenfang Chen, Chuang Gan
cs.AI

摘要

大型語言模型(LLMs)和視覺語言模型(VLMs)已被證明在多項任務上表現出色,如常識推理。儘管這些模型強大,但它們並非基於三維物理世界,其中包含更豐富的概念,如空間關係、可供性、物理、佈局等。在這項工作中,我們提議將三維世界注入大型語言模型,並引入全新的三維語言模型家族。具體而言,三維語言模型可以將三維點雲及其特徵作為輸入,執行各種三維相關任務,包括標註、密集標註、三維問答、任務分解、三維定位、三維輔助對話、導航等。通過我們設計的三種提示機制,我們能夠收集涵蓋這些任務的30萬多個三維語言數據。為了有效訓練三維語言模型,我們首先利用一個從渲染的多視圖圖像中獲取三維特徵的三維特徵提取器。然後,我們使用二維VLMs作為基礎來訓練我們的三維語言模型。通過引入三維定位機制,三維語言模型可以更好地捕捉三維空間信息。在ScanQA上的實驗表明,我們的模型在很大程度上優於最先進的基準線(例如,BLEU-1分數超過最先進分數9%)。此外,在我們的三維標註、任務組合和三維輔助對話的持有數據集上進行的實驗顯示,我們的模型優於二維VLMs。定性示例還表明,我們的模型可以執行超出現有LLMs和VLMs範圍的更多任務。項目頁面:https://vis-www.cs.umass.edu/3dllm/。
English
Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.
PDF374December 15, 2024