ChatPaper.aiChatPaper

具有指稱標記的基於實境的3D-LLM

Grounded 3D-LLM with Referent Tokens

May 16, 2024
作者: Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, Jiangmiao Pang
cs.AI

摘要

先前對於3D場景理解的研究主要發展了針對特定任務的專用模型,或需要任務特定的微調。在本研究中,我們提出了Grounded 3D-LLM,探索3D大型多模型(3D LMMs)的潛力,將各種3D視覺任務統一整合到一個生成框架中。該模型使用場景參照標記作為特殊名詞片語,用於參照3D場景,從而處理交錯使用3D和文本數據的序列。它提供了一種自然的方法,通過任務特定的指令模板將3D視覺任務轉換為語言格式。為了促進在後續語言建模中使用參照標記,我們已經整理了大規模的基於場景的語言數據集,通過引導現有對象標籤,提供了更精細的場景-文本對應。隨後,我們引入了對比語言-場景預訓練(CLASP),以有效利用這些數據,從而將3D視覺與語言模型整合在一起。我們的全面評估涵蓋了像密集標註和3D問答等開放式任務,以及對象檢測和語言對應等封閉式任務。跨多個3D基準測試的實驗顯示了Grounded 3D-LLM的領先性能和廣泛應用性。代碼和數據集將在項目頁面上發布:https://groundedscenellm.github.io/grounded_3d-llm.github.io。
English
Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling the handling of sequences that interleave 3D and textual data. It offers a natural approach for translating 3D vision tasks into language formats using task-specific instruction templates. To facilitate the use of referent tokens in subsequent language modeling, we have curated large-scale grounded language datasets that offer finer scene-text correspondence at the phrase level by bootstrapping existing object labels. Subsequently, we introduced Contrastive LAnguage-Scene Pre-training (CLASP) to effectively leverage this data, thereby integrating 3D vision with language models. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D QA, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM. Code and datasets will be released on the project page: https://groundedscenellm.github.io/grounded_3d-llm.github.io.

Summary

AI-Generated Summary

PDF131December 15, 2024