具有指代标记的基于实体的3D-LLM
Grounded 3D-LLM with Referent Tokens
May 16, 2024
作者: Yilun Chen, Shuai Yang, Haifeng Huang, Tai Wang, Ruiyuan Lyu, Runsen Xu, Dahua Lin, Jiangmiao Pang
cs.AI
摘要
先前关于3D场景理解的研究主要开发了针对特定任务的专门模型,或者需要特定任务的微调。在本研究中,我们提出了基于3D大型多模型(3D LMMs)的Grounded 3D-LLM,探索了将各种3D视觉任务整合到统一生成框架中的潜力。该模型使用场景指代标记作为特殊名词短语来引用3D场景,从而能够处理交错使用3D和文本数据的序列。它提供了一种自然的方法,通过使用特定任务的指令模板,将3D视觉任务转化为语言格式。为了促进在后续语言建模中使用指代标记,我们已经筛选了大规模的基于场景的语言数据集,通过引导现有对象标签,提供了更精细的场景-文本对应关系,达到短语级别。随后,我们引入了对比语言-场景预训练(CLASP)来有效利用这些数据,从而将3D视觉与语言模型整合。我们的全面评估涵盖了像密集字幕生成和3D问答等开放式任务,以及像对象检测和语言定位等封闭式任务。跨多个3D基准测试的实验显示了Grounded 3D-LLM的领先性能和广泛适用性。代码和数据集将在项目页面发布:https://groundedscenellm.github.io/grounded_3d-llm.github.io。
English
Prior studies on 3D scene understanding have primarily developed specialized
models for specific tasks or required task-specific fine-tuning. In this study,
we propose Grounded 3D-LLM, which explores the potential of 3D large
multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a
unified generative framework. The model uses scene referent tokens as special
noun phrases to reference 3D scenes, enabling the handling of sequences that
interleave 3D and textual data. It offers a natural approach for translating 3D
vision tasks into language formats using task-specific instruction templates.
To facilitate the use of referent tokens in subsequent language modeling, we
have curated large-scale grounded language datasets that offer finer scene-text
correspondence at the phrase level by bootstrapping existing object labels.
Subsequently, we introduced Contrastive LAnguage-Scene Pre-training (CLASP) to
effectively leverage this data, thereby integrating 3D vision with language
models. Our comprehensive evaluation covers open-ended tasks like dense
captioning and 3D QA, alongside close-ended tasks such as object detection and
language grounding. Experiments across multiple 3D benchmarks reveal the
leading performance and the broad applicability of Grounded 3D-LLM. Code and
datasets will be released on the project page:
https://groundedscenellm.github.io/grounded_3d-llm.github.io.Summary
AI-Generated Summary