3D-LLM: 3D 세계를 대형 언어 모델에 주입하기

초록

대형 언어 모델(LLM)과 시각-언어 모델(VLM)은 상식 추론과 같은 다양한 작업에서 뛰어난 성능을 보여주는 것으로 입증되었습니다. 이러한 모델들이 강력할지라도, 이들은 공간 관계, 어포던스, 물리학, 레이아웃 등과 같은 더 풍부한 개념을 포함하는 3D 물리 세계에 기반을 두고 있지 않습니다. 본 연구에서는 3D 세계를 대형 언어 모델에 주입하고, 완전히 새로운 3D-LLM 패밀리를 소개합니다. 구체적으로, 3D-LLM은 3D 포인트 클라우드와 그 특징을 입력으로 받아 캡션 생성, 밀집 캡션 생성, 3D 질문 응답, 작업 분해, 3D 그라운딩, 3D 지원 대화, 네비게이션 등 다양한 3D 관련 작업을 수행할 수 있습니다. 우리가 설계한 세 가지 유형의 프롬프트 메커니즘을 사용하여 이러한 작업을 포괄하는 300k 이상의 3D-언어 데이터를 수집할 수 있었습니다. 3D-LLM을 효율적으로 학습시키기 위해, 먼저 렌더링된 다중 뷰 이미지에서 3D 특징을 추출하는 3D 특징 추출기를 활용합니다. 그런 다음, 2D VLM을 백본으로 사용하여 3D-LLM을 학습시킵니다. 3D 위치 지정 메커니즘을 도입함으로써, 3D-LLM은 3D 공간 정보를 더 잘 포착할 수 있습니다. ScanQA에 대한 실험에서 우리의 모델은 최첨단 베이스라인을 큰 차이로 능가하는 것으로 나타났습니다(예: BLEU-1 점수가 최첨단 점수를 9% 초과). 또한, 3D 캡션 생성, 작업 구성, 3D 지원 대화를 위한 우리의 보유 데이터셋에 대한 실험에서 우리의 모델은 2D VLM을 능가하는 성능을 보였습니다. 질적 예제는 또한 우리의 모델이 기존 LLM과 VLM의 범위를 넘어 더 많은 작업을 수행할 수 있음을 보여줍니다. 프로젝트 페이지: https://vis-www.cs.umass.edu/3dllm/.

English

Large language models (LLMs) and Vision-Language Models (VLMs) have been proven to excel at multiple tasks, such as commonsense reasoning. Powerful as these models can be, they are not grounded in the 3D physical world, which involves richer concepts such as spatial relationships, affordances, physics, layout, and so on. In this work, we propose to inject the 3D world into large language models and introduce a whole new family of 3D-LLMs. Specifically, 3D-LLMs can take 3D point clouds and their features as input and perform a diverse set of 3D-related tasks, including captioning, dense captioning, 3D question answering, task decomposition, 3D grounding, 3D-assisted dialog, navigation, and so on. Using three types of prompting mechanisms that we design, we are able to collect over 300k 3D-language data covering these tasks. To efficiently train 3D-LLMs, we first utilize a 3D feature extractor that obtains 3D features from rendered multi- view images. Then, we use 2D VLMs as our backbones to train our 3D-LLMs. By introducing a 3D localization mechanism, 3D-LLMs can better capture 3D spatial information. Experiments on ScanQA show that our model outperforms state-of-the-art baselines by a large margin (e.g., the BLEU-1 score surpasses state-of-the-art score by 9%). Furthermore, experiments on our held-in datasets for 3D captioning, task composition, and 3D-assisted dialogue show that our model outperforms 2D VLMs. Qualitative examples also show that our model could perform more tasks beyond the scope of existing LLMs and VLMs. Project Page: : https://vis-www.cs.umass.edu/3dllm/.

3D-LLM: 3D 세계를 대형 언어 모델에 주입하기

3D-LLM: Injecting the 3D World into Large Language Models

초록

Support