Point-Bind & Point-LLM: 다중 모달리티와 정렬된 포인트 클라우드를 통한 3D 이해, 생성 및 명령 수행

초록

본 논문에서는 3D 포인트 클라우드를 2D 이미지, 언어, 오디오, 비디오와 정렬하는 3D 다중 모달리티 모델인 Point-Bind를 소개한다. ImageBind를 기반으로 3D와 다중 모달리티 간의 공통 임베딩 공간을 구축하여, 이를 통해 다양한 유망한 응용 프로그램을 가능하게 한다. 예를 들어, 임의의 데이터에서 3D 생성, 3D 임베딩 산술 연산, 그리고 3D 오픈 월드 이해 등이 포함된다. 이를 기반으로, 3D 다중 모달리티 명령어를 따르는 최초의 3D 대형 언어 모델(LLM)인 Point-LLM을 추가로 제시한다. 파라미터 효율적 미세 조정 기법을 통해 Point-LLM은 사전 훈련된 LLM(예: LLaMA)에 Point-Bind의 의미를 주입하며, 이는 3D 명령어 데이터를 필요로 하지 않으면서도 우수한 3D 및 다중 모달리티 질의응답 능력을 보여준다. 본 연구가 3D 포인트 클라우드를 다중 모달리티 응용으로 확장하는 데 있어 커뮤니티에 기여할 수 있기를 바란다. 코드는 https://github.com/ZiyuGuo99/Point-Bind_Point-LLM에서 확인할 수 있다.

English

We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.

Point-Bind & Point-LLM: 다중 모달리티와 정렬된 포인트 클라우드를 통한 3D 이해, 생성 및 명령 수행

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

초록

Support