Point-Bind 和 Point-LLM:将点云与多模态对齐,用于3D理解、生成和指令跟踪。
Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following
September 1, 2023
作者: Ziyu Guo, Renrui Zhang, Xiangyang Zhu, Yiwen Tang, Xianzheng Ma, Jiaming Han, Kexin Chen, Peng Gao, Xianzhi Li, Hongsheng Li, Pheng-Ann Heng
cs.AI
摘要
我们介绍了Point-Bind,这是一个3D多模态模型,将点云与2D图像、语言、音频和视频进行对齐。在ImageBind的指导下,我们构建了3D和多模态之间的联合嵌入空间,实现了许多有前途的应用,例如任意到3D生成、3D嵌入算术和3D开放世界理解。在此基础上,我们进一步提出了Point-LLM,这是第一个遵循3D多模态指令的3D大型语言模型(LLM)。通过参数高效的微调技术,Point-LLM将Point-Bind的语义注入到预训练的LLMs中,例如LLaMA,它不需要3D指令数据,但表现出优越的3D和多模态问答能力。我们希望我们的工作能为将3D点云扩展到多模态应用的社区投下一线希望。代码可在https://github.com/ZiyuGuo99/Point-Bind_Point-LLM找到。
English
We introduce Point-Bind, a 3D multi-modality model aligning point clouds with
2D image, language, audio, and video. Guided by ImageBind, we construct a joint
embedding space between 3D and multi-modalities, enabling many promising
applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D
open-world understanding. On top of this, we further present Point-LLM, the
first 3D large language model (LLM) following 3D multi-modal instructions. By
parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of
Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction
data, but exhibits superior 3D and multi-modal question-answering capacity. We
hope our work may cast a light on the community for extending 3D point clouds
to multi-modality applications. Code is available at
https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.