Point-Bind和Point-LLM：將點雲與多模態對齊，用於3D理解、生成和指示跟隨。

摘要

我們介紹了Point-Bind，一個3D多模態模型，將點雲與2D圖像、語言、音頻和視頻對齊。在ImageBind的指導下，我們在3D和多模態之間建立了一個聯合嵌入空間，實現了許多有前途的應用，例如任意到3D生成、3D嵌入算術和3D開放世界理解。除此之外，我們進一步提出了Point-LLM，這是第一個遵循3D多模態指令的3D大型語言模型（LLM）。通過參數高效的微調技術，Point-LLM將Point-Bind的語義注入到預先訓練的LLM中，例如LLaMA，它不需要3D指令數據，但表現出優越的3D和多模態問答能力。我們希望我們的工作可以為將3D點雲擴展到多模態應用的社區提供一些啟示。代碼可在https://github.com/ZiyuGuo99/Point-Bind_Point-LLM找到。

English

We introduce Point-Bind, a 3D multi-modality model aligning point clouds with 2D image, language, audio, and video. Guided by ImageBind, we construct a joint embedding space between 3D and multi-modalities, enabling many promising applications, e.g., any-to-3D generation, 3D embedding arithmetic, and 3D open-world understanding. On top of this, we further present Point-LLM, the first 3D large language model (LLM) following 3D multi-modal instructions. By parameter-efficient fine-tuning techniques, Point-LLM injects the semantics of Point-Bind into pre-trained LLMs, e.g., LLaMA, which requires no 3D instruction data, but exhibits superior 3D and multi-modal question-answering capacity. We hope our work may cast a light on the community for extending 3D point clouds to multi-modality applications. Code is available at https://github.com/ZiyuGuo99/Point-Bind_Point-LLM.

Point-Bind和Point-LLM：將點雲與多模態對齊，用於3D理解、生成和指示跟隨。

Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following

摘要

Support