ImageBind-LLM：多模态指导调整

摘要

我们提出了ImageBind-LLM，这是一种通过ImageBind调整大型语言模型（LLMs）的多模态指令调整方法。现有的工作主要集中在语言和图像指令调整上，与此不同，我们的ImageBind-LLM可以响应多模态条件，包括音频、3D点云、视频，以及它们的嵌入空间算术，仅通过图像文本对齐训练。在训练过程中，我们采用可学习的绑定网络来对齐LLaMA和ImageBind的图像编码器之间的嵌入空间。然后，通过绑定网络转换的图像特征被添加到LLaMA所有层的单词标记中，逐渐通过无注意力和零初始化的门控机制注入视觉指令。在ImageBind的联合嵌入的帮助下，简单的图像文本训练使我们的模型展现出优越的多模态指令遵循能力。在推断过程中，多模态输入被馈送到相应的ImageBind编码器，并通过提出的视觉缓存模型进行进一步的跨模态嵌入增强。这个无需训练的缓存模型从ImageBind提取的三百万图像特征中检索，有效地减轻了训练推断模态差异。值得注意的是，通过我们的方法，ImageBind-LLM可以响应各种模态的指令，并展示出显著的语言生成质量。代码已发布在https://github.com/OpenGVLab/LLaMA-Adapter。

English

We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During inference, the multi-modality inputs are fed into the corresponding ImageBind encoders, and processed by a proposed visual cache model for further cross-modal embedding enhancement. The training-free cache model retrieves from three million image features extracted by ImageBind, which effectively mitigates the training-inference modality discrepancy. Notably, with our approach, ImageBind-LLM can respond to instructions of diverse modalities and demonstrate significant language generation quality. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.

ImageBind-LLM：多模态指导调整

ImageBind-LLM: Multi-modality Instruction Tuning

摘要

Support