ImageBind-LLM:多模态指导调整
ImageBind-LLM: Multi-modality Instruction Tuning
September 7, 2023
作者: Jiaming Han, Renrui Zhang, Wenqi Shao, Peng Gao, Peng Xu, Han Xiao, Kaipeng Zhang, Chris Liu, Song Wen, Ziyu Guo, Xudong Lu, Shuai Ren, Yafei Wen, Xiaoxin Chen, Xiangyu Yue, Hongsheng Li, Yu Qiao
cs.AI
摘要
我们提出了ImageBind-LLM,这是一种通过ImageBind调整大型语言模型(LLMs)的多模态指令调整方法。现有的工作主要集中在语言和图像指令调整上,与此不同,我们的ImageBind-LLM可以响应多模态条件,包括音频、3D点云、视频,以及它们的嵌入空间算术,仅通过图像文本对齐训练。在训练过程中,我们采用可学习的绑定网络来对齐LLaMA和ImageBind的图像编码器之间的嵌入空间。然后,通过绑定网络转换的图像特征被添加到LLaMA所有层的单词标记中,逐渐通过无注意力和零初始化的门控机制注入视觉指令。在ImageBind的联合嵌入的帮助下,简单的图像文本训练使我们的模型展现出优越的多模态指令遵循能力。在推断过程中,多模态输入被馈送到相应的ImageBind编码器,并通过提出的视觉缓存模型进行进一步的跨模态嵌入增强。这个无需训练的缓存模型从ImageBind提取的三百万图像特征中检索,有效地减轻了训练推断模态差异。值得注意的是,通过我们的方法,ImageBind-LLM可以响应各种模态的指令,并展示出显著的语言生成质量。代码已发布在https://github.com/OpenGVLab/LLaMA-Adapter。
English
We present ImageBind-LLM, a multi-modality instruction tuning method of large
language models (LLMs) via ImageBind. Existing works mainly focus on language
and image instruction tuning, different from which, our ImageBind-LLM can
respond to multi-modality conditions, including audio, 3D point clouds, video,
and their embedding-space arithmetic by only image-text alignment training.
During training, we adopt a learnable bind network to align the embedding space
between LLaMA and ImageBind's image encoder. Then, the image features
transformed by the bind network are added to word tokens of all layers in
LLaMA, which progressively injects visual instructions via an attention-free
and zero-initialized gating mechanism. Aided by the joint embedding of
ImageBind, the simple image-text training enables our model to exhibit superior
multi-modality instruction-following capabilities. During inference, the
multi-modality inputs are fed into the corresponding ImageBind encoders, and
processed by a proposed visual cache model for further cross-modal embedding
enhancement. The training-free cache model retrieves from three million image
features extracted by ImageBind, which effectively mitigates the
training-inference modality discrepancy. Notably, with our approach,
ImageBind-LLM can respond to instructions of diverse modalities and demonstrate
significant language generation quality. Code is released at
https://github.com/OpenGVLab/LLaMA-Adapter.