ImageBind-LLM：多模式指令調整

摘要

我們提出了ImageBind-LLM，一種透過ImageBind調整大型語言模型（LLMs）的多模式指令調整方法。現有的研究主要集中在語言和圖像指令調整，不同於此的是，我們的ImageBind-LLM可以應對多模式條件，包括音頻、3D點雲、視頻，以及它們的嵌入空間算術，僅通過圖像文本對齊訓練。在訓練期間，我們採用可學習的綁定網絡來對齊LLaMA和ImageBind的圖像編碼器之間的嵌入空間。然後，經過綁定網絡轉換的圖像特徵被添加到LLaMA所有層的詞元中，通過一種無關注且零初始化的閘控機制逐步注入視覺指令。在ImageBind的聯合嵌入的幫助下，簡單的圖像文本訓練使我們的模型展現出卓越的多模式指令遵循能力。在推斷期間，多模式輸入被餵入相應的ImageBind編碼器，並通過提出的視覺緩存模型進行進一步的跨模態嵌入增強處理。這個無需訓練的緩存模型從ImageBind提取的三百萬個圖像特徵中檢索，有效地減輕了訓練推斷模態差異。值得注意的是，採用我們的方法，ImageBind-LLM可以應對各種模式的指令並展現出顯著的語言生成質量。代碼已在https://github.com/OpenGVLab/LLaMA-Adapter釋出。

English

We present ImageBind-LLM, a multi-modality instruction tuning method of large language models (LLMs) via ImageBind. Existing works mainly focus on language and image instruction tuning, different from which, our ImageBind-LLM can respond to multi-modality conditions, including audio, 3D point clouds, video, and their embedding-space arithmetic by only image-text alignment training. During training, we adopt a learnable bind network to align the embedding space between LLaMA and ImageBind's image encoder. Then, the image features transformed by the bind network are added to word tokens of all layers in LLaMA, which progressively injects visual instructions via an attention-free and zero-initialized gating mechanism. Aided by the joint embedding of ImageBind, the simple image-text training enables our model to exhibit superior multi-modality instruction-following capabilities. During inference, the multi-modality inputs are fed into the corresponding ImageBind encoders, and processed by a proposed visual cache model for further cross-modal embedding enhancement. The training-free cache model retrieves from three million image features extracted by ImageBind, which effectively mitigates the training-inference modality discrepancy. Notably, with our approach, ImageBind-LLM can respond to instructions of diverse modalities and demonstrate significant language generation quality. Code is released at https://github.com/OpenGVLab/LLaMA-Adapter.

ImageBind-LLM：多模式指令調整

ImageBind-LLM: Multi-modality Instruction Tuning

摘要

Support