多模态 GPT：用于与人类对话的视觉与语言模型

摘要

我们提出了一种名为MultiModal-GPT的视觉与语言模型，用于与人类进行多轮对话。MultiModal-GPT能够遵循人类的各种指令，例如生成详细说明、计算感兴趣对象的数量，以及回答用户的一般问题。MultiModal-GPT是通过对OpenFlamingo进行参数高效微调而得到的，其中在语言模型的交叉注意力部分和自注意力部分均添加了低秩适配器（LoRA）。我们首先利用视觉和语言数据构建指令模板，用于进行多模态指令调整，以使模型理解并遵循人类指令。我们发现训练数据的质量对对话表现至关重要，少量包含简短答案的数据可能导致模型对任何指令作出简短回应。为了进一步增强MultiModal-GPT与人类聊天的能力，我们利用仅包含语言的指令遵循数据对MultiModal-GPT进行联合训练。将仅包含语言和视觉-语言指令的联合训练应用于相同的指令模板，有效提高了对话表现。各种演示展示了MultiModal-GPT与人类进行连续对话的能力。代码和演示位于https://github.com/open-mmlab/Multimodal-GPT。

English

We present a vision and language model named MultiModal-GPT to conduct multi-round dialogue with humans. MultiModal-GPT can follow various instructions from humans, such as generating a detailed caption, counting the number of interested objects, and answering general questions from users. MultiModal-GPT is parameter-efficiently fine-tuned from OpenFlamingo, with Low-rank Adapter (LoRA) added both in the cross-attention part and the self-attention part of the language model. We first construct instruction templates with vision and language data for multi-modality instruction tuning to make the model understand and follow human instructions. We find the quality of training data is vital for the dialogue performance, where few data containing short answers can lead the model to respond shortly to any instructions. To further enhance the ability to chat with humans of the MultiModal-GPT, we utilize language-only instruction-following data to train the MultiModal-GPT jointly. The joint training of language-only and visual-language instructions with the same instruction template effectively improves dialogue performance. Various demos show the ability of continuous dialogue of MultiModal-GPT with humans. Code and demo are at https://github.com/open-mmlab/Multimodal-GPT

多模态 GPT：用于与人类对话的视觉与语言模型

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

摘要

Support