多模式 GPT：一個與人對話的視覺和語言模型

摘要

我們提出了一個名為MultiModal-GPT的視覺與語言模型，用於與人類進行多輪對話。MultiModal-GPT能夠遵循來自人類的各種指示，例如生成詳細說明、計算感興趣物件的數量，以及回答用戶的一般問題。MultiModal-GPT是從OpenFlamingo進行了參數高效微調，並在語言模型的交叉注意力部分和自注意力部分都添加了低秩適配器（LoRA）。我們首先利用視覺和語言數據構建指示模板，進行多模式指示調整，以使模型理解並遵循人類指示。我們發現訓練數據的質量對對話表現至關重要，少量包含簡短答案的數據可能會導致模型對任何指示作出簡短回應。為了進一步增強MultiModal-GPT與人類對話的能力，我們利用僅語言指示遵循數據來聯合訓練MultiModal-GPT。將僅語言和視覺-語言指示使用相同指示模板進行聯合訓練有效提高了對話表現。各種演示展示了MultiModal-GPT與人類進行連續對話的能力。代碼和演示位於https://github.com/open-mmlab/Multimodal-GPT。

English

We present a vision and language model named MultiModal-GPT to conduct multi-round dialogue with humans. MultiModal-GPT can follow various instructions from humans, such as generating a detailed caption, counting the number of interested objects, and answering general questions from users. MultiModal-GPT is parameter-efficiently fine-tuned from OpenFlamingo, with Low-rank Adapter (LoRA) added both in the cross-attention part and the self-attention part of the language model. We first construct instruction templates with vision and language data for multi-modality instruction tuning to make the model understand and follow human instructions. We find the quality of training data is vital for the dialogue performance, where few data containing short answers can lead the model to respond shortly to any instructions. To further enhance the ability to chat with humans of the MultiModal-GPT, we utilize language-only instruction-following data to train the MultiModal-GPT jointly. The joint training of language-only and visual-language instructions with the same instruction template effectively improves dialogue performance. Various demos show the ability of continuous dialogue of MultiModal-GPT with humans. Code and demo are at https://github.com/open-mmlab/Multimodal-GPT

多模式 GPT：一個與人對話的視覺和語言模型

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

摘要

Support