多模式 GPT:一個與人對話的視覺和語言模型
MultiModal-GPT: A Vision and Language Model for Dialogue with Humans
May 8, 2023
作者: Tao Gong, Chengqi Lyu, Shilong Zhang, Yudong Wang, Miao Zheng, Qian Zhao, Kuikun Liu, Wenwei Zhang, Ping Luo, Kai Chen
cs.AI
摘要
我們提出了一個名為MultiModal-GPT的視覺與語言模型,用於與人類進行多輪對話。MultiModal-GPT能夠遵循來自人類的各種指示,例如生成詳細說明、計算感興趣物件的數量,以及回答用戶的一般問題。MultiModal-GPT是從OpenFlamingo進行了參數高效微調,並在語言模型的交叉注意力部分和自注意力部分都添加了低秩適配器(LoRA)。我們首先利用視覺和語言數據構建指示模板,進行多模式指示調整,以使模型理解並遵循人類指示。我們發現訓練數據的質量對對話表現至關重要,少量包含簡短答案的數據可能會導致模型對任何指示作出簡短回應。為了進一步增強MultiModal-GPT與人類對話的能力,我們利用僅語言指示遵循數據來聯合訓練MultiModal-GPT。將僅語言和視覺-語言指示使用相同指示模板進行聯合訓練有效提高了對話表現。各種演示展示了MultiModal-GPT與人類進行連續對話的能力。代碼和演示位於https://github.com/open-mmlab/Multimodal-GPT。
English
We present a vision and language model named MultiModal-GPT to conduct
multi-round dialogue with humans. MultiModal-GPT can follow various
instructions from humans, such as generating a detailed caption, counting the
number of interested objects, and answering general questions from users.
MultiModal-GPT is parameter-efficiently fine-tuned from OpenFlamingo, with
Low-rank Adapter (LoRA) added both in the cross-attention part and the
self-attention part of the language model. We first construct instruction
templates with vision and language data for multi-modality instruction tuning
to make the model understand and follow human instructions. We find the quality
of training data is vital for the dialogue performance, where few data
containing short answers can lead the model to respond shortly to any
instructions. To further enhance the ability to chat with humans of the
MultiModal-GPT, we utilize language-only instruction-following data to train
the MultiModal-GPT jointly. The joint training of language-only and
visual-language instructions with the same instruction template
effectively improves dialogue performance. Various demos show the ability of
continuous dialogue of MultiModal-GPT with humans. Code and demo are at
https://github.com/open-mmlab/Multimodal-GPT