MultiModal-GPT: 人間との対話のための視覚と言語モデル

要旨

我々は、人間との多回対話を可能にする視覚と言語モデル「MultiModal-GPT」を提案する。MultiModal-GPTは、詳細なキャプションの生成、興味対象物の数のカウント、ユーザーからの一般的な質問への回答など、人間からの多様な指示に従うことができる。MultiModal-GPTは、OpenFlamingoを基にパラメータ効率の良いファインチューニングを行い、言語モデルのクロスアテンション部分とセルフアテンション部分の両方にLow-rank Adapter（LoRA）を追加している。まず、マルチモーダル指示チューニングのために、視覚と言語データを用いた指示テンプレートを構築し、モデルが人間の指示を理解し従えるようにした。訓練データの質が対話性能に重要であることを発見し、短い回答を含む少数のデータでは、モデルがどの指示に対しても短く応答してしまう傾向がある。MultiModal-GPTの人間との対話能力をさらに向上させるため、言語のみの指示追従データを活用してMultiModal-GPTを共同で訓練する。同じ指示テンプレートを用いた言語のみの指示と視覚言語指示の共同訓練は、対話性能を効果的に向上させる。様々なデモを通じて、MultiModal-GPTが人間と連続対話を行う能力を示す。コードとデモはhttps://github.com/open-mmlab/Multimodal-GPTで公開されている。

English

We present a vision and language model named MultiModal-GPT to conduct multi-round dialogue with humans. MultiModal-GPT can follow various instructions from humans, such as generating a detailed caption, counting the number of interested objects, and answering general questions from users. MultiModal-GPT is parameter-efficiently fine-tuned from OpenFlamingo, with Low-rank Adapter (LoRA) added both in the cross-attention part and the self-attention part of the language model. We first construct instruction templates with vision and language data for multi-modality instruction tuning to make the model understand and follow human instructions. We find the quality of training data is vital for the dialogue performance, where few data containing short answers can lead the model to respond shortly to any instructions. To further enhance the ability to chat with humans of the MultiModal-GPT, we utilize language-only instruction-following data to train the MultiModal-GPT jointly. The joint training of language-only and visual-language instructions with the same instruction template effectively improves dialogue performance. Various demos show the ability of continuous dialogue of MultiModal-GPT with humans. Code and demo are at https://github.com/open-mmlab/Multimodal-GPT

MultiModal-GPT: 人間との対話のための視覚と言語モデル

MultiModal-GPT: A Vision and Language Model for Dialogue with Humans

要旨

Support