堅牢な指示チューニングによる大規模マルチモーダルモデルの整合化

要旨

マルチモーダルタスクにおける有望な進展にもかかわらず、現在の大規模マルチモーダルモデル（LMM）は、関連する画像や人間の指示に対して一貫性のない記述を生成する傾向があります。本論文では、この問題に対処するため、大規模で多様な視覚的指示チューニングデータセットであるLarge-scale Robust Visual (LRV)-Instructionを初めて導入します。私たちのデータセットは、GPT4によって生成された12万の視覚的指示から構成され、16の視覚と言語タスクをカバーし、オープンエンドの指示と回答を含んでいます。既存の研究が主に肯定的な指示サンプルに焦点を当てているのに対し、LRV-Instructionは、より堅牢な視覚的指示チューニングのために、肯定的な指示と否定的な指示の両方を含むように設計されています。私たちの否定的な指示は、2つの意味レベルで設計されています：(i) 存在しない要素の操作と (ii) 存在する要素の操作。LMMによって生成される幻覚を効率的に測定するために、人間による正解データを必要とせず、多様な指示形式に適応できる新しい評価手法であるGPT4-Assisted Visual Instruction Evaluation (GAVIE)を提案します。私たちは、LMMの幻覚を調査するための包括的な実験を実施しました。その結果、既存のLMMは、特に存在する要素の操作指示に対して、私たちの否定的な指示を提示された際に顕著な幻覚を示すことが明らかになりました。さらに、MiniGPT4をLRV-Instructionでファインチューニングすることで、最先端の手法と比較して少ないトレーニングデータで公開データセットの性能を向上させながら、幻覚を軽減することに成功しました。また、トレーニングデータにおける肯定的なインスタンスと否定的なインスタンスのバランスの取れた比率が、より堅牢なモデルにつながることも観察されました。私たちのプロジェクトリンクはhttps://fuxiaoliu.github.io/LRV/で利用可能です。

English

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMM) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset consists of 120k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at two semantic levels: (i) Nonexistent Element Manipulation and (ii) Existent Element Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a novel approach to evaluate visual instruction tuning without the need for human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate that existing LMMs exhibit significant hallucination when presented with our negative instructions, particularly with Existent Element Manipulation instructions. Moreover, by finetuning MiniGPT4 on LRV-Instruction, we successfully mitigate hallucination while improving performance on public datasets using less training data compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Our project link is available at https://fuxiaoliu.github.io/LRV/.

堅牢な指示チューニングによる大規模マルチモーダルモデルの整合化

Aligning Large Multi-Modal Model with Robust Instruction Tuning

要旨

Support