對齊大型多模型與強韌指令調整

摘要

儘管多模式任務取得了令人鼓舞的進展，但目前的大型多模式模型（LMM）容易出現與相關圖像和人類指示不一致的描述幻覺。本文通過引入第一個大型且多樣化的視覺指示調整數據集，名為大規模穩健視覺（LRV）-指示，來解決這個問題。我們的數據集包含由GPT4生成的 120k 個視覺指示，涵蓋 16 個以開放式指示和答案為特色的視覺與語言任務。與現有研究主要聚焦於正面指示樣本不同，我們設計了 LRV-指示，以包含正面和負面指示，以實現更穩健的視覺指示調整。我們的負面指示在兩個語義層面上進行設計：（i）不存在元素操作和（ii）存在元素操作。為了有效衡量LMM產生的幻覺，我們提出了GPT4輔助視覺指示評估（GAVIE），這是一種新方法，可評估視覺指示調整，無需人工標註的真實答案，並且能夠適應各種指示格式。我們進行了全面的實驗來研究LMM的幻覺現象。我們的結果表明，現有的LMM在面對我們的負面指示時會出現顯著的幻覺，特別是在存在元素操作指示中。此外，通過在LRV-指示上對MiniGPT4進行微調，我們成功地減輕了幻覺，同時在使用比最先進方法更少的訓練數據的情況下提高了在公共數據集上的性能。此外，我們觀察到在訓練數據中正負實例的平衡比例導致了更穩健的模型。我們的項目鏈接可在 https://fuxiaoliu.github.io/LRV/ 上找到。

English

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMM) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset consists of 120k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at two semantic levels: (i) Nonexistent Element Manipulation and (ii) Existent Element Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a novel approach to evaluate visual instruction tuning without the need for human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate that existing LMMs exhibit significant hallucination when presented with our negative instructions, particularly with Existent Element Manipulation instructions. Moreover, by finetuning MiniGPT4 on LRV-Instruction, we successfully mitigate hallucination while improving performance on public datasets using less training data compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Our project link is available at https://fuxiaoliu.github.io/LRV/.

對齊大型多模型與強韌指令調整

Aligning Large Multi-Modal Model with Robust Instruction Tuning

摘要

Support