使用强大的指令调整来对齐大型多模态模型

摘要

尽管多模态任务取得了令人期待的进展，但当前的大型多模态模型（LMM）很容易在描述图像和人类指令之间出现幻觉性不一致。本文通过引入首个大规模且多样化的视觉指令调整数据集，名为大规模稳健视觉（LRV）-指令，来解决这一问题。我们的数据集包含由GPT4生成的12万条视觉指令，涵盖了16个视觉与语言任务，其中包括开放式指令和答案。与现有研究主要关注正面指令样本不同，我们设计了LRV-指令，以包含正面和负面指令，以实现更稳健的视觉指令调整。我们的负面指令在两个语义层面上设计：（i）不存在元素操作和（ii）存在元素操作。为了有效衡量LMM产生的幻觉，我们提出了GPT4辅助视觉指令评估（GAVIE），这是一种新方法，可评估视觉指令调整，无需人工标注的真实答案，并且可以适应多样的指令格式。我们进行了全面的实验来研究LMM的幻觉现象。我们的结果表明，现有的LMM在面对我们的负面指令时会出现显著的幻觉，特别是在存在元素操作指令中。此外，通过在LRV-指令上对MiniGPT4进行微调，我们成功减轻了幻觉，同时在使用比最先进方法更少的训练数据的情况下，提高了在公共数据集上的性能。此外，我们观察到在训练数据中正负实例比例平衡会导致更稳健的模型。我们的项目链接可在 https://fuxiaoliu.github.io/LRV/ 找到。

English

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMM) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset consists of 120k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at two semantic levels: (i) Nonexistent Element Manipulation and (ii) Existent Element Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a novel approach to evaluate visual instruction tuning without the need for human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate that existing LMMs exhibit significant hallucination when presented with our negative instructions, particularly with Existent Element Manipulation instructions. Moreover, by finetuning MiniGPT4 on LRV-Instruction, we successfully mitigate hallucination while improving performance on public datasets using less training data compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Our project link is available at https://fuxiaoliu.github.io/LRV/.

使用强大的指令调整来对齐大型多模态模型

Aligning Large Multi-Modal Model with Robust Instruction Tuning

摘要

Support