강력한 명령어 튜닝을 통한 대형 멀티모달 모델 정렬

초록

다중모달 작업에서의 유망한 진전에도 불구하고, 현재의 대규모 다중모달 모델(LMM)은 관련 이미지와 인간의 지시에 대해 일관성 없는 설명을 생성하는 경향이 있습니다. 본 논문은 이러한 문제를 해결하기 위해 대규모 및 다양한 시각적 지시 튜닝 데이터셋인 Large-scale Robust Visual (LRV)-Instruction을 소개합니다. 우리의 데이터셋은 GPT4에 의해 생성된 120,000개의 시각적 지시로 구성되어 있으며, 16개의 시각 및 언어 작업을 포함하고 개방형 지시와 답변을 다룹니다. 기존 연구가 주로 긍정적 지시 샘플에 초점을 맞추는 것과 달리, 우리는 LRV-Instruction을 더 강력한 시각적 지시 튜닝을 위해 긍정적 및 부정적 지시를 모두 포함하도록 설계했습니다. 우리의 부정적 지시는 두 가지 의미적 수준에서 설계되었습니다: (i) 존재하지 않는 요소 조작과 (ii) 존재하는 요소 조작. LMM에 의해 생성된 환각을 효율적으로 측정하기 위해, 우리는 GPT4-Assisted Visual Instruction Evaluation (GAVIE)을 제안합니다. 이는 인간이 주석을 단 정답이 필요 없이 다양한 지시 형식에 적응할 수 있는 새로운 시각적 지시 튜닝 평가 방법입니다. 우리는 LMM의 환각을 조사하기 위해 포괄적인 실험을 수행했습니다. 우리의 결과는 기존 LMM이 특히 존재하는 요소 조작 지시와 함께 부정적 지시를 받았을 때 상당한 환각을 보인다는 것을 보여줍니다. 또한, LRV-Instruction을 사용하여 MiniGPT4를 미세 조정함으로써, 우리는 최신 방법보다 적은 훈련 데이터를 사용하여 공개 데이터셋에서의 성능을 향상시키면서 환각을 성공적으로 완화했습니다. 추가적으로, 훈련 데이터에서 긍정적 및 부정적 인스턴스의 균형 잡힌 비율이 더 강력한 모델로 이어진다는 것을 관찰했습니다. 우리의 프로젝트 링크는 https://fuxiaoliu.github.io/LRV/에서 확인할 수 있습니다.

English

Despite the promising progress in multi-modal tasks, current large multi-modal models (LMM) are prone to hallucinating inconsistent descriptions with respect to the associated image and human instructions. This paper addresses this issue by introducing the first large and diverse visual instruction tuning dataset, named Large-scale Robust Visual (LRV)-Instruction. Our dataset consists of 120k visual instructions generated by GPT4, covering 16 vision-and-language tasks with open-ended instructions and answers. Unlike existing studies that primarily focus on positive instruction samples, we design LRV-Instruction to include both positive and negative instructions for more robust visual instruction tuning. Our negative instructions are designed at two semantic levels: (i) Nonexistent Element Manipulation and (ii) Existent Element Manipulation. To efficiently measure the hallucination generated by LMMs, we propose GPT4-Assisted Visual Instruction Evaluation (GAVIE), a novel approach to evaluate visual instruction tuning without the need for human-annotated groundtruth answers and can adapt to diverse instruction formats. We conduct comprehensive experiments to investigate the hallucination of LMMs. Our results demonstrate that existing LMMs exhibit significant hallucination when presented with our negative instructions, particularly with Existent Element Manipulation instructions. Moreover, by finetuning MiniGPT4 on LRV-Instruction, we successfully mitigate hallucination while improving performance on public datasets using less training data compared to state-of-the-art methods. Additionally, we observed that a balanced ratio of positive and negative instances in the training data leads to a more robust model. Our project link is available at https://fuxiaoliu.github.io/LRV/.

강력한 명령어 튜닝을 통한 대형 멀티모달 모델 정렬

Aligning Large Multi-Modal Model with Robust Instruction Tuning

초록

Support