HallusionBench: 당신은 생각하는 대로 보는가? 아니면 보는 대로 생각하는가? GPT-4V(ision), LLaVA-1.5 및 기타 다중 모달리티 모델에 도전하는 이미지-컨텍스트 추론 벤치마크

초록

대규모 언어 모델(LLMs)은 비전 모델과 정렬되고 비전-언어 모델(VLMs)로 통합된 후, 이미지 추론 작업에서 인상적인 개선을 가져올 수 있습니다. 이는 최근 출시된 GPT-4V(ison)와 LLaVA-1.5 등에서 확인할 수 있습니다. 그러나 이러한 최첨단 LVLM(Large Vision-Language Models)에서 강력한 언어 사전 지식은 양날의 검이 될 수 있습니다: 이들은 이미지 문맥을 무시하고 (심지어 모순되는) 언어 사전 지식만을 의존하여 추론할 수 있습니다. 반면, VLM의 비전 모듈은 LLM보다 약하며 잘못된 시각적 표현을 생성할 수 있고, 이는 LLM에 의해 확신에 찬 오류로 번역될 수 있습니다. 이러한 두 가지 유형의 VLM 오류, 즉 언어 환각과 시각적 착각을 연구하기 위해, 우리는 GPT-4V와 LLaVA-1.5에게도 여전히 도전적인 이미지 문맥 추론 벤치마크인 HallusionBench를 구축했습니다. 우리는 HallusionBench의 예시에 대한 상세한 분석을 제공하며, 이는 VLM의 착각 또는 환각에 대한 새로운 통찰과 향후 개선 방안을 제시합니다. 벤치마크와 코드베이스는 https://github.com/tianyi-lab/HallusionBench에서 공개될 예정입니다.

English

Large language models (LLMs), after being aligned with vision models and integrated into vision-language models (VLMs), can bring impressive improvement in image reasoning tasks. This was shown by the recently released GPT-4V(ison), LLaVA-1.5, etc. However, the strong language prior in these SOTA LVLMs can be a double-edged sword: they may ignore the image context and solely rely on the (even contradictory) language prior for reasoning. In contrast, the vision modules in VLMs are weaker than LLMs and may result in misleading visual representations, which are then translated to confident mistakes by LLMs. To study these two types of VLM mistakes, i.e., language hallucination and visual illusion, we curated HallusionBench, an image-context reasoning benchmark that is still challenging to even GPT-4V and LLaVA-1.5. We provide a detailed analysis of examples in HallusionBench, which sheds novel insights on the illusion or hallucination of VLMs and how to improve them in the future. The benchmark and codebase will be released at https://github.com/tianyi-lab/HallusionBench.

HallusionBench: 당신은 생각하는 대로 보는가? 아니면 보는 대로 생각하는가? GPT-4V(ision), LLaVA-1.5 및 기타 다중 모달리티 모델에 도전하는 이미지-컨텍스트 추론 벤치마크

HallusionBench: You See What You Think? Or You Think What You See? An Image-Context Reasoning Benchmark Challenging for GPT-4V(ision), LLaVA-1.5, and Other Multi-modality Models

초록

Support