NoLan: 언어 사전 지식의 동적 억제를 통한 대규모 시각-언어 모델의 객체 환각 완화

초록

객체 환각은 대규모 시각-언어 모델(LVLM)에서 입력 이미지에 존재하지 않는 객체를 출력에 포함하는 중요한 문제입니다. 이러한 현상에서 자연스럽게 제기되는 질문은 LVLM 파이프라인의 어떤 구성 요소가 객체 환각에 주로 기여하는가입니다. 시각 정보를 인지하는 비전 인코더일까요, 아니면 텍스트 응답을 생성하는 언어 디코더일까요? 본 연구에서는 환각 생성 과정에서 비전 인코더와 언어 디코더의 역할을 분석하기 위한 체계적인 실험을 설계하여 이 질문에 답하고자 합니다. 우리의 관찰 결과, 객체 환각은 주로 언어 디코더의 강력한 사전 지식과 연관되어 있음을 확인했습니다. 이러한 발견을 바탕으로 우리는 언어 디코더의 사전 지식을 동적으로 억제하여 출력 분포를 정제하는 간단한 학습 불요(訓練不要) 프레임워크인 No-Language-Hallucination Decoding(NoLan)을 제안합니다. 이때 억제 강도는 다중모달 입력과 텍스트 전용 입력 간의 출력 분포 차이를 기반으로 조절됩니다. 실험 결과, NoLan이 다양한 LVLM에서 여러 작업에 걸쳐 객체 환각을 효과적으로 감소시키는 것으로 나타났습니다. 예를 들어, NoLan은 POPE 평가에서 LLaVA-1.5 7B와 Qwen-VL 7B의 정확도를 각각 최대 6.45 및 7.21까지 크게 향상시켰습니다. 코드는 https://github.com/lingfengren/NoLan에서 공개되어 있습니다.

English

Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text-only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: https://github.com/lingfengren/NoLan.

NoLan: 언어 사전 지식의 동적 억제를 통한 대규모 시각-언어 모델의 객체 환각 완화

NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

초록

Support