NoLan: 大規模視覚言語モデルにおける物体幻覚の軽減に向けた言語事前確率の動的抑制

要旨

物体幻覚は大規模視覚言語モデル（LVLM）における重大な問題であり、入力画像に存在しない物体を出力に含めてしまう現象である。この現象から自然に生じる疑問は、LVLMパイプラインのどのコンポーネントが物体幻覚の主な原因となっているのか、ということである。視覚情報を認識するビジョンエンコーダなのか、それともテキスト応答を生成する言語デコーダなのか。本研究では、体系的な実験を設計し、幻覚生成におけるビジョンエンコーダと言語デコーダの役割を分析することで、この疑問に答えようとする。我々の観察結果は、物体幻覚が主に言語デコーダからの強い事前知識（プリオール）に関連していることを明らかにする。この知見に基づき、我々はシンプルかつ訓練不要なフレームワーク、No-Language-Hallucination Decoding (NoLan) を提案する。これは、マルチモーダル入力とテキストのみの入力における出力分布の差に基づいて調整され、言語的な事前知識を動的に抑制することで出力分布を洗練させる。実験結果は、NoLanが様々なLVLMにおいて、異なるタスクにわたって物体幻覚を効果的に低減することを示す。例えば、NoLanはPOPEベンチマークにおいて、LLaVA-1.5 7BとQwen-VL 7Bの精度をそれぞれ最大6.45ポイント、7.21ポイント向上させる顕著な改善を達成する。コードはhttps://github.com/lingfengren/NoLan で公開されている。

English

Object hallucination is a critical issue in Large Vision-Language Models (LVLMs), where outputs include objects that do not appear in the input image. A natural question arises from this phenomenon: Which component of the LVLM pipeline primarily contributes to object hallucinations? The vision encoder to perceive visual information, or the language decoder to generate text responses? In this work, we strive to answer this question through designing a systematic experiment to analyze the roles of the vision encoder and the language decoder in hallucination generation. Our observations reveal that object hallucinations are predominantly associated with the strong priors from the language decoder. Based on this finding, we propose a simple and training-free framework, No-Language-Hallucination Decoding, NoLan, which refines the output distribution by dynamically suppressing language priors, modulated based on the output distribution difference between multimodal and text-only inputs. Experimental results demonstrate that NoLan effectively reduces object hallucinations across various LVLMs on different tasks. For instance, NoLan achieves substantial improvements on POPE, enhancing the accuracy of LLaVA-1.5 7B and Qwen-VL 7B by up to 6.45 and 7.21, respectively. The code is publicly available at: https://github.com/lingfengren/NoLan.

NoLan: 大規模視覚言語モデルにおける物体幻覚の軽減に向けた言語事前確率の動的抑制

NoLan: Mitigating Object Hallucinations in Large Vision-Language Models via Dynamic Suppression of Language Priors

要旨

Support