視覚言語モデルにおけるマルチオブジェクト幻覚

要旨

大規模視覚言語モデル（LVLM）は、与えられた画像に存在しない物体を生成する「物体幻覚」に悩まされることが多い。現在の物体幻覚のベンチマークは主に単一の物体クラスの存在に焦点を当てており、個々の実体ではなく、この研究では、複数の物体に同時に注目するタスクにおいて、モデルがどのように誤認識するか（例えば、存在しない物体を発明したり、注意が散漫になったりするか）を体系的に調査する。本論文では、単一画像内の物体クラスの分布をテスト中に考慮し、視覚的参照プロンプトを使用して曖昧さを排除する自動評価プロトコルである「認識ベースの物体プロービング評価（ROPE）」を導入する。多物体幻覚を引き起こす潜在的要因の包括的な実証研究と分析を通じて、以下のことが明らかになった。(1) LVLMは、単一の物体に注目する場合と比較して、複数の物体に注目する際により多くの幻覚に悩まされる。(2) テストされた物体クラスの分布が幻覚の挙動に影響を与え、LVLMがショートカットや疑似相関に従う可能性を示唆している。(3) 幻覚的挙動は、データ固有の要因、顕著性と頻度、およびモデルの内在的挙動に影響を受ける。我々は、現実的な視覚シーンで頻繁に発生する複数の物体を認識し、推論する能力をLVLMに持たせ、その課題を軽減するための進捗を定量化し、洞察を提供することを目指す。

English

Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent objects or become distracted) when tasked with focusing on multiple objects simultaneously. We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. With comprehensive empirical studies and analysis of potential factors leading to multi-object hallucination, we found that (1) LVLMs suffer more hallucinations when focusing on multiple objects compared to a single object. (2) The tested object class distribution affects hallucination behaviors, indicating that LVLMs may follow shortcuts and spurious correlations.(3) Hallucinatory behaviors are influenced by data-specific factors, salience and frequency, and model intrinsic behaviors. We hope to enable LVLMs to recognize and reason about multiple objects that often occur in realistic visual scenes, provide insights, and quantify our progress towards mitigating the issues.

視覚言語モデルにおけるマルチオブジェクト幻覚

Multi-Object Hallucination in Vision-Language Models

要旨

Support