비전-언어 모델에서의 다중 객체 환각 현상

초록

대규모 시각 언어 모델(LVLMs)은 종종 객체 환각 현상을 겪으며, 주어진 이미지에 존재하지 않는 객체를 생성합니다. 현재의 객체 환각 벤치마크는 주로 단일 객체 클래스의 존재 여부에 초점을 맞추고 개별 엔티티보다는 클래스 수준에서 평가하지만, 본 연구는 다중 객체 환각을 체계적으로 조사하여 모델이 여러 객체에 동시에 주의를 기울일 때 어떻게 잘못 인지하는지(예: 존재하지 않는 객체를 만들어내거나 주의가 분산되는지)를 검토합니다. 우리는 Recognition-based Object Probing Evaluation(ROPE)을 도입했습니다. 이는 테스트 중 단일 이미지 내 객체 클래스의 분포를 고려하고 시각적 참조 프롬프트를 사용하여 모호성을 제거하는 자동화된 평가 프로토콜입니다. 다중 객체 환각을 유발할 수 있는 잠재적 요인에 대한 포괄적인 실험적 연구와 분석을 통해 다음과 같은 사실을 발견했습니다: (1) LVLMs은 단일 객체에 초점을 맞출 때보다 여러 객체에 초점을 맞출 때 더 많은 환각 현상을 겪습니다. (2) 테스트된 객체 클래스 분포는 환각 행동에 영향을 미치며, 이는 LVLMs이 단순한 경로와 허위 상관관계를 따를 수 있음을 시사합니다. (3) 환각 행동은 데이터 특정 요인, 두드러짐 및 빈도, 그리고 모델의 내재적 행동에 의해 영향을 받습니다. 우리는 LVLMs이 현실적인 시각적 장면에서 자주 발생하는 여러 객체를 인식하고 추론할 수 있도록 하고, 이러한 문제를 완화하기 위한 진전을 통찰과 함께 정량화하고자 합니다.

English

Large vision language models (LVLMs) often suffer from object hallucination, producing objects not present in the given images. While current benchmarks for object hallucination primarily concentrate on the presence of a single object class rather than individual entities, this work systematically investigates multi-object hallucination, examining how models misperceive (e.g., invent nonexistent objects or become distracted) when tasked with focusing on multiple objects simultaneously. We introduce Recognition-based Object Probing Evaluation (ROPE), an automated evaluation protocol that considers the distribution of object classes within a single image during testing and uses visual referring prompts to eliminate ambiguity. With comprehensive empirical studies and analysis of potential factors leading to multi-object hallucination, we found that (1) LVLMs suffer more hallucinations when focusing on multiple objects compared to a single object. (2) The tested object class distribution affects hallucination behaviors, indicating that LVLMs may follow shortcuts and spurious correlations.(3) Hallucinatory behaviors are influenced by data-specific factors, salience and frequency, and model intrinsic behaviors. We hope to enable LVLMs to recognize and reason about multiple objects that often occur in realistic visual scenes, provide insights, and quantify our progress towards mitigating the issues.

비전-언어 모델에서의 다중 객체 환각 현상

Multi-Object Hallucination in Vision-Language Models

초록

Support