시맨틱이 비전을 오도할 때: 장면 텍스트 탐지 및 이해에서 대규모 멀티모달 모델의 환각 현상 완화

초록

대규모 멀티모달 모델(LMMs)은 시각적 인식과 추론 분야에서 인상적인 진전을 이루어 왔습니다. 그러나 시각적으로 모호하거나 의미론적이지 않은 장면 텍스트를 마주할 경우, 이들은 종종 내용을 정확히 파악하고 이해하는 데 어려움을 겪으며, 의미론적으로는 그럴듯하지만 시각적으로는 잘못된 답변을 생성하는 경우가 많습니다. 이를 우리는 '의미론적 환각(semantic hallucination)'이라고 부릅니다. 본 연구에서는 의미론적 환각의 근본적인 원인을 조사하고, 중요한 발견을 확인했습니다: LLM의 Transformer 레이어 중 장면 텍스트 영역에 더 강한 주의를 기울이는 레이어일수록 의미론적 환각을 덜 생성하는 경향이 있습니다. 따라서 우리는 훈련이 필요 없는 의미론적 환각 완화 프레임워크를 제안합니다. 이 프레임워크는 두 가지 핵심 구성 요소로 이루어져 있습니다: (1) 외부 검출기 없이도 잠재적인 텍스트 영역을 식별하는 coarse-to-fine 전략인 ZoomText, 그리고 (2) 환각이 덜 발생하는 레이어의 내부 표현을 적응적으로 활용하여 디코딩을 안내하고, 의미 없는 샘플에 대한 환각 출력을 수정하면서 의미 있는 샘플의 의미론을 보존하는 Grounded Layer Correction. 엄격한 평가를 위해, 우리는 모델의 환각을 탐지하도록 설계된 수동으로 선별된 질문-답변 쌍으로 구성된 1,730개 이상의 샘플을 포함하는 TextHalu-Bench 벤치마크를 소개합니다. 광범위한 실험을 통해 우리의 방법이 의미론적 환각을 효과적으로 완화할 뿐만 아니라, 장면 텍스트 스팟팅 및 이해를 위한 공개 벤치마크에서도 강력한 성능을 달성함을 입증했습니다.

English

Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of over 1,730 samples spanning both semantic and non-semantic cases, with manually curated question-answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.

시맨틱이 비전을 오도할 때: 장면 텍스트 탐지 및 이해에서 대규모 멀티모달 모델의 환각 현상 완화

When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding

초록

Support