意味論が視覚を誤らせる時：大規模マルチモーダルモデルの幻覚を軽減する — シーンテキストのスポッティングと理解における

要旨

大規模マルチモーダルモデル（LMMs）は、視覚的知覚と推論において目覚ましい進展を遂げてきた。しかし、視覚的に曖昧または非意味的なシーンテキストに直面した場合、これらのモデルは正確にテキストを認識し内容を理解するのに苦戦し、しばしば意味的には妥当であるが視覚的には誤った回答を生成する。この現象を我々は「意味的幻覚」と呼ぶ。本研究では、意味的幻覚の根本的な原因を調査し、重要な知見を得た：シーンテキスト領域により強い注意を向けるTransformer層を持つLLMは、意味的幻覚を生じにくい。そこで、我々はトレーニング不要の意味的幻覚緩和フレームワークを提案する。このフレームワークは2つの主要なコンポーネントから構成される：(1) ZoomText、外部検出器を用いずに潜在的なテキスト領域を特定する粗から細への戦略、および(2) Grounded Layer Correction、幻覚を生じにくい層からの内部表現を適応的に活用し、デコードをガイドすることで、非意味的なサンプルにおける幻覚的出力を修正しつつ、意味のあるサンプルの意味を保持する。厳密な評価を可能にするため、我々はTextHalu-Benchを導入した。これは、意味的および非意味的なケースにまたがる1,730以上のサンプルからなるベンチマークであり、モデルの幻覚を探るために手動で作成された質問-回答ペアを備えている。広範な実験により、我々の手法が意味的幻覚を効果的に緩和するだけでなく、シーンテキストの認識と理解に関する公開ベンチマークでも高い性能を達成することが実証された。

English

Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of over 1,730 samples spanning both semantic and non-semantic cases, with manually curated question-answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.

意味論が視覚を誤らせる時：大規模マルチモーダルモデルの幻覚を軽減する — シーンテキストのスポッティングと理解における

When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding

要旨

Support