当语义误导视觉：缓解大规模多模态模型在场景文本检测与理解中的幻觉问题

摘要

大型多模态模型（LMMs）在视觉感知与推理方面取得了显著进展。然而，面对视觉上模糊或非语义的场景文本时，它们往往难以准确识别并理解内容，频繁生成语义上合理但视觉上错误的答案，我们称之为语义幻觉。在本研究中，我们深入探讨了语义幻觉的根源，并发现一个关键现象：在LLM中，Transformer层对场景文本区域注意力越强，产生语义幻觉的可能性越低。基于此，我们提出了一种无需训练的语义幻觉缓解框架，该框架包含两大核心组件：(1) ZoomText，一种从粗到细的策略，无需外部检测器即可识别潜在文本区域；(2) 基于层校正的接地方法，它自适应地利用不易产生幻觉的层内部表示来指导解码，在纠正非语义样本的幻觉输出的同时，保留有意义样本的语义。为了进行严格评估，我们引入了TextHalu-Bench，这是一个包含超过1,730个样本的基准测试集，涵盖语义与非语义案例，并配有精心设计的问题-答案对，旨在探测模型的幻觉现象。大量实验证明，我们的方法不仅有效缓解了语义幻觉，还在场景文本识别与理解的公共基准测试中展现了强劲性能。

English

Large Multimodal Models (LMMs) have achieved impressive progress in visual perception and reasoning. However, when confronted with visually ambiguous or non-semantic scene text, they often struggle to accurately spot and understand the content, frequently generating semantically plausible yet visually incorrect answers, which we refer to as semantic hallucination. In this work, we investigate the underlying causes of semantic hallucination and identify a key finding: Transformer layers in LLM with stronger attention focus on scene text regions are less prone to producing semantic hallucinations. Thus, we propose a training-free semantic hallucination mitigation framework comprising two key components: (1) ZoomText, a coarse-to-fine strategy that identifies potential text regions without external detectors; and (2) Grounded Layer Correction, which adaptively leverages the internal representations from layers less prone to hallucination to guide decoding, correcting hallucinated outputs for non-semantic samples while preserving the semantics of meaningful ones. To enable rigorous evaluation, we introduce TextHalu-Bench, a benchmark of over 1,730 samples spanning both semantic and non-semantic cases, with manually curated question-answer pairs designed to probe model hallucinations. Extensive experiments demonstrate that our method not only effectively mitigates semantic hallucination but also achieves strong performance on public benchmarks for scene text spotting and understanding.

当语义误导视觉：缓解大规模多模态模型在场景文本检测与理解中的幻觉问题

When Semantics Mislead Vision: Mitigating Large Multimodal Models Hallucinations in Scene Text Spotting and Understanding

摘要

Support