大型视觉语言模型如何识别图像中的文本?揭示OCR头部的独特作用
How Do Large Vision-Language Models See Text in Image? Unveiling the Distinctive Role of OCR Heads
May 21, 2025
作者: Ingeol Baek, Hwan Chang, Sunghyun Ryu, Hwanhee Lee
cs.AI
摘要
尽管大规模视觉语言模型(LVLMs)取得了显著进展,但在其可解释性及如何定位和解读图像中的文本信息方面仍存在差距。本文通过探索多种LVLMs,识别出负责从图像中识别文本的特定头部,我们称之为光学字符识别头部(OCR Head)。关于这些头部的发现如下:(1)稀疏性较低:与以往的检索头部不同,大量头部被激活以提取图像中的文本信息。(2)性质独特:OCR头部具有与一般检索头部显著不同的特性,其特征相似度较低。(3)静态激活:这些头部的激活频率与其OCR评分高度一致。我们通过在OCR头部和传统检索头部上应用思维链(CoT)以及对这些头部进行掩码,在下游任务中验证了我们的发现。我们还展示了在OCR头部内重新分配汇聚标记值可以提升性能。这些见解深化了我们对LVLMs处理图像中嵌入文本信息内部机制的理解。
English
Despite significant advancements in Large Vision Language Models (LVLMs), a
gap remains, particularly regarding their interpretability and how they locate
and interpret textual information within images. In this paper, we explore
various LVLMs to identify the specific heads responsible for recognizing text
from images, which we term the Optical Character Recognition Head (OCR Head).
Our findings regarding these heads are as follows: (1) Less Sparse: Unlike
previous retrieval heads, a large number of heads are activated to extract
textual information from images. (2) Qualitatively Distinct: OCR heads possess
properties that differ significantly from general retrieval heads, exhibiting
low similarity in their characteristics. (3) Statically Activated: The
frequency of activation for these heads closely aligns with their OCR scores.
We validate our findings in downstream tasks by applying Chain-of-Thought (CoT)
to both OCR and conventional retrieval heads and by masking these heads. We
also demonstrate that redistributing sink-token values within the OCR heads
improves performance. These insights provide a deeper understanding of the
internal mechanisms LVLMs employ in processing embedded textual information in
images.Summary
AI-Generated Summary