鹦鹉字幕教会CLIP识别文本
Parrot Captions Teach CLIP to Spot Text
December 21, 2023
作者: Yiqi Lin, Conghui He, Alex Jinpeng Wang, Bin Wang, Weijia Li, Mike Zheng Shou
cs.AI
摘要
尽管CLIP是许多视觉-语言应用中的基础模型,但CLIP存在严重的文本定位偏差。这种偏差导致CLIP模型在嵌入图像中的视觉文本时“模仿”,而忽略了真实的视觉语义。我们发现,在最流行的图像-文本数据集LAION-2B中,标题也密集地“模仿”(拼写)图像中嵌入的文本。我们的分析显示,约50\%的图像嵌入了视觉文本内容,它们的约90\%标题或多或少地模仿了视觉文本。基于这样的观察,我们彻底检查了不同版本的CLIP模型,并验证了视觉文本是衡量这些模型的LAION风格图像-文本相似性的主要因素。为了检验这些“模仿”标题是否塑造了文本定位偏差,我们训练了一系列根据不同“模仿”标题导向标准筛选的LAION子集的CLIP模型。我们展示了通过“模仿”标题训练容易塑造这种偏差,但却损害了CLIP模型中预期的视觉-语言表示学习。这表明迫切需要重新审视CLIP样式模型的设计或基于CLIP分数过滤构建的现有图像-文本数据集筛选流程。
English
Despite CLIP being the foundation model in numerous vision-language
applications, the CLIP suffers from a severe text spotting bias. Such bias
causes CLIP models to `Parrot' the visual text embedded within images while
disregarding the authentic visual semantics. We uncover that in the most
popular image-text dataset LAION-2B, the captions also densely parrot (spell)
the text embedded in images. Our analysis shows that around 50\% of
images are embedded with visual text content, and 90\% of their
captions more or less parrot the visual text. Based on such observation, we
thoroughly inspect the different release d versions of CLIP models and verify
that the visual text is the dominant factor in measuring the LAION-style
image-text similarity for these models. To examine whether these parrot
captions shape the text spotting bias, we train a series of CLIP models with
LAION subsets curated by different parrot-caption-oriented criteria. We show
that training with parrot captions easily shapes such bias but harms the
expected visual-language representation learning in CLIP models. This suggests
that it is urgent to revisit either the design of CLIP-like models or the
existing image-text dataset curation pipeline built on CLIP score filtering.