鸚鵡字幕教導 CLIP 辨識文字

摘要

儘管CLIP是許多視覺語言應用中的基礎模型，但CLIP存在嚴重的文本定位偏見。這種偏見導致CLIP模型在忽略真實的視覺語義的同時，會「模仿」嵌入在圖像中的視覺文本。我們發現，在最流行的圖像-文本數據集LAION-2B中，標題也密集地「模仿」（拼寫）嵌入在圖像中的文本。我們的分析顯示，約有50％的圖像嵌入了視覺文本內容，而其中90％的標題或多或少地模仿了視覺文本。基於這樣的觀察，我們仔細檢查了不同版本的CLIP模型，並驗證了視覺文本是衡量這些模型的LAION風格圖像-文本相似性的主要因素。為了檢查這些模仿標題是否塑造了文本定位偏見，我們使用不同以模仿標題為導向的標準篩選出的LAION子集來訓練一系列CLIP模型。我們展示了使用模仿標題訓練容易塑造此類偏見，但卻損害了CLIP模型中預期的視覺語言表示學習。這表明，迫切需要重新審視CLIP-like模型的設計或基於CLIP分數篩選構建的現有圖像-文本數據集策劃流程。

English

Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50\% of images are embedded with visual text content, and 90\% of their captions more or less parrot the visual text. Based on such observation, we thoroughly inspect the different release d versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.

鸚鵡字幕教導 CLIP 辨識文字

Parrot Captions Teach CLIP to Spot Text

摘要

Support