Los subtítulos de Parrot enseñan a CLIP a detectar texto.

Resumen

A pesar de que CLIP es el modelo base en numerosas aplicaciones de visión y lenguaje, CLIP sufre de un sesgo severo en la detección de texto. Este sesgo hace que los modelos CLIP "repitan" el texto visual incrustado en las imágenes, ignorando la semántica visual auténtica. Descubrimos que en el conjunto de datos imagen-texto más popular, LAION-2B, los textos descriptivos también repiten densamente (deletrean) el texto incrustado en las imágenes. Nuestro análisis muestra que alrededor del 50\% de las imágenes contienen texto visual, y el 90\% de sus descripciones repiten, en mayor o menor medida, dicho texto visual. Basándonos en esta observación, inspeccionamos exhaustivamente las diferentes versiones lanzadas de los modelos CLIP y verificamos que el texto visual es el factor dominante al medir la similitud imagen-texto al estilo LAION en estos modelos. Para examinar si estas descripciones repetitivas moldean el sesgo en la detección de texto, entrenamos una serie de modelos CLIP con subconjuntos de LAION seleccionados según diferentes criterios orientados a descripciones repetitivas. Demostramos que entrenar con descripciones repetitivas fácilmente forma este sesgo, pero perjudica el aprendizaje esperado de representaciones visuales y lingüísticas en los modelos CLIP. Esto sugiere que es urgente reconsiderar tanto el diseño de modelos similares a CLIP como el proceso actual de curación de conjuntos de datos imagen-texto basado en filtrado por puntuación CLIP.

English

Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50\% of images are embedded with visual text content, and 90\% of their captions more or less parrot the visual text. Based on such observation, we thoroughly inspect the different release d versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.

Los subtítulos de Parrot enseñan a CLIP a detectar texto.

Parrot Captions Teach CLIP to Spot Text

Resumen

Support