Le didascalie di Parrot insegnano a CLIP a individuare il testo

Abstract

Nonostante CLIP sia il modello di base in numerose applicazioni di visione e linguaggio, CLIP soffre di un grave bias nel riconoscimento del testo. Tale bias induce i modelli CLIP a "ripetere a pappagallo" il testo visivo incorporato nelle immagini, trascurando l'autentica semantica visiva. Scopriamo che nel dataset immagine-testo più popolare, LAION-2B, anche le didascalie ripetono densamente (ortograficamente) il testo incorporato nelle immagini. La nostra analisi mostra che circa il 50\% delle immagini contiene contenuti testuali visivi e che il 90\% delle loro didascalie ripete più o meno il testo visivo. Sulla base di tale osservazione, esaminiamo approfonditamente le diverse versioni rilasciate dei modelli CLIP e verifichiamo che il testo visivo è il fattore dominante nella misurazione della similarità immagine-testo in stile LAION per questi modelli. Per esaminare se queste didascalie ripetute a pappagallo influenzino il bias nel riconoscimento del testo, addestriamo una serie di modelli CLIP con sottoinsiemi di LAION curati secondo diversi criteri orientati alle didascalie ripetute. Dimostriamo che l'addestramento con didascalie ripetute facilmente forma tale bias, ma danneggia l'apprendimento della rappresentazione visivo-linguistica atteso nei modelli CLIP. Ciò suggerisce che è urgente rivedere sia la progettazione di modelli simili a CLIP che la pipeline esistente di curatela dei dataset immagine-testo basata sul filtraggio del punteggio CLIP.

English

Despite CLIP being the foundation model in numerous vision-language applications, the CLIP suffers from a severe text spotting bias. Such bias causes CLIP models to `Parrot' the visual text embedded within images while disregarding the authentic visual semantics. We uncover that in the most popular image-text dataset LAION-2B, the captions also densely parrot (spell) the text embedded in images. Our analysis shows that around 50\% of images are embedded with visual text content, and 90\% of their captions more or less parrot the visual text. Based on such observation, we thoroughly inspect the different release d versions of CLIP models and verify that the visual text is the dominant factor in measuring the LAION-style image-text similarity for these models. To examine whether these parrot captions shape the text spotting bias, we train a series of CLIP models with LAION subsets curated by different parrot-caption-oriented criteria. We show that training with parrot captions easily shapes such bias but harms the expected visual-language representation learning in CLIP models. This suggests that it is urgent to revisit either the design of CLIP-like models or the existing image-text dataset curation pipeline built on CLIP score filtering.

Le didascalie di Parrot insegnano a CLIP a individuare il testo

Parrot Captions Teach CLIP to Spot Text

Abstract

Support