I modelli linguistici visivi presentano pregiudizi.

Abstract

I grandi modelli linguistici (LLM) memorizzano una vasta quantità di conoscenza pregressa da Internet che li aiuta nei compiti successivi, ma può anche notoriamente influenzare i loro output verso risposte errate o distorte. In questo lavoro, testiamo come la conoscenza su argomenti popolari comprometta l'accuratezza dei modelli visione-linguaggio (VLM) su compiti visivi standard e oggettivi di conteggio e identificazione. Scopriamo che i VLM all'avanguardia sono fortemente distorti (ad esempio, incapaci di riconoscere che è stata aggiunta una quarta striscia al logo a tre strisce di Adidas), ottenendo una precisione media del 17,05% nel conteggio (ad esempio, contando le strisce in un logo simile a quello di Adidas) in 7 domini diversi, che vanno dagli animali, ai loghi, agli scacchi, ai giochi da tavolo, alle illusioni ottiche, fino alle griglie con motivi. Inserire testo (ad esempio, "Adidas") che descrive il nome del soggetto nell'immagine controfattuale riduce ulteriormente l'accuratezza dei VLM. Le distorsioni nei VLM sono così forti che istruirli a ricontrollare i loro risultati o a fare affidamento esclusivamente sui dettagli dell'immagine per rispondere migliora l'accuratezza del conteggio di soli +2 punti, in media. Il nostro lavoro presenta un interessante caso di fallimento nei VLM e un framework automatizzato per testare le distorsioni dei VLM. Codice e dati sono disponibili su: vlmsarebiased.github.io.

English

Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that help them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g, unable to recognize a fourth stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Insert text (e.g., "Adidas") describing the subject name into the counterfactual image further decreases VLM accuracy. The biases in VLMs are so strong that instructing them to double-check their results or rely exclusively on image details to answer improves counting accuracy by only +2 points, on average. Our work presents an interesting failure mode in VLMs and an automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.

I modelli linguistici visivi presentano pregiudizi.

Vision Language Models are Biased

Abstract

Support