视觉语言模型存在偏见。

摘要

大型语言模型（LLMs）从互联网中记忆了大量先验知识，这些知识虽有助于其下游任务的表现，但也可能使其输出偏向错误或带有偏见的结果。在本研究中，我们探讨了关于流行主题的知识如何影响视觉语言模型（VLMs）在标准、客观的视觉任务（如计数与识别）上的准确性。我们发现，当前最先进的VLMs存在显著偏见（例如，无法识别在阿迪达斯三叶草标志上新增的第四条条纹），在涵盖动物、标志、国际象棋、棋盘游戏、视错觉及图案网格等七个多样化领域的计数任务中，平均准确率仅为17.05%。在反事实图像中插入描述主题名称的文本（如“阿迪达斯”）会进一步降低VLMs的准确性。VLMs的偏见如此强烈，即便指示其复核结果或仅依赖图像细节作答，计数准确率平均仅提升约2个百分点。本研究揭示了VLMs的一种有趣失效模式，并提供了一个自动化框架用于测试VLMs的偏见。代码与数据可访问：vlmsarebiased.github.io。

English

Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that help them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g, unable to recognize a fourth stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Insert text (e.g., "Adidas") describing the subject name into the counterfactual image further decreases VLM accuracy. The biases in VLMs are so strong that instructing them to double-check their results or rely exclusively on image details to answer improves counting accuracy by only +2 points, on average. Our work presents an interesting failure mode in VLMs and an automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.

视觉语言模型存在偏见。

Vision Language Models are Biased

摘要

Support