視覺語言模型存在偏見

摘要

大型語言模型（LLMs）從網際網路中記憶了大量先驗知識，這些知識有助於它們在下游任務中的表現，但也可能使其輸出偏向錯誤或有偏見的答案。在本研究中，我們測試了關於流行主題的知識如何影響視覺語言模型（VLMs）在標準、客觀的視覺任務（如計數和識別）中的準確性。我們發現，最先進的VLMs存在強烈的偏見（例如，無法識別在阿迪達斯三條紋標誌上添加了第四條紋），在七個不同領域（從動物、標誌、國際象棋、棋盤遊戲、視錯覺到圖案網格）的計數任務（例如，計算類似阿迪達斯標誌的條紋數量）中，平均準確率僅為17.05%。在反事實圖像中插入描述主題名稱的文本（例如，“阿迪達斯”）進一步降低了VLM的準確性。VLMs的偏見如此強烈，以至於指示它們重新檢查結果或僅依賴圖像細節來回答問題，平均僅將計數準確率提高了+2分。我們的工作揭示了VLMs中的一個有趣失敗模式，並提供了一個自動化框架來測試VLM的偏見。代碼和數據可在以下網址獲取：vlmsarebiased.github.io。

English

Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that help them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g, unable to recognize a fourth stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Insert text (e.g., "Adidas") describing the subject name into the counterfactual image further decreases VLM accuracy. The biases in VLMs are so strong that instructing them to double-check their results or rely exclusively on image details to answer improves counting accuracy by only +2 points, on average. Our work presents an interesting failure mode in VLMs and an automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.

視覺語言模型存在偏見

Vision Language Models are Biased

摘要

Support