ビジョン言語モデルはバイアスを持つ

要旨

大規模言語モデル（LLMs）は、インターネットから膨大な量の事前知識を記憶しており、下流タスクにおいて役立つ一方で、その出力を誤ったまたは偏った答えに傾ける可能性があることで知られています。本研究では、一般的な主題に関する知識が、視覚言語モデル（VLMs）の標準的で客観的な視覚タスク（カウントや識別）における精度をどのように損なうかを検証します。最先端のVLMsは強いバイアスを示すことがわかりました（例えば、3本のストライプのアディダスロゴに4本目のストライプが追加されたことを認識できない）。7つの多様なドメイン（動物、ロゴ、チェス、ボードゲーム、錯視、パターングリッドなど）にわたるカウントタスク（例えば、アディダス風ロゴのストライプを数える）において、平均17.05%の精度しか達成しませんでした。主題名を記述したテキスト（例えば、「アディダス」）を反事実的画像に挿入すると、VLMの精度はさらに低下します。VLMsのバイアスは非常に強く、結果を再確認するよう指示したり、画像の詳細のみに基づいて回答するよう指示しても、カウント精度は平均でわずか+2ポイントしか向上しませんでした。本研究は、VLMsにおける興味深い失敗モードと、VLMのバイアスをテストする自動化されたフレームワークを提示します。コードとデータはvlmsarebiased.github.ioで公開されています。

English

Large language models (LLMs) memorize a vast amount of prior knowledge from the Internet that help them on downstream tasks but also may notoriously sway their outputs towards wrong or biased answers. In this work, we test how the knowledge about popular subjects hurt the accuracy of vision language models (VLMs) on standard, objective visual tasks of counting and identification. We find that state-of-the-art VLMs are strongly biased (e.g, unable to recognize a fourth stripe has been added to a 3-stripe Adidas logo) scoring an average of 17.05% accuracy in counting (e.g., counting stripes in an Adidas-like logo) across 7 diverse domains from animals, logos, chess, board games, optical illusions, to patterned grids. Insert text (e.g., "Adidas") describing the subject name into the counterfactual image further decreases VLM accuracy. The biases in VLMs are so strong that instructing them to double-check their results or rely exclusively on image details to answer improves counting accuracy by only +2 points, on average. Our work presents an interesting failure mode in VLMs and an automated framework for testing VLM biases. Code and data are available at: vlmsarebiased.github.io.

ビジョン言語モデルはバイアスを持つ

Vision Language Models are Biased

要旨

Support