ビジョン言語モデルは、テクスチャに偏ったものか形状に偏ったものか、そして我々はそれらを誘導できるのか？

要旨

ビジョン言語モデル（VLMs）は、わずか数年でコンピュータビジョンモデルの風景を劇的に変え、ゼロショット画像分類から画像キャプショニング、視覚的質問応答までの新しいアプリケーションの幅広い展開を可能にしました。純粋なビジョンモデルとは異なり、VLMsは言語プロンプトを介して直感的に視覚コンテンツにアクセスする方法を提供します。このようなモデルの広範な適用性は、これらが人間の視覚とも一致するかどうかを尋ねることを奨励します。具体的には、マルチモーダル融合を通じて人間由来の視覚バイアスをどの程度取り入れるか、または純粋なビジョンモデルからバイアスを単に継承するかに焦点を当てます。重要な視覚バイアスの1つは、テクスチャ対形状バイアス、または局所情報の優越性です。本論文では、人気のある幅広いVLMsにおけるこのバイアスを研究しています。興味深いことに、VLMsはしばしばビジョンエンコーダーよりも形状にバイアスがかかっていることがわかり、視覚バイアスがある程度テキストを介して多モーダルモデルで調整されていることを示唆しています。テキストが実際に視覚バイアスに影響を与える場合、これは視覚入力だけでなく言語を介して視覚バイアスを誘導できる可能性があることを示唆しています。これは、広範な実験を通じて確認される仮説です。たとえば、プロンプトだけで形状バイアスを49%から72%まで誘導することができます。現時点では、形状に対する強い人間のバイアス（96%）は、すべてのテストされたVLMsにとって到達困難な状態です。

English

Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.

ビジョン言語モデルは、テクスチャに偏ったものか形状に偏ったものか、そして我々はそれらを誘導できるのか？

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

要旨

Support