視覺語言模型是否偏向紋理或形狀，我們能否對其進行引導？

摘要

視覺語言模型（VLMs）在短短幾年內徹底改變了計算機視覺模型的格局，開創了一系列新的應用，從零樣本圖像分類，到圖像標題生成，以及視覺問答等。與純視覺模型不同，它們提供了一種直觀的方式通過語言提示來訪問視覺內容。這些模型的廣泛應用性鼓勵我們探討它們是否也與人類視覺相符 - 具體來說，它們在多模態融合中如何採納人為誘導的視覺偏見，或者它們是否僅是從純視覺模型中繼承偏見。一個重要的視覺偏見是紋理與形狀偏見，或者局部信息的優勢與全局信息之間的占主導地位。在本文中，我們研究了這種偏見在各種熱門VLMs中的表現。有趣的是，我們發現VLMs往往比它們的視覺編碼器更加偏好形狀，這表明視覺偏見在某種程度上通過文本在多模態模型中被調節。如果文本確實影響視覺偏見，這表明我們可能不僅可以通過視覺輸入來引導視覺偏見，還可以通過語言來引導：這一假設我們通過大量實驗得到了證實。例如，我們能夠僅通過提示將形狀偏見從低至49%引導至高達72%。目前，對形狀的強烈人類偏見（96%）對於所有測試的VLMs來說仍然難以實現。

English

Vision language models (VLMs) have drastically changed the computer vision model landscape in only a few years, opening an exciting array of new applications from zero-shot image classification, over to image captioning, and visual question answering. Unlike pure vision models, they offer an intuitive way to access visual content through language prompting. The wide applicability of such models encourages us to ask whether they also align with human vision - specifically, how far they adopt human-induced visual biases through multimodal fusion, or whether they simply inherit biases from pure vision models. One important visual bias is the texture vs. shape bias, or the dominance of local over global information. In this paper, we study this bias in a wide range of popular VLMs. Interestingly, we find that VLMs are often more shape-biased than their vision encoders, indicating that visual biases are modulated to some extent through text in multimodal models. If text does indeed influence visual biases, this suggests that we may be able to steer visual biases not just through visual input but also through language: a hypothesis that we confirm through extensive experiments. For instance, we are able to steer shape bias from as low as 49% to as high as 72% through prompting alone. For now, the strong human bias towards shape (96%) remains out of reach for all tested VLMs.

視覺語言模型是否偏向紋理或形狀，我們能否對其進行引導？

Are Vision Language Models Texture or Shape Biased and Can We Steer Them?

摘要

Support