視覺語言模型在理解圖像變換上的局限性

摘要

視覺語言模型（VLMs）在多種下游任務中展現了顯著的潛力，包括圖像/視頻生成、視覺問答、多模態聊天機器人以及視頻理解。然而，這些模型在處理基本的圖像變換時往往表現不佳。本文深入探討了VLMs在圖像層面的理解能力，特別是OpenAI的CLIP和Google的SigLIP模型。我們的研究發現，這些模型對多種圖像層面的增強處理缺乏理解。為了支持這項研究，我們創建了Flickr8k數據集的增強版本，將每張圖像與所應用的變換詳細描述配對。我們進一步探討了這種缺陷如何影響下游任務，尤其是在圖像編輯方面，並評估了最先進的Image2Image模型在簡單變換上的表現。

English

Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.

視覺語言模型在理解圖像變換上的局限性

On the Limitations of Vision-Language Models in Understanding Image Transforms

摘要

Support