視覺語言模型在理解圖像變換上的局限性
On the Limitations of Vision-Language Models in Understanding Image Transforms
March 12, 2025
作者: Ahmad Mustafa Anis, Hasnain Ali, Saquib Sarfraz
cs.AI
摘要
視覺語言模型(VLMs)在多種下游任務中展現了顯著的潛力,包括圖像/視頻生成、視覺問答、多模態聊天機器人以及視頻理解。然而,這些模型在處理基本的圖像變換時往往表現不佳。本文深入探討了VLMs在圖像層面的理解能力,特別是OpenAI的CLIP和Google的SigLIP模型。我們的研究發現,這些模型對多種圖像層面的增強處理缺乏理解。為了支持這項研究,我們創建了Flickr8k數據集的增強版本,將每張圖像與所應用的變換詳細描述配對。我們進一步探討了這種缺陷如何影響下游任務,尤其是在圖像編輯方面,並評估了最先進的Image2Image模型在簡單變換上的表現。
English
Vision Language Models (VLMs) have demonstrated significant potential in
various downstream tasks, including Image/Video Generation, Visual Question
Answering, Multimodal Chatbots, and Video Understanding. However, these models
often struggle with basic image transformations. This paper investigates the
image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by
Google. Our findings reveal that these models lack comprehension of multiple
image-level augmentations. To facilitate this study, we created an augmented
version of the Flickr8k dataset, pairing each image with a detailed description
of the applied transformation. We further explore how this deficiency impacts
downstream tasks, particularly in image editing, and evaluate the performance
of state-of-the-art Image2Image models on simple transformations.Summary
AI-Generated Summary