ChatPaper.aiChatPaper

視覺語言模型在理解圖像變換上的局限性

On the Limitations of Vision-Language Models in Understanding Image Transforms

March 12, 2025
作者: Ahmad Mustafa Anis, Hasnain Ali, Saquib Sarfraz
cs.AI

摘要

視覺語言模型(VLMs)在多種下游任務中展現了顯著的潛力,包括圖像/視頻生成、視覺問答、多模態聊天機器人以及視頻理解。然而,這些模型在處理基本的圖像變換時往往表現不佳。本文深入探討了VLMs在圖像層面的理解能力,特別是OpenAI的CLIP和Google的SigLIP模型。我們的研究發現,這些模型對多種圖像層面的增強處理缺乏理解。為了支持這項研究,我們創建了Flickr8k數據集的增強版本,將每張圖像與所應用的變換詳細描述配對。我們進一步探討了這種缺陷如何影響下游任務,尤其是在圖像編輯方面,並評估了最先進的Image2Image模型在簡單變換上的表現。
English
Vision Language Models (VLMs) have demonstrated significant potential in various downstream tasks, including Image/Video Generation, Visual Question Answering, Multimodal Chatbots, and Video Understanding. However, these models often struggle with basic image transformations. This paper investigates the image-level understanding of VLMs, specifically CLIP by OpenAI and SigLIP by Google. Our findings reveal that these models lack comprehension of multiple image-level augmentations. To facilitate this study, we created an augmented version of the Flickr8k dataset, pairing each image with a detailed description of the applied transformation. We further explore how this deficiency impacts downstream tasks, particularly in image editing, and evaluate the performance of state-of-the-art Image2Image models on simple transformations.

Summary

AI-Generated Summary

PDF102March 14, 2025