ChatPaper.aiChatPaper

迷失於嵌入之中:視覺-語言模型中的信息損失

Lost in Embeddings: Information Loss in Vision-Language Models

September 15, 2025
作者: Wenyan Li, Raphael Tang, Chengzu Li, Caiqi Zhang, Ivan Vulić, Anders Søgaard
cs.AI

摘要

視覺-語言模型(VLMs)通常通過預訓練的視覺編碼器處理視覺輸入,隨後通過連接器組件將其投影到語言模型的嵌入空間中。儘管這種投影對於模態融合至關重要,但此步驟可能導致的信息損失及其對模型能力的直接影響仍未被充分研究。我們引入了兩種互補的方法來檢驗並量化這種損失,通過分析潛在表示空間來實現。首先,我們通過分析圖像表示在投影前後的k近鄰關係變化來評估語義信息的保留情況。其次,我們通過從投影表示中重建視覺嵌入,直接在圖像塊級別定位信息損失。實驗表明,連接器顯著扭曲了視覺表示的局部幾何結構,投影後的k近鄰關係偏離了40-60%,這與檢索性能的下降相關。圖像塊級別的嵌入重建為模型在視覺基礎問答任務中的行為提供了可解釋的洞察,發現高信息損失區域能可靠地預測模型表現不佳的實例。
English
Vision--language models (VLMs) often process visual inputs through a pretrained vision encoder, followed by a projection into the language model's embedding space via a connector component. While crucial for modality fusion, the potential information loss induced by this projection step and its direct impact on model capabilities remain understudied. We introduce two complementary approaches to examine and quantify this loss by analyzing the latent representation space. First, we evaluate semantic information preservation by analyzing changes in k-nearest neighbor relationships between image representations, before and after projection. Second, we directly measure information loss by reconstructing visual embeddings from the projected representation, localizing loss at an image patch level. Experiments reveal that connectors substantially distort the local geometry of visual representations, with k-nearest neighbors diverging by 40--60\% post-projection, correlating with degradation in retrieval performance. The patch-level embedding reconstruction provides interpretable insights for model behavior on visually grounded question-answering tasks, finding that areas of high information loss reliably predict instances where models struggle.
PDF273September 16, 2025