信任與否：如何評估視覺語言模型的預測

摘要

視覺-語言模型（VLMs）在對齊視覺與文本模態方面展現了強大的能力，從而促進了多模態理解與生成的廣泛應用。儘管在零樣本學習和遷移學習場景中表現出色，VLMs仍易於出現誤分類，常常給出自信卻錯誤的預測。這一侷限性在安全關鍵領域構成了重大風險，錯誤預測可能導致嚴重後果。在本研究中，我們提出了TrustVLM，這是一個無需訓練的框架，旨在解決評估VLM預測可信度的關鍵挑戰。基於觀察到的VLM中的模態差距以及某些概念在圖像嵌入空間中更為顯著表示的洞察，我們提出了一種新穎的置信度評分函數，利用該空間來提升誤分類檢測能力。我們在17個多樣化的數據集上，採用4種架構和2種VLMs，對我們的方法進行了嚴格的評估，並展示了其在AURC、AUROC和FPR95指標上分別高達51.87%、9.14%和32.42%的提升，相較於現有基線達到了最先進的性能。通過在不需重新訓練的情況下提升模型的可靠性，TrustVLM為VLMs在現實世界應用中的安全部署鋪平了道路。代碼將於https://github.com/EPFL-IMOS/TrustVLM 提供。

English

Vision-Language Models (VLMs) have demonstrated strong capabilities in aligning visual and textual modalities, enabling a wide range of applications in multimodal understanding and generation. While they excel in zero-shot and transfer learning scenarios, VLMs remain susceptible to misclassification, often yielding confident yet incorrect predictions. This limitation poses a significant risk in safety-critical domains, where erroneous predictions can lead to severe consequences. In this work, we introduce TrustVLM, a training-free framework designed to address the critical challenge of estimating when VLM's predictions can be trusted. Motivated by the observed modality gap in VLMs and the insight that certain concepts are more distinctly represented in the image embedding space, we propose a novel confidence-scoring function that leverages this space to improve misclassification detection. We rigorously evaluate our approach across 17 diverse datasets, employing 4 architectures and 2 VLMs, and demonstrate state-of-the-art performance, with improvements of up to 51.87% in AURC, 9.14% in AUROC, and 32.42% in FPR95 compared to existing baselines. By improving the reliability of the model without requiring retraining, TrustVLM paves the way for safer deployment of VLMs in real-world applications. The code will be available at https://github.com/EPFL-IMOS/TrustVLM.

信任與否：如何評估視覺語言模型的預測

To Trust Or Not To Trust Your Vision-Language Model's Prediction

摘要

Support