ChatPaper.aiChatPaper

通過多模態表徵的跨模態對齊增強異常檢測能力

Enhanced OoD Detection through Cross-Modal Alignment of Multi-Modal Representations

March 24, 2025
作者: Jeonghyeon Kim, Sangheum Hwang
cs.AI

摘要

先前關於分佈外檢測(OoDD)的研究主要集中在單模態模型上。近年來,隨著大規模預訓練視覺-語言模型(如CLIP)的出現,利用此類多模態表徵並通過零樣本和提示學習策略的OoDD方法應運而生。然而,這些方法通常要麼凍結預訓練權重,要麼僅對其進行部分微調,這對於下游數據集可能並非最優選擇。本文強調,多模態微調(MMFT)能夠實現顯著的OoDD性能。儘管最近的一些工作展示了微調方法對OoDD的影響,但性能提升仍有巨大潛力。我們探討了簡單微調方法的侷限性,分析其未能充分利用預訓練知識的原因。我們的實證分析表明,這一問題可能源於分佈內(ID)嵌入中的模態間隔。為解決此問題,我們提出了一種訓練目標,通過規範化ID數據的圖像和文本嵌入之間的距離來增強跨模態對齊。這一調整有助於更好地利用預訓練的文本信息,通過在超球面表示空間中更緊密地對齊來自不同模態(即文本和圖像)的相似語義。我們從理論上證明,所提出的規範化對應於超球面上基於能量模型的最大似然估計。利用ImageNet-1k OoD基準數據集,我們展示了我們的方法與利用預訓練知識的後處理OoDD方法(如NegLabel)相結合,顯著優於現有方法,實現了最先進的OoDD性能並領先ID準確率。
English
Prior research on out-of-distribution detection (OoDD) has primarily focused on single-modality models. Recently, with the advent of large-scale pretrained vision-language models such as CLIP, OoDD methods utilizing such multi-modal representations through zero-shot and prompt learning strategies have emerged. However, these methods typically involve either freezing the pretrained weights or only partially tuning them, which can be suboptimal for downstream datasets. In this paper, we highlight that multi-modal fine-tuning (MMFT) can achieve notable OoDD performance. Despite some recent works demonstrating the impact of fine-tuning methods for OoDD, there remains significant potential for performance improvement. We investigate the limitation of na\"ive fine-tuning methods, examining why they fail to fully leverage the pretrained knowledge. Our empirical analysis suggests that this issue could stem from the modality gap within in-distribution (ID) embeddings. To address this, we propose a training objective that enhances cross-modal alignment by regularizing the distances between image and text embeddings of ID data. This adjustment helps in better utilizing pretrained textual information by aligning similar semantics from different modalities (i.e., text and image) more closely in the hyperspherical representation space. We theoretically demonstrate that the proposed regularization corresponds to the maximum likelihood estimation of an energy-based model on a hypersphere. Utilizing ImageNet-1k OoD benchmark datasets, we show that our method, combined with post-hoc OoDD approaches leveraging pretrained knowledge (e.g., NegLabel), significantly outperforms existing methods, achieving state-of-the-art OoDD performance and leading ID accuracy.

Summary

AI-Generated Summary

PDF41April 3, 2025