AdaptCLIP：適應性CLIP於通用視覺異常檢測之應用

摘要

通用視覺異常檢測旨在無需額外微調的情況下，從新穎或未見過的視覺領域中識別異常，這在開放場景中至關重要。最近的研究表明，像CLIP這樣的預訓練視覺語言模型僅需零張或少數正常圖像即可展現出強大的泛化能力。然而，現有方法在設計提示模板、處理複雜的標記交互或需要額外微調方面存在困難，導致靈活性受限。在本研究中，我們提出了一種簡單而有效的方法，名為AdaptCLIP，基於兩個關鍵見解。首先，視覺和文本表示應交替而非聯合學習。其次，查詢與正常圖像提示之間的比較學習應結合上下文和對齊的殘差特徵，而非僅依賴於殘差特徵。AdaptCLIP將CLIP模型視為基礎服務，僅在其輸入或輸出端添加三個簡單的適配器：視覺適配器、文本適配器和提示查詢適配器。AdaptCLIP支持跨領域的零樣本/少樣本泛化，並在基礎數據集上訓練後，在目標領域上具有無需訓練的特性。AdaptCLIP在來自工業和醫療領域的12個異常檢測基準測試中達到了最先進的性能，顯著超越了現有的競爭方法。我們將在https://github.com/gaobb/AdaptCLIP上提供AdaptCLIP的代碼和模型。

English

Universal visual anomaly detection aims to identify anomalies from novel or unseen vision domains without additional fine-tuning, which is critical in open scenarios. Recent studies have demonstrated that pre-trained vision-language models like CLIP exhibit strong generalization with just zero or a few normal images. However, existing methods struggle with designing prompt templates, complex token interactions, or requiring additional fine-tuning, resulting in limited flexibility. In this work, we present a simple yet effective method called AdaptCLIP based on two key insights. First, adaptive visual and textual representations should be learned alternately rather than jointly. Second, comparative learning between query and normal image prompt should incorporate both contextual and aligned residual features, rather than relying solely on residual features. AdaptCLIP treats CLIP models as a foundational service, adding only three simple adapters, visual adapter, textual adapter, and prompt-query adapter, at its input or output ends. AdaptCLIP supports zero-/few-shot generalization across domains and possesses a training-free manner on target domains once trained on a base dataset. AdaptCLIP achieves state-of-the-art performance on 12 anomaly detection benchmarks from industrial and medical domains, significantly outperforming existing competitive methods. We will make the code and model of AdaptCLIP available at https://github.com/gaobb/AdaptCLIP.

AdaptCLIP：適應性CLIP於通用視覺異常檢測之應用

AdaptCLIP: Adapting CLIP for Universal Visual Anomaly Detection

摘要

Support