AnomalyVFM -- 視覚基盤モデルをゼロショット異常検出器に変換する

要旨

ゼロショット異常検出は、ドメイン内の学習画像に一切アクセスすることなく、画像中の異常領域を検出・位置特定することを目的とする。近年のアプローチではCLIPなどの視覚言語モデル（VLM）を活用して高レベルな概念知識を転移させる方法が主流だが、DINOv2のような純粋な視覚基盤モデル（VFM）に基づく手法は性能面で遅れをとっている。この乖離は、(i) 既存の補助的異常検出データセットの多様性不足、(ii) VFM適応戦略の過度に浅い実装、という2つの実践的課題に起因すると我々は考える。両課題に対処するため、我々はAnomalyVFMを提案する。これは任意の事前学習済みVFMを強力なゼロショット異常検出器に変換する汎用的かつ効果的なフレームワークである。本手法は、堅牢な3段階合成データセット生成手法と、低ランク特徴アダプタ及び信頼度重み付きピクセル損失を活用したパラメータ効率の良い適応機構を組み合わせる。これらの要素により、現代のVFMは既存の最先端手法を大幅に上回る性能を発揮する。具体的には、RADIOをバックボーンとして、AnomalyVFMは9つの多様なデータセットにおいて平均画像レベルAUROC94.1%を達成し、従来手法を3.3ポイント大幅に上回った。プロジェクトページ: https://maticfuc.github.io/anomaly_vfm/

English

Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision-language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Project Page: https://maticfuc.github.io/anomaly_vfm/

AnomalyVFM -- 視覚基盤モデルをゼロショット異常検出器に変換する

AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

要旨

Support