概念引导微调:引导视觉Transformer远离伪相关以提升鲁棒性
Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness
March 9, 2026
作者: Yehonatan Elisha, Oren Barkan, Noam Koenigstein
cs.AI
摘要
視覺Transformer(ViT)在分佈偏移下性能退化,主要源於其依賴於虛假相關性(如背景線索)而非語義特徵。現有正則化方法通常基於簡單的前景-背景掩碼,難以捕捉定義物體的細粒度語義概念(例如「鳥類」的「長喙」和「翅膀」),導致對分佈偏移的魯棒性有限。為解決此問題,我們提出一種新穎的微調框架,將模型推理引導至概念層級的語義理解。該方法通過優化模型內部相關性映射,使其與空間錨定的概念掩碼對齊。這些掩碼無需人工標註即可自動生成:首先使用基於大語言模型的無標籤方法提出類別相關概念,再通過視覺語言模型進行分割。微調目標旨在使相關性映射與概念區域對齊,同時抑制對虛假背景區域的關注。值得注意的是,該過程僅需少量圖像且使用半數數據集類別。在五個分佈外基準測試上的大量實驗表明,本方法能提升多種ViT模型的魯棒性。此外,我們發現生成的相關性映射與語義物體部件呈現更強一致性,為構建更魯棒、可解釋的視覺模型提供了可擴展路徑。最後,我們驗證了概念引導的掩碼相較傳統分割圖能為模型魯棒性提供更有效的監督,支持了我們的核心假設。
English
Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground-background masks, which fail to capture the fine-grained semantic concepts that define an object (e.g., ``long beak'' and ``wings'' for a ``bird''). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class-relevant concepts are first proposed using an LLM-based, label-free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution benchmarks demonstrate that our method improves robustness across multiple ViT-based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.