概念引导微调：引导视觉Transformer远离伪相关以提升鲁棒性

摘要

视觉变换器(ViTs)在分布偏移下性能常出现退化，因其倾向于依赖虚假相关性（如背景线索）而非语义特征。现有正则化方法通常基于简单的前景-背景掩码，难以捕捉定义物体的细粒度语义概念（如“鸟类”的“长喙”和“翅膀”），导致对分布偏移的鲁棒性有限。为此，我们提出一种新颖的微调框架，将模型推理引导至概念级语义层面。该方法通过优化模型内部相关性图谱，使其与空间锚定的概念掩码对齐。这些掩码无需人工标注即可自动生成：首先采用基于大语言模型的无标签方法提出类别相关概念，随后通过视觉语言模型进行分割。微调目标旨在使相关性聚焦于概念区域，同时抑制对虚假背景区域的关注。值得注意的是，该过程仅需少量图像且使用半数数据集类别。在五个分布外基准测试上的大量实验表明，我们的方法能提升多种ViT基模型的鲁棒性。此外，实验证明生成的相关性图谱与语义物体部件具有更强一致性，为构建更鲁棒、可解释的视觉模型提供了可扩展路径。最后，我们验证了概念引导掩码相比传统分割图谱能为模型鲁棒性提供更有效的监督，从而支撑了核心假设。

English

Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground-background masks, which fail to capture the fine-grained semantic concepts that define an object (e.g., ``long beak'' and ``wings'' for a ``bird''). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class-relevant concepts are first proposed using an LLM-based, label-free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution benchmarks demonstrate that our method improves robustness across multiple ViT-based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.