개념 기반 미세 조정: ViT가 허위 상관관계에서 벗어나 견고성 향상하도록 유도하기

초록

비전 트랜스포머(ViT)는 의미론적으로 의미 있는 특징보다 배경 단서와 같은 허위 상관관계에 의존하기 때문에 분포 변화 하에서 성능이 저하되는 경우가 많습니다. 기존 정규화 방법은 일반적으로 단순한 전경-배경 마스크에 의존하는데, 이는 객체를 정의하는 세분화된 의미 개념(예: '새'의 '긴 부리'와 '날개')을 포착하지 못합니다. 결과적으로 이러한 방법은 분포 변화에 대한 견고성이 제한적입니다. 이러한 한계를 해결하기 위해 우리는 모델의 추론을 개념 수준의 의미론으로 이끄는 새로운 파인튜닝 프레임워크를 소개합니다. 우리의 접근 방식은 모델의 내부 관련성 맵이 공간적으로 근거된 개념 마스크와 일치하도록 최적화합니다. 이러한 마스크는 수동 주석 없이 자동으로 생성됩니다: LLM 기반의 레이블 없는 방법을 사용하여 클래스 관련 개념을 먼저 제안한 다음, VLM을 사용하여 분할합니다. 파인튜닝 목표는 이러한 개념 영역과의 관련성을 정렬하는 동시에 허위 배경 영역에 대한 집중을 억제합니다. 특히 이 과정은 최소한의 이미지 집합만 필요로 하며 데이터셋 클래스의 절반을 사용합니다. 5개의 분포 외 벤치마크에 대한 광범위한 실험을 통해 우리의 방법이 여러 ViT 기반 모델의 견고성을 향상시킴을 입증했습니다. 더 나아가, 결과적인 관련성 맵이 의미론적 객체 부분과 더 강력하게 정렬되어 더 견고하고 해석 가능한 비전 모델로 가는 확장 가능한 경로를 제공함을 보여줍니다. 마지막으로 개념 주도 마스크가 기존 분할 맵보다 모델 견고성에 더 효과적인 감독을 제공하여 우리의 핵심 가설을 지지함을 확인했습니다.

English

Vision Transformers (ViTs) often degrade under distribution shifts because they rely on spurious correlations, such as background cues, rather than semantically meaningful features. Existing regularization methods, typically relying on simple foreground-background masks, which fail to capture the fine-grained semantic concepts that define an object (e.g., ``long beak'' and ``wings'' for a ``bird''). As a result, these methods provide limited robustness to distribution shifts. To address this limitation, we introduce a novel finetuning framework that steers model reasoning toward concept-level semantics. Our approach optimizes the model's internal relevance maps to align with spatially grounded concept masks. These masks are generated automatically, without manual annotation: class-relevant concepts are first proposed using an LLM-based, label-free method, and then segmented using a VLM. The finetuning objective aligns relevance with these concept regions while simultaneously suppressing focus on spurious background areas. Notably, this process requires only a minimal set of images and uses half of the dataset classes. Extensive experiments on five out-of-distribution benchmarks demonstrate that our method improves robustness across multiple ViT-based models. Furthermore, we show that the resulting relevance maps exhibit stronger alignment with semantic object parts, offering a scalable path toward more robust and interpretable vision models. Finally, we confirm that concept-guided masks provide more effective supervision for model robustness than conventional segmentation maps, supporting our central hypothesis.

개념 기반 미세 조정: ViT가 허위 상관관계에서 벗어나 견고성 향상하도록 유도하기

Concept-Guided Fine-Tuning: Steering ViTs away from Spurious Correlations to Improve Robustness

초록

Support