희소 오토인코더는 CLIP 모델의 강건하고 해석 가능한 미세 조정을 가능하게 한다.

초록

대규모 사전 학습된 CLIP과 같은 시각-언어 모델은 다양한 작업에서 놀라운 제로샷 성능을 보여준다. 그러나 이러한 모델을 미세 조정하여 하위 작업 성능을 향상시키면 분포 변화에 대한 강건성이 저하되는 경우가 많다. 최근 접근법들은 이러한 트레이드오프를 완화하려 시도했지만, 대개 계산 비용이 높은 텍스트 안내에 의존한다. 본 논문에서는 모델의 시각적 표현에만 작용하는 새로운 강건 미세 조정 방법인 SAE-FT를 제안한다. SAE-FT는 사전 학습된 모델로 훈련된 희소 오토인코더가 식별한 의미론적으로 의미 있는 특징의 추가 및 제거에 페널티를 부과함으로써 이러한 표현의 변화를 정규화한다. 이러한 제약 조건은 파괴적 망각을 방지하고 미세 조정 과정을 해석 가능하게 만들어 의미 변화의 직접적 분석을 가능하게 한다. SAE-FT는 기계론적으로 투명하면서도 계산 효율적이며, ImageNet 및 관련 분포 변화 벤치마크에서 최신 기술 수준과 동등하거나 이를 능가하는 성능을 달성한다. 코드는 다음에서 공개적으로 이용 가능하다: https://github.com/Fabian-Mor/sae-ft.

English

Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model's visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks. Code is publicly available at: https://github.com/Fabian-Mor/sae-ft.