稀疏自编码器使得CLIP模型能够进行鲁棒且可解释的微调。

摘要

像CLIP这样的大规模预训练视觉语言模型在多种任务中展现出显著的零样本性能。然而，通过微调这些模型来提升下游性能往往会削弱其对分布偏移的鲁棒性。近期方法试图缓解这种权衡，但通常依赖计算成本高昂的文本引导。我们提出一种新颖的鲁棒微调方法——SAE-FT，该方法仅作用于模型的视觉表征。SAE-FT通过惩罚基于预训练模型训练的稀疏自编码器所识别的语义特征被添加或移除的行为，来约束表征变化。这种约束既防止了灾难性遗忘，又使微调过程具备可解释性，从而能够直接分析语义变化。SAE-FT兼具机制透明性与计算高效性，在ImageNet及其相关分布偏移基准测试中达到或超越当前最优性能。代码开源地址：https://github.com/Fabian-Mor/sae-ft。

English

Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model's visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks. Code is publicly available at: https://github.com/Fabian-Mor/sae-ft.