稀疏自編碼器實現CLIP模型的穩健且可解釋微調
Sparse Autoencoders enable Robust and Interpretable Fine-tuning of CLIP models
May 15, 2026
作者: Fabian Morelli, Arnas Uselis, Ankit Sonthalia, Seong Joon Oh
cs.AI
摘要
大型预训练视觉-语言模型(如CLIP)在各类任务中展现出显著的零样本性能。然而,通过微调这些模型提升下游任务表现时,往往会导致模型对分布偏移的鲁棒性下降。近期研究尝试缓解这一权衡问题,但通常依赖计算成本高昂的文本引导方法。我们提出了一种全新的鲁棒微调方法——SAE-FT,该方法仅对模型的视觉表征进行操作。SAE-FT通过惩罚对预训练模型训练的稀疏自编码器所识别的语义有意义特征的增减行为,来约束视觉表征的变化。这一约束机制既防止了灾难性遗忘,又使微调过程具备可解释性,从而能够直接分析语义变化。SAE-FT兼具机制透明与计算高效的特点,在ImageNet及其相关分布偏移基准测试中达到或超越了当前最先进性能。代码已开源:https://github.com/Fabian-Mor/sae-ft
English
Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model's visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks. Code is publicly available at: https://github.com/Fabian-Mor/sae-ft.