スパースオートエンコーダは、CLIPモデルのロバストで解釈可能なファインチューニングを可能にする。

要旨

大規模な事前学習済み視覚言語モデル（例：CLIP）は、多様なタスクにおいて顕著なゼロショット性能を示す。しかし、下流性能向上のためにこれらのモデルをファインチューニングすると、分布シフトに対するロバスト性が低下することが多い。近年の研究ではこのトレードオフの緩和を試みているが、多くの場合、計算コストの高いテキストガイダンスに依存している。本稿では、ロバストなファインチューニングのための新規手法であるSAE-FTを提案する。これはモデルの視覚表現のみに作用する。SAE-FTは、事前学習済みモデルで学習されたスパースオートエンコーダにより特定された意味的に意味のある特徴の追加と削除を罰則化することで、これらの表現への変化を正則化する。この制約により破滅的忘却が防止され、ファインチューニングプロセスが解釈可能となり、意味的変化の直接的な分析が可能になる。SAE-FTは機構的に透明であり、計算効率も良く、ImageNetおよびその関連する分布シフトベンチマークにおいて、最先端の性能と同等またはそれを上回る。コードはhttps://github.com/Fabian-Mor/sae-ftで公開されている。

English

Large-scale pre-trained vision-language models like CLIP demonstrate remarkable zero-shot performance across diverse tasks. However, fine-tuning these models to improve downstream performance often degrades robustness against distribution shifts. Recent approaches have attempted to mitigate this trade-off, but often rely on computationally expensive text-guidance. We propose a novel method for robust fine-tuning, SAE-FT, which operates only on the model's visual representations. SAE-FT regularizes changes to these representations by penalizing the addition and removal of semantically meaningful features identified by a Sparse Autoencoder trained on the pre-trained model. This constraint prevents catastrophic forgetting and makes the fine-tuning process interpretable, enabling direct analysis of semantic changes. SAE-FT is both mechanistically transparent and computationally efficient, matching or exceeding state-of-the-art performance on ImageNet and its associated distribution shift benchmarks. Code is publicly available at: https://github.com/Fabian-Mor/sae-ft.