稀疏自編碼器用於對視覺模型進行科學嚴謹的解釋

摘要

為了真正理解視覺模型，我們不僅需要解釋其學習到的特徵，還需要通過控制實驗來驗證這些解釋。目前的方法要麼提供可解釋的特徵但無法測試其因果影響，要麼允許模型編輯但沒有可解釋的控制。我們提出了一個統一的框架，使用稀疏自編碼器（SAEs）來彌合這一差距，使我們能夠發現人類可解釋的視覺特徵，並精確地操縱它們以測試有關模型行為的假設。通過將我們的方法應用於最先進的視覺模型，我們揭示了具有不同預訓練目標的模型所學習的語義抽象中的關鍵差異。然後，我們通過在多個視覺任務中進行控制干預，展示了我們框架的實際用途。我們展示了SAEs能夠可靠地識別和操縱可解釋的視覺特徵，而無需重新訓練模型，為理解和控制視覺模型行為提供了一個強大的工具。我們在我們的項目網站上提供了代碼、演示和模型：https://osu-nlp-group.github.io/SAE-V。

English

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. Current approaches either provide interpretable features without the ability to test their causal influence, or enable model editing without interpretable controls. We present a unified framework using sparse autoencoders (SAEs) that bridges this gap, allowing us to discover human-interpretable visual features and precisely manipulate them to test hypotheses about model behavior. By applying our method to state-of-the-art vision models, we reveal key differences in the semantic abstractions learned by models with different pre-training objectives. We then demonstrate the practical usage of our framework through controlled interventions across multiple vision tasks. We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training, providing a powerful tool for understanding and controlling vision model behavior. We provide code, demos and models on our project website: https://osu-nlp-group.github.io/SAE-V.

稀疏自編碼器用於對視覺模型進行科學嚴謹的解釋

Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models

摘要

Support