視覚モデルの科学的に厳密な解釈のためのスパースオートエンコーダ

要旨

ビジョンモデルを真に理解するためには、学習された特徴を解釈するだけでなく、これらの解釈を制御された実験を通じて検証する必要があります。現在のアプローチは、解釈可能な特徴を提供するものの、その因果関係をテストする能力を持たず、また解釈可能なコントロールを可能にするものの、モデルの編集を可能にするものはありません。私たちは、このギャップを埋めるスパースオートエンコーダー（SAE）を使用した統合フレームワークを提案し、人間が解釈可能な視覚的特徴を発見し、それらを精密に操作してモデルの振る舞いに関する仮説を検証できるようにします。最先端のビジョンモデルに私たちの手法を適用することで、異なる事前学習目標を持つモデルが学習する意味の抽象化の主要な違いを明らかにします。その後、複数のビジョンタスクにわたる制御された介入を通じて、私たちのフレームワークの実用的な使用法を示します。SAEは、モデルの再学習を必要とせずに解釈可能な視覚的特徴を信頼性高く特定および操作できることを示し、ビジョンモデルの振る舞いを理解し制御するための強力なツールを提供します。プロジェクトのウェブサイトhttps://osu-nlp-group.github.io/SAE-Vには、コード、デモ、およびモデルが提供されています。

English

To truly understand vision models, we must not only interpret their learned features but also validate these interpretations through controlled experiments. Current approaches either provide interpretable features without the ability to test their causal influence, or enable model editing without interpretable controls. We present a unified framework using sparse autoencoders (SAEs) that bridges this gap, allowing us to discover human-interpretable visual features and precisely manipulate them to test hypotheses about model behavior. By applying our method to state-of-the-art vision models, we reveal key differences in the semantic abstractions learned by models with different pre-training objectives. We then demonstrate the practical usage of our framework through controlled interventions across multiple vision tasks. We show that SAEs can reliably identify and manipulate interpretable visual features without model re-training, providing a powerful tool for understanding and controlling vision model behavior. We provide code, demos and models on our project website: https://osu-nlp-group.github.io/SAE-V.

視覚モデルの科学的に厳密な解釈のためのスパースオートエンコーダ

Sparse Autoencoders for Scientifically Rigorous Interpretation of Vision Models

要旨

Support