特徴分析のためのツールとしての画像再構成

要旨

ビジョンエンコーダは、ビジョン専用モデルから視覚-言語モデルなどのマルチモーダルシステムまで、現代のアプリケーションでますます使用されています。その顕著な成功にもかかわらず、これらのアーキテクチャが内部でどのように特徴を表現しているかは不明瞭です。ここでは、画像再構成を通じてビジョン特徴を解釈するための新しいアプローチを提案します。訓練目的のみが異なる2つの関連モデルファミリー、SigLIPとSigLIP2を比較し、画像ベースのタスクで事前訓練されたエンコーダが、コントラスティブ学習などの非画像タスクで訓練されたエンコーダよりもはるかに多くの画像情報を保持していることを示します。さらに、この手法をさまざまなビジョンエンコーダに適用し、その特徴表現の情報量に基づいてランク付けします。最後に、特徴空間を操作することで再構成画像に予測可能な変化が生じることを実証し、色のエンコーディングを制御するのは空間変換ではなく直交回転であることを明らかにします。本アプローチは任意のビジョンエンコーダに適用可能であり、その特徴空間の内部構造を明らかにします。実験を再現するためのコードとモデル重みはGitHubで公開されています。

English

Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here, we propose a novel approach for interpreting vision features via image reconstruction. We compare two related model families, SigLIP and SigLIP2, which differ only in their training objective, and show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks such as contrastive learning. We further apply our method to a range of vision encoders, ranking them by the informativeness of their feature representations. Finally, we demonstrate that manipulating the feature space yields predictable changes in reconstructed images, revealing that orthogonal rotations (rather than spatial transformations) control color encoding. Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space. The code and model weights to reproduce the experiments are available in GitHub.

特徴分析のためのツールとしての画像再構成

Image Reconstruction as a Tool for Feature Analysis

要旨

Support