影像重建作為特徵分析的工具

摘要

視覺編碼器在現代應用中日益普及，從純視覺模型到多模態系統（如視覺-語言模型）皆有使用。儘管這些架構取得了顯著成功，但其內部如何表徵特徵仍不明確。在此，我們提出了一種新穎的方法，通過圖像重建來解釋視覺特徵。我們比較了兩個相關的模型家族——SigLIP 和 SigLIP2，它們僅在訓練目標上有所不同，並展示了基於圖像任務預訓練的編碼器比那些基於非圖像任務（如對比學習）訓練的編碼器保留了顯著更多的圖像信息。我們進一步將此方法應用於一系列視覺編碼器，根據其特徵表徵的信息量對它們進行排序。最後，我們證明，通過操控特徵空間可以在重建圖像中產生可預測的變化，揭示了正交旋轉（而非空間變換）控制著色彩編碼。我們的方法可應用於任何視覺編碼器，為其特徵空間的內部結構提供洞見。重現實驗的代碼和模型權重已公開於 GitHub。

English

Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here, we propose a novel approach for interpreting vision features via image reconstruction. We compare two related model families, SigLIP and SigLIP2, which differ only in their training objective, and show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks such as contrastive learning. We further apply our method to a range of vision encoders, ranking them by the informativeness of their feature representations. Finally, we demonstrate that manipulating the feature space yields predictable changes in reconstructed images, revealing that orthogonal rotations (rather than spatial transformations) control color encoding. Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space. The code and model weights to reproduce the experiments are available in GitHub.

影像重建作為特徵分析的工具

Image Reconstruction as a Tool for Feature Analysis

摘要

Support