图像重建作为特征分析的工具

摘要

在现代应用中，视觉编码器正被广泛采用，从纯视觉模型到视觉-语言模型等多模态系统。尽管这些架构取得了显著成功，但其内部如何表示特征仍不明确。本文提出了一种通过图像重建来解读视觉特征的新方法。我们比较了两个相关模型家族——SigLIP和SigLIP2，它们仅在训练目标上有所不同，结果表明，基于图像任务预训练的编码器比通过对比学习等非图像任务训练的编码器保留了更多的图像信息。我们进一步将该方法应用于一系列视觉编码器，根据其特征表示的信息量对其进行排序。最后，我们证明了对特征空间进行操作会在重建图像中产生可预测的变化，揭示了正交旋转（而非空间变换）控制着色彩编码。我们的方法可应用于任何视觉编码器，为其特征空间的内部结构提供洞见。实验复现代码及模型权重已在GitHub上公开。

English

Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here, we propose a novel approach for interpreting vision features via image reconstruction. We compare two related model families, SigLIP and SigLIP2, which differ only in their training objective, and show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks such as contrastive learning. We further apply our method to a range of vision encoders, ranking them by the informativeness of their feature representations. Finally, we demonstrate that manipulating the feature space yields predictable changes in reconstructed images, revealing that orthogonal rotations (rather than spatial transformations) control color encoding. Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space. The code and model weights to reproduce the experiments are available in GitHub.

图像重建作为特征分析的工具

Image Reconstruction as a Tool for Feature Analysis

摘要

Support