특징 분석 도구로서의 이미지 재구성

초록

비전 인코더는 비전 전용 모델부터 비전-언어 모델과 같은 멀티모달 시스템에 이르기까지 현대 애플리케이션에서 점점 더 많이 사용되고 있습니다. 이러한 아키텍처가 내부적으로 특징을 어떻게 표현하는지는 놀라운 성공에도 불구하고 여전히 명확하지 않습니다. 본 연구에서는 이미지 재구성을 통해 비전 특징을 해석하는 새로운 접근 방식을 제안합니다. 우리는 훈련 목표만 다른 두 가지 관련 모델 패밀리인 SigLIP과 SigLIP2를 비교하고, 이미지 기반 작업에서 사전 훈련된 인코더가 대조 학습과 같은 비이미지 작업에서 훈련된 인코더보다 훨씬 더 많은 이미지 정보를 보유한다는 것을 보여줍니다. 또한, 이 방법을 다양한 비전 인코더에 적용하여 특징 표현의 정보성을 기준으로 순위를 매깁니다. 마지막으로, 특징 공간을 조작하면 재구성된 이미지에서 예측 가능한 변화가 발생하며, 이는 공간 변환이 아닌 직교 회전이 색상 인코딩을 제어한다는 것을 보여줍니다. 우리의 접근 방식은 모든 비전 인코더에 적용할 수 있으며, 그 특징 공간의 내부 구조를 밝히는 데 도움을 줍니다. 실험을 재현하기 위한 코드와 모델 가중치는 GitHub에서 제공됩니다.

English

Vision encoders are increasingly used in modern applications, from vision-only models to multimodal systems such as vision-language models. Despite their remarkable success, it remains unclear how these architectures represent features internally. Here, we propose a novel approach for interpreting vision features via image reconstruction. We compare two related model families, SigLIP and SigLIP2, which differ only in their training objective, and show that encoders pre-trained on image-based tasks retain significantly more image information than those trained on non-image tasks such as contrastive learning. We further apply our method to a range of vision encoders, ranking them by the informativeness of their feature representations. Finally, we demonstrate that manipulating the feature space yields predictable changes in reconstructed images, revealing that orthogonal rotations (rather than spatial transformations) control color encoding. Our approach can be applied to any vision encoder, shedding light on the inner structure of its feature space. The code and model weights to reproduce the experiments are available in GitHub.

특징 분석 도구로서의 이미지 재구성

Image Reconstruction as a Tool for Feature Analysis

초록

Support