혼합 식단이 DINO를 잡식성 비전 인코더로 만든다

초록

DINOv2와 같은 사전 학습된 비전 인코더는 단일 모달리티 작업에서 뛰어난 성능을 입증했습니다. 그러나 우리는 이러한 인코더의 특징 표현이 서로 다른 모달리티 간에 제대로 정렬되지 않는다는 점을 관찰했습니다. 예를 들어, 동일한 장면의 RGB 이미지와 그에 해당하는 깊이 맵 간의 특징 임베딩 코사인 유사도는 서로 무관한 두 임의의 이미지 간 유사도와 거의 동일합니다. 이를 해결하기 위해 우리는 모달리티에 구애받지 않는 특징 공간을 학습하는 새로운 프레임워크인 Omnivorous Vision Encoder를 제안합니다. 우리는 이 인코더를 이중 목표로 학습시킵니다: 첫째, 동일한 장면의 서로 다른 모달리티 간 특징 정렬을 최대화하는 것, 둘째, 학습된 표현을 DINOv2와 같이 완전히 고정된 교사 모델의 출력에 정착시키는 지식 증류 목표입니다. 그 결과, 학생 인코더는 입력 모달리티(RGB, 깊이, 분할 맵 등)에 관계없이 주어진 장면에 대해 일관되고 강력한 임베딩을 생성함으로써 "다양한 모달리티를 처리하는(omnivorous)" 특성을 갖추게 됩니다. 이 접근 방식은 원본 파운데이션 모델의 판별 의미를 유지하면서도 강력한 교차 모달리티 이해를 가능하게 합니다.

English

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

혼합 식단이 DINO를 잡식성 비전 인코더로 만든다

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

초록

Support