Una Dieta Mista Rende DINO un Encoder Visivo Onnivoro

Abstract

Encoder visivi pre-addestrati come DINOv2 hanno dimostrato prestazioni eccezionali in compiti unimodali. Tuttavia, osserviamo che le loro rappresentazioni di feature sono scarsamente allineate tra diverse modalità. Ad esempio, l'embedding di feature per un'immagine RGB e la sua mappa di profondità corrispondente della stessa scena presenta una similarità coseno quasi identica a quella di due immagini casuali e non correlate. Per affrontare questo problema, proponiamo l'Encoder Visivo Onnivoro, un framework innovativo che apprende uno spazio di feature agnostico rispetto alla modalità. Addestriamo l'encoder con un duplice obiettivo: primo, massimizzare l'allineamento delle feature tra diverse modalità della stessa scena; secondo, un obiettivo di distillazione che ancorizza le rappresentazioni apprese all'output di un teacher completamente congelato come DINOv2. L'encoder studente risultante diventa "onnivoro" producendo un embedding potente e consistente per una determinata scena, indipendentemente dalla modalità di input (RGB, profondità, segmentazione, ecc.). Questo approccio consente una robusta comprensione cross-modale preservando al contempo la semantica discriminativa del modello foundation originale.

English

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

Una Dieta Mista Rende DINO un Encoder Visivo Onnivoro

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

Abstract

Support