混合食餌がDINOを雑食性視覚エンコーダーにする

要旨

DINOv2に代表される事前学習済み視覚エンコーダーは、単一モダリティタスクにおいて卓越した性能を発揮することが実証されている。しかしながら、我々はその特徴表現が異なるモダリティ間で十分に整合されていないことを確認した。例えば、同一シーンのRGB画像とそれに対応する深度マップの特徴埋め込みのコサイン類似度は、無関係な二つのランダムな画像間の類似度とほとんど同程度である。この問題を解決するため、我々はモダリティに依存しない特徴空間を学習する新しいフレームワーク「Omnivorous Vision Encoder」を提案する。本エンコーダーは二つの目的で学習を行う。第一に、同一シーンの異なるモダリティ間の特徴整合性を最大化すること。第二に、学習された表現をDINOv2のような完全に凍結された教師モデルの出力に固定する蒸留目的である。これにより得られた学生エンコーダーは、入力モダリティ（RGB、深度、セグメンテーションなど）に関わらず、与えられたシーンに対して一貫性のある強力な埋め込みを生成する「オムニボラス（何でも食べる）」な特性を獲得する。本手法は、頑健なクロスモーダル理解を可能にすると同時に、基盤モデルが有する識別的セマンティクスを保持する。

English

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

混合食餌がDINOを雑食性視覚エンコーダーにする

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

要旨

Support