混合式训练策略：DINO变身多才多艺的视觉编码器（注：此处采用意译手法，"Omnivorous Vision Encoder"译为"多才多艺的视觉编码器"，既保留了原意中"全食性/杂食性"的隐喻，又通过"多才多艺"更符合中文技术文献的表述习惯，同时"混合式训练策略"比直译"混合饮食"更能准确传达论文中训练方法的本质。）

摘要

诸如DINOv2等预训练视觉编码器在单模态任务中展现出卓越性能，但我们发现其特征表征在不同模态间存在严重失准。例如，同一场景的RGB图像与其对应深度图的特征嵌入，其余弦相似度几乎与两张随机无关图像无异。为解决此问题，我们提出全模态视觉编码器——一种学习模态无关特征空间的新型框架。该编码器采用双重目标进行训练：首先最大化同一场景不同模态间的特征对齐度；其次通过蒸馏目标将学习到的表征锚定于完全冻结的教师模型（如DINOv2）的输出。最终的学生编码器可对任意输入模态（RGB、深度、分割图等）生成统一且强大的场景嵌入，从而成为"全模态"编码器。该方法在保持原始基础模型判别性语义的同时，实现了鲁棒的跨模态理解能力。

English

Pre-trained vision encoders like DINOv2 have demonstrated exceptional performance on unimodal tasks. However, we observe that their feature representations are poorly aligned across different modalities. For instance, the feature embedding for an RGB image and its corresponding depth map of the same scene exhibit a cosine similarity that is nearly identical to that of two random, unrelated images. To address this, we propose the Omnivorous Vision Encoder, a novel framework that learns a modality-agnostic feature space. We train the encoder with a dual objective: first, to maximize the feature alignment between different modalities of the same scene; and second, a distillation objective that anchors the learned representations to the output of a fully frozen teacher such as DINOv2. The resulting student encoder becomes "omnivorous" by producing a consistent, powerful embedding for a given scene, regardless of the input modality (RGB, Depth, Segmentation, etc.). This approach enables robust cross-modal understanding while retaining the discriminative semantics of the original foundation model.

A Mixed Diet Makes DINO An Omnivorous Vision Encoder

摘要

Support