CoME-VL：规模化互补式多编码器视觉语言学习

摘要

当前主流视觉语言模型（VLM）通常依赖单一视觉编码器，并采用对比式图像-文本目标（如CLIP风格的预训练）进行训练。虽然对比式编码器在多模态对齐和检索任务中表现优异，但自监督视觉编码器往往能捕获更丰富的稠密语义，并在识别与理解任务中展现出更强的鲁棒性。本研究旨在探索如何融合这两种互补的视觉表征以扩展视觉语言模型的性能。我们提出CoME-VL：互补式多编码器视觉语言模型，该模块化融合框架将对比训练视觉编码器与自监督DINO编码器相集成。我们的方法通过以下方式实现表征级融合：（i）采用熵引导的多层聚合配合正交约束投影以减少冗余；（ii）通过RoPE增强的交叉注意力对齐异构令牌网格，生成紧凑的融合视觉令牌。融合后的令牌只需对标准VLM流程进行最小改动即可注入仅解码器架构的大语言模型。在多个视觉语言基准测试上的大量实验表明，CoME-VL始终优于单编码器基线模型，尤其在视觉理解任务中平均提升4.9%，指代定位任务中提升5.4%。我们的方法在RefCOCO检测任务上达到最先进水平，较基线模型实现大幅提升。最后，我们通过消融实验对层级融合、非冗余特征混合及融合容量进行评估，以探究对比式与自监督信号如何协同影响VLM性能。

English

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

CoME-VL：规模化互补式多编码器视觉语言学习

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

摘要

Support