CoME-VL：扩展型互补多编码器视觉-语言学习架构

摘要

当前视觉语言模型通常采用基于对比式图像-文本目标（如CLIP风格预训练）的单视觉编码器架构。虽然对比式编码器在跨模态对齐与检索任务中表现优异，但自监督视觉编码器往往能捕捉更丰富的稠密语义，并在识别理解任务中展现出更强的鲁棒性。本研究探索如何规模化融合这两种互补的视觉表征以增强视觉语言建模。我们提出互补多编码器视觉语言框架，该模块化融合架构整合了对比训练视觉编码器与自监督DINO编码器。我们的方法通过以下机制实现表征级融合：(i) 采用熵引导的多层聚合配合正交约束投影以减少冗余；(ii) 通过RoPE增强的交叉注意力对齐异构令牌网格，生成紧凑的融合视觉令牌。融合后的令牌可注入仅解码器架构的大语言模型，且对标准视觉语言模型流程改动极小。在多类视觉语言基准测试上的实验表明，该框架持续超越单编码器基线模型：在视觉理解任务中平均提升4.9%，指代定位任务中提升5.4%。我们的方法在RefCOCO检测任务上达到最先进性能，且相对基线模型实现显著提升。最后，我们通过层融合策略、非冗余特征混合及融合能力三个维度的消融实验，系统评估了对比式与自监督信号对视觉语言模型性能的互补影响。

English

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.