CoME-VL: 상호보완적 멀티-인코더 비전-언어 학습의 확장

초록

최근의 시각-언어 모델(VLM)은 일반적으로 CLIP 방식의 사전 학습과 같은 대조적 이미지-텍스트 목표로 학습된 단일 시각 인코더에 의존합니다. 대조적 인코더는 크로스 모달 정렬 및 검색에 효과적이지만, 자기 지도 시각 인코더는 종종 더 풍부한 조밀한 의미를 포착하고 인식 및 이해 작업에서 더 강력한 견고성을 보여줍니다. 본 연구에서는 시각-언어 모델링을 위해 이러한 상호 보완적인 시각 표현의 융합을 확장하는 방법을 탐구합니다. 우리는 대조적으로 학습된 시각 인코더와 자기 지도 DINO 인코더를 통합하는 모듈식 융합 프레임워크인 CoME-VL(Complementary Multi-Encoder Vision-Language)을 제안합니다. 우리의 접근 방식은 (i) 중복성을 줄이기 위한 직교성 제약 투영을 통한 엔트로피 기반 다중 계층 집계와 (ii) 이질적인 토큰 그리드를 정렬하고 컴팩트한 융합 시각 토큰을 생성하기 위한 RoPE 강화 교차 주의를 통해 표현 수준 융합을 수행합니다. 융합된 토큰은 표준 VLM 파이프라인을 최소한으로 변경하여 디코더 전용 LLM에 주입될 수 있습니다. 다양한 시각-언어 벤치마크에 대한 폭넓은 실험을 통해 CoME-VL이 단일 인코더 기준선을 지속적으로 능가함을 입증했습니다. 특히, 시각 이해 작업에서 평균 4.9%, 접지 작업에서 5.4%의 성능 향상을 관찰했습니다. 우리의 방법은 기준선 대비 큰 차이로 개선하면서 검출을 위한 RefCOCO에서 최첨단 성능을 달성했습니다. 마지막으로, 계층 병합, 비중복 특징 혼합 및 융합 용량에 대한 애블레이션 연구를 수행하여 상호 보완적인 대조적 및 자기 지도 신호가 VLM 성능에 미치는 영향을 평가합니다.

English

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

CoME-VL: 상호보완적 멀티-인코더 비전-언어 학습의 확장

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

초록

Support