CoME-VL: 補完的マルチエンコーダ視覚言語学習のスケーリング

要旨

近年のビジョン言語モデル（VLM）は、通常、CLIPスタイルの事前学習のような対照的な画像-テキスト目的で学習された単一の視覚エンコーダに依存している。対照的エンコーダはクロスモーダルなアライメントや検索に有効である一方で、自己教師あり視覚エンコーダは、より豊富な密なセマンティクスを捉え、認識・理解タスクにおいてより強力なロバスト性を示すことが多い。本研究では、ビジョン言語モデリングに向けて、これらの相補的な視覚表現の融合を如何に拡張するかを探求する。我々は、CoME-VL: Complementary Multi-Encoder Vision-Language を提案する。これは、対照的に学習された視覚エンコーダと自己教師ありのDINOエンコーダを統合するモジュラー型融合フレームワークである。本手法は、(i) 直交性制約付き射影による冗長性低減のためのエントロピー誘導型多層集約と、(ii) 異種トークングリッドを整列させコンパクトな融合視覚トークンを生成するためのRoPE強化クロスアテンションにより、表現レベルの融合を実行する。融合されたトークンは、標準的なVLMパイプラインへの変更を最小限に抑えつつ、デコーダのみの大規模言語モデルに注入することができる。多様なビジョン言語ベンチマークにおける広範な実験により、CoME-VLが単一エンコーダベースラインを一貫して凌駕することを実証する。特に、視覚的理解タスクで平均4.9%、グラウンディングタスクで平均5.4%の改善を観測した。本手法は、検出タスクにおけるRefCOCOで state-of-the-art 性能を達成し、ベースラインを大幅に上回る改善を示した。最後に、層統合、非冗長な特徴混合、融合能力に関する ablation study を実施し、対照的シグナルと自己教師ありシグナルの相補性がVLM性能に如何に影響するかを評価する。

English

Recent vision-language models (VLMs) typically rely on a single vision encoder trained with contrastive image-text objectives, such as CLIP-style pretraining. While contrastive encoders are effective for cross-modal alignment and retrieval, self-supervised visual encoders often capture richer dense semantics and exhibit stronger robustness on recognition and understanding tasks. In this work, we investigate how to scale the fusion of these complementary visual representations for vision-language modeling. We propose CoME-VL: Complementary Multi-Encoder Vision-Language, a modular fusion framework that integrates a contrastively trained vision encoder with a self-supervised DINO encoder. Our approach performs representation-level fusion by (i) entropy-guided multi-layer aggregation with orthogonality-constrained projections to reduce redundancy, and (ii) RoPE-enhanced cross-attention to align heterogeneous token grids and produce compact fused visual tokens. The fused tokens can be injected into a decoder-only LLM with minimal changes to standard VLM pipelines. Extensive experiments across diverse vision-language benchmarks demonstrate that CoME-VL consistently outperforms single-encoder baselines. In particular, we observe an average improvement of 4.9% on visual understanding tasks and 5.4% on grounding tasks. Our method achieves state-of-the-art performance on RefCOCO for detection while improving over the baseline by a large margin. Finally, we conduct ablation studies on layer merging, non-redundant feature mixing, and fusion capacity to evaluate how complementary contrastive and self-supervised signals affect VLM performance.

CoME-VL: 補完的マルチエンコーダ視覚言語学習のスケーリング

CoME-VL: Scaling Complementary Multi-Encoder Vision-Language Learning

要旨

Support