概念空間アライメントによる統合視覚言語モデリング

要旨

本論文では、1500のテキスト言語と177の音声言語をサポートするテキスト専用埋め込み空間SONAR（Omnilingual Embeddings Team et al., 2026）を拡張した視覚言語埋め込み空間V-SONARを提案する。V-SONARを構築するため、既存の視覚エンコーダの表現をSONAR空間に写像する事後的アライメント手法を開発した。V-SONARを詳細に評価した結果、その埋め込み表現がテキスト-映像検索において競争力のある性能を達成することを示す。さらにOMNISONARテキストデコーダを組み合わせることで、映像キャプション生成タスク（DREAM-1K：BLEU 23.9対19.6、PE-VIDEO：BLEU 39.0対30.0）において既存の視覚言語モデルを凌駕する。 V-SONARを活用し、まずSONAR空間で動作し英語テキストのみで学習された大規模概念モデル（LCM; LCM team et al. 2024）が、ゼロショットで単一／複数の視覚的概念理解を実行可能であることを実証する。最後に、視覚言語指示チューニングによりLCMを拡張したV-LCMを提案する。V-LCMは視覚と言語入力をV-SONARとSONARにより統合された潜在埋め込み列に符号化し、LCMのテキスト事前学習と同様の潜在拡散目的関数で次埋め込み予測を学習する。大規模多言語・多モーダル指示チューニングデータ混合による実験では、V-LCMが画像/映像キャプション生成や質問応答タスクで最先端視覚言語モデルと同等の性能を発揮しつつ、テスト全62言語中61言語（高資源言語から低資源言語まで）でそれらを大幅に上回る可能性が示された。

English

We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.

概念空間アライメントによる統合視覚言語モデリング

Unified Vision-Language Modeling via Concept Space Alignment

要旨

Support