ChatPaper.aiChatPaper

基于概念空间对齐的统一视觉语言建模

Unified Vision-Language Modeling via Concept Space Alignment

March 1, 2026
作者: Yifu Qiu, Paul-Ambroise Duquenne, Holger Schwenk
cs.AI

摘要

我们推出V-SONAR——一个从纯文本嵌入空间SONAR(Omnilingual Embeddings Team等人,2026)扩展而来的视觉-语言嵌入空间,其支持1500种文本语言和177种语音语言。为构建V-SONAR,我们提出了一种后验对齐流程,将现有视觉编码器的表征映射至SONAR空间。通过全面评估,我们证明V-SONAR嵌入在文本-视频检索任务中具备竞争优势。结合OMNISONAR文本解码器后,V-SONAR在视频描述生成任务上进一步超越现有顶尖视觉-语言模型,在DREAM-1K(BLEU分数23.9对19.6)和PE-VIDEO(BLEU分数39.0对30.0)数据集上表现优异。 基于V-SONAR,我们首次证实:仅通过英文文本训练、运行于SONAR空间的大型概念模型(LCM;LCM团队等人,2024)能以零样本方式实现单视觉概念与多视觉概念理解。最后,我们提出通过视觉-语言指令微调扩展LCM的V-LCM模型。该模型通过V-SONAR和SONAR将视觉与语言输入编码为统一的潜在嵌入序列,并采用与LCM纯文本预训练相同的潜在扩散目标进行下一嵌入预测。在大规模多语言多模态指令微调混合数据上的实验凸显了V-LCM的潜力:在涵盖图像/视频描述生成和问答的任务中,V-LCM与顶尖视觉-语言模型性能相当,而在全部62种测试语言中,其于61种资源丰富度各异的语言上显著优于现有模型。
English
We introduce V-SONAR, a vision-language embedding space extended from the text-only embedding space SONAR (Omnilingual Embeddings Team et al., 2026), which supports 1500 text languages and 177 speech languages. To construct V-SONAR, we propose a post-hoc alignment pipeline that maps the representations of an existing vision encoder into the SONAR space. We thoroughly evaluate V-SONAR and show that its embeddings achieve competitive performance on text-to-video retrieval. Equipped with the OMNISONAR text decoder, V-SONAR further surpasses state-of-the-art vision-language models on video captioning tasks, including DREAM-1K (BLEU 23.9 vs. 19.6) and PE-VIDEO (BLEU 39.0 vs. 30.0). Leveraging V-SONAR, we first demonstrate that the Large Concept Model (LCM; LCM team et al. 2024) operating in SONAR and trained with English text only, can perform both single- and multi-visual concept understanding in a zero-shot manner. Finally, we introduce V-LCM, which extends the LCM with vision-language instruction tuning. V-LCM encodes vision and language inputs into an unified sequence of latent embeddings via V-SONAR and SONAR, and it is trained with the same latent diffusion objective for next-embedding prediction as in LCM's text-only pre-training. Experiments on a large-scale multilingual and -modal instruction-tuning data mixture highlight the potential of V-LCM: V-LCM matches state-of-the-art vision-language models on tasks covering image/video captioning and question answering, while significantly outperforming them across 61 rich- to low-resource languages out of all 62 tested languages.
PDF61March 4, 2026