ChatPaper.aiChatPaper

擴展語言為核心的全模態表徵學習

Scaling Language-Centric Omnimodal Representation Learning

October 13, 2025
作者: Chenghao Xiao, Hou Pong Chan, Hao Zhang, Weiwen Xu, Mahani Aljunied, Yu Rong
cs.AI

摘要

近期,基於多模態大語言模型(MLLMs)並通過對比學習(CL)進行微調的多模態嵌入方法展現出了令人矚目的成果,然而其優越性背後的深層原因尚未得到充分探討。本研究提出,MLLM基於方法的一個關鍵優勢源於生成式預訓練過程中實現的隱式跨模態對齊,在此過程中,語言解碼器學會在共享表示空間內利用多模態信號來生成單模態輸出。通過對各向異性和核相似性結構的分析,我們實證確認了MLLM表示中出現的潛在對齊,使得CL能夠作為一個輕量級的優化階段。基於這一洞察,我們提出了一種以語言為核心的全模態嵌入框架,命名為LCO-Emb。在多樣化的骨幹網絡和基準測試上的廣泛實驗證明了其有效性,在多個模態上達到了最先進的性能。此外,我們發現了一條生成-表示縮放定律(GRSL),表明通過對比優化獲得的表示能力與MLLM的生成能力呈正比增長。這表明提升生成能力已成為增強表示質量的有效範式。我們對GRSL提供了理論解釋,正式將MLLM的生成質量與其表示性能的上限聯繫起來,並在一個具有挑戰性的低資源視覺-文檔檢索任務上進行了驗證,顯示在CL之前持續進行生成式預訓練能夠進一步提升模型嵌入能力的潛力。代碼、模型及資源可在https://github.com/LCO-Embedding/LCO-Embedding獲取。
English
Recent multimodal embedding approaches leveraging multimodal large language models (MLLMs) fine-tuned with contrastive learning (CL) have shown promising results, yet the underlying reasons behind their superiority remain underexplored. This work argues that a crucial advantage of MLLM-based approaches stems from implicit cross-modal alignment achieved during generative pretraining, where the language decoder learns to exploit multimodal signals within a shared representation space for generating unimodal outputs. Through analysis of anisotropy and kernel similarity structure, we empirically confirm that latent alignment emerges within MLLM representations, allowing CL to serve as a lightweight refinement stage. Leveraging this insight, we propose a Language-Centric Omnimodal Embedding framework, termed LCO-Emb. Extensive experiments across diverse backbones and benchmarks demonstrate its effectiveness, achieving state-of-the-art performance across modalities. Furthermore, we identify a Generation-Representation Scaling Law (GRSL), showing that the representational capabilities gained through contrastive refinement scales positively with the MLLM's generative capabilities. This suggests that improving generative abilities evolves as an effective paradigm for enhancing representation quality. We provide a theoretical explanation of GRSL, which formally links the MLLM's generative quality to the upper bound on its representation performance, and validate it on a challenging, low-resource visual-document retrieval task, showing that continual generative pretraining before CL can further enhance the potential of a model's embedding capabilities. Codes, models, and resources are available at https://github.com/LCO-Embedding/LCO-Embedding.
PDF944October 15, 2025