透過高品質合成資料改善多模式多語言嵌入式表示_mmE5
mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data
February 12, 2025
作者: Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou
cs.AI
摘要
多模式嵌入模型因其能夠將來自不同模態(例如文本和圖像)的數據映射到統一的表示空間而受到廣泛關注。然而,有限的標記多模式數據通常會影響嵌入性能。最近的方法利用數據合成來解決這個問題,然而合成數據的質量仍然是一個關鍵瓶頸。在這項工作中,我們確定了三個高質量合成多模式數據的標準。首先,廣泛的範圍確保生成的數據涵蓋各種任務和模態,使其適用於各種下游場景。其次,強大的跨模態對齊使不同模態在語義上保持一致。第三,高保真度確保合成數據保留逼真的細節,以增強其可靠性。在這些原則的指導下,我們合成了數據集:(1)涵蓋各種任務、模態組合和語言,(2)通過多模式大型語言模型的單次深思過程生成,以及(3)將真實世界的圖像與準確且相關的文本相結合,通過自我評估和改進確保保真度。利用這些高質量的合成和標記數據集,我們訓練了一個多模式多語言E5模型mmE5。大量實驗表明,mmE5在MMEB基準測試中實現了最先進的性能,並在XTD基準測試中實現了卓越的多語言性能。我們的代碼、數據集和模型已在https://github.com/haon-chen/mmE5 上發布。
English
Multimodal embedding models have gained significant attention for their
ability to map data from different modalities, such as text and images, into a
unified representation space. However, the limited labeled multimodal data
often hinders embedding performance. Recent approaches have leveraged data
synthesis to address this problem, yet the quality of synthetic data remains a
critical bottleneck. In this work, we identify three criteria for high-quality
synthetic multimodal data. First, broad scope ensures that the generated data
covers diverse tasks and modalities, making it applicable to various downstream
scenarios. Second, robust cross-modal alignment makes different modalities
semantically consistent. Third, high fidelity ensures that the synthetic data
maintains realistic details to enhance its reliability. Guided by these
principles, we synthesize datasets that: (1) cover a wide range of tasks,
modality combinations, and languages, (2) are generated via a deep thinking
process within a single pass of a multimodal large language model, and (3)
incorporate real-world images with accurate and relevant texts, ensuring
fidelity through self-evaluation and refinement. Leveraging these high-quality
synthetic and labeled datasets, we train a multimodal multilingual E5 model
mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art
performance on the MMEB Benchmark and superior multilingual performance on the
XTD benchmark. Our codes, datasets and models are released in
https://github.com/haon-chen/mmE5.Summary
AI-Generated Summary