ChatPaper.aiChatPaper

透過高品質合成資料改善多模式多語言嵌入式表示_mmE5

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

February 12, 2025
作者: Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou
cs.AI

摘要

多模式嵌入模型因其能夠將來自不同模態(例如文本和圖像)的數據映射到統一的表示空間而受到廣泛關注。然而,有限的標記多模式數據通常會影響嵌入性能。最近的方法利用數據合成來解決這個問題,然而合成數據的質量仍然是一個關鍵瓶頸。在這項工作中,我們確定了三個高質量合成多模式數據的標準。首先,廣泛的範圍確保生成的數據涵蓋各種任務和模態,使其適用於各種下游場景。其次,強大的跨模態對齊使不同模態在語義上保持一致。第三,高保真度確保合成數據保留逼真的細節,以增強其可靠性。在這些原則的指導下,我們合成了數據集:(1)涵蓋各種任務、模態組合和語言,(2)通過多模式大型語言模型的單次深思過程生成,以及(3)將真實世界的圖像與準確且相關的文本相結合,通過自我評估和改進確保保真度。利用這些高質量的合成和標記數據集,我們訓練了一個多模式多語言E5模型mmE5。大量實驗表明,mmE5在MMEB基準測試中實現了最先進的性能,並在XTD基準測試中實現了卓越的多語言性能。我們的代碼、數據集和模型已在https://github.com/haon-chen/mmE5 上發布。
English
Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.

Summary

AI-Generated Summary

PDF132February 14, 2025