透過高品質合成資料改善多模式多語言嵌入式表示_mmE5

摘要

多模式嵌入模型因其能夠將來自不同模態（例如文本和圖像）的數據映射到統一的表示空間而受到廣泛關注。然而，有限的標記多模式數據通常會影響嵌入性能。最近的方法利用數據合成來解決這個問題，然而合成數據的質量仍然是一個關鍵瓶頸。在這項工作中，我們確定了三個高質量合成多模式數據的標準。首先，廣泛的範圍確保生成的數據涵蓋各種任務和模態，使其適用於各種下游場景。其次，強大的跨模態對齊使不同模態在語義上保持一致。第三，高保真度確保合成數據保留逼真的細節，以增強其可靠性。在這些原則的指導下，我們合成了數據集：（1）涵蓋各種任務、模態組合和語言，（2）通過多模式大型語言模型的單次深思過程生成，以及（3）將真實世界的圖像與準確且相關的文本相結合，通過自我評估和改進確保保真度。利用這些高質量的合成和標記數據集，我們訓練了一個多模式多語言E5模型mmE5。大量實驗表明，mmE5在MMEB基準測試中實現了最先進的性能，並在XTD基準測試中實現了卓越的多語言性能。我們的代碼、數據集和模型已在https://github.com/haon-chen/mmE5 上發布。

English

Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.

透過高品質合成資料改善多模式多語言嵌入式表示_mmE5

mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

摘要

Support