SAIL-嵌入技術報告：全模態嵌入基礎模型

摘要

多模態嵌入模型旨在生成信息豐富的統一表徵，以支持多樣的跨模態任務。儘管從基於CLIP的雙塔架構到大型視覺語言模型的演進中取得了令人鼓舞的進展，先前的工作在實際應用和商業場景中仍面臨不可避免的挑戰，如模態支持有限、訓練機制不穩定以及工業領域差距等。在本研究中，我們介紹了SAIL-Embedding，這是一個全模態嵌入基礎模型，通過定制的訓練策略和架構設計來解決這些問題。在優化過程中，我們提出了一種多階段訓練方案，以提升表徵學習的多方面效能。具體而言，內容感知的漸進訓練旨在增強模型對多樣下游任務的適應性，並掌握豐富的跨模態能力。協作感知的推薦增強訓練則通過從序列到項目和ID到項目的嵌入中提取知識，同時挖掘用戶歷史興趣，進一步適應推薦場景的多模態表徵。與此同時，我們開發了隨機專業化和數據集驅動的模式匹配，以增強模型訓練的靈活性和泛化能力。實驗結果顯示，SAIL-Embedding在不同檢索任務中相比其他方法達到了SOTA性能。在與我們模型整合的各種實際場景的在線實驗中，我們觀察到Lifetime（LT）這一推薦體驗的關鍵指標顯著提升。例如，在抖音精選場景中，模型實現了7天LT增益+0.158%和14天LT增益+0.144%。對於抖音信息流排序模型，SAIL-Embedding生成的匹配特徵帶來了+0.08%的AUC增益。

English

Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.158% and the 14-day LT gain of +0.144% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.08% AUC gain.

SAIL-嵌入技術報告：全模態嵌入基礎模型

SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model

摘要

Support