ChatPaper.aiChatPaper

SAIL-嵌入技術報告:全模態嵌入基礎模型

SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model

October 14, 2025
作者: Lin Lin, Jiefeng Long, Zhihe Wan, Yuchi Wang, Dingkang Yang, Shuang Yang, Yueyang Yao, Xu Chen, Zirui Guo, Shengqiang Li, Weiran Li, Hanyu Li, Yaling Mou, Yan Qiu, Haiyang Yu, Xiao Liang, Hongsheng Li, Chao Feng
cs.AI

摘要

多模態嵌入模型旨在生成信息豐富的統一表徵,以支持多樣的跨模態任務。儘管從基於CLIP的雙塔架構到大型視覺語言模型的演進中取得了令人鼓舞的進展,先前的工作在實際應用和商業場景中仍面臨不可避免的挑戰,如模態支持有限、訓練機制不穩定以及工業領域差距等。在本研究中,我們介紹了SAIL-Embedding,這是一個全模態嵌入基礎模型,通過定制的訓練策略和架構設計來解決這些問題。在優化過程中,我們提出了一種多階段訓練方案,以提升表徵學習的多方面效能。具體而言,內容感知的漸進訓練旨在增強模型對多樣下游任務的適應性,並掌握豐富的跨模態能力。協作感知的推薦增強訓練則通過從序列到項目和ID到項目的嵌入中提取知識,同時挖掘用戶歷史興趣,進一步適應推薦場景的多模態表徵。與此同時,我們開發了隨機專業化和數據集驅動的模式匹配,以增強模型訓練的靈活性和泛化能力。實驗結果顯示,SAIL-Embedding在不同檢索任務中相比其他方法達到了SOTA性能。在與我們模型整合的各種實際場景的在線實驗中,我們觀察到Lifetime(LT)這一推薦體驗的關鍵指標顯著提升。例如,在抖音精選場景中,模型實現了7天LT增益+0.158%和14天LT增益+0.144%。對於抖音信息流排序模型,SAIL-Embedding生成的匹配特徵帶來了+0.08%的AUC增益。
English
Multimodal embedding models aim to yield informative unified representations that empower diverse cross-modal tasks. Despite promising developments in the evolution from CLIP-based dual-tower architectures to large vision-language models, prior works still face unavoidable challenges in real-world applications and business scenarios, such as the limited modality support, unstable training mechanisms, and industrial domain gaps. In this work, we introduce SAIL-Embedding, an omni-modal embedding foundation model that addresses these issues through tailored training strategies and architectural design. In the optimization procedure, we propose a multi-stage training scheme to boost the multifaceted effectiveness of representation learning. Specifically, the content-aware progressive training aims to enhance the model's adaptability to diverse downstream tasks and master enriched cross-modal proficiency. The collaboration-aware recommendation enhancement training further adapts multimodal representations for recommendation scenarios by distilling knowledge from sequence-to-item and ID-to-item embeddings while mining user historical interests. Concurrently, we develop the stochastic specialization and dataset-driven pattern matching to strengthen model training flexibility and generalizability. Experimental results show that SAIL-Embedding achieves SOTA performance compared to other methods in different retrieval tasks. In online experiments across various real-world scenarios integrated with our model, we observe a significant increase in Lifetime (LT), which is a crucial indicator for the recommendation experience. For instance, the model delivers the 7-day LT gain of +0.158% and the 14-day LT gain of +0.144% in the Douyin-Selected scenario. For the Douyin feed rank model, the match features produced by SAIL-Embedding yield a +0.08% AUC gain.
PDF102October 15, 2025