SAIL-嵌入技術報告:全模態嵌入基礎模型
SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
October 14, 2025
作者: Lin Lin, Jiefeng Long, Zhihe Wan, Yuchi Wang, Dingkang Yang, Shuang Yang, Yueyang Yao, Xu Chen, Zirui Guo, Shengqiang Li, Weiran Li, Hanyu Li, Yaling Mou, Yan Qiu, Haiyang Yu, Xiao Liang, Hongsheng Li, Chao Feng
cs.AI
摘要
多模態嵌入模型旨在生成信息豐富的統一表徵,以支持多樣的跨模態任務。儘管從基於CLIP的雙塔架構到大型視覺語言模型的演進中取得了令人鼓舞的進展,先前的工作在實際應用和商業場景中仍面臨不可避免的挑戰,如模態支持有限、訓練機制不穩定以及工業領域差距等。在本研究中,我們介紹了SAIL-Embedding,這是一個全模態嵌入基礎模型,通過定制的訓練策略和架構設計來解決這些問題。在優化過程中,我們提出了一種多階段訓練方案,以提升表徵學習的多方面效能。具體而言,內容感知的漸進訓練旨在增強模型對多樣下游任務的適應性,並掌握豐富的跨模態能力。協作感知的推薦增強訓練則通過從序列到項目和ID到項目的嵌入中提取知識,同時挖掘用戶歷史興趣,進一步適應推薦場景的多模態表徵。與此同時,我們開發了隨機專業化和數據集驅動的模式匹配,以增強模型訓練的靈活性和泛化能力。實驗結果顯示,SAIL-Embedding在不同檢索任務中相比其他方法達到了SOTA性能。在與我們模型整合的各種實際場景的在線實驗中,我們觀察到Lifetime(LT)這一推薦體驗的關鍵指標顯著提升。例如,在抖音精選場景中,模型實現了7天LT增益+0.158%和14天LT增益+0.144%。對於抖音信息流排序模型,SAIL-Embedding生成的匹配特徵帶來了+0.08%的AUC增益。
English
Multimodal embedding models aim to yield informative unified representations
that empower diverse cross-modal tasks. Despite promising developments in the
evolution from CLIP-based dual-tower architectures to large vision-language
models, prior works still face unavoidable challenges in real-world
applications and business scenarios, such as the limited modality support,
unstable training mechanisms, and industrial domain gaps. In this work, we
introduce SAIL-Embedding, an omni-modal embedding foundation model that
addresses these issues through tailored training strategies and architectural
design. In the optimization procedure, we propose a multi-stage training scheme
to boost the multifaceted effectiveness of representation learning.
Specifically, the content-aware progressive training aims to enhance the
model's adaptability to diverse downstream tasks and master enriched
cross-modal proficiency. The collaboration-aware recommendation enhancement
training further adapts multimodal representations for recommendation scenarios
by distilling knowledge from sequence-to-item and ID-to-item embeddings while
mining user historical interests. Concurrently, we develop the stochastic
specialization and dataset-driven pattern matching to strengthen model training
flexibility and generalizability. Experimental results show that SAIL-Embedding
achieves SOTA performance compared to other methods in different retrieval
tasks. In online experiments across various real-world scenarios integrated
with our model, we observe a significant increase in Lifetime (LT), which is a
crucial indicator for the recommendation experience. For instance, the model
delivers the 7-day LT gain of +0.158% and the 14-day LT gain of +0.144% in the
Douyin-Selected scenario. For the Douyin feed rank model, the match features
produced by SAIL-Embedding yield a +0.08% AUC gain.