SAIL-嵌入技术报告:全模态嵌入基础模型
SAIL-Embedding Technical Report: Omni-modal Embedding Foundation Model
October 14, 2025
作者: Lin Lin, Jiefeng Long, Zhihe Wan, Yuchi Wang, Dingkang Yang, Shuang Yang, Yueyang Yao, Xu Chen, Zirui Guo, Shengqiang Li, Weiran Li, Hanyu Li, Yaling Mou, Yan Qiu, Haiyang Yu, Xiao Liang, Hongsheng Li, Chao Feng
cs.AI
摘要
多模态嵌入模型致力于生成信息丰富的统一表示,以赋能多样化的跨模态任务。尽管从基于CLIP的双塔架构到大规模视觉语言模型的演进中取得了显著进展,先前的研究在实际应用和商业场景中仍面临不可避免的挑战,如模态支持有限、训练机制不稳定以及工业领域差距等问题。在本研究中,我们提出了SAIL-Embedding,一种全模态嵌入基础模型,通过定制化的训练策略和架构设计解决了上述问题。在优化过程中,我们提出了一种多阶段训练方案,以提升表示学习在多方面的有效性。具体而言,内容感知的渐进训练旨在增强模型对多样化下游任务的适应性,并掌握丰富的跨模态能力。协作感知的推荐增强训练则通过从序列到项目及ID到项目的嵌入中提炼知识,同时挖掘用户历史兴趣,进一步调整多模态表示以适应推荐场景。同时,我们开发了随机专业化和数据集驱动的模式匹配,以增强模型训练的灵活性和泛化能力。实验结果表明,SAIL-Embedding在不同检索任务中相比其他方法实现了SOTA性能。在整合我们模型的各种现实场景在线实验中,我们观察到关键推荐体验指标——生命周期(LT)的显著提升。例如,在抖音精选场景中,模型带来了7天LT增益+0.158%和14天LT增益+0.144%。对于抖音信息流排序模型,SAIL-Embedding生成的匹配特征实现了+0.08%的AUC增益。
English
Multimodal embedding models aim to yield informative unified representations
that empower diverse cross-modal tasks. Despite promising developments in the
evolution from CLIP-based dual-tower architectures to large vision-language
models, prior works still face unavoidable challenges in real-world
applications and business scenarios, such as the limited modality support,
unstable training mechanisms, and industrial domain gaps. In this work, we
introduce SAIL-Embedding, an omni-modal embedding foundation model that
addresses these issues through tailored training strategies and architectural
design. In the optimization procedure, we propose a multi-stage training scheme
to boost the multifaceted effectiveness of representation learning.
Specifically, the content-aware progressive training aims to enhance the
model's adaptability to diverse downstream tasks and master enriched
cross-modal proficiency. The collaboration-aware recommendation enhancement
training further adapts multimodal representations for recommendation scenarios
by distilling knowledge from sequence-to-item and ID-to-item embeddings while
mining user historical interests. Concurrently, we develop the stochastic
specialization and dataset-driven pattern matching to strengthen model training
flexibility and generalizability. Experimental results show that SAIL-Embedding
achieves SOTA performance compared to other methods in different retrieval
tasks. In online experiments across various real-world scenarios integrated
with our model, we observe a significant increase in Lifetime (LT), which is a
crucial indicator for the recommendation experience. For instance, the model
delivers the 7-day LT gain of +0.158% and the 14-day LT gain of +0.144% in the
Douyin-Selected scenario. For the Douyin feed rank model, the match features
produced by SAIL-Embedding yield a +0.08% AUC gain.