RecGOAT:基于图最优自适应传输的双语义对齐大语言模型增强多模态推荐系统
RecGOAT: Graph Optimal Adaptive Transport for LLM-Enhanced Multimodal Recommendation with Dual Semantic Alignment
January 31, 2026
作者: Yuecheng Li, Hengwei Ju, Zeyu Song, Wei Yang, Chi Lu, Peng Jiang, Kun Gai
cs.AI
摘要
多模态推荐系统通常将用户行为与物品的多模态数据相融合,从而更精准地捕捉用户偏好。随着大模型的兴起,多模态推荐日益利用其在语义理解和上下文推理方面的优势。然而,大模型表征本质上是为通用语义任务优化的,而推荐模型则高度依赖稀疏的用户/物品唯一身份标识特征。现有研究忽视了大模型与推荐系统之间的表征差异,导致多模态表征不兼容及推荐性能欠佳。为弥合这一差距,我们提出RecGOAT——一种新颖而简洁的双语义对齐框架,用于大语言模型增强的多模态推荐,该框架具备理论保障的对齐能力。RecGOAT首先通过图注意力网络,利用用户/物品的大模型表征和交互历史,对物品-物品、用户-物品及用户-用户关系建模以丰富协同语义。进一步,我们设计了双粒度渐进式多模态-身份标识对齐框架,分别通过跨模态对比学习和最优自适应传输实现实例级与分布级语义对齐。理论上,我们证明了该对齐框架衍生的统一表征具有更优的语义一致性和全面性。在三个公开基准数据集上的大量实验表明,RecGOAT实现了最先进的性能,从实证角度验证了我们的理论见解。此外,在大型在线广告平台上的部署证实了该模型在工业推荐场景中的有效性和可扩展性。代码详见https://github.com/6lyc/RecGOAT-LLM4Rec。
English
Multimodal recommendation systems typically integrates user behavior with multimodal data from items, thereby capturing more accurate user preferences. Concurrently, with the rise of large models (LMs), multimodal recommendation is increasingly leveraging their strengths in semantic understanding and contextual reasoning. However, LM representations are inherently optimized for general semantic tasks, while recommendation models rely heavily on sparse user/item unique identity (ID) features. Existing works overlook the fundamental representational divergence between large models and recommendation systems, resulting in incompatible multimodal representations and suboptimal recommendation performance. To bridge this gap, we propose RecGOAT, a novel yet simple dual semantic alignment framework for LLM-enhanced multimodal recommendation, which offers theoretically guaranteed alignment capability. RecGOAT first employs graph attention networks to enrich collaborative semantics by modeling item-item, user-item, and user-user relationships, leveraging user/item LM representations and interaction history. Furthermore, we design a dual-granularity progressive multimodality-ID alignment framework, which achieves instance-level and distribution-level semantic alignment via cross-modal contrastive learning (CMCL) and optimal adaptive transport (OAT), respectively. Theoretically, we demonstrate that the unified representations derived from our alignment framework exhibit superior semantic consistency and comprehensiveness. Extensive experiments on three public benchmarks show that our RecGOAT achieves state-of-the-art performance, empirically validating our theoretical insights. Additionally, the deployment on a large-scale online advertising platform confirms the model's effectiveness and scalability in industrial recommendation scenarios. Code available at https://github.com/6lyc/RecGOAT-LLM4Rec.