多模态推荐中的多尺度双边注意力模态对齐
Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation
September 11, 2025
作者: Kelin Ren, Chan-Yang Ju, Dong-Ho Lee
cs.AI
摘要
多模态推荐系统正日益成为电子商务和内容平台的基础技术,通过联合建模用户的历史行为与物品的多模态特征(如视觉与文本),实现个性化服务。然而,现有方法大多依赖静态融合策略或基于图的局部交互建模,面临两大关键局限:(1) 对细粒度跨模态关联的建模能力不足,导致融合质量欠佳;(2) 缺乏全局分布层面的一致性,引发表征偏差。为此,我们提出了MambaRec,一个通过注意力引导学习整合局部特征对齐与全局分布正则化的新颖框架。其核心是引入了扩张细化注意力模块(DREAM),该模块利用多尺度扩张卷积结合通道与空间注意力,对齐视觉与文本模态间的细粒度语义模式。此模块捕捉层次化关系与上下文感知关联,提升了跨模态语义建模能力。此外,我们应用最大均值差异(MMD)与对比损失函数约束全局模态对齐,增强语义一致性。这种双重正则化减少了模态特异性偏差,提升了鲁棒性。为提高可扩展性,MambaRec采用降维策略降低高维多模态特征的计算成本。在真实世界电商数据集上的广泛实验表明,MambaRec在融合质量、泛化能力及效率上均优于现有方法。我们的代码已公开于https://github.com/rkl71/MambaRec。
English
Multimodal recommendation systems are increasingly becoming foundational
technologies for e-commerce and content platforms, enabling personalized
services by jointly modeling users' historical behaviors and the multimodal
features of items (e.g., visual and textual). However, most existing methods
rely on either static fusion strategies or graph-based local interaction
modeling, facing two critical limitations: (1) insufficient ability to model
fine-grained cross-modal associations, leading to suboptimal fusion quality;
and (2) a lack of global distribution-level consistency, causing
representational bias. To address these, we propose MambaRec, a novel framework
that integrates local feature alignment and global distribution regularization
via attention-guided learning. At its core, we introduce the Dilated Refinement
Attention Module (DREAM), which uses multi-scale dilated convolutions with
channel-wise and spatial attention to align fine-grained semantic patterns
between visual and textual modalities. This module captures hierarchical
relationships and context-aware associations, improving cross-modal semantic
modeling. Additionally, we apply Maximum Mean Discrepancy (MMD) and contrastive
loss functions to constrain global modality alignment, enhancing semantic
consistency. This dual regularization reduces mode-specific deviations and
boosts robustness. To improve scalability, MambaRec employs a dimensionality
reduction strategy to lower the computational cost of high-dimensional
multimodal features. Extensive experiments on real-world e-commerce datasets
show that MambaRec outperforms existing methods in fusion quality,
generalization, and efficiency. Our code has been made publicly available at
https://github.com/rkl71/MambaRec.