多模態推薦中的多尺度雙邊注意力模態對齊
Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation
September 11, 2025
作者: Kelin Ren, Chan-Yang Ju, Dong-Ho Lee
cs.AI
摘要
多模態推薦系統正日益成為電子商務和內容平台的基礎技術,通過聯合建模用戶的歷史行為與物品的多模態特徵(如視覺和文本),實現個性化服務。然而,現有方法大多依賴於靜態融合策略或基於圖的局部交互建模,面臨兩個關鍵限制:(1)對細粒度跨模態關聯的建模能力不足,導致融合質量欠佳;(2)缺乏全局分佈層面的一致性,引發表示偏差。為解決這些問題,我們提出了MambaRec,這是一個通過注意力引導學習整合局部特徵對齊與全局分佈正則化的新框架。其核心是引入了擴展細化注意力模塊(DREAM),該模塊利用多尺度擴展卷積結合通道和空間注意力,對齊視覺與文本模態間的細粒度語義模式。此模塊捕捉了層次化關係和上下文感知關聯,提升了跨模態語義建模能力。此外,我們應用最大均值差異(MMD)和對比損失函數來約束全局模態對齊,增強語義一致性。這種雙重正則化減少了模態特定偏差,提升了魯棒性。為提高可擴展性,MambaRec採用了降維策略,降低了高維多模態特徵的計算成本。在真實電子商務數據集上的廣泛實驗表明,MambaRec在融合質量、泛化能力和效率方面均優於現有方法。我們的代碼已公開於https://github.com/rkl71/MambaRec。
English
Multimodal recommendation systems are increasingly becoming foundational
technologies for e-commerce and content platforms, enabling personalized
services by jointly modeling users' historical behaviors and the multimodal
features of items (e.g., visual and textual). However, most existing methods
rely on either static fusion strategies or graph-based local interaction
modeling, facing two critical limitations: (1) insufficient ability to model
fine-grained cross-modal associations, leading to suboptimal fusion quality;
and (2) a lack of global distribution-level consistency, causing
representational bias. To address these, we propose MambaRec, a novel framework
that integrates local feature alignment and global distribution regularization
via attention-guided learning. At its core, we introduce the Dilated Refinement
Attention Module (DREAM), which uses multi-scale dilated convolutions with
channel-wise and spatial attention to align fine-grained semantic patterns
between visual and textual modalities. This module captures hierarchical
relationships and context-aware associations, improving cross-modal semantic
modeling. Additionally, we apply Maximum Mean Discrepancy (MMD) and contrastive
loss functions to constrain global modality alignment, enhancing semantic
consistency. This dual regularization reduces mode-specific deviations and
boosts robustness. To improve scalability, MambaRec employs a dimensionality
reduction strategy to lower the computational cost of high-dimensional
multimodal features. Extensive experiments on real-world e-commerce datasets
show that MambaRec outperforms existing methods in fusion quality,
generalization, and efficiency. Our code has been made publicly available at
https://github.com/rkl71/MambaRec.