マルチモーダル推薦のためのマルチスケール双方向アテンションを用いたモダリティアライメント

要旨

マルチモーダル推薦システムは、eコマースやコンテンツプラットフォームにおいて基盤技術としてますます重要になってきており、ユーザーの過去の行動とアイテムのマルチモーダル特徴（例：視覚的およびテキスト的）を共同でモデル化することで、パーソナライズされたサービスを実現しています。しかし、既存の手法の多くは、静的な融合戦略またはグラフベースの局所的相互作用モデリングに依存しており、2つの重要な制限に直面しています：(1) 細粒度のクロスモーダル関連性をモデル化する能力が不十分で、融合品質が最適でないこと、(2) グローバルな分布レベルの一貫性が欠如しており、表現バイアスが生じることです。これらの課題に対処するため、我々はMambaRecという新しいフレームワークを提案します。このフレームワークは、注意誘導学習を通じて局所的特徴アラインメントとグローバル分布正則化を統合します。その中核として、Dilated Refinement Attention Module (DREAM)を導入します。このモジュールは、マルチスケールの拡張畳み込みとチャネル単位および空間的注意を利用して、視覚的およびテキスト的モダリティ間の細粒度の意味的パターンをアラインメントします。このモジュールは階層的関係と文脈を考慮した関連性を捉え、クロスモーダル意味モデリングを改善します。さらに、Maximum Mean Discrepancy (MMD)とコントラスティブ損失関数を適用して、グローバルなモダリティアラインメントを制約し、意味的一貫性を高めます。この二重の正則化により、モード固有の偏差が減少し、ロバスト性が向上します。スケーラビリティを向上させるため、MambaRecは高次元マルチモーダル特徴の計算コストを削減するための次元削減戦略を採用しています。実世界のeコマースデータセットを用いた広範な実験により、MambaRecが融合品質、汎化性能、効率性において既存の手法を上回ることが示されました。我々のコードはhttps://github.com/rkl71/MambaRecで公開されています。

English

Multimodal recommendation systems are increasingly becoming foundational technologies for e-commerce and content platforms, enabling personalized services by jointly modeling users' historical behaviors and the multimodal features of items (e.g., visual and textual). However, most existing methods rely on either static fusion strategies or graph-based local interaction modeling, facing two critical limitations: (1) insufficient ability to model fine-grained cross-modal associations, leading to suboptimal fusion quality; and (2) a lack of global distribution-level consistency, causing representational bias. To address these, we propose MambaRec, a novel framework that integrates local feature alignment and global distribution regularization via attention-guided learning. At its core, we introduce the Dilated Refinement Attention Module (DREAM), which uses multi-scale dilated convolutions with channel-wise and spatial attention to align fine-grained semantic patterns between visual and textual modalities. This module captures hierarchical relationships and context-aware associations, improving cross-modal semantic modeling. Additionally, we apply Maximum Mean Discrepancy (MMD) and contrastive loss functions to constrain global modality alignment, enhancing semantic consistency. This dual regularization reduces mode-specific deviations and boosts robustness. To improve scalability, MambaRec employs a dimensionality reduction strategy to lower the computational cost of high-dimensional multimodal features. Extensive experiments on real-world e-commerce datasets show that MambaRec outperforms existing methods in fusion quality, generalization, and efficiency. Our code has been made publicly available at https://github.com/rkl71/MambaRec.

マルチモーダル推薦のためのマルチスケール双方向アテンションを用いたモダリティアライメント

Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation

要旨

Support