Modaliteitsuitlijning met Multi-schaal Bilaterale Aandacht voor Multimodale Aanbeveling

Samenvatting

Multimodale aanbevelingssystemen worden steeds meer fundamentele technologieën voor e-commerce- en contentplatforms, die gepersonaliseerde diensten mogelijk maken door het gezamenlijk modelleren van de historische gedragingen van gebruikers en de multimodale kenmerken van items (bijv. visueel en tekstueel). De meeste bestaande methoden vertrouwen echter op statische fusiestrategieën of op grafieken gebaseerde lokale interactiemodellering, wat twee kritieke beperkingen met zich meebrengt: (1) onvoldoende vermogen om fijnmazige cross-modale associaties te modelleren, wat leidt tot suboptimale fusiekwaliteit; en (2) een gebrek aan globale distributieniveau consistentie, wat representatiebias veroorzaakt. Om deze problemen aan te pakken, stellen we MambaRec voor, een nieuw framework dat lokale kenmerkuitlijning en globale distributieregularisatie integreert via aandacht-geleerd leren. Centraal in ons framework staat de Dilated Refinement Attention Module (DREAM), die multi-schaal gedilateerde convoluties gebruikt met kanaalgewijze en ruimtelijke aandacht om fijnmazige semantische patronen tussen visuele en tekstuele modaliteiten uit te lijnen. Deze module vangt hiërarchische relaties en contextbewuste associaties op, waardoor de cross-modale semantische modellering wordt verbeterd. Daarnaast passen we Maximum Mean Discrepancy (MMD) en contrastieve verliesfuncties toe om de globale modaliteitsuitlijning te beperken, wat de semantische consistentie versterkt. Deze dubbele regularisatie vermindert modus-specifieke afwijkingen en verhoogt de robuustheid. Om de schaalbaarheid te verbeteren, gebruikt MambaRec een dimensionaliteitsreductiestrategie om de rekenkosten van hoogdimensionale multimodale kenmerken te verlagen. Uitgebreide experimenten op real-world e-commerce datasets tonen aan dat MambaRec bestaande methoden overtreft in fusiekwaliteit, generalisatie en efficiëntie. Onze code is publiekelijk beschikbaar gemaakt op https://github.com/rkl71/MambaRec.

English

Multimodal recommendation systems are increasingly becoming foundational technologies for e-commerce and content platforms, enabling personalized services by jointly modeling users' historical behaviors and the multimodal features of items (e.g., visual and textual). However, most existing methods rely on either static fusion strategies or graph-based local interaction modeling, facing two critical limitations: (1) insufficient ability to model fine-grained cross-modal associations, leading to suboptimal fusion quality; and (2) a lack of global distribution-level consistency, causing representational bias. To address these, we propose MambaRec, a novel framework that integrates local feature alignment and global distribution regularization via attention-guided learning. At its core, we introduce the Dilated Refinement Attention Module (DREAM), which uses multi-scale dilated convolutions with channel-wise and spatial attention to align fine-grained semantic patterns between visual and textual modalities. This module captures hierarchical relationships and context-aware associations, improving cross-modal semantic modeling. Additionally, we apply Maximum Mean Discrepancy (MMD) and contrastive loss functions to constrain global modality alignment, enhancing semantic consistency. This dual regularization reduces mode-specific deviations and boosts robustness. To improve scalability, MambaRec employs a dimensionality reduction strategy to lower the computational cost of high-dimensional multimodal features. Extensive experiments on real-world e-commerce datasets show that MambaRec outperforms existing methods in fusion quality, generalization, and efficiency. Our code has been made publicly available at https://github.com/rkl71/MambaRec.

Modaliteitsuitlijning met Multi-schaal Bilaterale Aandacht voor Multimodale Aanbeveling

Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation

Samenvatting

Support