CM^3:多模态推荐校准
CM^3: Calibrating Multimodal Recommendation
August 2, 2025
作者: Xin Zhou, Yongjie Wang, Zhiqi Shen
cs.AI
摘要
對齊性與均勻性是對比學習領域中的基本原則。在推薦系統中,先前的研究已證實,優化貝葉斯個性化排序(BPR)損失有助於實現對齊性與均勻性的目標。具體而言,對齊性旨在拉近互動用戶與物品的表徵,而均勻性則要求用戶與物品的嵌入在單位超球面上均勻分佈。本研究重新審視了多模態推薦系統中的對齊性與均勻性特性,揭示了現有模型傾向於優先考慮均勻性而犧牲對齊性的現象。我們的假設挑戰了通過均勻性損失實現物品平等處理的傳統觀念,提出了一種更為細緻的方法,即具有相似多模態屬性的物品在超球面流形上趨向於相近的表徵。具體而言,我們利用物品多模態數據之間的固有相似性來校準其均勻分佈,從而在嵌入空間中誘導出更為顯著的異質實體間的排斥力。理論分析闡明了這種校準後的均勻性損失與傳統均勻性函數之間的關係。此外,為了增強多模態特徵的融合,我們引入了一種球形貝塞爾方法,旨在整合任意數量的模態,同時確保融合後的特徵被約束在同一超球面流形上。在五個真實世界數據集上進行的實證評估證實了我們的方法相較於競爭基線的優越性。我們還展示了所提出的方法通過整合MLLM提取的特徵,能夠在NDCG@20性能上實現高達5.4%的提升。源代碼可訪問:https://github.com/enoche/CM3。
English
Alignment and uniformity are fundamental principles within the domain of
contrastive learning. In recommender systems, prior work has established that
optimizing the Bayesian Personalized Ranking (BPR) loss contributes to the
objectives of alignment and uniformity. Specifically, alignment aims to draw
together the representations of interacting users and items, while uniformity
mandates a uniform distribution of user and item embeddings across a unit
hypersphere. This study revisits the alignment and uniformity properties within
the context of multimodal recommender systems, revealing a proclivity among
extant models to prioritize uniformity to the detriment of alignment. Our
hypothesis challenges the conventional assumption of equitable item treatment
through a uniformity loss, proposing a more nuanced approach wherein items with
similar multimodal attributes converge toward proximal representations within
the hyperspheric manifold. Specifically, we leverage the inherent similarity
between items' multimodal data to calibrate their uniformity distribution,
thereby inducing a more pronounced repulsive force between dissimilar entities
within the embedding space. A theoretical analysis elucidates the relationship
between this calibrated uniformity loss and the conventional uniformity
function. Moreover, to enhance the fusion of multimodal features, we introduce
a Spherical B\'ezier method designed to integrate an arbitrary number of
modalities while ensuring that the resulting fused features are constrained to
the same hyperspherical manifold. Empirical evaluations conducted on five
real-world datasets substantiate the superiority of our approach over competing
baselines. We also shown that the proposed methods can achieve up to a 5.4%
increase in NDCG@20 performance via the integration of MLLM-extracted features.
Source code is available at: https://github.com/enoche/CM3.