OneRec:統一檢索與排序的生成式推薦系統與迭代偏好對齊
OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment
February 26, 2025
作者: Jiaxin Deng, Shiyao Wang, Kuo Cai, Lejian Ren, Qigen Hu, Weifeng Ding, Qiang Luo, Guorui Zhou
cs.AI
摘要
近年來,基於生成式檢索的推薦系統已成為一個頗具前景的研究範式。然而,當前大多數推薦系統仍採用檢索-排序策略,其中生成模型僅在檢索階段作為選擇器發揮作用。本文提出OneRec,以統一生成模型取代級聯學習框架。據我們所知,這是首個在實際場景中顯著超越現有複雜且精心設計的推薦系統的端到端生成模型。具體而言,OneRec包含:1)編碼器-解碼器結構,該結構對用戶歷史行為序列進行編碼,並逐步解碼出用戶可能感興趣的視頻。我們採用稀疏專家混合模型(MoE)來擴展模型容量,而不成比例地增加計算量。2)會話級生成方法。與傳統的下一個項目預測不同,我們提出會話級生成,相比於依賴手工規則來正確組合生成結果的逐點生成,這種方法更加優雅且上下文連貫。3)結合直接偏好優化(DPO)的迭代偏好對齊模塊,以提升生成結果的質量。與自然語言處理中的DPO不同,推薦系統通常只有一次機會為每個用戶的瀏覽請求展示結果,因此無法同時獲得正負樣本。為解決這一限制,我們設計了一個獎勵模型來模擬用戶生成,並定制採樣策略。大量實驗表明,有限數量的DPO樣本即可對齊用戶興趣偏好,並顯著提升生成結果的質量。我們將OneRec部署在快手的主場景中,實現了1.6%的觀看時長提升,這是一個顯著的改進。
English
Recently, generative retrieval-based recommendation systems have emerged as a
promising paradigm. However, most modern recommender systems adopt a
retrieve-and-rank strategy, where the generative model functions only as a
selector during the retrieval stage. In this paper, we propose OneRec, which
replaces the cascaded learning framework with a unified generative model. To
the best of our knowledge, this is the first end-to-end generative model that
significantly surpasses current complex and well-designed recommender systems
in real-world scenarios. Specifically, OneRec includes: 1) an encoder-decoder
structure, which encodes the user's historical behavior sequences and gradually
decodes the videos that the user may be interested in. We adopt sparse
Mixture-of-Experts (MoE) to scale model capacity without proportionally
increasing computational FLOPs. 2) a session-wise generation approach. In
contrast to traditional next-item prediction, we propose a session-wise
generation, which is more elegant and contextually coherent than point-by-point
generation that relies on hand-crafted rules to properly combine the generated
results. 3) an Iterative Preference Alignment module combined with Direct
Preference Optimization (DPO) to enhance the quality of the generated results.
Unlike DPO in NLP, a recommendation system typically has only one opportunity
to display results for each user's browsing request, making it impossible to
obtain positive and negative samples simultaneously. To address this
limitation, We design a reward model to simulate user generation and customize
the sampling strategy. Extensive experiments have demonstrated that a limited
number of DPO samples can align user interest preferences and significantly
improve the quality of generated results. We deployed OneRec in the main scene
of Kuaishou, achieving a 1.6\% increase in watch-time, which is a substantial
improvement.Summary
AI-Generated Summary