OneRec: 생성적 추천 시스템과 반복적 선호도 정렬을 통한 검색과 순위의 통합

초록

최근 생성 기반 검색 추천 시스템이 유망한 패러다임으로 부상하고 있습니다. 그러나 대부분의 현대 추천 시스템은 검색 단계에서 생성 모델이 선택자 역할만 수행하는 검색 후 순위 지정 전략을 채택하고 있습니다. 본 논문에서는 이러한 계단식 학습 프레임워크를 통합 생성 모델로 대체하는 OneRec을 제안합니다. 우리가 아는 한, 이는 실제 시나리오에서 현재의 복잡하고 잘 설계된 추천 시스템을 크게 능가하는 최초의 종단 간(end-to-end) 생성 모델입니다. 구체적으로, OneRec은 다음과 같은 특징을 포함합니다: 1) 사용자의 과거 행동 시퀀스를 인코딩하고 사용자가 관심을 가질 만한 비디오를 점진적으로 디코딩하는 인코더-디코더 구조. 계산 FLOPs를 비례적으로 증가시키지 않으면서 모델 용량을 확장하기 위해 희소 Mixture-of-Experts(MoE)를 채택했습니다. 2) 세션 단위 생성 접근법. 기존의 다음 항목 예측과 달리, 우리는 세션 단위 생성을 제안하며, 이는 생성된 결과를 적절히 결합하기 위해 수작업 규칙에 의존하는 점진적 생성보다 더 우아하고 문맥적으로 일관성이 있습니다. 3) 생성된 결과의 품질을 향상시키기 위해 Direct Preference Optimization(DPO)과 결합된 Iterative Preference Alignment 모듈. NLP에서의 DPO와 달리, 추천 시스템은 일반적으로 각 사용자의 탐색 요청에 대해 결과를 표시할 기회가 단 한 번뿐이므로 긍정적 및 부정적 샘플을 동시에 얻는 것이 불가능합니다. 이러한 한계를 해결하기 위해, 우리는 사용자 생성을 시뮬레이션하고 샘플링 전략을 맞춤화하기 위해 보상 모델을 설계했습니다. 광범위한 실험을 통해 제한된 수의 DPO 샘플만으로도 사용자의 관심 선호도를 정렬하고 생성된 결과의 품질을 크게 향상시킬 수 있음을 입증했습니다. 우리는 OneRec을 Kuaishou의 주요 장면에 배포하여 시청 시간이 1.6% 증가하는 상당한 개선을 달성했습니다.

English

Recently, generative retrieval-based recommendation systems have emerged as a promising paradigm. However, most modern recommender systems adopt a retrieve-and-rank strategy, where the generative model functions only as a selector during the retrieval stage. In this paper, we propose OneRec, which replaces the cascaded learning framework with a unified generative model. To the best of our knowledge, this is the first end-to-end generative model that significantly surpasses current complex and well-designed recommender systems in real-world scenarios. Specifically, OneRec includes: 1) an encoder-decoder structure, which encodes the user's historical behavior sequences and gradually decodes the videos that the user may be interested in. We adopt sparse Mixture-of-Experts (MoE) to scale model capacity without proportionally increasing computational FLOPs. 2) a session-wise generation approach. In contrast to traditional next-item prediction, we propose a session-wise generation, which is more elegant and contextually coherent than point-by-point generation that relies on hand-crafted rules to properly combine the generated results. 3) an Iterative Preference Alignment module combined with Direct Preference Optimization (DPO) to enhance the quality of the generated results. Unlike DPO in NLP, a recommendation system typically has only one opportunity to display results for each user's browsing request, making it impossible to obtain positive and negative samples simultaneously. To address this limitation, We design a reward model to simulate user generation and customize the sampling strategy. Extensive experiments have demonstrated that a limited number of DPO samples can align user interest preferences and significantly improve the quality of generated results. We deployed OneRec in the main scene of Kuaishou, achieving a 1.6\% increase in watch-time, which is a substantial improvement.

OneRec: 생성적 추천 시스템과 반복적 선호도 정렬을 통한 검색과 순위의 통합

OneRec: Unifying Retrieve and Rank with Generative Recommender and Iterative Preference Alignment

초록

Support