ChatPaper.aiChatPaper

ProRL:透過修正策略梯度估計實現有效強化學習的主動式推薦

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

May 27, 2026
作者: Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang, Ao Xu, Hengrui Chen, Jiaqing Liang, Deqing Yang
cs.AI

摘要

主動推薦系統(Proactive Recommender Systems, PRSs)旨在透過生成中介推薦路徑,引導使用者偏好轉向目標物品。強化學習(Reinforcement Learning, RL)為此類序列決策任務提供了嚴謹的優化框架,因為路徑獎勵能自然同時捕捉短期接受度與長期引導效果。然而,直接將策略梯度應用於PRS會導致梯度估計不足。我們發現兩個缺陷:(1) 路徑級獎勵分解為具有正均值的步驟級獎勵,會產生依賴長度的偏差,使梯度傾向於延伸路徑而非進行有意義的探索;(2) 以整個路徑級獎勵加權每個步驟,忽略了分解結構,導致梯度變異數過高。為修正這兩個缺陷,我們提出了一個有效的強化學習框架ProRL,其中包含兩種用於主動推薦的新機制。首先,步驟級獎勵中心化(Stepwise Reward Centering)減去期望獎勵,以消除依賴長度的偏差,確保路徑延伸產生零期望梯度訊號。其次,位置特定優勢估計(Position-Specific Advantage Estimation)利用獎勵分解結構計算步驟相關基準線,以降低梯度變異數。兩者結合產生的策略梯度能精確優化路徑品質。我們在三個真實世界資料集上的實驗顯示,ProRL顯著優於當前最先進的PRS方法。我們的程式碼已公開於 https://github.com/hongruhou89/ProRL。
English
Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path-level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our code is available at https://github.com/hongruhou89/ProRL.