ProRL：通过修正策略梯度估计实现主动推荐的有效强化学习

摘要

主动推荐系统（PRS）旨在通过生成中间推荐路径，引导用户偏好向目标项目转变。强化学习（RL）为优化此类序列决策任务提供了原则性框架，因为路径奖励可以自然兼顾短期接受度与长期引导效果。然而，将策略梯度直接应用于PRS会导致梯度估计存在缺陷。我们识别出两个缺陷：（1）路径级奖励分解为具有正均值的步骤级奖励，产生长度依赖性偏差，使梯度倾向于延长路径而非进行有意义的探索；（2）用整个路径级奖励对每一步进行加权忽略了分解结构，导致梯度方差过高。为修正这两个缺陷，我们提出了一种高效的RL框架ProRL，其中包含两个用于主动推荐的新机制。首先，步骤级奖励中心化通过减去期望奖励来消除长度依赖性偏差，确保路径延长产生零期望梯度信号。其次，位置特定优势估计利用奖励分解结构计算步骤相关的基线，降低梯度方差。这些机制共同生成精准针对路径质量的策略梯度。我们在三个真实数据集上的实验表明，ProRL显著优于现有最先进的PRS。我们的代码已开源至 https://github.com/hongruhou89/ProRL。

English

Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path-level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our code is available at https://github.com/hongruhou89/ProRL.