ProRL: 보정된 정책 그래디언트 추정을 통한 능동적 추천을 위한 효과적인 강화 학습

초록

프로액티브 추천 시스템(PRS)은 중간 추천 항목의 경로를 생성하여 사용자의 선호도를 목표 항목으로 체계적으로 전환하는 것을 목표로 한다. 강화 학습(RL)은 이러한 순차적 의사결정 작업을 최적화하기 위한 원칙적인 프레임워크를 제공하는데, 경로 보상은 단기적 수용과 장기적 유도 효과를 자연스럽게 포착할 수 있기 때문이다. 그러나 PRS에 정책 그래디언트를 단순하게 적용하면 그래디언트 추정이 부족해진다. 본 논문은 두 가지 결함을 식별한다: (1) 경로 수준 보상이 양의 평균을 가진 단계 수준 보상으로 분해되면서 길이 의존적 편향이 발생하여, 그래디언트가 의미 있는 탐색보다 경로 확장을 선호하게 된다; (2) 각 단계를 전체 경로 수준 보상으로 가중하면 분해 구조를 무시하여 그래디언트 분산이 높아진다. 이러한 두 가지 결함을 해결하기 위해, 본 논문은 프로액티브 추천을 위한 두 가지 새로운 메커니즘을 포함하는 효과적인 RL 프레임워크인 ProRL을 제안한다. 첫째, 단계별 보상 중앙화(Stepwise Reward Centering)는 기대 보상을 차감하여 길이 의존적 편향을 중화함으로써, 경로 확장이 기대 그래디언트 신호를 생성하지 않도록 보장한다. 둘째, 위치 특이적 이점 추정(Position-Specific Advantage Estimation)은 보상 분해 구조를 활용하여 단계 의존적 기준선을 계산함으로써 그래디언트 분산을 줄인다. 이 두 메커니즘을 통해 경로 품질을 정확히 타겟팅하는 정책 그래디언트를 얻을 수 있다. 세 개의 실제 데이터셋에 대한 실험 결과, ProRL이 최신 PRS 방법들을 유의미하게 능가함을 보여준다. 코드는 https://github.com/hongruhou89/ProRL에서 확인할 수 있다.

English

Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path-level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our code is available at https://github.com/hongruhou89/ProRL.