ChatPaper.aiChatPaper

Rank-GRPO:基于强化学习的LLM对话推荐系统训练方法

Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning

October 23, 2025
作者: Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Jundong Li, Nathan Kallus
cs.AI

摘要

大型语言模型(LLMs)正在通过让用户以对话形式表达偏好并获取推荐,重塑推荐系统的范式。然而将LLMs适配至推荐任务仍存在挑战:预训练模型常生成目录外项目、违反输出格式要求,且其推荐列表末段的排序质量急剧下降。为此,我们提出ConvRec-R1——一个用于端到端训练基于LLM的对话推荐系统的两阶段框架。第一阶段通过重映射-反思-调整流程构建行为克隆数据集,从强大的黑盒LLMs中生成高质量、基于商品目录的示范样本,为强化学习训练提供预热初始化。第二阶段提出Rank-GRPO,这是针对排序式输出任务对群体相对策略优化(GRPO)的改进方案。该方法将推荐列表中的每个排名位置而非单个词元(过于细粒度)或完整序列(过于粗粒度)作为优化单元,通过重新定义奖励函数消除非因果性贡献分配,并基于按排名位置统计的词元概率几何平均数构建排名级重要性比率,以稳定策略更新。在公开Reddit-v2数据集上的实验表明,ConvRec-R1相比GRPO类基线方法收敛更快,并在召回率和NDCG指标上表现更优。代码与数据集已发布于https://github.com/yaochenzhu/Rank-GRPO。
English
Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://github.com/yaochenzhu/Rank-GRPO.
PDF42December 2, 2025