Rank-GRPO:基于强化学习的LLM对话推荐系统训练方法
Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning
October 23, 2025
作者: Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Jundong Li, Nathan Kallus
cs.AI
摘要
大型语言模型(LLMs)正在通过让用户以对话形式表达偏好并获取推荐,重塑推荐系统的范式。然而将LLMs适配至推荐任务仍存在挑战:预训练模型常生成目录外项目、违反输出格式要求,且其推荐列表末段的排序质量急剧下降。为此,我们提出ConvRec-R1——一个用于端到端训练基于LLM的对话推荐系统的两阶段框架。第一阶段通过重映射-反思-调整流程构建行为克隆数据集,从强大的黑盒LLMs中生成高质量、基于商品目录的示范样本,为强化学习训练提供预热初始化。第二阶段提出Rank-GRPO,这是针对排序式输出任务对群体相对策略优化(GRPO)的改进方案。该方法将推荐列表中的每个排名位置而非单个词元(过于细粒度)或完整序列(过于粗粒度)作为优化单元,通过重新定义奖励函数消除非因果性贡献分配,并基于按排名位置统计的词元概率几何平均数构建排名级重要性比率,以稳定策略更新。在公开Reddit-v2数据集上的实验表明,ConvRec-R1相比GRPO类基线方法收敛更快,并在召回率和NDCG指标上表现更优。代码与数据集已发布于https://github.com/yaochenzhu/Rank-GRPO。
English
Large language models (LLMs) are reshaping the recommender system paradigm by
enabling users to express preferences and receive recommendations through
conversations. Yet, aligning LLMs to the recommendation task remains
challenging: pretrained LLMs often generate out-of-catalog items, violate
required output formats, and their ranking quality degrades sharply toward the
end of the generated list. To this end, we propose ConvRec-R1, a two-stage
framework for end-to-end training of LLM-based conversational recommender
systems. In Stage 1, we construct a behavioral-cloning dataset with a
Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded
demonstrations from powerful blackbox LLMs to warm-start the RL training. In
Stage 2, we propose Rank-GRPO, a principled extension of group relative policy
optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats
each rank in the recommendation list as the unit instead of token (too
fine-grained) or sequence (too coarse), redefining rewards to remove non-causal
credit assignment and introducing a rank-level importance ratio based on the
geometric mean of rank-wise token probabilities to stabilize policy updates.
Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges
faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and
datasets are released at https://github.com/yaochenzhu/Rank-GRPO.