ChatPaper.aiChatPaper

Rank-GRPO:基于强化学习的LLM对话推荐系统训练方法

Rank-GRPO: Training LLM-based Conversational Recommender Systems with Reinforcement Learning

October 23, 2025
作者: Yaochen Zhu, Harald Steck, Dawen Liang, Yinhan He, Jundong Li, Nathan Kallus
cs.AI

摘要

大型语言模型(LLMs)正在通过让用户以会话方式表达偏好并获取推荐,重塑推荐系统的范式。然而,将LLMs与推荐任务对齐仍存在挑战:预训练LLMs常生成目录外的项目,违反输出格式要求,且其排序质量在生成列表末尾急剧下降。为此,我们提出ConvRec-R1——一个用于基于LLM的会话推荐系统端到端训练的双阶段框架。在第一阶段,我们通过重映射-反思-调整流程构建行为克隆数据集,利用强大的黑盒LLMs生成高质量、基于目录的示范样本,为强化学习训练提供热启动。在第二阶段,我们提出Rank-GRPO,这是针对具有排序式输出任务的群组相对策略优化(GRPO)的原则性扩展。Rank-GRPO将推荐列表中的每个排名位置(而非过于细粒度的词元或过于粗粒度的序列)作为基本单元,通过重新定义奖励函数消除非因果性信用分配,并基于按排名分层的词元概率几何平均数引入排名层级的重要性比率,以稳定策略更新。在公开Reddit-v2数据集上的实验表明,ConvRec-R1比GRPO类基线收敛更快,并实现了更高的召回率和NDCG指标。代码与数据集已发布于https://github.com/yaochenzhu/Rank-GRPO。
English
Large language models (LLMs) are reshaping the recommender system paradigm by enabling users to express preferences and receive recommendations through conversations. Yet, aligning LLMs to the recommendation task remains challenging: pretrained LLMs often generate out-of-catalog items, violate required output formats, and their ranking quality degrades sharply toward the end of the generated list. To this end, we propose ConvRec-R1, a two-stage framework for end-to-end training of LLM-based conversational recommender systems. In Stage 1, we construct a behavioral-cloning dataset with a Remap-Reflect-Adjust pipeline, which produces high-quality, catalog-grounded demonstrations from powerful blackbox LLMs to warm-start the RL training. In Stage 2, we propose Rank-GRPO, a principled extension of group relative policy optimization (GRPO) tailored to tasks with rank-style outputs. Rank-GRPO treats each rank in the recommendation list as the unit instead of token (too fine-grained) or sequence (too coarse), redefining rewards to remove non-causal credit assignment and introducing a rank-level importance ratio based on the geometric mean of rank-wise token probabilities to stabilize policy updates. Experiments on the public Reddit-v2 dataset show that ConvRec-R1 converges faster and achieves higher Recall and NDCG than GRPO-style baselines. Code and datasets are released at https://github.com/yaochenzhu/Rank-GRPO.
PDF52February 7, 2026