用户强化学习:通过强化学习训练以用户为中心的交互式智能体
UserRL: Training Interactive User-Centric Agent via Reinforcement Learning
September 24, 2025
作者: Cheng Qian, Zuxin Liu, Akshara Prabhakar, Jielin Qiu, Zhiwei Liu, Haolin Chen, Shirley Kokane, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang
cs.AI
摘要
强化学习(RL)在训练具有能动性的模型方面展现出潜力,这些模型超越了静态基准,能够进行动态的多轮交互。然而,此类代理的最终价值在于其协助用户的能力,这一场景中用户互动的多样性和动态性带来了挑战。在本研究中,我们提出了UserRL,一个通过标准化gym环境与模拟用户相结合,用于训练和评估用户中心能力的统一框架。我们系统地调整了轮次级奖励分配与轨迹级得分计算,以分析不同设定在GRPO算法下对学习效果的影响。通过对Qwen3系列模型的实验,我们得出三个关键发现:(i) SFT冷启动对于解锁初始交互能力并实现持续的RL改进至关重要;(ii) 精心设计的轨迹评分能带来更高效且有效的多轮交互;(iii) 尽管更强的模拟用户(如GPT-4o)有助于训练,开源模拟器(如Qwen3-32B)仍是一个成本效益高且可迁移的选择。综合来看,这些结果表明,奖励塑造与用户模拟选择的精心设计与模型规模同等重要,并确立了UserRL作为开发稳健用户中心能动模型的实用途径。所有代码和数据均已公开,以供未来研究使用。
English
Reinforcement learning (RL) has shown promise in training agentic models that
move beyond static benchmarks to engage in dynamic, multi-turn interactions.
Yet, the ultimate value of such agents lies in their ability to assist users, a
setting where diversity and dynamics of user interaction pose challenges. In
this work, we propose UserRL, a unified framework for training and evaluating
user-centric abilities through standardized gym environments paired with
simulated users. We systematically vary turn-level reward assignment and
trajectory-level score calculation to analyze how different formulations affect
learning under the GRPO algorithm. Our experiments across Qwen3 models reveal
three key findings: (i) SFT cold start is critical for unlocking initial
interaction ability and enabling sustained RL improvements; (ii) deliberate
trajectory scoring yields more efficient and effective multi-turn interactions;
and (iii) while stronger simulated users (e.g., GPT-4o) facilitates training,
open-source simulators (e.g., Qwen3-32B) remain a cost-effective and
transferable option. Together, these results highlight that careful design of
reward shaping and user simulation choice is as crucial as model scale, and
establish UserRL as a practical pathway for developing robust user-centric
agentic models. All codes and data are public for future research.