ChatPaper.aiChatPaper

用户RL:通过强化学习训练交互式用户中心代理

UserRL: Training Interactive User-Centric Agent via Reinforcement Learning

September 24, 2025
作者: Cheng Qian, Zuxin Liu, Akshara Prabhakar, Jielin Qiu, Zhiwei Liu, Haolin Chen, Shirley Kokane, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, Huan Wang
cs.AI

摘要

強化學習(RL)在訓練具有能動性的模型方面展現出潛力,這些模型超越了靜態基準,能夠參與動態、多輪次的互動。然而,此類代理的最終價值在於其協助用戶的能力,在這一情境下,用戶互動的多樣性和動態性帶來了挑戰。在本研究中,我們提出了UserRL,這是一個通過標準化的訓練環境與模擬用戶相結合,來訓練和評估以用戶為中心能力的統一框架。我們系統性地改變了輪次級別的獎勵分配和軌跡級別的分數計算,以分析不同公式在GRPO算法下對學習的影響。我們在Qwen3模型上的實驗揭示了三個關鍵發現:(i)SFT冷啟動對於解鎖初始互動能力並實現持續的RL改進至關重要;(ii)精心設計的軌跡評分能帶來更高效且有效的多輪次互動;(iii)雖然更強大的模擬用戶(如GPT-4o)有助於訓練,但開源模擬器(如Qwen3-32B)仍是一個成本效益高且可遷移的選擇。這些結果共同表明,獎勵塑造和用戶模擬選擇的精心設計與模型規模同等重要,並確立了UserRL作為開發健壯的以用戶為中心能動性模型的實用途徑。所有代碼和數據均已公開,以供未來研究使用。
English
Reinforcement learning (RL) has shown promise in training agentic models that move beyond static benchmarks to engage in dynamic, multi-turn interactions. Yet, the ultimate value of such agents lies in their ability to assist users, a setting where diversity and dynamics of user interaction pose challenges. In this work, we propose UserRL, a unified framework for training and evaluating user-centric abilities through standardized gym environments paired with simulated users. We systematically vary turn-level reward assignment and trajectory-level score calculation to analyze how different formulations affect learning under the GRPO algorithm. Our experiments across Qwen3 models reveal three key findings: (i) SFT cold start is critical for unlocking initial interaction ability and enabling sustained RL improvements; (ii) deliberate trajectory scoring yields more efficient and effective multi-turn interactions; and (iii) while stronger simulated users (e.g., GPT-4o) facilitates training, open-source simulators (e.g., Qwen3-32B) remain a cost-effective and transferable option. Together, these results highlight that careful design of reward shaping and user simulation choice is as crucial as model scale, and establish UserRL as a practical pathway for developing robust user-centric agentic models. All codes and data are public for future research.
PDF92September 26, 2025