用户RL：通过强化学习训练交互式用户中心代理

摘要

強化學習（RL）在訓練具有能動性的模型方面展現出潛力，這些模型超越了靜態基準，能夠參與動態、多輪次的互動。然而，此類代理的最終價值在於其協助用戶的能力，在這一情境下，用戶互動的多樣性和動態性帶來了挑戰。在本研究中，我們提出了UserRL，這是一個通過標準化的訓練環境與模擬用戶相結合，來訓練和評估以用戶為中心能力的統一框架。我們系統性地改變了輪次級別的獎勵分配和軌跡級別的分數計算，以分析不同公式在GRPO算法下對學習的影響。我們在Qwen3模型上的實驗揭示了三個關鍵發現：（i）SFT冷啟動對於解鎖初始互動能力並實現持續的RL改進至關重要；（ii）精心設計的軌跡評分能帶來更高效且有效的多輪次互動；（iii）雖然更強大的模擬用戶（如GPT-4o）有助於訓練，但開源模擬器（如Qwen3-32B）仍是一個成本效益高且可遷移的選擇。這些結果共同表明，獎勵塑造和用戶模擬選擇的精心設計與模型規模同等重要，並確立了UserRL作為開發健壯的以用戶為中心能動性模型的實用途徑。所有代碼和數據均已公開，以供未來研究使用。

English

Reinforcement learning (RL) has shown promise in training agentic models that move beyond static benchmarks to engage in dynamic, multi-turn interactions. Yet, the ultimate value of such agents lies in their ability to assist users, a setting where diversity and dynamics of user interaction pose challenges. In this work, we propose UserRL, a unified framework for training and evaluating user-centric abilities through standardized gym environments paired with simulated users. We systematically vary turn-level reward assignment and trajectory-level score calculation to analyze how different formulations affect learning under the GRPO algorithm. Our experiments across Qwen3 models reveal three key findings: (i) SFT cold start is critical for unlocking initial interaction ability and enabling sustained RL improvements; (ii) deliberate trajectory scoring yields more efficient and effective multi-turn interactions; and (iii) while stronger simulated users (e.g., GPT-4o) facilitates training, open-source simulators (e.g., Qwen3-32B) remain a cost-effective and transferable option. Together, these results highlight that careful design of reward shaping and user simulation choice is as crucial as model scale, and establish UserRL as a practical pathway for developing robust user-centric agentic models. All codes and data are public for future research.

用户RL：通过强化学习训练交互式用户中心代理

UserRL: Training Interactive User-Centric Agent via Reinforcement Learning

摘要

Support