UserRL: インタラクティブなユーザー中心エージェントの強化学習によるトレーニング

要旨

強化学習（RL）は、静的なベンチマークを超えて動的で多ターンのインタラクションに従事するエージェントモデルの訓練において有望な成果を示しています。しかし、そのようなエージェントの真の価値は、ユーザーを支援する能力にあり、そこではユーザーインタラクションの多様性と動態が課題となります。本研究では、UserRLという統一フレームワークを提案し、標準化されたジム環境とシミュレートされたユーザーを組み合わせて、ユーザー中心の能力を訓練・評価します。ターンレベルの報酬割り当てと軌跡レベルのスコア計算を体系的に変化させ、GRPOアルゴリズム下での学習に異なる定式化がどのように影響するかを分析します。Qwen3モデルを用いた実験から、以下の3つの主要な知見が得られました：(i) SFTのコールドスタートは、初期インタラクション能力を引き出し、持続的なRLの改善を可能にするために重要である、(ii) 意図的な軌跡スコアリングは、より効率的で効果的な多ターンインタラクションをもたらす、(iii) 強力なシミュレートユーザー（例：GPT-4o）は訓練を促進するが、オープンソースのシミュレータ（例：Qwen3-32B）はコスト効率が高く、転移可能な選択肢として残る。これらの結果は、報酬設計とユーザーシミュレーションの選択の慎重な設計がモデルの規模と同様に重要であることを強調し、UserRLを堅牢なユーザー中心のエージェントモデルを開発するための実用的な道筋として確立します。すべてのコードとデータは今後の研究のために公開されています。

English

Reinforcement learning (RL) has shown promise in training agentic models that move beyond static benchmarks to engage in dynamic, multi-turn interactions. Yet, the ultimate value of such agents lies in their ability to assist users, a setting where diversity and dynamics of user interaction pose challenges. In this work, we propose UserRL, a unified framework for training and evaluating user-centric abilities through standardized gym environments paired with simulated users. We systematically vary turn-level reward assignment and trajectory-level score calculation to analyze how different formulations affect learning under the GRPO algorithm. Our experiments across Qwen3 models reveal three key findings: (i) SFT cold start is critical for unlocking initial interaction ability and enabling sustained RL improvements; (ii) deliberate trajectory scoring yields more efficient and effective multi-turn interactions; and (iii) while stronger simulated users (e.g., GPT-4o) facilitates training, open-source simulators (e.g., Qwen3-32B) remain a cost-effective and transferable option. Together, these results highlight that careful design of reward shaping and user simulation choice is as crucial as model scale, and establish UserRL as a practical pathway for developing robust user-centric agentic models. All codes and data are public for future research.

UserRL: インタラクティブなユーザー中心エージェントの強化学習によるトレーニング

UserRL: Training Interactive User-Centric Agent via Reinforcement Learning

要旨

Support