RODS：奖励驱动的多轮工具使用智能体在线数据合成

摘要

多轮工具使用强化学习的瓶颈在于静态数据集中信息样本的快速消耗。我们观察到，GRPO中的梯度信号集中在轨迹奖励方差最高的任务上，这一现象源于Popoviciu上界。因此，接近智能体能力边界的样本（即成功与失败大致平衡的样本）贡献了不成比例的大策略梯度。随着训练的进行，该边界持续移动，逐渐耗尽静态数据集中信息样本的池子。我们提出RODS（奖励驱动的在线数据合成）以解决这一消耗问题。RODS通过将进度奖励方差重新用作一种实用的零成本边界检测器（除已为训练计算的轨迹外无需额外推理），闭环了强化学习训练与数据生成之间的循环。它持续识别此类边界样本，通过技能对齐的重采样管道合成与其结构复杂度（例如API拓扑和依赖深度）相匹配的新多轮变体，并管理一个与策略共同进化的动态回放缓冲区。从400个人工种子开始，维持约800个样本的活动训练池，RODS实现了与使用17K样本的离线管线相当的性能，同时所需的轨迹数量减少了约20倍，并在我们的控制环境中优于固定数据强化学习和环境增强方法。

English

Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.