RODS：獎勵驅動的線上資料合成於多輪工具使用代理

摘要

多輪工具使用強化學習遇到了瓶頸，因為靜態數據集中具有資訊量的樣本快速耗盡。我們觀察到，GRPO中的梯度訊號集中在具有最高展開獎勵變異數的任務上，這是Popoviciu上界的結果。因此，接近智能體能力邊界（即成功與失敗大致平衡之處）的樣本會貢獻不成比例的大策略梯度。隨著訓練的進行，這個邊界不斷移動，逐漸耗盡靜態數據集中具有資訊量的樣本池。我們提出RODS（獎勵驅動的線上數據合成）來解決這種耗盡問題。RODS通過將進度獎勵變異數重新用作一個實用、零成本的邊界檢測器（不需要進行訓練已計算的展開之外的額外推論），來閉合RL訓練與數據生成之間的反饋迴路。它持續識別此類邊界樣本，通過技能對齊的重採樣管線合成與其結構複雜度（例如API拓撲結構和依賴深度）匹配的新多輪變體，並管理一個與策略共同進化的動態重播緩衝區。從400個人類種子樣本出發，並維持約800個樣本的活躍訓練池，RODS在需要大約少20倍軌跡的情況下，達到了與17K樣本離線管線相當的性能，並且在我們受控的設定中優於固定數據的RL和環境增強方法。

English

Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.