RODS: マルチターンツール使用エージェントのための報酬駆動型オンラインデータ合成

要旨

マルチターンのツール使用強化学習では、静的データセットにおける有益なサンプルの急速な枯渇がボトルネックとなっている。我々は、GRPOにおける勾配信号が最もロールアウト報酬の分散の高いタスクに集中することを観測する。これは、Popoviciuの上限の結果である。その結果、エージェントの能力境界付近（成功と失敗がほぼ均衡する領域）のサンプルが、不均衡に大きな方策勾配に寄与する。訓練が進むにつれてこの境界は継続的に移動し、静的なデータセット内の有益なサンプルのプールを徐々に枯渇させる。我々はこの枯渇を解決するために、RODS（Reward-driven Online Data Synthesis：報酬駆動型オンラインデータ合成）を提案する。RODSは、訓練用に既に計算されたロールアウト以外に追加の推論を必要としない、実用的でコストゼロの境界検出器として進捗報酬の分散を再利用することで、RL訓練とデータ生成のループを閉じる。同手法は、そのような境界サンプルを継続的に特定し、スキル調整型リサンプリングパイプラインを介して、その構造的複雑さ（例：APIトポロジーや依存関係の深さ）に合致する新しいマルチターンバリエーションを合成し、方策と共進化する動的リプレイバッファを管理する。400個の人間によるシードから開始し、約800サンプルのアクティブな訓練プールを維持することで、RODSは1万7千サンプルのオフラインパイプラインと同等の性能を達成しつつ、約20分の1の軌跡数で済み、我々の制御された設定において固定データRLや環境拡張よりも優れている。

English

Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.