RODS: 다중 턴 도구 사용 에이전트를 위한 보상 기반 온라인 데이터 합성

초록

다중 턴 도구 사용 강화학습은 정적 데이터셋에서 유익한 샘플이 빠르게 고갈되는 현상에 의해 병목 현상이 발생한다. GRPO에서 그래디언트 신호는 가장 높은 롤아웃 보상 분산을 가진 작업에 집중되는데, 이는 Popoviciu 상한의 결과이다. 결과적으로, 성공과 실패가 대략 균형을 이루는 에이전트의 능력 경계 근처 샘플이 불균형적으로 큰 정책 그래디언트에 기여한다. 학습이 진행됨에 따라 이 경계는 지속적으로 이동하며, 정적 데이터셋 내 유익한 샘플 풀을 점차 고갈시킨다. 우리는 이러한 고갈 문제를 해결하기 위해 RODS(보상 기반 온라인 데이터 합성)를 제안한다. RODS는 강화학습 훈련과 데이터 생성 간의 루프를 닫아, 훈련을 위해 이미 계산된 롤아웃 외에 추가 추론이 필요 없는 실용적이고 비용이 없는 경계 탐지기로서 진행 보상 분산을 재활용한다. 이는 지속적으로 이러한 경계 샘플을 식별하고, 기술 정렬 재표본추출 파이프라인을 통해 해당 샘플의 구조적 복잡성(예: API 토폴로지 및 종속성 깊이)과 일치하는 새로운 다중 턴 변형을 합성하며, 정책과 함께 공진화하는 동적 재생 버퍼를 관리한다. 400개의 인간 시드로 시작하여 약 800개의 샘플로 구성된 활성 학습 풀을 유지하는 RODS는 약 20배 적은 궤적으로 17K 샘플 오프라인 파이프라인과 유사한 성능을 달성하며, 통제된 환경에서 고정 데이터 강화학습 및 환경 증강보다 개선된 결과를 보인다.

English

Multi-turn tool-use RL is bottlenecked by the rapid depletion of informative samples in static datasets. We observe that the gradient signal in GRPO concentrates on tasks with the highest rollout reward variance, a consequence of the Popoviciu upper bound. Consequently, samples near the agent's capability boundary -- where successes and failures are roughly balanced -- contribute disproportionately large policy gradients. As training progresses, this boundary continuously shifts, which gradually depletes the pool of informative samples in a static dataset. We propose RODS (Reward-driven Online Data Synthesis) to resolve this depletion. RODS closes the loop between RL training and data generation by repurposing the progress reward variance as a practical, zero-cost boundary detector that requires no extra inference beyond the rollouts already computed for training. It continuously identifies such boundary samples, synthesizes new multi-turn variants matching their structural complexity (e.g., API topology and dependency depth) via a skill-aligned resampling pipeline, and manages a dynamic replay buffer that co-evolves with the policy. Starting from 400 human seeds and maintaining an active training pool of ~800 samples, RODS achieves comparable performance to a 17K-sample offline pipeline while requiring roughly 20x fewer trajectories, and improves over fixed-data RL and environment augmentation in our controlled setting.