TRACE: 효율적인 에이전트 강화 학습을 위한 통합 롤아웃 예산 할당 프레임워크

초록

검증 가능한 보상 기반 강화 학습(RLVR)은 대규모 언어 모델의 추론 및 에이전트 행동을 향상시키는 유망한 접근 방식이다. 그러나 롤아웃 중심의 정책 최적화는 과도하게 단순하거나 복잡한 프롬프트가 낮은 분산 피드백을 생성하고, 결과 기반 보상이 다회차 롤아웃의 모든 결정에 동일한 최종 평가를 할당할 때 발생하는 불충분한 보상 대비에 의해 제한되는 경우가 많다. 기존 연구는 제한된 롤아웃 자원을 유망한 프롬프트에 할당하는 데 초점을 맞추었으나, 이는 프롬프트 수준에서만 샘플의 정보성을 활용하고 동일 롤아웃 내 회차 간 접두사 수준 정보성의 변동을 무시한다. 본 연구는 각 ReAct 스타일의 사고-행동-관찰 회차를 의미적으로 구별되는 노드로 모델링하여 다회차 에이전트 RL을 대상으로 하며, 이를 통해 예산 할당을 프롬프트 루트에서 추가 연속이 가능한 회차 수준 접두사까지 확장함으로써 자연스럽게 트리 구조의 롤아웃을 형성한다. 우리는 대비적 탐색을 위한 트리 롤아웃 할당(TRACE)을 제안하는데, 이는 고정된 샘플링 예산 내에서 보상 대비를 강화하는 통합 롤아웃 할당 프레임워크이다. 기술적으로 TRACE는 혼합된 최종 보상을 산출할 가능성이 가장 높은 프롬프트 루트와 중간 접두사에 롤아웃 예산을 할당한다. 공유 가능한 일반화 예측기는 접두사 이력을 바탕으로 이러한 앵커에서의 조건부 성공 확률을 추정하여 할당을 안내한다. 결과적인 적응형 트리 구조는 결과 기반 피드백만을 풍부하게 하고 정책 업데이트 신호를 증폭시킨다. 실험적으로 TRACE는 일반적인 에이전트 벤치마크에서 경쟁력 있는 성능과 효율성 향상을 달성한다. 예를 들어, 동일한 샘플링 비용에서 Qwen3-14B의 다중 홉 QA 평균 정확도를 경쟁 기준선 대비 2.8%포인트 향상시킨다.

English

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.