TRACE: 一种用于高效智能体强化学习的统一Rollout预算分配框架

摘要

可验证奖励强化学习（RLVR）是增强大语言模型推理与代理行为的一种有前景的方法。然而，大规模 rollout 的策略优化常受限于奖励对比度不足——当过于简单或复杂的提示生成低方差反馈时，以及当仅基于结果的奖励在多次交互的 rollout 中将相同的终局评估赋予每一步决策时，这一问题尤为突出。以往的研究集中于将已有的 rollout 资源分配给有潜力的提示，但仅利用了提示层面的样本信息量，而忽视了同一 rollout 中不同步骤间前缀层级信息量的差异。本文针对多轮代理强化学习，将每个 ReAct 风格的思考-行动-观察步骤建模为语义上不同的节点，使得预算分配能从提示根节点扩展到步骤级前缀及其后续延伸，从而自然形成树形 rollout 结构。我们提出面向对比探索的树形 Rollout 分配（TRACE），这是一种统一的 rollout 分配框架，能在固定采样预算内增强奖励对比度。在技术上，TRACE 将 rollout 预算同时分配给最可能产生混合终局奖励的提示根节点和中间前缀节点。一个共享的通用预测器利用这些锚点的前缀历史来估计条件成功概率，从而指导这一分配过程。由此产生的自适应树形结构丰富了仅基于结果的反馈，并放大了策略更新信号。实验表明，TRACE 在典型代理基准任务上达到了具有竞争力的性能和效率提升，例如在同等采样成本下，使 Qwen3-14B 的多跳问答平均准确率相比强基线提升 2.8 个百分点。

English

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.