TRACE:一種用於高效智能體強化學習的統一展開預算分配框架
TRACE: A Unified Rollout Budget Allocation Framework for Efficient Agentic Reinforcement Learning
June 9, 2026
作者: Heming Zou, Qi Wang, Yun Qu, Yuhang Jiang, Lizhou Cai, Yixiu Mao, Ru Peng, Xin Xu, Weijie Liu, Kai Yang, Saiyong Yang, Xiangyang Ji
cs.AI
摘要
基於可驗證獎勵的強化學習(RLVR)是提升大型語言模型推理與智能體行為的一大可行方法。然而,依賴大量展開(rollout)的策略最佳化常受限於獎勵對比不足——當提示過於簡單或複雜導致回饋變異過低時,以及當僅有結果獎勵對多回合展開中每一步決策賦予相同的終端評估時,此問題尤為明顯。過去的研究主要將有限的展開資源分配給有潛力的提示,但僅在提示層級利用樣本資訊,忽略了同一展開中不同回合在前綴層級的資訊差異。本研究針對多回合智能體強化學習,將每個ReAct風格的思考-行動-觀察回合建模為語義獨立的節點,使預算分配能從提示根節點延伸至帶有後續展開的回合層級前綴,從而自然形成樹狀結構的展開。我們提出「基於樹狀展開分配之對比探索框架(TRACE)」,這是一個統一的展開分配框架,能在固定抽樣預算內強化獎勵對比。技術上,TRACE將展開預算分配至最可能產生混合終端獎勵的提示根節點與中間前綴。一個共享且可泛化的預測器根據這些錨點的前綴歷史估算條件成功機率,以引導資源分配。由此產生的自適應樹狀結構豐富了僅依賴結果的回饋,並增強策略更新的訊號。實驗結果顯示,TRACE在典型智能體基準測試中達到具競爭力的表現與效率提升,例如在等量抽樣成本下,將Qwen3-14B模型的多跳問答平均準確率提升2.8個百分點,優於各項對比基線。
English
Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.