TRACE: 効率的なエージェント強化学習のための統一ロールアウト予算配分フレームワーク

要旨

検証可能な報酬を用いた強化学習（RLVR）は、大規模言語モデルにおける推論能力やエージェント的振る舞いを強化する有望な手法である。しかし、ロールアウト集約的な方策最適化は、報酬のコントラスト不足によって制限されることが多い。これは、単純すぎる、あるいは複雑すぎるプロンプトが低分散なフィードバックを生成する場合や、結果のみの報酬がマルチターンロールアウト内の各判断に対して同一の終端評価を割り当てる場合に生じる。これまでの研究は、利用可能なロールアウトリソースを有望なプロンプトに割り当てることに焦点を当ててきたが、それらはプロンプトレベルでのサンプルの情報価値のみを活用し、同一ロールアウト内のターン間におけるプレフィックスレベルの情報価値の変動を無視している。本研究では、マルチターンエージェントRLを対象とし、各ReActスタイルの思考・行動・観測ターンを意味的に独立したノードとしてモデル化することで、予算割り当てをプロンプトの根ノードから、さらなる継続を伴うターンレベルのプレフィックスへと拡張し、自然にツリー構造のロールアウトを形成する。我々は、対比的探索のためのツリーロールアウト割り当て（TRACE）を導入する。これは、固定されたサンプリング予算内で報酬のコントラストを強化する統一的なロールアウト割り当てフレームワークである。技術的には、TRACEは、混合した終端報酬をもたらす可能性が最も高いプロンプト根ノードと中間プレフィックスの両方にロールアウト予算を割り当てる。共有された汎用的な予測器が、プレフィックス履歴からこれらのアンカーポイントにおける条件付き成功確率を推定し、この割り当てを導く。結果として得られる適応的なツリー構造は、結果のみのフィードバックを豊かにし、方策更新の信号を増幅する。実験的に、TRACEは典型的なエージェントベンチマークにおいて競争力のある性能と効率向上を達成しており、例えば、同一サンプリングコストで強力なベースラインと比較して、Qwen3-14BのマルチホップQA平均精度を2.8ポイント改善している。

English

Reinforcement learning with verifiable rewards (RLVR) is a promising approach for enhancing reasoning and agentic behavior in large language models. However, rollout-intensive policy optimization is often limited by insufficient reward contrast, arising when overly simple or complex prompts generate low-variance feedback and when outcome-only rewards assign the same terminal assessment to every decision in a multi-turn rollout. Past efforts have focused on allocating available rollout resources to promising prompts, yet they only leverage sample informativeness at the prompt level and neglect variation in prefix-level informativeness across turns within the same rollout. This work targets multi-turn agentic RL by modeling each ReAct-style thought-action-observation turn as a semantically distinct node, allowing budget allocation to extend from prompt roots to turn-level prefixes with further continuations, which naturally forms tree-structured rollouts. We introduce Tree Rollout Allocation for Contrastive Exploration (TRACE), a unified rollout allocation framework that enhances reward contrast within a fixed sampling budget. Technically, TRACE allocates rollout budget to both prompt roots and intermediate prefixes that are most likely to yield mixed terminal rewards. A shared generalizable predictor estimates conditional success probability at these anchors from prefix histories to guide this allocation. The resulting adaptive tree structure enriches outcome-only feedback and amplifies the policy-update signal. Empirically, TRACE achieves competitive performance and efficiency gains on typical agentic benchmarks, e.g., improving Qwen3-14B Multi-Hop QA average accuracy by 2.8 points over competitive baselines at equal sampling cost.