LLMエージェント強化学習のための木探索

要旨

近年の強化学習（RL）の進展により、大規模言語モデル（LLM）のエージェント能力が大幅に向上しています。長期的かつ多段階のエージェントタスクにおいて、結果報酬のみに基づく既存のアプローチは、スパースな教師信号の問題に直面することがしばしばあります。この課題に対処するため、我々はツリー探索に基づくグループ化エージェントRL手法であるTree-based Group Relative Policy Optimization（Tree-GRPO）を提案します。ここでは、各ツリーノードが完全なエージェントインタラクションステップを表します。共通のプレフィックスを共有することで、ツリー探索サンプリングは、固定されたトークンまたはツール呼び出しの予算内で達成可能なロールアウトの数を増加させます。さらに、ツリー構造の軌跡は、結果報酬のみを使用しても、段階的なプロセス教師信号の構築を自然に可能にすることがわかります。これに基づき、Tree-GRPOは、ツリー内レベルとツリー間レベルの両方でグループ化された相対的アドバンテージを推定します。理論分析を通じて、ツリー内レベルのグループ相対的ポリシー最適化の目的が、段階レベルの直接選好学習の目的と等価であることを示します。11のデータセットと3種類のQAタスクにわたる実験により、提案されたツリーベースのRL手法がチェーンベースのRL手法を上回る優位性を実証しました。

English

Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

LLMエージェント強化学習のための木探索

Tree Search for LLM Agent Reinforcement Learning

要旨

Support