大语言模型代理强化学习的树搜索方法
Tree Search for LLM Agent Reinforcement Learning
September 25, 2025
作者: Yuxiang Ji, Ziyu Ma, Yong Wang, Guanhua Chen, Xiangxiang Chu, Liaoni Wu
cs.AI
摘要
近期强化学习(RL)领域的进展显著提升了大型语言模型(LLMs)的代理能力。在长期、多轮代理任务中,仅依赖结果奖励的现有方法常面临监督信号稀疏的问题。为解决这一挑战,我们提出了基于树搜索的分组代理RL方法——树结构分组相对策略优化(Tree-GRPO),其中每个树节点代表完整的代理交互步骤。通过共享共同前缀,树搜索采样在固定的令牌或工具调用预算内增加了可实现的模拟次数。此外,我们发现树形轨迹结构天然支持仅使用结果奖励构建逐步过程监督信号。基于此,Tree-GRPO在树内和树间两个层面估计分组相对优势。通过理论分析,我们证明了树内层面分组相对策略优化的目标等同于步骤级直接偏好学习的目标。在11个数据集和3类问答任务上的实验表明,所提出的基于树的RL方法优于基于链的RL方法。
English
Recent advances in reinforcement learning (RL) have significantly enhanced
the agentic capabilities of large language models (LLMs). In long-term and
multi-turn agent tasks, existing approaches driven solely by outcome rewards
often suffer from the problem of sparse supervision. To address the challenge,
we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped
agent RL method based on tree search, where each tree node represents the
complete agent interaction step. By sharing common prefixes, the tree search
sampling increases the number of rollouts achievable within a fixed budget of
tokens or tool calls. Moreover, we find that the tree-structured trajectory
naturally allows the construction of step-wise process supervised signals even
using only the outcome reward. Based on this, Tree-GRPO estimates the grouped
relative advantages both on intra-tree and inter-tree levels. Through
theoretical analysis, we demonstrate that the objective of intra-tree level
group relative policy optimization is equivalent to that of step-level direct
preference learning. Experiments across 11 datasets and 3 types of QA tasks
demonstrate the superiority of the proposed tree-based RL over the chain-based
RL method.