大语言模型代理强化学习的树搜索方法

摘要

近期强化学习（RL）领域的进展显著提升了大型语言模型（LLMs）的代理能力。在长期、多轮代理任务中，仅依赖结果奖励的现有方法常面临监督信号稀疏的问题。为解决这一挑战，我们提出了基于树搜索的分组代理RL方法——树结构分组相对策略优化（Tree-GRPO），其中每个树节点代表完整的代理交互步骤。通过共享共同前缀，树搜索采样在固定的令牌或工具调用预算内增加了可实现的模拟次数。此外，我们发现树形轨迹结构天然支持仅使用结果奖励构建逐步过程监督信号。基于此，Tree-GRPO在树内和树间两个层面估计分组相对优势。通过理论分析，我们证明了树内层面分组相对策略优化的目标等同于步骤级直接偏好学习的目标。在11个数据集和3类问答任务上的实验表明，所提出的基于树的RL方法优于基于链的RL方法。

English

Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

大语言模型代理强化学习的树搜索方法

Tree Search for LLM Agent Reinforcement Learning

摘要

Support