大語言模型代理強化學習的樹狀搜索

摘要

近期在強化學習（RL）領域的進展顯著提升了大型語言模型（LLMs）的代理能力。在長期和多輪代理任務中，僅依賴結果獎勵驅動的現有方法常面臨監督信號稀疏的問題。為應對這一挑戰，我們提出了基於樹搜索的分組代理RL方法——樹結構分組相對策略優化（Tree-GRPO），其中每個樹節點代表完整的代理交互步驟。通過共享共同前綴，樹搜索採樣在固定的令牌或工具調用預算內增加了可實現的rollout數量。此外，我們發現樹結構的軌跡自然允許構建基於步驟的過程監督信號，即使僅使用結果獎勵。基於此，Tree-GRPO在樹內和樹間層面估計分組相對優勢。通過理論分析，我們證明了樹內層面分組相對策略優化的目標等同於步驟級直接偏好學習的目標。在11個數據集和3類問答任務上的實驗證明了所提出的基於樹的RL方法相較於基於鏈的RL方法的優越性。

English

Recent advances in reinforcement learning (RL) have significantly enhanced the agentic capabilities of large language models (LLMs). In long-term and multi-turn agent tasks, existing approaches driven solely by outcome rewards often suffer from the problem of sparse supervision. To address the challenge, we propose Tree-based Group Relative Policy Optimization (Tree-GRPO), a grouped agent RL method based on tree search, where each tree node represents the complete agent interaction step. By sharing common prefixes, the tree search sampling increases the number of rollouts achievable within a fixed budget of tokens or tool calls. Moreover, we find that the tree-structured trajectory naturally allows the construction of step-wise process supervised signals even using only the outcome reward. Based on this, Tree-GRPO estimates the grouped relative advantages both on intra-tree and inter-tree levels. Through theoretical analysis, we demonstrate that the objective of intra-tree level group relative policy optimization is equivalent to that of step-level direct preference learning. Experiments across 11 datasets and 3 types of QA tasks demonstrate the superiority of the proposed tree-based RL over the chain-based RL method.

大語言模型代理強化學習的樹狀搜索

Tree Search for LLM Agent Reinforcement Learning

摘要

Support