ChatPaper.aiChatPaper

AT^2PO:基于树搜索的智能体回合制策略优化

AT^2PO: Agentic Turn-based Policy Optimization via Tree Search

January 8, 2026
作者: Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou, Chengming Li, Bo Qian, Peng Chen, Jie Jiang
cs.AI

摘要

大语言模型智能体已成为通过交替进行内部推理与外部工具交互来处理多轮任务的强大系统。智能体强化学习作为进一步优化这些能力的关键后训练范式,近期引起了广泛研究关注。本文提出AT^2PO(基于树搜索的轮次策略优化),一种面向多轮智能体强化学习的统一框架,解决了三大核心挑战:探索多样性不足、稀疏信用分配以及策略优化失准。AT^2PO引入轮次树结构,同步实现熵引导树扩展以进行战略探索,以及轮次信用分配以实现稀疏结果下的细粒度奖励传播。在此基础上,我们提出智能体轮次策略优化——一种与智能体交互自然决策粒度对齐的轮次级学习目标。该优化方法与树搜索正交,可无缝集成至任何多轮强化学习流程。在七个基准测试上的实验表明,该方法相较最先进基线模型平均提升达1.84个百分点,消融研究验证了各组件的有效性。代码已开源:https://github.com/zzfoutofspace/ATPO。
English
LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT^2PO (Agentic Turn-based Policy Optimization via Tree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT^2PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at https://github.com/zzfoutofspace/ATPO.
PDF181January 10, 2026