ChatPaper.aiChatPaper

AT^2PO:基於樹狀搜尋的智慧體回合制策略優化

AT^2PO: Agentic Turn-based Policy Optimization via Tree Search

January 8, 2026
作者: Zefang Zong, Dingwei Chen, Yang Li, Qi Yi, Bo Zhou, Chengming Li, Bo Qian, Peng Chen, Jie Jiang
cs.AI

摘要

大型語言模型代理系統已成為處理多輪任務的強大工具,其通過交錯進行內部推理與外部工具互動來實現功能。近期,代理強化學習作為關鍵的後訓練範式,在進一步提升這些能力方面引起了研究界的廣泛關注。本文提出AT^2PO(基於樹搜索的代理輪次策略優化),這是一個針對多輪代理強化學習的統一框架,旨在解決三大核心挑戰:探索多樣性受限、稀疏信用分配困難以及策略優化失準。AT^2PO引入了一種輪次樹狀結構,可同時實現熵引導樹擴展以進行策略性探索,並通過輪次信用分配實現稀疏結果的細粒度獎勵傳播。與此相輔相成,我們提出代理輪次策略優化——一種與代理互動自然決策粒度相匹配的輪次級學習目標。該方法與樹搜索正交,可無縫集成至任何多輪強化學習流程。在七個基準測試上的實驗表明,該框架相較現有最先進基線模型平均提升達1.84個百分點,消融研究也驗證了各組件的有效性。相關代碼已開源於:https://github.com/zzfoutofspace/ATPO。
English
LLM agents have emerged as powerful systems for tackling multi-turn tasks by interleaving internal reasoning and external tool interactions. Agentic Reinforcement Learning has recently drawn significant research attention as a critical post-training paradigm to further refine these capabilities. In this paper, we present AT^2PO (Agentic Turn-based Policy Optimization via Tree Search), a unified framework for multi-turn agentic RL that addresses three core challenges: limited exploration diversity, sparse credit assignment, and misaligned policy optimization. AT^2PO introduces a turn-level tree structure that jointly enables Entropy-Guided Tree Expansion for strategic exploration and Turn-wise Credit Assignment for fine-grained reward propagation from sparse outcomes. Complementing this, we propose Agentic Turn-based Policy Optimization, a turn-level learning objective that aligns policy updates with the natural decision granularity of agentic interactions. ATPO is orthogonal to tree search and can be readily integrated into any multi-turn RL pipeline. Experiments across seven benchmarks demonstrate consistent improvements over the state-of-the-art baseline by up to 1.84 percentage points in average, with ablation studies validating the effectiveness of each component. Our code is available at https://github.com/zzfoutofspace/ATPO.
PDF181January 10, 2026