ChatPaper.aiChatPaper

代理熵平衡策略優化

Agentic Entropy-Balanced Policy Optimization

October 16, 2025
作者: Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou
cs.AI

摘要

近期,代理强化学习(Agentic RL)在激励网络代理的多轮次、长时程工具使用能力方面取得了显著进展。尽管主流的代理RL算法在熵的引导下自主探索高不确定性的工具调用步骤,但过度依赖熵信号可能会施加进一步的限制,导致训练崩溃。本文深入探讨了由熵引起的挑战,并提出了代理熵平衡策略优化(AEPO),这是一种旨在在滚动更新和策略更新阶段平衡熵的代理RL算法。AEPO包含两个核心组件:(1)一种动态熵平衡滚动机制,通过熵预监控自适应地分配全局和分支采样预算,同时对连续高熵工具调用步骤施加分支惩罚,以防止过度分支问题;(2)熵平衡策略优化,在高熵裁剪项中插入停止梯度操作,以保留并适当重新缩放高熵标记上的梯度,同时结合熵感知优势估计,优先学习高不确定性标记。在14个具有挑战性的数据集上的结果表明,AEPO始终优于7种主流RL算法。仅使用1K RL样本,搭载AEPO的Qwen3-14B取得了令人印象深刻的成绩:在GAIA上为47.6%,在Humanity's Last Exam上为11.2%,在WebWalker上为43.0%的Pass@1;在GAIA上为65.0%,在Humanity's Last Exam上为26.0%,在WebWalker上为70.0%的Pass@5。进一步分析显示,AEPO在保持策略熵稳定的同时提高了滚动采样的多样性,促进了可扩展的网络代理训练。
English
Recently, Agentic Reinforcement Learning (Agentic RL) has made significant progress in incentivizing the multi-turn, long-horizon tool-use capabilities of web agents. While mainstream agentic RL algorithms autonomously explore high-uncertainty tool-call steps under the guidance of entropy, excessive reliance on entropy signals can impose further constraints, leading to the training collapse. In this paper, we delve into the challenges caused by entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an agentic RL algorithm designed to balance entropy in both the rollout and policy update phases. AEPO comprises two core components: (1) a dynamic entropy-balanced rollout mechanism that adaptively allocate global and branch sampling budget through entropy pre-monitoring, while imposing a branch penalty on consecutive high-entropy tool-call steps to prevent over-branching issues; and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient operation into the high-entropy clipping term to preserve and properly rescale gradients on high-entropy tokens, while incorporating entropy-aware advantage estimation to prioritize learning on high-uncertainty tokens. Results across 14 challenging datasets show that AEPO consistently outperforms 7 mainstream RL algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout sampling diversity while maintaining stable policy entropy, facilitating scalable web agent training.
PDF954October 17, 2025