代理熵平衡策略优化
Agentic Entropy-Balanced Policy Optimization
October 16, 2025
作者: Guanting Dong, Licheng Bao, Zhongyuan Wang, Kangzhi Zhao, Xiaoxi Li, Jiajie Jin, Jinghan Yang, Hangyu Mao, Fuzheng Zhang, Kun Gai, Guorui Zhou, Yutao Zhu, Ji-Rong Wen, Zhicheng Dou
cs.AI
摘要
近期,自主强化学习(Agentic RL)在激励网络代理的多轮次、长周期工具使用能力方面取得了显著进展。尽管主流自主强化学习算法在熵的引导下自主探索高不确定性的工具调用步骤,但过度依赖熵信号可能会带来额外限制,导致训练崩溃。本文深入探讨了由熵引起的挑战,并提出了自主熵平衡策略优化(AEPO),这是一种旨在在策略执行和更新阶段平衡熵的自主强化学习算法。AEPO包含两个核心组件:(1)动态熵平衡执行机制,通过熵预监测自适应分配全局和分支采样预算,同时对连续高熵工具调用步骤施加分支惩罚,以防止过度分支问题;(2)熵平衡策略优化,在高熵裁剪项中插入停止梯度操作,以保留并适当重新缩放高熵标记上的梯度,同时结合熵感知优势估计,优先学习高不确定性标记。在14个具有挑战性的数据集上的结果表明,AEPO始终优于7种主流强化学习算法。仅使用1K强化学习样本,搭载AEPO的Qwen3-14B在GAIA上达到了47.6%,在Humanity's Last Exam上达到了11.2%,在WebWalker上达到了43.0%的Pass@1成绩;在GAIA上达到了65.0%,在Humanity's Last Exam上达到了26.0%,在WebWalker上达到了70.0%的Pass@5成绩。进一步分析表明,AEPO在保持策略熵稳定的同时提高了执行采样的多样性,促进了可扩展的网络代理训练。
English
Recently, Agentic Reinforcement Learning (Agentic RL) has made significant
progress in incentivizing the multi-turn, long-horizon tool-use capabilities of
web agents. While mainstream agentic RL algorithms autonomously explore
high-uncertainty tool-call steps under the guidance of entropy, excessive
reliance on entropy signals can impose further constraints, leading to the
training collapse. In this paper, we delve into the challenges caused by
entropy and propose the Agentic Entropy-Balanced Policy Optimization (AEPO), an
agentic RL algorithm designed to balance entropy in both the rollout and policy
update phases. AEPO comprises two core components: (1) a dynamic
entropy-balanced rollout mechanism that adaptively allocate global and branch
sampling budget through entropy pre-monitoring, while imposing a branch penalty
on consecutive high-entropy tool-call steps to prevent over-branching issues;
and (2) Entropy-Balanced Policy Optimization that inserts a stop-gradient
operation into the high-entropy clipping term to preserve and properly rescale
gradients on high-entropy tokens, while incorporating entropy-aware advantage
estimation to prioritize learning on high-uncertainty tokens. Results across 14
challenging datasets show that AEPO consistently outperforms 7 mainstream RL
algorithms. With just 1K RL samples, Qwen3-14B with AEPO achieves impressive
results: 47.6% on GAIA, 11.2% on Humanity's Last Exam, and 43.0% on WebWalker
for Pass@1; 65.0% on GAIA, 26.0% on Humanity's Last Exam, and 70.0% on
WebWalker for Pass@5. Further analysis reveals that AEPO improves rollout
sampling diversity while maintaining stable policy entropy, facilitating
scalable web agent training.