AEM:面向多轮代理强化学习的自适应熵调控
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
May 8, 2026
作者: Haotian Zhao, Songlin Zhou, Yuxin Zhang, Stephen S. -T. Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, Daxiang Dong, Jianmin Wu
cs.AI
摘要
强化学习(RL)显著提升了大型语言模型(LLM)智能体与环境交互及解决多轮任务的能力。然而,实现高效的智能体强化学习仍面临挑战:稀疏的结果导向型奖励难以为长交互轨迹中的单步动作提供有效信用分配。现有方法通常引入密集的中间监督信号(如过程奖励模型或辅助自监督信号),这不仅增加了监督与调参复杂度,还可能限制跨任务与跨领域的泛化能力。本文提出自适应熵调控(AEM)方法,该无监督信用分配技术通过动态调节RL训练过程中的熵值来优化探索-利用平衡。鉴于智能体强化学习中环境通常受完整响应(而非单个词元)影响,我们的分析将熵动态从词元层面提升至响应层面,使不确定性估计与LLM智能体的有效动作粒度对齐,并降低对词元级采样噪声的敏感性。研究进一步表明,自然梯度更新下的熵漂移由采样响应优势度与其相对信息量的交互作用主导。基于此发现,AEM构建了响应级不确定性代理指标,通过重缩放优势度来利用正负样本间的动态平衡,实现从探索到利用的自然过渡。在ALFWorld、WebShop及SWE-bench-Verified数据集上,针对1.5B至32B参数模型的广泛实验表明,AEM能持续提升强基线RL方法性能,其中集成至最先进软件工程RL训练框架时取得+1.4%的增益。
English
Reinforcement learning (RL) has substantially improved the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. However, effective agentic RL remains challenging: sparse outcome-only rewards provide limited guidance for assigning credit to individual steps within long interaction trajectories. Existing approaches often introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, which increases supervision and tuning complexity and may limit generalization across tasks and domains. We present AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to improve the exploration-exploitation trade-off. Since in agentic RL the environment is typically affected by a complete response, rather than an individual token, our analysis lifts entropy dynamics from the token level to the response level, aligning uncertainty estimation with the effective action granularity of LLM agents and reducing sensitivity to token-level sampling noise. We further show that entropy drift under natural-gradient updates is governed by the interaction between the sampled-response advantage and its relative surprisal. Motivated by this result, AEM derives a practical response-level uncertainty proxy and uses it to rescale advantages, leveraging the evolving balance between positive and negative samples to naturally transition from exploration to exploitation. Extensive experiments on ALFWorld, WebShop, and SWE-bench-Verified with models ranging from 1.5B to 32B demonstrate that AEM consistently improves strong RL baselines, including a +1.4\% gain when integrated into a state-of-the-art software-engineering RL training framework.