AEM:適應性熵調制於多輪代理強化學習
AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning
May 8, 2026
作者: Haotian Zhao, Songlin Zhou, Yuxin Zhang, Stephen S. -T. Yau, Wenyu Zhang, Lun Tian, Tianshu Zhu, Yifeng Huang, Yucheng Zeng, Jingnan Gu, Daxiang Dong, Jianmin Wu
cs.AI
摘要
強化學習(RL)顯著提升了大型語言模型(LLM)代理與環境互動及解決多輪任務的能力。然而,實現高效的代理式強化學習仍面臨挑戰:稀疏的結果導向獎勵難以為長交互軌跡中的單一步驟提供充分信號分配。現有方法通常引入密集的中間監督(如過程獎勵模型或輔助自監督信號),這不僅增加了監督與調參複雜度,還可能限制跨任務與跨領域的泛化能力。本文提出AEM——一種無需監督的信號分配方法,通過自適應調控RL訓練過程中的熵動態來優化探索-利用權衡。由於代理式RL中的環境通常受完整響應(而非單個標記)影響,我們的分析將熵動態從標記層級提升至響應層級,使不確定性估計與LLM代理的有效動作粒度對齊,並降低對標記級採樣噪聲的敏感性。研究進一步揭示:在自然梯度更新下,熵漂移由採樣響應優勢度與其相對驚異值的交互作用主導。基於此發現,AEM推導出實用的響應級不確定性代理指標,通過重標定優勢度來動態平衡正負樣本,實現從探索到利用的自然過渡。在ALFWorld、WebShop及SWE-bench-Verified上進行的廣泛實驗(模型規模1.5B至32B)表明,AEM能穩定提升各類RL基線性能,其中整合至最先進軟體工程RL訓練框架時可實現+1.4%的性能增益。
English
Reinforcement learning (RL) has substantially improved the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. However, effective agentic RL remains challenging: sparse outcome-only rewards provide limited guidance for assigning credit to individual steps within long interaction trajectories. Existing approaches often introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, which increases supervision and tuning complexity and may limit generalization across tasks and domains. We present AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to improve the exploration-exploitation trade-off. Since in agentic RL the environment is typically affected by a complete response, rather than an individual token, our analysis lifts entropy dynamics from the token level to the response level, aligning uncertainty estimation with the effective action granularity of LLM agents and reducing sensitivity to token-level sampling noise. We further show that entropy drift under natural-gradient updates is governed by the interaction between the sampled-response advantage and its relative surprisal. Motivated by this result, AEM derives a practical response-level uncertainty proxy and uses it to rescale advantages, leveraging the evolving balance between positive and negative samples to naturally transition from exploration to exploitation. Extensive experiments on ALFWorld, WebShop, and SWE-bench-Verified with models ranging from 1.5B to 32B demonstrate that AEM consistently improves strong RL baselines, including a +1.4\% gain when integrated into a state-of-the-art software-engineering RL training framework.