AEM: 다중 턴 에이전트 강화 학습을 위한 적응형 엔트로피 변조

초록

강화 학습(RL)은 대규모 언어 모델(LLM) 에이전트가 환경과 상호작용하고 다중 턴 작업을 해결하는 능력을 크게 향상시켰다. 그러나 효과적인 에이전트 RL은 여전히 어려운 과제로 남아 있다: 희소한 결과 중심 보상만으로는 긴 상호작용 궤적 내에서 개별 단계에 신용을 할당하는 데 제한적인 지침을 제공한다. 기존 접근 방식은 종종 과정 보상 모델이나 보조 자기 지도 신호와 같은 조밀한 중간 감독을 도입하는데, 이는 감독 및 튜닝 복잡성을 증가시키고 작업 및 도메인 간 일반화를 제한할 수 있다. 본 논문에서는 RL 훈련 중 엔트로피 동역학을 적응적으로 조절하여 탐색-활용 균형을 개선하는 감독 없는 신용 할당 방법인 AEM을 제시한다. 에이전트 RL에서는 환경이 일반적으로 개별 토큰보다는 완전한 응답에 의해 영향을 받기 때문에, 우리의 분석은 엔트로피 동역학을 토큰 수준에서 응답 수준으로 끌어올려 불확실성 추정을 LLM 에이전트의 효과적인 행동 세분성과 정렬시키고 토큰 수준 샘플링 노이즈에 대한 민감성을 줄인다. 또한 자연 경사도 업데이트 하에서 엔트로피 드리프트가 샘플링된 응답의 이점과 그 상대적 놀라움 간의 상호작용에 의해 결정됨을 보여준다. 이 결과에 동기 부여되어, AEM은 실용적인 응답 수준의 불확실성 프록시를 도출하고 이를 사용하여 이점을 재조정하며, 긍정적 샘플과 부정적 샘플 간의 진화하는 균형을 활용하여 자연스럽게 탐색에서 활용으로 전환한다. 1.5B부터 32B까지의 모델을 사용한 ALFWorld, WebShop, SWE-bench-Verified에 대한 광범위한 실험은 AEM이 강력한 RL 기준선을 일관되게 개선함을 보여주며, 최첨단 소프트웨어 엔지니어링 RL 훈련 프레임워크에 통합되었을 때 +1.4%의 성능 향상을 달성한다.

English

Reinforcement learning (RL) has substantially improved the ability of large language model (LLM) agents to interact with environments and solve multi-turn tasks. However, effective agentic RL remains challenging: sparse outcome-only rewards provide limited guidance for assigning credit to individual steps within long interaction trajectories. Existing approaches often introduce dense intermediate supervision, such as process reward models or auxiliary self-supervised signals, which increases supervision and tuning complexity and may limit generalization across tasks and domains. We present AEM, a supervision-free credit assignment method that adaptively modulates entropy dynamics during RL training to improve the exploration-exploitation trade-off. Since in agentic RL the environment is typically affected by a complete response, rather than an individual token, our analysis lifts entropy dynamics from the token level to the response level, aligning uncertainty estimation with the effective action granularity of LLM agents and reducing sensitivity to token-level sampling noise. We further show that entropy drift under natural-gradient updates is governed by the interaction between the sampled-response advantage and its relative surprisal. Motivated by this result, AEM derives a practical response-level uncertainty proxy and uses it to rescale advantages, leveraging the evolving balance between positive and negative samples to naturally transition from exploration to exploitation. Extensive experiments on ALFWorld, WebShop, and SWE-bench-Verified with models ranging from 1.5B to 32B demonstrate that AEM consistently improves strong RL baselines, including a +1.4\% gain when integrated into a state-of-the-art software-engineering RL training framework.

AEM: 다중 턴 에이전트 강화 학습을 위한 적응형 엔트로피 변조

AEM: Adaptive Entropy Modulation for Multi-Turn Agentic Reinforcement Learning

초록

Support