EPO: LLM 에이전트를 위한 엔트로피 정규화 정책 최적화 강화 학습

초록

희소 보상이 존재하는 다중 턴 환경에서 LLM 에이전트를 훈련시키는 것은, 단일 작업을 완료하기 위해 에피소드 내에서 30회 이상의 상호작용이 필요한 경우, 강화 학습에 있어 근본적인 도전 과제를 제시한다. 우리는 이러한 설정에서만 발생하는 중요한 실패 모드를 식별하였는데, 이를 탐색-활용 캐스케이드 실패라고 명명한다. 이 캐스케이드는 초기 단계에서 정책의 조기 수렴으로 시작되며, 희소한 피드백으로 인해 에이전트가 결함이 있고 엔트로피가 낮은 전략에 고착되게 된다. 이후 에이전트는 후기 단계에서 정책 붕괴에 이르게 되는데, 이때 기존의 엔트로피 정규화는 역효과를 일으켜 훈련을 불안정하게 만드는 혼란스러운 탐색을 촉진한다. 우리는 이러한 실패 사이클을 깨기 위해 엔트로피 정규화 정책 최적화(EPO)라는 일반적인 프레임워크를 제안한다. EPO는 세 가지 상호 보완적인 메커니즘을 통해 이를 달성한다: (1) 다중 턴 설정에서 엔트로피 정규화를 채택하여 탐색을 강화하고, (2) 정책 엔트로피를 역사적 평균 내에 제한하여 급격한 변동을 방지하는 엔트로피 평활 정규화, (3) 훈련 전반에 걸쳐 탐색과 활용의 균형을 맞추는 적응형 단계 기반 가중치 조정. 우리의 분석은 EPO가 수렴을 유지하면서 엔트로피 분산이 단조롭게 감소함을 보장한다는 것을 입증한다. EPO는 ScienceWorld에서 최대 152%, ALFWorld에서 최대 19.8%의 성능 향상을 달성한다. 우리의 연구는 다중 턴 희소 보상 설정이 기존의 강화 학습과는 근본적으로 다른 엔트로피 제어를 필요로 하며, 이는 LLM 에이전트 훈련에 광범위한 함의를 가짐을 보여준다.

English

Training LLM agents in multi-turn environments with sparse rewards, where completing a single task requires 30+ turns of interaction within an episode, presents a fundamental challenge for reinforcement learning. We identify a critical failure mode unique to this setting: the exploration-exploitation cascade failure. This cascade begins with early-stage policy premature convergence, where sparse feedback causes agents to commit to flawed, low-entropy strategies. Subsequently, agents enter late-stage policy collapse, where conventional entropy regularization becomes counterproductive, promoting chaotic exploration that destabilizes training. We propose Entropy-regularized Policy Optimization (EPO), a general framework that breaks this failure cycle through three synergistic mechanisms: (1) adopting entropy regularization in multi-turn settings to enhance exploration, (2) an entropy smoothing regularizer that bounds policy entropy within historical averages to prevent abrupt fluctuations, and (3) adaptive phase-based weighting that balances exploration and exploitation across training. Our analysis justifies that EPO guarantees monotonically decreasing entropy variance while maintaining convergence. EPO achieves up to 152% performance improvement on ScienceWorld and up to 19.8% on ALFWorld. Our work demonstrates that multi-turn sparse-reward settings require fundamentally different entropy control than traditional RL, with broad implications for LLM agent training.

EPO: LLM 에이전트를 위한 엔트로피 정규화 정책 최적화 강화 학습

EPO: Entropy-regularized Policy Optimization for LLM Agents Reinforcement Learning

초록

Support