다중 턴 에이전트 강화 학습 실무자 가이드

초록

우리는 다중 턴 강화 학습을 통해 대규모 언어 모델을 에이전트로 훈련시킬 때 실제로 효과가 있는 방법과 그렇지 않은 방법을 연구합니다. 빠른 발전에도 불구하고, 기존의 프레임워크와 정의는 단편적이며, 어떤 설계 선택이 다양한 작업에서 중요한지에 대한 체계적인 공식화나 분석이 부족합니다. 우리는 이 격차를 해소하기 위해 먼저 설계 공간을 환경, 보상, 정책이라는 세 가지 상호 연관된 기둥으로 나누고, 텍스트 기반 도메인에서 LLM 에이전트를 훈련시키기 위한 레시피를 실증적으로 도출합니다. 특히, 우리는 상황적 구체적 추론을 테스트하기 위한 인기 있는 도메인인 TextWorld와 ALFWorld, 그리고 소프트웨어 엔지니어링 스타일 작업을 위한 SWE-Gym을 테스트합니다. (i) 환경의 경우, 상태와 행동 공간의 크기 및 최적 해결책 길이 측면에서 작업 복잡성의 영향을 분석하며, 도메인 내의 단순한 환경조차도 에이전트가 더 복잡한 작업에 얼마나 잘 일반화할 수 있는지에 대한 신호를 제공할 수 있음을 발견합니다. (ii) 보상의 경우, 상대적 보상 희소성을 제거하며, 밀집된 턴 수준 보상이 훈련을 가속화하지만, 성능과 안정성은 RL 알고리즘 선택에 크게 의존함을 관찰합니다. (iii) 에이전트의 정책의 경우, 보상 희소성과 편향된 (PPO, GRPO) 및 편향되지 않은 (RLOO) 정책 경사 방법 간의 상호작용을 탐구하고, 고정된 예산 내에서 최적의 지도 미세 조정(SFT)과 RL 훈련 비율을 찾는 방법을 보여줍니다. 우리는 이러한 발견을 세 기둥 간의 공동 설계를 안내하는 훈련 레시피로 정제하여, 다중 턴 에이전트 RL 연구와 실용적 노력을 촉진합니다. 코드: https://github.com/pearls-lab/meow-tea-taro

English

We study what actually works and what doesn't for training large language models as agents via multi-turn reinforcement learning. Despite rapid progress, existing frameworks and definitions are fragmented, and there is no systematic formulation or analysis of which design choices matter across tasks. We address this gap by first breaking down the design space into three inter-related pillars -- environment, reward, and policy -- and empirically derive a recipe for training LLM agents in situated textual domains. In particular, we test TextWorld and ALFWorld, popular domains for testing situated embodied reasoning, as well as SWE-Gym for more software engineering style tasks. (i) For the environment, we analyze the impacts of task complexity in terms of sizes of the state and action spaces as well as optimal solution length, finding that even simple environments within a domain can provide signal on how well an agent can generalize to more complex tasks. (ii) For the reward, we ablate relative reward sparsity, observing that while dense turn-level rewards accelerate training, performance and stability is highly dependent on the choice of RL algorithm. (iii) And for the agent's policy, we explore the interplay between reward sparsity and biased (PPO, GRPO) and unbiased (RLOO) policy gradient methods in addition to showing how to find the optimal Supervised Fine-tuning (SFT) to RL training ratio given a fixed budget. We distill these findings into a training recipe that guides co-design across the three pillars, facilitating research and practical efforts in multi-turn agentic RL. Code: https://github.com/pearls-lab/meow-tea-taro

다중 턴 에이전트 강화 학습 실무자 가이드

A Practitioner's Guide to Multi-turn Agentic Reinforcement Learning

초록

Support