장기적 도구 사용 에이전트를 위한 강화 학습 해설: 포괄적 실전 가이드

초록

강화학습(Reinforcement Learning, RL)은 대규모 언어 모델(LLM)이 장기적인 계획을 수립할 수 있는 자율 에이전트로 발전하는 데 필수적이지만, 복잡한 다중 턴 환경에서 RL을 확장하기 위한 실용적인 방법론은 여전히 부족합니다. 본 논문은 다양한 제약 조건을 충족하기 위해 도구 조정이 필요한 난제인 TravelPlanner 테스트베드를 활용한 체계적인 실증 연구를 제시합니다. 우리는 에이전트 RL 설계 공간을 보상 설계, 모델 규모, 데이터 구성, 알고리즘 선택, 환경 안정성이라는 5가지 축으로 분해합니다. 통제된 실험을 통해 7가지 주요 시사점을 도출했으며, 예를 들어 (1) 보상과 알고리즘 선택은 규모에 의존적이어서 소규모 모델은 단계적 보상과 향상된 탐색의 이점을 얻는 반면, 대규모 모델은 단순한 조밀 보상으로도 효율적으로 수렴하고, (2) 약 1,000개의 훈련 샘플과 균형 잡힌 난이도 혼합이 도메인 내 및 도메인 외 성능 모두에서 최적점을 나타내며, (3) 정책 성능 저하를 방지하기 위해 환경 안정성이 중요하다는 것을 확인했습니다. 우리가 정제한 방법론을 바탕으로, RL로 훈련된 우리 모델은 TravelPlanner에서 최첨단 성능을 달성하며 주요 LLM을 크게 능가합니다.

English

Reinforcement Learning (RL) is essential for evolving Large Language Models (LLMs) into autonomous agents capable of long-horizon planning, yet a practical recipe for scaling RL in complex, multi-turn environments remains elusive. This paper presents a systematic empirical study using TravelPlanner, a challenging testbed requiring tool orchestration to satisfy multifaceted constraints. We decompose the agentic RL design space along 5 axes: reward shaping, model scaling, data composition, algorithm selection, and environmental stability. Our controlled experiments yield 7 key takeaways, e.g., (1) reward and algorithm choices are scale-dependent as smaller models benefit from staged rewards and enhanced exploration, whereas larger models converge efficiently with simpler dense rewards, (2) ~ 1K training samples with a balanced difficulty mixture mark a sweet spot for both in-domain and out-of-domain performance, and (3) environmental stability is critical to prevent policy degradation. Based on our distilled recipe, our RL-trained models achieve state-of-the-art performance on TravelPlanner, significantly outperforming leading LLMs.

장기적 도구 사용 에이전트를 위한 강화 학습 해설: 포괄적 실전 가이드

Demystifying Reinforcement Learning for Long-Horizon Tool-Using Agents: A Comprehensive Recipe

초록

Support