StraTA: 전략적 궤적 추상화를 통한 능동적 강화 학습 유도

초록

대규모 언어 모델(LLM)이 점차 대화형 에이전트로 활용되고 있지만, 장기적 의사 결정을 위해 최적화하는 것은 여전히 어려운 과제입니다. 이는 현재의 방법론이 대부분 순전히 반응적이어서 장기 궤적에서의 탐색과 크레딧 할당을 모두 약화시키기 때문입니다. 본 연구에서는 명시적인 궤적 수준 전략을 에이전트 강화 학습(RL)에 도입하는 간단한 프레임워크인 전략적 궤적 추상화(StraTA)를 제안합니다. StraTA는 초기 작업 상태에서 간결한 전략을 샘플링하고, 이후 행동을 해당 전략에 조건화하며, 계층적 GRPO 스타일 롤아웃 설계를 통해 전략 생성과 행동 실행을 공동으로 학습합니다. 여기에 다양한 전략 롤아웃과 비판적 자기 판단을 더해 성능을 향상시켰습니다. ALFWorld, WebShop, SciWorld에서의 실험 결과, StraTA가 강력한 베이스라인 대비 샘플 효율성과 최종 성능을 모두 지속적으로 향상시키는 것으로 나타났습니다. StraTA는 ALFWorld에서 93.1%, WebShop에서 84.2%의 성공률을 기록했습니다. SciWorld에서는 63.5%의 종합 점수를 달성하여 최첨단 클로즈드소스 모델들을 능가하는 성과를 보였습니다.

English

Large language models (LLMs) are increasingly used as interactive agents, but optimizing them for long-horizon decision making remains difficult because current methods are largely purely reactive, which weakens both exploration and credit assignment over extended trajectories. In this work, we present Strategic Trajectory Abstraction (StraTA), a simple framework that introduces an explicit trajectory-level strategy into agentic reinforcement learning (RL). StraTA samples a compact strategy from the initial task state, conditions subsequent actions on that strategy, and trains strategy generation and action execution jointly with a hierarchical GRPO-style rollout design, further enhanced by diverse strategy rollout and critical self-judgment. Experiments on ALFWorld, WebShop, and SciWorld show that StraTA consistently improves both sample efficiency and final performance over strong baselines. StraTA reaches success rates of 93.1% on ALFWorld and 84.2% on WebShop. On SciWorld, StraTA attains a 63.5% overall score, outperforming frontier closed-source models.

StraTA: 전략적 궤적 추상화를 통한 능동적 강화 학습 유도

StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

초록

Support