장기적 계획을 위한 LLM 에이전트 성능 향상을 위한 하위 목표 주도 프레임워크

초록

대규모 언어 모델(LLM) 기반 에이전트는 모바일 인터페이스, 운영 체제, 웹 브라우저 등 디지털 환경을 위한 강력한 자율 제어기로 부상하고 있습니다. 예를 들어 웹 탐색은 동적 콘텐츠 처리와 긴 행동 순차열을 요구하기에 특히 어려운 과제입니다. 기존 LLM 기반 에이전트는 장기 계획 수립에 두 가지 주요 방식으로 어려움을 겪습니다. 온라인 실행 중에는 새로운 정보가 도착함에 따라 최종 목표를 향한 명확하고 적응적인 경로가 부족하여 종종 추적을 잃습니다. 이 문제는 강화 학습(RL) 미세 조정 시 더욱 악화되는데, 희소하고 지연된 보상으로 인해 에이전트가 어떤 행동이 성공을 이끄는지 식별하기 어려워 장기 과제에 걸쳐 일관된 추론을 유지하지 못합니다. 이러한 문제를 해결하기 위해 우리는 두 가지 기여를 제안합니다. 첫째, 하위 목표 분해를 통한 온라인 계획 수립을 위해 전용 모델을 활용하는 에이전트 프레임워크를 소개합니다. 둘째, 조밀한 마일스톤 기반 보상 신호를 사용하는 RL 훈련 프레임워크인 MiRA(Milestoning your Reinforcement Learning Enhanced Agent)를 제시합니다. 실시간 계획 수립 메커니즘은 WebArena-Lite 벤치마크에서 Gemini와 같은 전용 모델의 성공률(SR)을 약 10% 절대적으로 향상시킵니다. 한편, 오픈 모델인 Gemma3-12B에 MiRA를 적용하면 성공률이 6.4%에서 43.0%로 증가합니다. 이 성능은 GPT-4-Turbo(17.6%) 및 GPT-4o(13.9%)와 같은 전용 시스템과 기존 오픈 모델 최고 성능이었던 WebRL(38.4%)을 모두 능가합니다. 전반적으로, 우리의 연구 결과는 명시적 추론 시점 계획 수립과 마일스톤 기반 보상을 결합하면 에이전트의 장기 계획 능력이 크게 향상되어 더욱 강력하고 범용적인 자율 시스템으로 가는 길을 열어줌을 보여줍니다.

English

Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent's long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.

장기적 계획을 위한 LLM 에이전트 성능 향상을 위한 하위 목표 주도 프레임워크

A Subgoal-driven Framework for Improving Long-Horizon LLM Agents

초록

Support