提升长视野LLM智能体性能的子目标驱动框架

摘要

基于大型语言模型（LLM）的智能体已成为数字环境（包括移动界面、操作系统和网络浏览器）中强大的自主控制器。以网页导航为例，该任务需要处理动态内容和长序列操作，因而具有特殊挑战性。现有基于LLM的智能体在长程规划方面主要存在两大困境：在线执行过程中，智能体常因新信息涌入而偏离轨道，缺乏清晰且自适应的最终目标实现路径；这一难题在强化学习（RL）微调阶段更为突出，稀疏且延迟的奖励信号使智能体难以识别成功关键动作，无法在扩展任务中保持连贯推理。为应对这些挑战，我们提出两项创新：首先设计了一种通过子目标分解实现在线规划的智能体框架，利用专有模型进行动态决策；其次推出MiRA（里程碑式强化学习增强智能体），该RL训练框架采用基于里程碑的密集奖励信号。实时规划机制使Gemini等专有模型在WebArena-Lite基准测试中的成功率绝对提升约10%。同时，将MiRA应用于开源模型Gemma3-12B后，其成功率从6.4%跃升至43.0%，不仅超越GPT-4-Turbo（17.6%）和GPT-4o（13.9%）等专有系统，也优于此前开源模型的最佳成绩WebRL（38.4%）。研究表明，显式推理时规划与里程碑奖励机制的融合能显著增强智能体的长程任务能力，为构建更稳健的通用自主系统开辟新路径。

English

Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent's long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.

提升长视野LLM智能体性能的子目标驱动框架

A Subgoal-driven Framework for Improving Long-Horizon LLM Agents

摘要

Support