面向长视野任务提升的亚目标驱动式LLM智能体框架

摘要

基于大语言模型（LLM）的智能体已成为移动界面、操作系统和网页浏览器等数字环境的强大自主控制器。以网页导航为例，该任务需处理动态内容与长序列操作，因而具有特殊挑战性。现有LLM智能体在长周期规划方面存在两大瓶颈：在线执行时，面对持续涌入的新信息容易偏离目标，缺乏清晰自适应的最终任务路径；强化学习（RL）微调阶段，稀疏延迟的奖励信号使智能体难以识别关键动作，无法在扩展任务中保持连贯推理。针对这些问题，我们提出两项创新：首先设计了一种通过子目标分解实现在线规划的智能体框架，利用专有模型进行实时决策；其次推出MiRA（里程碑式强化学习增强智能体），该训练框架采用基于里程碑的密集奖励机制。实时规划机制使Gemini等专有模型在WebArena-Lite基准测试中的成功率（SR）绝对提升约10%。同时，将MiRA应用于开源的Gemma3-12B模型后，其成功率从6.4%跃升至43.0%，不仅超越GPT-4-Turbo（17.6%）和GPT-4o（13.9%）等专有系统，也优于此前开源模型的最佳成绩WebRL（38.4%）。研究结果表明，显式推理时规划与里程碑奖励机制的融合能显著增强智能体的长周期任务能力，为构建更稳健的通用自主系统开辟了新路径。

English

Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent's long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.