長期的目標達成を改善するためのサブゴール駆動型フレームワーク

要旨

大規模言語モデル（LLM）を基盤としたエージェントは、モバイルインターフェース、オペレーティングシステム、ウェブブラウザなどのデジタル環境における強力な自律制御器として登場している。例えば、ウェブ操作タスクは、動的コンテンツの扱いや長い一連の行動を必要とするため、特に困難な課題である。既存のLLMベースのエージェントは、長期的な計画立案において主に2つの点で苦戦している。オンライン実行時には、新しい情報が入ってくるにつれて最終目標への明確で適応的な経路を見失いがちである。この問題は強化学習（RL）によるファインチューニング時にさらに悪化する。すなわち、スパースで遅延した報酬により、どの行動が成功に結びつくのかをエージェントが特定することが難しく、長期的なタスクにおいて一貫した推論を維持できなくなる。これらの課題に対処するため、我々は2つの貢献を提案する。第一に、サブゴール分解によるオンライン計画のためにプロプライエタリモデルを活用するエージェントフレームワークを導入する。第二に、高密度でマイルストーンに基づく報酬信号を用いるRL訓練フレームワークであるMiRA（Milestoning your Reinforcement Learning Enhanced Agent）を提示する。このリアルタイム計画メカニズムは、WebArena-Liteベンチマークにおいて、Geminiのようなプロプライエタリモデルの成功率（SR）を約10%絶対値で向上させた。一方、オープンなGemma3-12BモデルにMiRAを適用すると、その成功率は6.4%から43.0%に増加した。この性能は、GPT-4-Turbo（17.6%）やGPT-4o（13.9%）といったプロプライエタリシステム、および従来のオープンモデルのState-of-the-ArtであったWebRL（38.4%）をも上回る。全体として、我々の知見は、明示的な推論時計画とマイルストーンに基づく報酬を組み合わせることが、エージェントの長期的能力を大幅に向上させ、より堅牢で汎用的な自律システムへの道を開くことを実証している。

English

Large language model (LLM)-based agents have emerged as powerful autonomous controllers for digital environments, including mobile interfaces, operating systems, and web browsers. Web navigation, for example, requires handling dynamic content and long sequences of actions, making it particularly challenging. Existing LLM-based agents struggle with long-horizon planning in two main ways. During online execution, they often lose track as new information arrives, lacking a clear and adaptive path toward the final goal. This issue is further exacerbated during reinforcement learning (RL) fine-tuning, where sparse and delayed rewards make it difficult for agents to identify which actions lead to success, preventing them from maintaining coherent reasoning over extended tasks. To address these challenges, we propose two contributions. First, we introduce an agent framework that leverages proprietary models for online planning through subgoal decomposition. Second, we present MiRA (Milestoning your Reinforcement Learning Enhanced Agent), an RL training framework that uses dense, milestone-based reward signals. The real-time planning mechanism improves proprietary models such as Gemini by approximately a 10% absolute increase in success rate (SR) on the WebArena-Lite benchmark. Meanwhile, applying MiRA to the open Gemma3-12B model increases its success rate from 6.4% to 43.0%. This performance surpasses proprietary systems such as GPT-4-Turbo (17.6%) and GPT-4o (13.9%), as well as the previous open-model state of the art, WebRL (38.4%). Overall, our findings demonstrate that combining explicit inference-time planning with milestone-based rewards significantly improves an agent's long-horizon capabilities, paving the way for more robust and general-purpose autonomous systems.

長期的目標達成を改善するためのサブゴール駆動型フレームワーク

A Subgoal-driven Framework for Improving Long-Horizon LLM Agents

要旨

Support