跳ぶ前に見よ：LLMエージェントのための自律的探索

要旨

大規模言語モデルに基づくエージェントは、未知の環境において早期の活用（環境固有の情報を十分に獲得する前に事前知識に基づいて行動する傾向）により、しばしば失敗する。我々は、適応型エージェントを構築する上で、自律的な探索が重要でありながら未解明の能力であると特定する。この能力を形式化・定量化するため、検証可能な指標である探索チェックポイントカバレッジを導入する。これは、エージェントが主要な状態、物体、アフォーダンスをどの程度広く発見するかを測定するものである。我々の体系的な評価により、標準的なタスク指向強化学習で訓練されたエージェントは、下流タスクの性能を妨げる狭く反復的な行動を一貫して示すことが明らかになった。この限界に対処するため、タスク実行ロールアウトと探索ロールアウトを交互に配置し、各ロールアウトを対応する検証可能な報酬で最適化する訓練戦略を開発する。この訓練戦略に基づき、情報収集とタスク実行を分離する探索→行動パラダイムを提案する。エージェントはまず相互作用予算を利用してグラウンディングされた環境知識を獲得し、その後それをタスク解決に活用する。結果は、体系的な探索を学習することが、汎用的で実環境対応可能なエージェントを構築する上で不可欠であることを示している。

English

Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.