先看后跳：面向LLM智能体的自主探索

摘要

基于大语言模型的智能体在陌生环境中常因过早开发而失败，即倾向于在获取足够环境特定信息前依赖先验知识采取行动。我们认为自主探索是构建自适应智能体的关键但尚未充分研究的能力。为形式化并量化该能力，我们引入可验证指标"探索检查点覆盖率"，用于衡量智能体发现关键状态、物体及可供性（affordances）的广度。系统评估表明，经标准面向任务的强化学习训练的智能体始终表现出狭窄且重复的行为模式，这阻碍了下游任务性能。为解决该局限，我们提出一种训练策略，将任务执行轨迹采样与探索轨迹采样交错结合，每类轨迹采样通过相应的可验证奖励进行优化。基于此训练策略，我们提出"先探索后行动"范式，将信息收集与任务执行解耦：智能体首先利用交互预算获取具身环境知识，随后将其用于任务求解。结果表明，学会系统性探索对构建可泛化且具备现实应用能力的智能体至关重要。

English

Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.