先看再跳：大型語言模型智能體的自主探索

摘要

基於大型語言模型的智能體經常在不熟悉的環境中失敗，原因在於過早利用：即在獲取足夠的環境特定資訊之前，便傾向於依賴既有知識採取行動。我們認為自主探索是建構適應性智能體的關鍵能力，但此能力至今仍未被充分探討。為正式定義並量化此能力，我們引入「探索檢查點覆蓋率」一詞，這項可驗證指標衡量智能體探索關鍵狀態、物體及其可供性的廣泛程度。我們的系統性評估顯示，經由標準任務導向強化學習訓練的智能體，始終表現出狹隘且重複的行為模式，從而阻礙後續任務表現。為解決此限制，我們開發了一種訓練策略，將任務執行軌跡與探索軌跡交錯進行，並以各自對應的可驗證獎勵進行優化。基於此訓練策略，我們提出「先探索後行動」範式，將資訊收集與任務執行分離：智能體首先利用互動預算來獲取紮根於環境的知識，再將其運用於解決任務。我們的結果表明，學習系統性地探索對於建構可泛化且適應真實世界的智能體至關重要。

English

Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.