뛰기 전에 살펴보라: LLM 에이전트의 자율 탐색

초록

대규모 언어 모델 기반 에이전트는 충분한 환경 특화 정보를 획득하기 전에 사전 지식에 기반하여 행동하려는 경향, 즉 조기 활용( premature exploitation)으로 인해 익숙하지 않은 환경에서 종종 실패한다. 본 연구는 적응형 에이전트 구축을 위한 핵심이면서도 충분히 탐구되지 않은 능력으로서 자율적 탐색(autonomous exploration)을 식별한다. 이 능력을 공식화하고 정량화하기 위해, 에이전트가 주요 상태, 객체 및 행동 가능성(affordances)을 얼마나 폭넓게 발견하는지 측정하는 검증 가능한 지표인 탐색 체크포인트 커버리지(Exploration Checkpoint Coverage)를 도입한다. 체계적 평가 결과, 표준 과제 지향 강화 학습(task-oriented reinforcement learning)으로 훈련된 에이전트는 일관되게 좁고 반복적인 행동을 보여 하위 과제 성능을 저해함을 확인했다. 이러한 한계를 해결하기 위해, 과제 실행 롤아웃(task-execution rollouts)과 탐색 롤아웃(exploration rollouts)을 교차 배치하는 훈련 전략을 개발하며, 각 롤아웃 유형은 해당하는 검증 가능한 보상(verifiable reward)에 의해 최적화된다. 이 훈련 전략을 기반으로, 정보 수집과 과제 실행을 분리하는 탐색 후 행동(Explore-then-Act) 패러다임을 제안한다. 즉, 에이전트는 먼저 상호작용 예산(interaction budget)을 활용하여 근거 기반 환경 지식을 획득한 후, 이를 과제 해결에 활용한다. 본 연구 결과는 체계적 탐색을 학습하는 것이 일반화 가능하고 실제 환경에 적용 가능한 에이전트를 구축하는 데 필수적임을 보여준다.

English

Large language model based agents often fail in unfamiliar environments due to premature exploitation: a tendency to act on prior knowledge before acquiring sufficient environment-specific information. We identify autonomous exploration as a critical yet underexplored capability for building adaptive agents. To formalize and quantify this capability, we introduce Exploration Checkpoint Coverage, a verifiable metric that measures how broadly an agent discovers key states, objects, and affordances. Our systematic evaluation reveals that agents trained with standard task-oriented reinforcement learning consistently exhibit narrow and repetitive behaviors that impede downstream performance. To address this limitation, we develop a training strategy that interleaves task-execution rollouts and exploration rollouts, with each type of rollout optimized by its corresponding verifiable reward. Building on this training strategy, we propose the Explore-then-Act paradigm, which decouples information-gathering from task execution: agents first utilize an interaction budget to acquire grounded environmental knowledge, then leverage it for task resolution. Our results demonstrate that learning to systematically explore is imperative for building generalizable and real-world-ready agents.