언어 모델 에이전트의 탐색과 활용 오류는 측정 가능하다

초록

언어 모델(LM) 에이전트는 AI 코딩부터 물리적 AI에 이르기까지 복잡한 개방형 의사 결정 작업에 점점 더 많이 활용되고 있습니다. 이러한 환경에서 핵심 요구 사항은 문제 공간을 탐색하고 습득한 지식을 효과적으로 활용하는 능력입니다. 그러나 에이전트의 내부 정책에 접근하지 않고 관찰된 행동으로부터 탐색과 활용을 체계적으로 구분하고 정량화하는 것은 여전히 어려운 과제입니다. 이를 해결하기 위해 우리는 실용적인 구체화된 AI 시나리오에서 영감을 받은 제어 가능한 환경을 설계했습니다. 각 환경은 부분적으로 관찰 가능한 2D 그리드 맵과 알려지지 않은 작업 방향성 비순환 그래프(DAG)로 구성됩니다. 맵 생성은 탐색 또는 활용 난이도를 강조하도록 프로그램 방식으로 조정할 수 있습니다. 정책 독립적인 평가를 가능하게 하기 위해, 우리는 에이전트의 행동으로부터 탐색 및 활용 오류를 정량화하는 지표를 설계했습니다. 다양한 최첨단 LM 에이전트를 평가한 결과, 심지어 최신 모델들도 우리의 작업에 어려움을 겪으며, 서로 다른 모델이 뚜렷한 실패 모드를 보이는 것을 확인했습니다. 또한 추론 모델이 작업을 더 효과적으로 해결하며, 최소한의 하네스 엔지니어링을 통해 탐색과 활용이 모두 크게 개선될 수 있음을 관찰했습니다. 우리는 코드를 https://github.com/jjj-madison/measurable-explore-exploit 에 공개합니다.

English

Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent's internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent's actions. We evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. We release our code https://github.com/jjj-madison/measurable-explore-exploit{here}.

언어 모델 에이전트의 탐색과 활용 오류는 측정 가능하다

Exploration and Exploitation Errors Are Measurable for Language Model Agents

초록

Support