言語モデルエージェントにおける探索と活用の誤差は測定可能である

要旨

言語モデル（LM）エージェントは、AIコーディングから物理AIに至るまで、複雑なオープンエンドな意思決定タスクにおいてますます活用されている。こうした環境における核心的な要件は、問題空間を探索する能力と、獲得した知識を効果的に活用する能力の両立である。しかし、エージェントの内部方針にアクセスせずに観測された行動から探索と活用を体系的に区別し定量化することは、依然として困難な課題である。この問題に対処するため、実践的な具現化AIシナリオに着想を得た制御可能な環境を設計した。各環境は、部分観測可能な2Dグリッドマップと未知のタスク有向非巡回グラフ（DAG）で構成される。マップ生成はプログラムによって調整可能で、探索難易度や活用難易度を強調できる。方針非依存の評価を可能にするため、エージェントの行動から探索誤りと活用誤りを定量化する指標を設計した。様々な最先端LMエージェントを評価した結果、最新モデルでさえ本タスクに苦戦し、異なるモデルが特徴的な失敗モードを示すことが明らかになった。さらに、推論モデルがタスクをより効果的に解決すること、および最小限のハーネス設計により探索と活用の両方を大幅に改善できることを確認した。コードはhttps://github.com/jjj-madison/measurable-explore-exploitで公開している。

English

Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent's internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent's actions. We evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. We release our code https://github.com/jjj-madison/measurable-explore-exploit{here}.

言語モデルエージェントにおける探索と活用の誤差は測定可能である

Exploration and Exploitation Errors Are Measurable for Language Model Agents

要旨

Support