语言模型智能体的探索与利用误差可量化衡量

摘要

语言模型智能体正日益广泛地应用于从AI编程到具身AI等复杂开放式决策任务中。这类场景的核心需求是智能体既能探索问题空间，又能有效利用已获取的知识。然而，在无法获取智能体内部策略的情况下，系统性地从观测行为中区分并量化探索与利用行为仍具挑战。为此，我们基于实际具身AI场景设计了可控环境：每个环境包含部分可观测的二维网格地图和未知任务的有向无环图（DAG），可通过程序化调整地图生成机制来侧重考察探索或利用难度。为实现策略无关的评估，我们设计了从智能体行为中量化探索与利用误差的指标。通过对多种前沿语言模型智能体的评估，发现即使最先进的模型在我们的任务中也表现不佳，且不同模型呈现出截然不同的失败模式。进一步观察表明，推理模型能更有效地解决任务，同时通过最小化的约束工程即可显著提升探索与利用能力。代码已发布于https://github.com/jjj-madison/measurable-explore-exploit。

English

Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent's internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent's actions. We evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. We release our code https://github.com/jjj-madison/measurable-explore-exploit{here}.