語言模型代理的探索與利用錯誤可被量化衡量

摘要

語言模型（LM）代理在從AI編碼到實體AI的複雜開放式決策任務中日益普及。這類應用的核心需求是既能探索問題空間，又能有效利用已獲取知識。然而，在無法觸及代理內部策略的情況下，系統性地從觀測行動中區分並量化探索與利用行為仍具挑戰性。為此，我們基於實際具身AI場景設計了可控環境：每個環境由部分可觀測的二維網格地圖與未知任務的定向無環圖（DAG）構成。地圖生成可通過程式化調整以強化探索或利用的難度。為實現策略無關評估，我們設計了從代理行動量化探索與利用誤差的指標。通過對多種前沿語言模型代理的測試，發現即使最先進的模型在我們的任務中也表現不佳，且不同模型呈現出相異的失敗模式。我們進一步觀察到，推理模型能更有效地解決任務，並證明通過最小化的架構工程即可顯著提升探索與利用能力。相關程式碼已發佈於此處：https://github.com/jjj-madison/measurable-explore-exploit。

English

Language Model (LM) agents are increasingly used in complex open-ended decision-making tasks, from AI coding to physical AI. A core requirement in these settings is the ability to both explore the problem space and exploit acquired knowledge effectively. However, systematically distinguishing and quantifying exploration and exploitation from observed actions without access to the agent's internal policy remains challenging. To address this, we design controllable environments inspired by practical embodied AI scenarios. Each environment consists of a partially observable 2D grid map and an unknown task Directed Acyclic Graph (DAG). The map generation can be programmatically adjusted to emphasize exploration or exploitation difficulty. To enable policy-agnostic evaluation, we design a metric to quantify exploration and exploitation errors from agent's actions. We evaluate a variety of frontier LM agents and find that even state-of-the-art models struggle on our task, with different models exhibiting distinct failure modes. We further observe that reasoning models solve the task more effectively and show both exploration and exploitation can be significantly improved through minimal harness engineering. We release our code https://github.com/jjj-madison/measurable-explore-exploit{here}.

語言模型代理的探索與利用錯誤可被量化衡量

Exploration and Exploitation Errors Are Measurable for Language Model Agents

摘要

Support