具代理性的CLEAR：自動化多層級LLM代理評估

摘要

代理系統正變得越來越強大：代理能制定策略、採取行動，並與不同環境互動。這種自主性為監督與評估代理行為帶來了嚴峻挑戰。當前多數工具存在局限性，要嘛側重於具備基本評估能力的可觀測性，要嘛採用靜態、人工設計的錯誤分類法，無法適應新領域。為填補此缺口，我們提出Agentic CLEAR——一個自動、動態且易於使用的評估框架。該框架能以三種粒度層次（系統層、追蹤層、節點層）產出代理行為的文本洞見。Agentic CLEAR運作於可觀測層之上，能無縫整合，並配備直觀的使用者介面，大幅提升代理評估的可及性。在四項基準測試、七種代理場景及數萬次LLM呼叫的實驗中，我們證明了Agentic CLEAR能產出高品質、資料驅動且富含洞見的反饋。分析結果顯示，其與人工標註錯誤高度一致，並能預測任務成功率。

English

Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.