智能体CLEAR：自动化LLM智能体多层次评估

摘要

随着智能体系统能力的不断增强，它们能够制定策略、采取行动并与不同环境交互。这种自主性给监督和评估智能体行为带来了严峻挑战。当前大多数工具存在局限性：要么仅提供基础评估能力的可观测性功能，要么采用静态的人工构建错误分类体系，难以适应新领域。为弥补这一空白，我们提出Agentic CLEAR——一个自动、动态且易于使用的评估框架。该框架从系统级、轨迹级和节点级三个粒度层级生成智能体行为的文本洞察报告。Agentic CLEAR运行在可观测性层之上，支持无缝集成，并通过直观的用户界面大幅降低智能体评估门槛。在包含四个基准测试、七种智能体场景及数万次大语言模型调用的实验中，我们证明Agentic CLEAR能够生成高质量、数据驱动且富有洞察力的反馈。分析表明，其评估结果与人工标注的错误高度一致，并能有效预测任务成功率。

English

Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.