エージェンティックCLEAR：LLMエージェントのマルチレベル評価の自動化

要旨

エージェンティックシステムはますます高性能化しており、エージェントは戦略を定義し、行動を実行し、多様な環境と相互作用する。この自律性は、エージェントの行動を監視・評価する上で深刻な課題をもたらす。現在のツールの大半は限定的であり、基本的な評価機能を備えた可観測性に焦点を当てるか、あるいは新たな領域に適応できない静的な手作業によるエラータクソノミを適用している。このギャップに対処するため、我々は自動的で動的かつ使いやすい評価フレームワークであるAgentic CLEARを提案する。本フレームワークは、システム、トレース、ノードの3段階の粒度でエージェントの行動に関するテキスト形式の洞察を生成する。Agentic CLEARは可観測性レイヤの上位で動作し、シームレスな統合を可能にし、直感的なUIによりエージェント評価を極めてアクセスしやすいものにする。4つのベンチマーク、7つのエージェンティック設定、数万回のLLM呼び出しを用いた実験では、Agentic CLEARが高品質でデータ駆動型の洞察に富んだフィードバックを生成することを示す。分析の結果、人間が注釈を付けたエラーとの強い一致と、タスク成功率を予測する能力が確認された。

English

Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.