에이전트 기반 CLEAR: LLM 에이전트의 다중 수준 평가 자동화

초록

에이전트 시스템은 점점 더 강력해지고 있다. 에이전트는 전략을 정의하고, 행동을 취하며, 다양한 환경과 상호작용한다. 이러한 자율성은 에이전트 행동을 감독하고 평가하는 데 심각한 도전 과제를 제기한다. 현재 대부분의 도구는 제한적이어서 기본적인 평가 기능을 갖춘 관찰 가능성에 초점을 맞추거나, 새로운 도메인에 적응할 수 없는 정적이고 수작업으로 제작된 오류 분류 체계를 강제한다. 이러한 격차를 해소하기 위해 우리는 자동적이고 동적이며 사용하기 쉬운 평가 프레임워크인 Agentic CLEAR를 제시한다. 이 프레임워크는 시스템, 추적, 노드의 세 가지 세분화 수준에서 에이전트 행동에 대한 텍스트 기반 통찰력을 생성한다. Agentic CLEAR는 관찰 가능성 계층 위에서 작동하여 원활한 통합을 가능하게 하며, 에이전트 평가를 매우 접근하기 쉽게 만드는 직관적인 UI를 제공한다. 네 가지 벤치마크, 일곱 가지 에이전트 설정, 수만 건의 LLM 호출에 대한 실험에서 Agentic CLEAR가 고품질의 데이터 기반 통찰력 있는 피드백을 생성함을 보여준다. 우리의 분석은 인간이 주석을 단 오류와의 강력한 정합성과 작업 성공률 예측 능력을 입증한다.

English

Agentic systems are becoming more capable: agents define strategies, take actions, and interact with different environments. This autonomy poses serious challenges for overseeing and assessing agent behavior. Most current tools are limited, focusing on observability with basic evaluation capabilities or imposing static, hand-crafted error taxonomies that cannot adapt to new domains. To address this gap, we present Agentic CLEAR, an automatic, dynamic, and easy-to-use evaluation framework. It produces textual insights into the agent behavior on three levels of granularity: system, trace, and node. Agentic CLEAR operates above the observability layer, enabling seamless integration and featuring an intuitive UI that makes agent evaluation highly accessible. In our experiments on four benchmarks, seven agentic settings, and tens of thousands of LLM calls, we show that Agentic CLEAR produces high-quality, data-driven, insightful feedback. Our analysis shows strong alignment with human-annotated errors and the ability to predict task success rate.