AI 에이전트 신뢰성 과학을 향하여

초록

AI 에이전트가 중요한 업무를 수행하기 위해 점점 더 많이 배포되고 있습니다. 표준 벤치마크에서 상승하는 정확도 점수는 빠른 발전을 시사하지만, 많은 에이전트들은 실제 상황에서 여전히 실패를 거듭하고 있습니다. 이러한 괴리는 현재 평가 방법의 근본적인 한계를 부각시킵니다. 에이전트의 행동을 단일 성공 지표로 압축하는 것은 중요한 운영상의 결함을 가리기 때문입니다. 특히, 에이전트가 실행 간 일관되게 행동하는지, 외부 교란을 견딜 수 있는지, 예측 가능하게 실패하는지, 오류의 심각도가 제한되는지 여부는 무시됩니다. 안전이 중시되는 공학 분야에 기반하여, 우리는 신뢰도를 네 가지 핵심 차원(일관성, 견고성, 예측 가능성, 안전성)으로 분해하는 12가지 구체적인 지표를 제안함으로써 종합적인 성능 프로필을 제시합니다. 두 가지 상호 보완적인 벤치마크를 통해 14가지 에이전트 모델을 평가한 결과, 최근의 능력 향상이 신뢰도 측면에서는 작은 개선만을 가져왔음을 발견했습니다. 이러한 지속적인 한계를 드러냄으로써, 우리의 지표는 기존 평가를 보완하면서 에이전트가 어떻게 성능을 발휘하고, 저하되고, 실패하는지에 대해 추론할 수 있는 도구를 제공합니다.

English

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

AI 에이전트 신뢰성 과학을 향하여

Towards a Science of AI Agent Reliability

초록

Support