오픈소스 AI 에이전트 프레임워크 및 에이전트 애플리케이션에서의 테스트 실태에 대한 실증적 연구

초록

파운데이션 모델(FM) 기반 AI 에이전트는 다양한 도메인에서 빠르게 채택되고 있지만, 그 내재된 비결정성과 재현 불가능성은 테스트 및 품질 보증에 있어 도전 과제로 작용합니다. 최근 벤치마크들이 작업 수준 평가를 제공하고 있지만, 개발 과정에서 개발자들이 이러한 에이전트의 내부 정확성을 어떻게 검증하는지에 대한 이해는 제한적입니다. 이러한 격차를 해소하기 위해, 우리는 AI 에이전트 생태계에서의 테스트 관행에 대한 첫 번째 대규모 실증 연구를 수행하여, 39개의 오픈소스 에이전트 프레임워크와 439개의 에이전트 애플리케이션을 분석했습니다. 우리는 10가지의 독특한 테스트 패턴을 식별했으며, DeepEval과 같은 새로운 에이전트 특화 방법은 거의 사용되지 않는 반면(약 1%), 부정 테스트와 멤버십 테스트와 같은 전통적인 패턴들은 FM의 불확실성을 관리하기 위해 널리 적용되고 있음을 발견했습니다. 이러한 패턴들을 에이전트 프레임워크와 에이전트 애플리케이션의 표준 아키텍처 구성 요소에 매핑함으로써, 우리는 테스트 노력의 근본적인 역전 현상을 밝혀냈습니다: 리소스 아티팩트(도구)와 조정 아티팩트(워크플로우)와 같은 결정론적 구성 요소들이 테스트 노력의 70% 이상을 차지하는 반면, FM 기반 플랜 바디는 5% 미만의 테스트 노력을 받고 있습니다. 특히, 트리거 구성 요소(프롬프트)는 약 1%의 테스트에서만 나타나며 여전히 소외되고 있습니다. 우리의 연구 결과는 FM 기반 에이전트 프레임워크와 에이전트 애플리케이션에서의 첫 번째 실증적 테스트 기준을 제공하며, 비결정성에 대한 합리적이지만 불완전한 적응을 보여줍니다. 이를 해결하기 위해, 프레임워크 개발자들은 새로운 테스트 방법에 대한 지원을 개선해야 하며, 애플리케이션 개발자들은 프롬프트 회귀 테스트를 도입해야 하고, 연구자들은 채택 장벽을 탐구해야 합니다. 이러한 관행을 강화하는 것은 더 견고하고 신뢰할 수 있는 AI 에이전트를 구축하는 데 필수적입니다.

English

Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.

오픈소스 AI 에이전트 프레임워크 및 에이전트 애플리케이션에서의 테스트 실태에 대한 실증적 연구

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

초록

Support