Een Empirisch Onderzoek naar Testpraktijken in Open Source AI Agent Frameworks en Agentische Toepassingen

Samenvatting

AI-agents gebaseerd op foundation models (FM) worden snel geadopteerd in diverse domeinen, maar hun inherente non-determinisme en niet-reproduceerbaarheid brengen uitdagingen met zich mee voor testen en kwaliteitsborging. Hoewel recente benchmarks taakniveau-evaluaties bieden, is er beperkt inzicht in hoe ontwikkelaars de interne correctheid van deze agents verifiëren tijdens de ontwikkeling. Om deze kloof te dichten, voeren we de eerste grootschalige empirische studie uit naar testpraktijken in het ecosysteem van AI-agents, waarbij we 39 open-source agentframeworks en 439 agentische applicaties analyseren. We identificeren tien verschillende testpatronen en constateren dat nieuwe, agent-specifieke methoden zoals DeepEval zelden worden gebruikt (ongeveer 1%), terwijl traditionele patronen zoals negatieve en lidmaatschaptesten breed worden aangepast om FM-onzekerheid te beheersen. Door deze patronen te koppelen aan canonieke architectuurcomponenten van agentframeworks en agentische applicaties, ontdekken we een fundamentele omkering van testinspanning: deterministische componenten zoals Resource Artifacts (tools) en Coordination Artifacts (workflows) nemen meer dan 70% van de testinspanning in beslag, terwijl het FM-gebaseerde Plan Body minder dan 5% krijgt. Cruciaal is dat dit een kritieke blinde vlek blootlegt, aangezien de Trigger-component (prompts) verwaarloosd blijft en in ongeveer 1% van alle tests voorkomt. Onze bevindingen bieden de eerste empirische testbasislijn in FM-gebaseerde agentframeworks en agentische applicaties, wat een rationele maar onvolledige aanpassing aan non-determinisme onthult. Om dit aan te pakken, moeten frameworkontwikkelaars de ondersteuning voor nieuwe testmethoden verbeteren, moeten applicatieontwikkelaars prompt-regressietesten omarmen, en moeten onderzoekers barrières voor adoptie verkennen. Het versterken van deze praktijken is essentieel voor het bouwen van robuustere en betrouwbaardere AI-agents.

English

Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.

Een Empirisch Onderzoek naar Testpraktijken in Open Source AI Agent Frameworks en Agentische Toepassingen

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

Samenvatting

Support