ChatPaper.aiChatPaper

开源AI智能体框架与智能体应用中的测试实践实证研究

An Empirical Study of Testing Practices in Open Source AI Agent Frameworks and Agentic Applications

September 23, 2025
作者: Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, Ahmed E. Hassan
cs.AI

摘要

基于基础模型(FM)的AI智能体正在多个领域迅速普及,但其固有的非确定性和不可复现性给测试和质量保障带来了挑战。尽管近期基准测试提供了任务层面的评估,但开发者如何在开发过程中验证这些智能体内部正确性的理解仍然有限。 为填补这一空白,我们首次对AI智能体生态系统中的测试实践进行了大规模实证研究,分析了39个开源智能体框架和439个智能体应用。我们识别出十种不同的测试模式,发现如DeepEval等新颖的、专为智能体设计的测试方法使用率极低(约1%),而传统的负面测试和成员测试等模式则被广泛采用以应对FM的不确定性。通过将这些模式映射到智能体框架和智能体应用的典型架构组件上,我们发现了一个根本性的测试投入倒置现象:确定性组件如资源构件(工具)和协调构件(工作流)占据了超过70%的测试投入,而基于FM的计划主体却仅获得不到5%的关注。尤为关键的是,触发组件(提示词)几乎被忽视,仅出现在约1%的测试中,这揭示了一个严重的盲点。 我们的研究首次为基于FM的智能体框架和应用提供了实证测试基准,揭示了在应对非确定性方面虽理性但不全面的适应策略。为解决这一问题,框架开发者应增强对新型测试方法的支持,应用开发者需采纳提示词回归测试,而研究者则应探索阻碍这些方法采用的障碍。强化这些实践对于构建更健壮、更可靠的AI智能体至关重要。
English
Foundation model (FM)-based AI agents are rapidly gaining adoption across diverse domains, but their inherent non-determinism and non-reproducibility pose testing and quality assurance challenges. While recent benchmarks provide task-level evaluations, there is limited understanding of how developers verify the internal correctness of these agents during development. To address this gap, we conduct the first large-scale empirical study of testing practices in the AI agent ecosystem, analyzing 39 open-source agent frameworks and 439 agentic applications. We identify ten distinct testing patterns and find that novel, agent-specific methods like DeepEval are seldom used (around 1%), while traditional patterns like negative and membership testing are widely adapted to manage FM uncertainty. By mapping these patterns to canonical architectural components of agent frameworks and agentic applications, we uncover a fundamental inversion of testing effort: deterministic components like Resource Artifacts (tools) and Coordination Artifacts (workflows) consume over 70% of testing effort, while the FM-based Plan Body receives less than 5%. Crucially, this reveals a critical blind spot, as the Trigger component (prompts) remains neglected, appearing in around 1% of all tests. Our findings offer the first empirical testing baseline in FM-based agent frameworks and agentic applications, revealing a rational but incomplete adaptation to non-determinism. To address it, framework developers should improve support for novel testing methods, application developers must adopt prompt regression testing, and researchers should explore barriers to adoption. Strengthening these practices is vital for building more robust and dependable AI agents.
PDF32October 2, 2025