TAM-Eval：面向自动化单元测试维护的大语言模型评估

摘要

尽管大语言模型在软件工程领域展现出潜力，但其在单元测试中的应用仍主要局限于孤立的测试生成或预言预测，未能解决测试套件维护这一更广泛的挑战。我们推出TAM-Eval（测试自动化维护评估框架），该框架与基准测试平台旨在评估模型在三大核心测试维护场景中的表现：测试套件的创建、修复与更新。与先前局限于函数级任务的研究不同，TAM-Eval在测试文件层级运行，同时在独立评估期间保持对完整代码库上下文的访问，更真实地反映实际维护工作流程。我们的基准数据集包含从Python、Java和Go项目中自动提取并验证的1,539个场景。TAM-Eval支持对原始大语言模型和智能体工作流进行系统无关的评估，采用基于测试套件通过率、代码覆盖率和变异测试的无参考评估方案。实验结果表明，当前最先进的大语言模型在真实测试维护流程中能力有限，仅能小幅提升测试有效性。我们将TAM-Eval作为开源框架发布，以支持自动化软件测试的未来研究。数据与代码已公开于https://github.com/trndcenter/TAM-Eval。

English

While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using a reference-free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state-of-the-art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM-Eval as an open-source framework to support future research in automated software testing. Our data and code are publicly available at https://github.com/trndcenter/TAM-Eval.

TAM-Eval：面向自动化单元测试维护的大语言模型评估

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

摘要

Support