TAM-Eval：面向自动化单元测试维护的大语言模型评估框架

摘要

儘管大型語言模型在軟件工程領域展現出應用潛力，但其在單元測試方面的應用目前仍主要侷限於孤立測試用例生成或預言預測，未能充分應對測試套件維護這一更廣泛的挑戰。我們提出TAM-Eval（測試自動化維護評估框架），該框架與基準測試體系旨在評估模型在三大核心測試維護場景中的表現：測試套件的創建、修復與更新。有別於以往僅限於函數級任務的研究，TAM-Eval在測試文件層級進行操作，同時在隔離評估期間保持對完整代碼庫上下文的訪問權限，從而更真實地反映實際維護工作流程。我們的基準測試包含從Python、Java和Go項目中自動提取並驗證的1,539個測試場景。TAM-Eval採用基於測試套件通過率、代碼覆蓋率和變異測試的無參考評估協議，支持對原始LLM與智能體工作流進行系統無關的評估。實證結果表明，現有頂尖LLM在真實測試維護流程中能力有限，僅能邊際提升測試有效性。我們將TAM-Eval作為開源框架發布，以支持自動化軟件測試的未來研究。相關數據與代碼已公開於https://github.com/trndcenter/TAM-Eval。

English

While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using a reference-free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state-of-the-art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM-Eval as an open-source framework to support future research in automated software testing. Our data and code are publicly available at https://github.com/trndcenter/TAM-Eval.

TAM-Eval：面向自动化单元测试维护的大语言模型评估框架

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

摘要

Support