TAM-Eval: 자동화된 단위 테스트 유지보수를 위한 대규모 언어 모델 평가

초록

대규모 언어 모델(LLM)이 소프트웨어 공학 분야에서 유용성을 보여주고 있지만, 단위 테스트 적용은 주로 고립된 테스트 생성이나 오라클 예측에 국한되어 테스트 슈트 유지보수라는 더 광범위한 과제를 소홀히 해왔습니다. 본 연구에서는 세 가지 핵심 테스트 유지보수 시나리오(테스트 슈트 생성, 수리, 갱신)에서 모델 성능을 평가하기 위해 설계된 프레임워크이자 벤치마크인 TAM-Eval(Test Automated Maintenance Evaluation)을 소개합니다. 함수 수준 작업에 한정된 기존 연구와 달리, TAM-Eval은 테스트 파일 수준에서 작동하면서도 고립된 평가 중에도 전체 저장소 컨텍스트에 접근할 수 있어 실제 유지보수 워크플로우를 더 잘 반영합니다. 우리의 벤치마크는 Python, Java, Go 프로젝트에서 자동으로 추출하고 검증한 1,539개의 시나리오로 구성됩니다. TAM-Eval은 테스트 슈트 통과율, 코드 커버리지, 돌연변이 테스트를 기반으로 한 참조 없는 프로토콜을 사용하여 원시 LLM과 에이전트 기반 워크플로우 모두에 대한 시스템 독립적 평가를 지원합니다. 실험 결과에 따르면 최첨단 LLM도 현실적인 테스트 유지보수 과정에서는 제한된 능력만을 보여주며 테스트 효과성 측면에서 미미한 개선만을 제공합니다. 우리는 자동화된 소프트웨어 테스트 분야의 향후 연구를 지원하기 위해 TAM-Eval을 오픈소스 프레임워크로 공개합니다. 데이터와 코드는 https://github.com/trndcenter/TAM-Eval에서 공개적으로 이용 가능합니다.

English

While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using a reference-free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state-of-the-art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM-Eval as an open-source framework to support future research in automated software testing. Our data and code are publicly available at https://github.com/trndcenter/TAM-Eval.

TAM-Eval: 자동화된 단위 테스트 유지보수를 위한 대규모 언어 모델 평가

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

초록

Support