TAM-Eval: Het evalueren van grote taalmodellen voor het geautomatiseerd onderhouden van unittests

Samenvatting

Hoewel Large Language Models (LLM's) veelbelovend zijn in software engineering, blijft hun toepassing voor unittesting grotendeels beperkt tot geïsoleerde testgeneratie of orakelvoorspelling, waarbij de bredere uitdaging van testsuite-onderhoud wordt verwaarloosd. Wij introduceren TAM-Eval (Test Automated Maintenance Evaluation), een raamwerk en benchmark ontworpen om modelprestaties te evalueren in drie kernscenario's voor testonderhoud: het creëren, repareren en bijwerken van testsuites. In tegenstelling tot eerder werk dat beperkt bleef tot taken op functieniveau, opereert TAM-Eval op testbestandsniveau, met behoud van toegang tot de volledige repositorycontext tijdens geïsoleerde evaluatie, wat realistischere onderhoudswerkstromen weerspiegelt. Onze benchmark omvat 1.539 automatisch geëxtraheerde en gevalideerde scenario's uit Python-, Java- en Go-projecten. TAM-Eval ondersteunt systeemonafhankelijke evaluatie van zowel ruwe LLM's als agent-gebaseerde workflows, met behulp van een referentievrij protocol gebaseerd op testsuite-slagingspercentage, codecoveragedekking en mutatietesten. Empirische resultaten tonen aan dat state-of-the-art LLM's beperkte capaciteiten hebben in realistische testonderhoudsprocessen en slechts marginale verbeteringen in testefficiëntie opleveren. Wij geven TAM-Eval vrij als een open-source raamwerk om toekomstig onderzoek naar geautomatiseerd softwaretesten te ondersteunen. Onze data en code zijn openbaar beschikbaar op https://github.com/trndcenter/TAM-Eval.

English

While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using a reference-free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state-of-the-art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM-Eval as an open-source framework to support future research in automated software testing. Our data and code are publicly available at https://github.com/trndcenter/TAM-Eval.

TAM-Eval: Het evalueren van grote taalmodellen voor het geautomatiseerd onderhouden van unittests

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

Samenvatting

Support