TAM-Eval: 自動単体テストメンテナンスのための大規模言語モデル評価

要旨

大規模言語モデル（LLM）はソフトウェア工学において有望な成果を示しているものの、単体テストへの応用は、主に独立したテスト生成やオラクル予測に限定されており、テストスイートメンテナンスというより広範な課題が軽視されてきた。本論文では、テストスイートの作成、修復、更新という3つの核心的なテストメンテナンスシナリオにわたるモデルの性能を評価するためのフレームワーク兼ベンチマークであるTAM-Eval（Test Automated Maintenance Evaluation）を提案する。関数レベルのタスクに限定された従来研究とは異なり、TAM-Evalはテストファイルレベルで動作し、独立評価中もリポジトリ全体のコンテキストへのアクセスを維持することで、実世界のメンテナンスワークフローをより忠実に反映する。我々のベンチマークは、Python、Java、Goプロジェクトから自動抽出され検証された1,539のシナリオで構成される。TAM-Evalは、テストスイートの合格率、コードカバレッジ、突然変異テストに基づく参照不要のプロトコルを用いて、生のLLMとエージェント型ワークフローの両方に対するシステム非依存の評価をサポートする。実証実験の結果、最先端のLLMであっても現実的なテストメンテナンスプロセスにおける能力は限定的であり、テスト効果の向上は僅かであることが示された。自動ソフトウェアテストの将来研究を支援するため、TAM-Evalをオープンソースフレームワークとして公開する。データとコードはhttps://github.com/trndcenter/TAM-Eval で公開されている。

English

While Large Language Models (LLMs) have shown promise in software engineering, their application to unit testing remains largely confined to isolated test generation or oracle prediction, neglecting the broader challenge of test suite maintenance. We introduce TAM-Eval (Test Automated Maintenance Evaluation), a framework and benchmark designed to evaluate model performance across three core test maintenance scenarios: creation, repair, and updating of test suites. Unlike prior work limited to function-level tasks, TAM-Eval operates at the test file level, while maintaining access to full repository context during isolated evaluation, better reflecting real-world maintenance workflows. Our benchmark comprises 1,539 automatically extracted and validated scenarios from Python, Java, and Go projects. TAM-Eval supports system-agnostic evaluation of both raw LLMs and agentic workflows, using a reference-free protocol based on test suite pass rate, code coverage, and mutation testing. Empirical results indicate that state-of-the-art LLMs have limited capabilities in realistic test maintenance processes and yield only marginal improvements in test effectiveness. We release TAM-Eval as an open-source framework to support future research in automated software testing. Our data and code are publicly available at https://github.com/trndcenter/TAM-Eval.

TAM-Eval: 自動単体テストメンテナンスのための大規模言語モデル評価

TAM-Eval: Evaluating LLMs for Automated Unit Test Maintenance

要旨

Support