LLMベースのソフトウェアエンジニアリングエージェントにおけるエージェント生成テストの価値の再考

要旨

大規模言語モデル（LLM）コードエージェントは、コードの反復的編集、ツールの呼び出し、候補パッチの検証を通じて、リポジトリレベルの課題解決を進化させている。このようなワークフローでは、SWE-benchリーダーボードで高順位のエージェントの多くが採用するように、エージェントがその場でテストを記述するパラダイムが一般的である。しかし、新規テストをほとんど記述しないGPT-5.2でさえ、トップクラスのエージェントに匹敵する性能を達成できることが観察されている。これは、こうしたテストが課題解決を実質的に改善するのか、それとも大幅なインタラクション予算を消費しながら単に人間のテスト慣行を模倣しているに過ぎないのかという重要な疑問を提起する。エージェント記述テストの影響を明らかにするため、我々はSWE-bench Verifiedにおいて6つの最先端LLMのエージェント軌跡を分析する実証研究を行う。結果によれば、テスト記述は一般的に採用されているものの、同一モデル内で解決済みと未解決のタスク間ではテスト記述頻度に類似性が認められる。さらに、これらのテストは通常、観察的フィードバックチャネルとして機能し、エージェントは形式的なアサーションベースのチェックよりも、値を表示するプリント文を有意に好んで使用する。これらの知見に基づき、4つのエージェントのプロンプトを改変し、テスト記述を増加または減少させる制御実験を実施した。結果は、エージェント記述テストの量の変化が最終結果に有意な変化をもたらさないことを示唆している。以上を総合すると、現在のテスト記述慣行は自律的なソフトウェアエンジニアリングタスクにおいて限定的な効用しか提供していない可能性が示される。

English

Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, a paradigm adopted by many high-ranking agents on the SWE-bench leaderboard. However, we observe that GPT-5.2, which writes almost no new tests, can even achieve performance comparable to top-ranking agents. This raises the critical question: whether such tests meaningfully improve issue resolution or merely mimic human testing practices while consuming a substantial interaction budget. To reveal the impact of agent-written tests, we present an empirical study that analyzes agent trajectories across six state-of-the-art LLMs on SWE-bench Verified. Our results show that while test writing is commonly adopted, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies Furthermore, these tests typically serve as observational feedback channels, where agents prefer value-revealing print statements significantly more than formal assertion-based checks. Based on these insights, we perform a controlled experiment by revising the prompts of four agents to either increase or reduce test writing. The results suggest that changes in the volume of agent-written tests do not significantly change final outcomes. Taken together, our study reveals that current test-writing practices may provide marginal utility in autonomous software engineering tasks.

LLMベースのソフトウェアエンジニアリングエージェントにおけるエージェント生成テストの価値の再考

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

要旨

Support