重新审视基于LLM的软件工程智能体中智能生成测试的价值
Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents
February 8, 2026
作者: Zhi Chen, Zhensu Sun, Yuling Shi, Chao Peng, Xiaodong Gu, David Lo, Lingxiao Jiang
cs.AI
摘要
大型语言模型(LLM)代码代理正日益通过迭代式代码编辑、工具调用和候选补丁验证来解决代码库级别的问题。在这些工作流程中,代理常会动态编写测试用例,这一模式已被SWE-bench排行榜上多数高排名代理所采用。然而我们观察到,几乎不编写新测试的GPT-5.2甚至能达到与顶尖代理相仿的性能。这引出一个关键问题:此类测试是否能切实提升问题解决效果,抑或只是在消耗大量交互预算的同时模仿人类测试实践?
为揭示代理编写测试的影响,我们开展了一项实证研究,分析六种前沿LLM在SWE-bench Verified数据集上的代理执行轨迹。结果表明:尽管测试编写被广泛采用,但同一模型中已解决和未解决任务呈现相似的测试编写频率;此外这些测试主要作为观测反馈渠道,代理明显更倾向于使用数值输出语句而非基于断言的正式检查。基于这些发现,我们通过修改四个代理的提示词进行对照实验,分别增加或减少其测试编写量。实验结果显示代理编写测试量的变化并未显著影响最终结果。综合来看,我们的研究表明当前代理编写测试的实践在自动化软件工程任务中可能仅能提供有限效用。
English
Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, a paradigm adopted by many high-ranking agents on the SWE-bench leaderboard. However, we observe that GPT-5.2, which writes almost no new tests, can even achieve performance comparable to top-ranking agents. This raises the critical question: whether such tests meaningfully improve issue resolution or merely mimic human testing practices while consuming a substantial interaction budget.
To reveal the impact of agent-written tests, we present an empirical study that analyzes agent trajectories across six state-of-the-art LLMs on SWE-bench Verified. Our results show that while test writing is commonly adopted, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies Furthermore, these tests typically serve as observational feedback channels, where agents prefer value-revealing print statements significantly more than formal assertion-based checks. Based on these insights, we perform a controlled experiment by revising the prompts of four agents to either increase or reduce test writing. The results suggest that changes in the volume of agent-written tests do not significantly change final outcomes. Taken together, our study reveals that current test-writing practices may provide marginal utility in autonomous software engineering tasks.