LLM 기반 소프트웨어 엔지니어링 에이전트를 위한 에이전트 생성 테스트의 가치 재고

초록

대규모 언어 모델(LLM) 코드 에이전트는 코드 수정, 도구 호출, 후보 패치 검증을 반복적으로 수행하며 저장소 수준 이슈 해결 능력을 점차 확대하고 있습니다. 이러한 워크플로우에서 에이전트는 실시간으로 테스트를 작성하는 방식을 자주 채택하는데, 이는 SWE-bench 리더보드 상위권 에이전트들이 흔히 사용하는 패러다임입니다. 그러나 흥미롭게도, 새로운 테스트를 거의 작성하지 않는 GPT-5.2 조차 최상위권 에이전트들과 유사한 성능을 달성할 수 있음을 관찰했습니다. 이는 중요한 질문을 제기합니다: 이러한 테스트가 실제로 이슈 해결을 의미 있게 개선하는지, 아니면 상당한 상호작용 예산을 소모하면서 단순히 인간의 테스트 관행을 모방하는 것에 불과한지입니다. 에이전트 작성 테스트의 영향을 규명하기 위해, 우리는 SWE-bench Verified에서 6개의 최신 LLM에 대한 에이전트 실행 궤적을 분석한 실증 연구를 제시합니다. 연구 결과에 따르면, 테스트 작성이 일반적으로 채택되기는 하지만, 동일 모델 내에서 해결된 작업과 해결되지 않은 작업이 유사한 테스트 작성 빈도를 보입니다. 더욱이 이러한 테스트는 일반적으로 관찰적 피드백 채널로 활용되며, 에이전트들은 형식적인 어설션 기반 검사보다 값을 출력하는 프린트 문을 선호하는 경향이 뚜렷합니다. 이러한 통찰을 바탕으로, 우리는 4개 에이전트의 프롬프트를 수정하여 테스트 작성을 증가시키거나 감소시키는 통제 실험을 수행했습니다. 결과는 에이전트 작성 테스트의 양 변화가 최종 결과에 유의미한 변화를 가져오지 않음을 시사합니다. 종합적으로, 우리의 연구는 현재의 테스트 작성 관행이 자율 소프트웨어 엔지니어링 작업에서 한계적인 효용만을 제공할 수 있음을 보여줍니다.

English

Large Language Model (LLM) code agents increasingly resolve repository-level issues by iteratively editing code, invoking tools, and validating candidate patches. In these workflows, agents often write tests on the fly, a paradigm adopted by many high-ranking agents on the SWE-bench leaderboard. However, we observe that GPT-5.2, which writes almost no new tests, can even achieve performance comparable to top-ranking agents. This raises the critical question: whether such tests meaningfully improve issue resolution or merely mimic human testing practices while consuming a substantial interaction budget. To reveal the impact of agent-written tests, we present an empirical study that analyzes agent trajectories across six state-of-the-art LLMs on SWE-bench Verified. Our results show that while test writing is commonly adopted, but resolved and unresolved tasks within the same model exhibit similar test-writing frequencies Furthermore, these tests typically serve as observational feedback channels, where agents prefer value-revealing print statements significantly more than formal assertion-based checks. Based on these insights, we perform a controlled experiment by revising the prompts of four agents to either increase or reduce test writing. The results suggest that changes in the volume of agent-written tests do not significantly change final outcomes. Taken together, our study reveals that current test-writing practices may provide marginal utility in autonomous software engineering tasks.

LLM 기반 소프트웨어 엔지니어링 에이전트를 위한 에이전트 생성 테스트의 가치 재고

Rethinking the Value of Agent-Generated Tests for LLM-Based Software Engineering Agents

초록

Support