SlopCodeBench: 장기 반복 작업에서 코딩 에이전트 성능 저하 벤치마킹

초록

소프트웨어 개발은 반복적이지만, 대부분의 자율 코딩 벤치마크는 완전한 명세에 대한 단일 시도 솔루션만을 평가합니다. 코드는 테스트 스위트를 통과할 수 있지만 점차 확장하기 어려워질 수 있습니다. 최근의 반복적 벤치마크는 이 격차를 해소하려고 시도하지만, 에이전트의 설계 결정을 지나치게 엄격하게 제한하여 코드 품질이 미래 확장에 미치는 영향을 충실히 측정하지 못합니다. 우리는 SlopCodeBench를 소개합니다. 이는 언어 중립적인 벤치마크로, 20개의 문제와 93개의 체크포인트로 구성되어 있으며, 에이전트가 내부 구조를 규정하지 않으면서도 아키텍처 결정을 강제하는 진화하는 명세 하에서 자신의 이전 솔루션을 반복적으로 확장합니다. 우리는 두 가지 트랙터리 수준의 품질 신호를 추적합니다: 중복되거나 반복된 코드의 비율을 나타내는 'verbosity'(다소성)와 높은 복잡도의 함수에 집중된 복잡성 질량의 비율을 나타내는 'structural erosion'(구조적 침식)입니다. 11개 모델 중 그 어떤 에이전트도 단일 문제를 종단간(end-to-end) 해결하지 못했으며, 최고 체크포인트 해결률은 17.2%에 불과했습니다. 품질은 꾸준히 저하됩니다: 침식은 트랙터리의 80%에서, 다소성은 89.8%에서 증가했습니다. 48개의 오픈소스 Python 저장소와 비교했을 때, 에이전트 코드는 2.2배 더 다소성 높고 현저히 더 침식된 것으로 나타났습니다. 해당 저장소 중 20개의 시간에 따른 추적 결과, 인간의 코드는 평탄하게 유지되는 반면 에이전트 코드는 각 반복마다 악화되었습니다. 프롬프트 개입 연구는 초기 품질은 향상될 수 있지만, 저하를 막지는 못함을 보여줍니다. 이러한 결과는 통과율 벤치마크가 확장 견고성을 체계적으로 과소 측정하며, 현재의 에이전트들은 반복적 소프트웨어 개발이 요구하는 설계 훈련이 부족함을 입증합니다.

English

Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent's design decisions too tightly to faithfully measure how code quality shapes future extensions. We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without prescribing internal structure. We track two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions. No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%. Quality degrades steadily: erosion rises in 80% of trajectories and verbosity in 89.8%. Against 48 open-source Python repositories, agent code is 2.2x more verbose and markedly more eroded. Tracking 20 of those repositories over time shows that human code stays flat, while agent code deteriorates with each iteration. A prompt-intervention study shows that initial quality can be improved, but it does not halt degradation. These results demonstrate that pass-rate benchmarks systematically undermeasure extension robustness, and that current agents lack the design discipline iterative software development demands.

SlopCodeBench: 장기 반복 작업에서 코딩 에이전트 성능 저하 벤치마킹

SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks

초록

Support