SlopCodeBench:评估编码智能体在长周期迭代任务中的性能退化基准
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
March 25, 2026
作者: Gabriel Orlanski, Devjeet Roy, Alexander Yun, Changho Shin, Alex Gu, Albert Ge, Dyah Adila, Frederic Sala, Aws Albarghouthi
cs.AI
摘要
软件开发本质上是迭代过程,然而当前的智能体编程基准测试大多针对完整需求规范评估单次解决方案。代码或许能通过测试套件,但随着迭代会逐渐难以扩展。近期出现的迭代基准测试试图弥补这一差距,但过度约束智能体的设计决策,难以真实衡量代码质量对后续扩展的影响。我们推出SlopCodeBench——一个语言无关的基准测试集,包含20个问题和93个检查点,要求智能体在持续演化的需求规范下反复扩展自身先前方案,这些规范会强制进行架构决策但不会限定内部结构。我们追踪两个轨迹级质量指标:冗余度(冗余或重复代码占比)和结构侵蚀度(高复杂度函数中的复杂性质量占比)。在11个模型中,没有智能体能端到端解决任何问题;最高检查点解决率仅为17.2%。代码质量持续恶化:80%的轨迹出现结构侵蚀加剧,89.8%存在冗余度上升。与48个开源Python代码库相比,智能体代码冗余度高出2.2倍且结构侵蚀显著更严重。对其中20个代码库的持续追踪显示,人类代码质量保持稳定,而智能体代码每次迭代都会恶化。提示干预研究表明初始质量可提升,但无法阻止退化趋势。这些结果证明通过率基准测试系统性低估了扩展稳健性,且当前智能体缺乏迭代软件开发所需的设计规约能力。
English
Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent's design decisions too tightly to faithfully measure how code quality shapes future extensions. We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without prescribing internal structure. We track two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions. No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%. Quality degrades steadily: erosion rises in 80% of trajectories and verbosity in 89.8%. Against 48 open-source Python repositories, agent code is 2.2x more verbose and markedly more eroded. Tracking 20 of those repositories over time shows that human code stays flat, while agent code deteriorates with each iteration. A prompt-intervention study shows that initial quality can be improved, but it does not halt degradation. These results demonstrate that pass-rate benchmarks systematically undermeasure extension robustness, and that current agents lack the design discipline iterative software development demands.