SlopCodeBench:評估編碼代理在長程迭代任務中的效能衰退基準
SlopCodeBench: Benchmarking How Coding Agents Degrade Over Long-Horizon Iterative Tasks
March 25, 2026
作者: Gabriel Orlanski, Devjeet Roy, Alexander Yun, Changho Shin, Alex Gu, Albert Ge, Dyah Adila, Frederic Sala, Aws Albarghouthi
cs.AI
摘要
軟體開發具有迭代特性,但現有的智能編碼基準測試大多針對完整規格評估單次解決方案。程式碼可能通過測試套件,卻逐漸難以擴展。近期迭代式基準測試試圖彌合這一差距,但過度約束智能體的設計決策,難以真實衡量程式碼品質對後續擴展的影響。我們推出SlopCodeBench——一個語言無關的基準測試集,包含20個問題和93個檢查點,要求智能體在不斷演進的規格下反覆擴展自身先前的解決方案,這些規格強制要求架構決策卻不限定內部結構。我們追蹤兩個軌跡層面的品質信號:冗餘度(重複或冗餘程式碼的比例)與結構侵蝕(高複雜度函數中聚集的複雜度質量占比)。在11個模型中,沒有任何智能體能端到端解決所有問題;最高檢查點解決率僅為17.2%。品質持續惡化:80%的軌跡出現結構侵蝕加劇,89.8%的軌跡冗餘度上升。與48個開源Python代碼庫相比,智能體產生的程式碼冗餘度高出2.2倍,結構侵蝕程度顯著加劇。對其中20個代碼庫的持續追蹤顯示,人類程式碼品質保持穩定,而智能體程式碼隨每次迭代惡化。一項提示干預研究表明,初始品質雖可提升,但無法阻止劣化進程。這些結果證明,通過率基準測試系統性低估了擴展健壯性,且當前智能體缺乏迭代式軟體開發所需的設計紀律。
English
Software development is iterative, yet agentic coding benchmarks overwhelmingly evaluate single-shot solutions against complete specifications. Code can pass the test suite but become progressively harder to extend. Recent iterative benchmarks attempt to close this gap, but constrain the agent's design decisions too tightly to faithfully measure how code quality shapes future extensions. We introduce SlopCodeBench, a language-agnostic benchmark comprising 20 problems and 93 checkpoints, in which agents repeatedly extend their own prior solutions under evolving specifications that force architectural decisions without prescribing internal structure. We track two trajectory-level quality signals: verbosity, the fraction of redundant or duplicated code, and structural erosion, the share of complexity mass concentrated in high-complexity functions. No agent solves any problem end-to-end across 11 models; the highest checkpoint solve rate is 17.2%. Quality degrades steadily: erosion rises in 80% of trajectories and verbosity in 89.8%. Against 48 open-source Python repositories, agent code is 2.2x more verbose and markedly more eroded. Tracking 20 of those repositories over time shows that human code stays flat, while agent code deteriorates with each iteration. A prompt-intervention study shows that initial quality can be improved, but it does not halt degradation. These results demonstrate that pass-rate benchmarks systematically undermeasure extension robustness, and that current agents lack the design discipline iterative software development demands.