CodeClash:目标导向型软件工程基准测试平台
CodeClash: Benchmarking Goal-Oriented Software Engineering
November 2, 2025
作者: John Yang, Kilian Lieret, Joyce Yang, Carlos E. Jimenez, Ofir Press, Ludwig Schmidt, Diyi Yang
cs.AI
摘要
当前针对编程能力的基准测试主要评估语言模型在具体、明确任务上的表现,例如修复特定错误或编写针对性测试。然而,人类程序员并非终日埋头处理孤立任务。现实中的软件开发始终围绕高层次目标展开,如提升用户留存率或降低运营成本。如何评估语言模型能否在无明确指引的情况下,通过迭代开发逐步优化代码以实现开放性目标,仍是待解难题。为此,我们推出CodeClash基准测试框架:在该多轮竞技中,语言模型需通过锦标赛模式构建最能实现竞争性目标的代码库。每轮比赛分为两个阶段——智能体编辑代码后,其代码库将在竞技场中正面交锋,通过分数最大化、资源获取或生存时长等目标决出胜负。无论是编写注释、研读文档、分析对战日志还是创建测试套件,模型必须自主决策如何从绝对水平和相对优势两个维度优化代码库。我们通过1680场锦标赛(总计25200轮)对8种语言模型在6类竞技场中进行评估。结果表明:尽管模型展现出多样化的开发风格,但在战略推理方面存在共性缺陷;随着代码库逐渐冗杂,模型在长期维护方面也表现不佳。这些局限十分显著——顶尖模型在与人类编程专家的对决中全盘皆输。我们开源CodeClash框架,以推动面向自主化、目标导向的代码开发研究。
English
Current benchmarks for coding evaluate language models (LMs) on concrete,
well-specified tasks such as fixing specific bugs or writing targeted tests.
However, human programmers do not spend all day incessantly addressing isolated
tasks. Instead, real-world software development is grounded in the pursuit of
high-level goals, like improving user retention or reducing costs. Evaluating
whether LMs can also iteratively develop code to better accomplish open-ended
objectives without any explicit guidance remains an open challenge. To address
this, we introduce CodeClash, a benchmark where LMs compete in multi-round
tournaments to build the best codebase for achieving a competitive objective.
Each round proceeds in two phases: agents edit their code, then their codebases
compete head-to-head in a code arena that determines winners based on
objectives like score maximization, resource acquisition, or survival. Whether
it's writing notes, scrutinizing documentation, analyzing competition logs, or
creating test suites, models must decide for themselves how to improve their
codebases both absolutely and against their opponents. We run 1680 tournaments
(25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal
that while models exhibit diverse development styles, they share fundamental
limitations in strategic reasoning. Models also struggle with long-term
codebase maintenance, as repositories become progressively messy and redundant.
These limitations are stark: top models lose every round against expert human
programmers. We open-source CodeClash to advance the study of autonomous,
goal-oriented code development.