CodeClash:目标导向型软件工程基准测试平台
CodeClash: Benchmarking Goal-Oriented Software Engineering
November 2, 2025
作者: John Yang, Kilian Lieret, Joyce Yang, Carlos E. Jimenez, Ofir Press, Ludwig Schmidt, Diyi Yang
cs.AI
摘要
当前针对编程能力的基准测试,主要评估语言模型在具体明确任务上的表现,例如修复特定错误或编写针对性测试。然而,人类程序员并非终日埋头处理孤立任务。现实中的软件开发始终围绕高层目标展开,如提升用户留存率或降低运营成本。如何评估语言模型能否在无明确指导的情况下,通过迭代开发逐步实现这类开放性目标,仍是待解难题。为此我们推出CodeClash基准测试框架:在该多轮竞技中,语言模型需通过锦标赛模式角逐最佳代码库,以达成竞争性目标。
每轮比赛包含两个阶段:智能体先编辑代码,随后其代码库将在竞技场中正面交锋,通过分数最大化、资源获取或生存竞争等目标判定胜负。无论是编写注释、研读文档、分析对战日志还是创建测试套件,模型必须自主决策如何从绝对实力和相对优势两个维度优化代码库。我们通过1680场锦标赛(总计25200轮)对8种语言模型在6类竞技场中进行评估。研究发现:尽管不同模型展现出多样化的开发风格,但均存在战略推理的根本性局限;随着代码库逐渐臃肿冗余,模型在长期维护方面也表现不佳。这些缺陷十分显著——顶尖模型在与人类编程专家的对战中全数落败。
我们开源CodeClash框架,以推动面向自主目标导向的代码开发研究。
English
Current benchmarks for coding evaluate language models (LMs) on concrete,
well-specified tasks such as fixing specific bugs or writing targeted tests.
However, human programmers do not spend all day incessantly addressing isolated
tasks. Instead, real-world software development is grounded in the pursuit of
high-level goals, like improving user retention or reducing costs. Evaluating
whether LMs can also iteratively develop code to better accomplish open-ended
objectives without any explicit guidance remains an open challenge. To address
this, we introduce CodeClash, a benchmark where LMs compete in multi-round
tournaments to build the best codebase for achieving a competitive objective.
Each round proceeds in two phases: agents edit their code, then their codebases
compete head-to-head in a code arena that determines winners based on
objectives like score maximization, resource acquisition, or survival. Whether
it's writing notes, scrutinizing documentation, analyzing competition logs, or
creating test suites, models must decide for themselves how to improve their
codebases both absolutely and against their opponents. We run 1680 tournaments
(25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal
that while models exhibit diverse development styles, they share fundamental
limitations in strategic reasoning. Models also struggle with long-term
codebase maintenance, as repositories become progressively messy and redundant.
These limitations are stark: top models lose every round against expert human
programmers. We open-source CodeClash to advance the study of autonomous,
goal-oriented code development.