CodeClash: 목표 지향 소프트웨어 엔지니어링 벤치마킹

초록

현재의 코딩 벤치마크는 특정 버그 수정이나 목표적인 테스트 작성과 같이 구체적이고 명확히 정의된 작업에 대해 언어 모델(LM)을 평가합니다. 그러나 인간 프로그래머는 하루 종일 분리된 작업만 끊임없이 해결하지 않습니다. 실제 소프트웨어 개발은 사용자 유지율 향상이나 비용 절감과 같은 높은 수준의 목표 추구에 기반을 둡니다. 명시적 지도 없이 언어 모델이 열린 목표를 더 잘 달성하기 위해 코드를 반복적으로 개발할 수 있는지 평가하는 것은 여전히 해결되지 않은 과제로 남아 있습니다. 이를 위해 우리는 CodeClash를 소개합니다. 이 벤치마크에서는 언어 모델이 경쟁적 목표를 달성하기 위한 최고의 코드베이스를 구축하기 위해 다중 라운드 토너먼트에서 경쟁합니다. 각 라운드는 두 단계로 진행됩니다. 에이전트가 코드를 편집한 후, 점수 극대화, 자원 획득, 생존과 같은 목표에 따라 승자를 결정하는 코드 아레나에서 상대의 코드베이스와 직접 겨룹니다. 노트 작성, 문서 검토, 경쟁 로그 분석, 테스트 스위트 생성 등 모델은 상대방에 대해 절대적이고 상대적으로 자신의 코드베이스를 개선하는 방법을 스스로 결정해야 합니다. 우리는 6개의 아레나에서 8개의 언어 모델을 평가하기 위해 1,680회의 토너먼트(총 25,200라운드)를 실행했습니다. 결과에 따르면 모델이 다양한 개발 스타일을 보여주지만, 전략적 추론에 있어서는 근본적인 한계를 공유합니다. 또한 저장소가 점점 지저분하고 중복되어 모델이 장기적인 코드베이스 유지 관리에 어려움을 겪습니다. 이러한 한계는 분명합니다. 최고 수준의 모델도 전문 인간 프로그래머에 대항하면 모든 라운드에서 패배합니다. 우리는 자율적이고 목표 지향적인 코드 개발 연구를 발전시키기 위해 CodeClash를 오픈소스로 공개합니다.

English

Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it's writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.

CodeClash: 목표 지향 소프트웨어 엔지니어링 벤치마킹

CodeClash: Benchmarking Goal-Oriented Software Engineering

초록

Support