CodeClash: 目標指向型ソフトウェアエンジニアリングのベンチマーキング

要旨

現在のコーディング評価指標は、特定のバグ修正やテストコード作成といった具体的で明確に定義されたタスクに対して言語モデル（LM）を評価する。しかし、人間のプログラマーは一日中、孤立したタスクをひたすら処理し続けているわけではない。現実のソフトウェア開発は、ユーザー維持率の向上やコスト削減といった高次元の目標達成を基盤としている。明示的な指示なしに、LMが反復的にコードを開発し、より自由度の高い目標を達成できるかどうかを評価することは、未解決の課題である。この問題に対処するため、我々は**CodeClash**を提案する。これは、競争的な目標を達成するための最良のコードベース構築を目指し、LMが多ラウンドのトーナメントで競い合うベンチマークである。各ラウンドは2つのフェーズで進行する：エージェントが自身のコードを編集する「編集フェーズ」と、それらのコードベースがコードアリーナで直接対決し、スコア最大化、リソース獲得、生存などを目的として勝者が決定される「競技フェーズ」である。メモの作成、ドキュメントの精査、競合ログの分析、テストスイートの作成など、モデルは自らのコードベースを、絶対的な改善と対戦相手に対する相対的な改善の両面から、どのように改良すべきかを自律的に判断しなければならない。我々は6種類のアリーナにおいて8つのLMを評価するため、1680トーナメント（総ラウンド数25,200）を実施した。結果から、モデルが多様な開発スタイルを示す一方で、戦略的推論において根本的な限界を共通して有することが明らかになった。また、リポジトリが次第に煩雑で冗長になるにつれ、モデルは長期的なコードベースの維持管理に苦戦する。これらの限界は顕著であり、最高性能のモデルでも熟練した人間のプログラマーとの対戦では全ラウンドで敗北した。自律的かつ目標指向のコード開発研究を推進するため、我々はCodeClashをオープンソースとして公開する。

English

Current benchmarks for coding evaluate language models (LMs) on concrete, well-specified tasks such as fixing specific bugs or writing targeted tests. However, human programmers do not spend all day incessantly addressing isolated tasks. Instead, real-world software development is grounded in the pursuit of high-level goals, like improving user retention or reducing costs. Evaluating whether LMs can also iteratively develop code to better accomplish open-ended objectives without any explicit guidance remains an open challenge. To address this, we introduce CodeClash, a benchmark where LMs compete in multi-round tournaments to build the best codebase for achieving a competitive objective. Each round proceeds in two phases: agents edit their code, then their codebases compete head-to-head in a code arena that determines winners based on objectives like score maximization, resource acquisition, or survival. Whether it's writing notes, scrutinizing documentation, analyzing competition logs, or creating test suites, models must decide for themselves how to improve their codebases both absolutely and against their opponents. We run 1680 tournaments (25,200 rounds total) to evaluate 8 LMs across 6 arenas. Our results reveal that while models exhibit diverse development styles, they share fundamental limitations in strategic reasoning. Models also struggle with long-term codebase maintenance, as repositories become progressively messy and redundant. These limitations are stark: top models lose every round against expert human programmers. We open-source CodeClash to advance the study of autonomous, goal-oriented code development.

CodeClash: 目標指向型ソフトウェアエンジニアリングのベンチマーキング

CodeClash: Benchmarking Goal-Oriented Software Engineering

要旨

Support