CodeMonkeys: ソフトウェアエンジニアリングのためのテスト時間計算のスケーリング

要旨

テスト時の計算能力のスケーリングは、LLMの機能を向上させるための有望なアプローチです。ただし、テスト時の計算はさまざまな方法でスケーリングでき、異なるアプローチを効果的に組み合わせることは、現在も研究の活発な分野です。本研究では、SWE-benchデータセットからの実世界のGitHubの課題を解決する文脈で、この問題を探究します。当システムであるCodeMonkeysは、モデルがコードベースを編集する際に、テストスクリプトを生成し実行することを同時に行うことで、反復的に編集を行うことを可能にします。私たちは、各課題に対して多くのこれらのマルチターンの軌跡をサンプリングし、候補編集のコレクションを生成します。このアプローチにより、トラジェクトリごとの反復回数を増やすことで「シリアル」テスト時の計算をスケーリングし、「パラレル」テスト時の計算を増やすことで、前向きなコストを複数のダウンストリームサンプルに分散させ、LLMによる各ファイルの読み取りを通じて関連するコードベースのコンテキストを特定することができます。候補編集の選択には、モデル生成のテストを使用した投票と、選択に専用の最終マルチターン軌跡を組み合わせています。全体として、CodeMonkeysは、約2300米ドルの予算を使用して、SWE-bench Verifiedの課題の57.4%を解決しています。私たちの選択方法は、異なるソースからの候補を組み合わせるためにも使用できます。既存のトップSWE-bench Verified提出からの編集のアンサンブルを選択することで、66.2%のスコアを獲得し、アンサンブルの最良メンバーを単独で上回ります。私たちは、当システムのコードとデータを完全に公開しています：https://scalingintelligence.stanford.edu/pubs/codemonkeys。

English

Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale "serial" test-time compute by increasing the number of iterations per trajectory and "parallel" test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. In order to select between candidate edits, we combine voting using model-generated tests with a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeys resolves 57.4% of issues from SWE-bench Verified using a budget of approximately 2300 USD. Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench Verified submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. We fully release our code and data at https://scalingintelligence.stanford.edu/pubs/codemonkeys.

CodeMonkeys: ソフトウェアエンジニアリングのためのテスト時間計算のスケーリング

CodeMonkeys: Scaling Test-Time Compute for Software Engineering

要旨

Support