ChatPaper.aiChatPaper

程式碼猴子:為軟體工程擴展測試時間計算

CodeMonkeys: Scaling Test-Time Compute for Software Engineering

January 24, 2025
作者: Ryan Ehrlich, Bradley Brown, Jordan Juravsky, Ronald Clark, Christopher Ré, Azalia Mirhoseini
cs.AI

摘要

擴展測試時間計算是提升語言模型能力的一個有潛力的方向。然而,測試時間計算可以通過多種方式進行擴展,有效地結合不同方法仍然是一個活躍的研究領域。在這裡,我們在解決來自SWE-bench數據集的真實世界GitHub問題的背景下探討這個問題。我們的系統名為CodeMonkeys,允許模型通過同時生成和運行測試腳本來迭代編輯代碼庫。我們對每個問題採樣許多這些多輪軌跡,以生成一組候選編輯。這種方法讓我們通過增加每個軌跡的迭代次數來擴展“串行”測試時間計算,通過增加每個問題的軌跡數量來擴展“並行”測試時間計算。通過並行擴展,我們可以在多個下游樣本中攤提前成本,從而讓語言模型通過讀取每個文件來識別相關的代碼庫上下文。為了從候選編輯中進行選擇,我們結合使用模型生成的測試進行投票,並專門用於選擇的最終多輪軌跡。總的來說,CodeMonkeys在使用約2300美元的預算下解決了SWE-bench Verified的57.4%問題。我們的選擇方法也可以用於從不同來源組合候選編輯。從現有頂尖SWE-bench Verified提交的編輯集合中進行選擇,獲得了66.2%的分數,並且在自身上表現優於該集合的最佳成員。我們完全公開了我們的代碼和數據,網址為https://scalingintelligence.stanford.edu/pubs/codemonkeys。
English
Scaling test-time compute is a promising axis for improving LLM capabilities. However, test-time compute can be scaled in a variety of ways, and effectively combining different approaches remains an active area of research. Here, we explore this problem in the context of solving real-world GitHub issues from the SWE-bench dataset. Our system, named CodeMonkeys, allows models to iteratively edit a codebase by jointly generating and running a testing script alongside their draft edit. We sample many of these multi-turn trajectories for every issue to generate a collection of candidate edits. This approach lets us scale "serial" test-time compute by increasing the number of iterations per trajectory and "parallel" test-time compute by increasing the number of trajectories per problem. With parallel scaling, we can amortize up-front costs across multiple downstream samples, allowing us to identify relevant codebase context using the simple method of letting an LLM read every file. In order to select between candidate edits, we combine voting using model-generated tests with a final multi-turn trajectory dedicated to selection. Overall, CodeMonkeys resolves 57.4% of issues from SWE-bench Verified using a budget of approximately 2300 USD. Our selection method can also be used to combine candidates from different sources. Selecting over an ensemble of edits from existing top SWE-bench Verified submissions obtains a score of 66.2% and outperforms the best member of the ensemble on its own. We fully release our code and data at https://scalingintelligence.stanford.edu/pubs/codemonkeys.

Summary

AI-Generated Summary

PDF102January 28, 2025