ChatPaper.aiChatPaper

Satori-SWE:面向樣本高效軟體工程的演化式測試時擴展

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

May 29, 2025
作者: Guangtao Zeng, Maohao Shen, Delin Chen, Zhenting Qi, Subhro Das, Dan Gutfreund, David Cox, Gregory Wornell, Wei Lu, Zhang-Wei Hong, Chuang Gan
cs.AI

摘要

語言模型(LMs)在標準化的編碼基準測試中表現出色,但在處理現實世界的軟體工程任務時卻面臨挑戰,例如在SWE-Bench中解決GitHub問題,尤其是在模型參數少於100B的情況下。雖然在實踐中較小的模型因其較低的計算成本更受青睞,但提升其性能仍然困難重重。現有的方法主要依賴於使用高質量數據進行監督式微調(SFT),而這在大規模上進行時成本高昂。另一種方法是測試時擴展:生成多個輸出,使用驗證器進行評分,並選擇最佳的一個。儘管有效,但這種策略通常需要過多的採樣和昂貴的評分,限制了其實際應用。我們提出了進化測試時擴展(EvoScale),這是一種樣本效率高的方法,將生成視為一個進化過程。通過選擇和變異迭代精煉輸出,EvoScale將輸出分佈向高分區域轉移,減少了找到正確解決方案所需的樣本數量。為了減少重複採樣和選擇的開銷,我們訓練模型使用強化學習(RL)進行自我進化。在推理時,模型不再依賴外部驗證器,而是學會在迭代中自我提升其生成的分數。在SWE-Bench-Verified上的評估顯示,EvoScale使我們的32B模型Satori-SWE-32B能夠匹配或超越參數超過100B的模型性能,同時僅使用少量樣本。代碼、數據和模型將完全開源。
English
Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.

Summary

AI-Generated Summary

PDF232May 30, 2025