ChatPaper.aiChatPaper

Satori-SWE:面向高效样本软件工程的进化式测试时扩展

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

May 29, 2025
作者: Guangtao Zeng, Maohao Shen, Delin Chen, Zhenting Qi, Subhro Das, Dan Gutfreund, David Cox, Gregory Wornell, Wei Lu, Zhang-Wei Hong, Chuang Gan
cs.AI

摘要

语言模型(LMs)在标准化编码基准测试中表现出色,但在处理现实世界的软件工程任务时却面临挑战,例如解决SWE-Bench中的GitHub问题,尤其是在模型参数少于100B的情况下。虽然在实际应用中,较小的模型因其较低的计算成本更受青睐,但提升其性能仍具挑战性。现有方法主要依赖于使用高质量数据进行监督微调(SFT),而大规模获取此类数据成本高昂。另一种方法是测试时扩展:生成多个输出,通过验证器评分,并选择最佳结果。尽管有效,但该策略通常需要大量采样和昂贵的评分,限制了其实际应用。我们提出了进化测试时扩展(EvoScale),这是一种样本高效的方法,将生成视为一个进化过程。通过选择和变异迭代优化输出,EvoScale将输出分布向高分区域转移,减少了找到正确解决方案所需的样本数量。为了减少重复采样和选择的开销,我们利用强化学习(RL)训练模型自我进化。在推理时,模型不再依赖外部验证器,而是学会在迭代过程中自我提升其生成结果的评分。在SWE-Bench-Verified上的评估显示,EvoScale使我们的32B模型Satori-SWE-32B在仅使用少量样本的情况下,性能匹配甚至超越了参数超过100B的模型。代码、数据和模型将完全开源。
English
Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.

Summary

AI-Generated Summary

PDF232May 30, 2025