Satori-SWE: 샘플 효율적인 소프트웨어 엔지니어링을 위한 진화적 테스트 시간 스케일링

초록

언어 모델(LMs)은 표준화된 코딩 벤치마크에서는 우수한 성능을 보이지만, SWE-Bench에서 GitHub 이슈 해결과 같은 실제 소프트웨어 엔지니어링 작업에서는 특히 모델 파라미터가 100B 미만일 때 어려움을 겪습니다. 더 작은 모델은 계산 비용이 낮아 실용적으로 선호되지만, 그들의 성능을 개선하는 것은 여전히 어려운 과제입니다. 기존 접근 방식은 주로 고품질 데이터를 사용한 지도 미세 조정(SFT)에 의존하는데, 이는 대규모로 구축하기에 비용이 많이 듭니다. 대안으로 테스트 타임 스케일링이 있습니다: 여러 출력을 생성하고 검증기를 사용해 점수를 매긴 후 최적의 것을 선택하는 방식입니다. 이 방법은 효과적이지만, 과도한 샘플링과 비용이 많이 드는 점수 매기기가 필요해 실용적 적용이 제한됩니다. 우리는 진화적 테스트 타임 스케일링(EvoScale)을 제안합니다. 이는 샘플 효율적인 방법으로, 생성을 진화 과정으로 간주합니다. 선택과 변이를 통해 출력을 반복적으로 개선함으로써, EvoScale은 출력 분포를 더 높은 점수 영역으로 이동시켜 올바른 해결책을 찾기 위해 필요한 샘플 수를 줄입니다. 반복적인 샘플링과 선택으로 인한 오버헤드를 줄이기 위해, 우리는 강화 학습(RL)을 사용해 모델이 스스로 진화하도록 훈련시킵니다. 추론 시 외부 검증기에 의존하는 대신, 모델은 반복을 거치며 자신의 생성물의 점수를 스스로 개선하는 법을 학습합니다. SWE-Bench-Verified에서 평가한 결과, EvoScale은 우리의 32B 모델인 Satori-SWE-32B가 100B 이상의 파라미터를 가진 모델의 성능을 적은 수의 샘플로도 맞추거나 능가하도록 합니다. 코드, 데이터, 모델은 완전히 오픈소스로 공개될 예정입니다.

English

Language models (LMs) perform well on standardized coding benchmarks but struggle with real-world software engineering tasks such as resolving GitHub issues in SWE-Bench, especially when model parameters are less than 100B. While smaller models are preferable in practice due to their lower computational cost, improving their performance remains challenging. Existing approaches primarily rely on supervised fine-tuning (SFT) with high-quality data, which is expensive to curate at scale. An alternative is test-time scaling: generating multiple outputs, scoring them using a verifier, and selecting the best one. Although effective, this strategy often requires excessive sampling and costly scoring, limiting its practical application. We propose Evolutionary Test-Time Scaling (EvoScale), a sample-efficient method that treats generation as an evolutionary process. By iteratively refining outputs via selection and mutation, EvoScale shifts the output distribution toward higher-scoring regions, reducing the number of samples needed to find correct solutions. To reduce the overhead from repeatedly sampling and selection, we train the model to self-evolve using reinforcement learning (RL). Rather than relying on external verifiers at inference time, the model learns to self-improve the scores of its own generations across iterations. Evaluated on SWE-Bench-Verified, EvoScale enables our 32B model, Satori-SWE-32B, to match or exceed the performance of models with over 100B parameters while using a few samples. Code, data, and models will be fully open-sourced.

Satori-SWE: 샘플 효율적인 소프트웨어 엔지니어링을 위한 진화적 테스트 시간 스케일링

Satori-SWE: Evolutionary Test-Time Scaling for Sample-Efficient Software Engineering

초록

Support