Socratic-SWE: 트레이스로부터 도출된 에이전트 스킬을 통한 자기 진화 코딩 에이전트

초록

LLM 기반 소프트웨어 엔지니어링 에이전트는 실제 언어 모델 능력을 평가하는 핵심 시험장이 되었지만, 그 훈련은 고품질 SWE 작업의 가용성에 의해 제한되어 왔다. 기존의 합성 데이터 방법은 일반적으로 고정된 변형(mutation) 또는 버그 주입 절차를 통해 작업을 생성하므로, 결과적으로 생성된 분포는 에이전트 자체의 약점이나 훈련 진행 상황과 대체로 무관하다. 본 논문에서는 에이전트의 과거 해결 추적(solving traces)을 훈련 신호의 원천으로 재사용하는 폐쇄 루프 자기 진화 프레임워크인 Socratic-SWE를 제안한다. Socratic-SWE는 추적을 단순한 보상 계산의 증거로만 취급하지 않고, 반복되는 실패 패턴과 효과적인 수리 패턴을 요약하는 구조화된 에이전트 스킬로 추출한다. 이러한 스킬은 이후 실제 저장소에서 표적 수리 작업의 생성을 안내한다. 후보 작업은 실행 기반 검증을 통해 확인되고, 해결사-기울기 정렬 보상(solver-gradient alignment reward)으로 점수가 매겨져, 최종 유지된 작업이 검증 가능할 뿐만 아니라 Solver 개선에 유용하도록 보장한다. 업데이트된 Solver는 새로운 추적을 생성하며, 이를 통해 작업 커리큘럼이 연속적인 라운드에 걸쳐 적응할 수 있다. SWE-bench Verified, SWE-bench Lite, SWE-bench Pro 및 Terminal-Bench 2.0 전반에서 Socratic-SWE는 동일한 계산 예산 하에서 자기 진화 기준선(self-evolving baselines) 대비 일관된 성능 향상을 보여주며, 세 번의 반복 후 SWE-bench Verified에서 50.40%에 도달한다. 이러한 결과는 해결 추적이 자기 진화 SWE 에이전트를 위한 확장 가능한 기반 자료로 활용될 수 있음을 시사한다.

English

LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typically create tasks through fixed mutation or bug-injection procedures, making the resulting distributions largely independent of the agent's own weaknesses and training progress. We introduce Socratic-SWE, a closed-loop self-evolution framework that reuses the agent's historical solving traces as a source of training signal. Rather than treating traces only as evidence for reward computation, Socratic-SWE distills them into structured agent skills that summarize recurring failures and effective repair patterns. These skills then guide the generation of targeted repair tasks in real repositories. Candidate tasks are checked through execution-based validation and scored with a solver-gradient alignment reward, so that the retained tasks are both verifiable and useful for improving the Solver. The updated Solver produces new traces, enabling the task curriculum to adapt over successive rounds. Across SWE-bench Verified, SWE-bench Lite, SWE-bench Pro, and Terminal-Bench 2.0, Socratic-SWE consistently improves over self-evolving baselines under the same compute budget, reaching 50.40% on SWE-bench Verified after three iterations. These results suggest that solving traces can serve as a scalable substrate for self-evolving SWE agents.