Socratic-SWE:經由軌跡派生代理技能之自我演化編碼代理
Socratic-SWE: Self-Evolving Coding Agents via Trace-Derived Agent Skills
June 5, 2026
作者: Chuan Xiao, Zhengbo Jiao, Shaobo Wang, Wei Wang, Bing Zhao, Hu Wei, Linfeng Zhang, Lin Qu
cs.AI
摘要
LLM驅動的軟體工程智能體已成為檢驗真實語言模型能力的核心測試平台,然而其訓練成效仍受限於高品質軟體工程任務的稀缺性。現有合成資料方法通常透過固定突變或缺陷注入程序產生任務,導致產生的任務分佈與智能體自身的弱點及訓練進展幾乎無關聯。我們提出Socratic-SWE,這是一個閉環式自我演化框架,可重複利用智能體歷史解題軌跡作為訓練訊號來源。不同於僅將軌跡視為獎勵計算的證據,Socratic-SWE 將其提煉為結構化的智能體技能,總結出反覆出現的失敗模式與有效修復策略。這些技能進而引導在真實程式庫中生成針對性修復任務。候選任務經過基於執行的驗證,並透過求解器梯度對齊獎勵進行評分,因此保留的任務既可驗證,又能有效提升求解器效能。更新後的求解器產生新的軌跡,使任務課程能在後續輪次中動態適應。在SWE-bench Verified、SWE-bench Lite、SWE-bench Pro及Terminal-Bench 2.0等基準測試中,Socratic-SWE在相同運算資源下持續優於自我演化基準方法,經過三次迭代後在SWE-bench Verified上達到50.40%的正確率。這些結果表明,解題軌跡可作為自我演化型軟體工程智能體的可擴展基礎。
English
LLM-driven software engineering agents have become a central testbed for real-world language-model capability, yet their training remains limited by the availability of high-quality SWE tasks. Existing synthetic data methods typically create tasks through fixed mutation or bug-injection procedures, making the resulting distributions largely independent of the agent's own weaknesses and training progress. We introduce Socratic-SWE, a closed-loop self-evolution framework that reuses the agent's historical solving traces as a source of training signal. Rather than treating traces only as evidence for reward computation, Socratic-SWE distills them into structured agent skills that summarize recurring failures and effective repair patterns. These skills then guide the generation of targeted repair tasks in real repositories. Candidate tasks are checked through execution-based validation and scored with a solver-gradient alignment reward, so that the retained tasks are both verifiable and useful for improving the Solver. The updated Solver produces new traces, enabling the task curriculum to adapt over successive rounds. Across SWE-bench Verified, SWE-bench Lite, SWE-bench Pro, and Terminal-Bench 2.0, Socratic-SWE consistently improves over self-evolving baselines under the same compute budget, reaching 50.40% on SWE-bench Verified after three iterations. These results suggest that solving traces can serve as a scalable substrate for self-evolving SWE agents.